I'm trying to learn how to use RCurl (or some other suitable R package if I'm wrong about RCurl being the right tool) to automate the process of submitting search terms to a web form and placing the search results in a data file. The specific problem I'm working on is as follows:
I have a data file giving license plate number (LPN) and vehicle identification number (VIN) for several automobiles. The California Department of Motor Vehicles (DMV) has a web page search form where you enter the LPN and the last five digits of the VIN, and it returns the vehicle license fee (VLF) payment for either 2010 or 2009 (there's a selector for that on the input form as well). (FYI: This is for a research project to look at the distribution of VLF payments by vehicle make, model and model year)
I could go through the tedious process of manually entering data for each vehicle and then manually typing the result into a spreadsheet. But this is the 21st Century and I'd like to try and automate the process. I want to write a script that will submit each LPN and VIN to the DMV web form and then put the result (the VLF payment) in a new VLF variable in my data file, doing this repeatedly until it gets to the end of the list of LPNs and VINs. (The DMV web form is here by the way:
https://www.dmv.ca.gov/FeeCalculatorWeb/vlfForm.do).
My plan was to use getHTMLFormDescription() (in the RHTMLForms package) to find out the names of the input fields and then use getForm() or postForm() (in the RCurl package) to retrieve the output. Unfortunately, I got stuck at the very first step. Here's the R command I used and the output:
> forms = getHTMLFormDescription("https://www.dmv.ca.gov/FeeCalculatorWeb/vlfForm.do")
Error in htmlParse(url, ...) :
File https://www.dmv.ca.gov/FeeCalculatorWeb/vlfForm.do does not exist
Unfortunately, being relatively new to R and almost completely new to HTTP and web-scraping, I'm not sure what to do next.
First, does anyone know why I'm getting an error on my getHTMLFormDescription() call? Alternatively, is there another way to figure out the names of the input fields?
Second, can you suggest some sample code to help me get started on actually submitting LPNs and VINs and retrieving the output? Is getForm() or postForm() the right approach or should I be doing something else? If it would help to have some real LPN-VIN combinations to submit, here are three:
LPN VIN
5MXH018 30135
4TOL562 74735
5CWR968 11802
Finally, since you can see I'm a complete novice at this, do you have suggestions on what I need to learn in order to become adept at web scraping of this sort and how to go about learning it (in R or in another language)? Specific suggestions for web sites, books, listservs, other StackOverflow questions, etc. would be great.
Thanks for your help.
See Question&Answers more detail:
os