On ParseHub, you can scrape information from many vehicle dealership websites for cars, vans, trucks, motorcycles... etc.
In this tutorial, you will learn how to scrape details such as price, mileage or VIN from each of the listings on a car dealership website. To demonstrate, we will scrape data from www.autolist.com
Scrape data from the listings page
1. Open ParseHub, click on "New Project" and enter the URL you would like to scrape data from. In this case we are using this URL which already specifies location and vehicle type. Click on "Start project on this URL".
2. Once the website has loaded, click on the + sign next to "Select page" and choose a Select command. The Select command allows you to select elements on the page.
3. Use your Select command to click on the first listing name which should be highlighted in green.
4. Similar elements will be highlighted in yellow. Click on the second listing name and you should see every listing name highlighted in green and the number of selections on the left hand side. If this is not the case, click on any unselected listing names until all of them are selected.
5. You can double-check on "selection1" to rename it to something such as "Listing".
6. You have the option to select other information that appears on the listing page by using Relative Select commands. Relative Select commands relate data - for example, the listing name to the listing price, or the listing name to the listing location. To add a Relative Select command, click on the + sign next to "Begin new entry in Listing" and choose "Relative Select".
7. Click on the first listing name and then click on the first listing price to relate the two. You should see an arrow going from each listing name to its associated price. You can double-check on "selection2" and rename it to "price".
8. You can preview your data on the bottom panel. Your project's JSON preview should look like this:
And your project's CSV/Excel preview should look like this:
If you are not interested in scraping the Listing_url or the Listing_price_url (these are extracted by default if you select links), you can:
- Click on the x that appears when hovering over "Extract url" to remove the Listing_url
- Click on the + sign next to "Relative price", go to "Advanced" and choose an Extract command to remove the Listing_price_url
9. Currently, the project will scrape all listing names and prices from the first page. To have it click through to the next page, click on the + sign next to "Select page" and choose a Select command, use this to click on the "next" button. You can rename your selection to "next".
10. Click on the + sign next to "Select & Extract next" and choose a Click command.
11. The pop-up that appears will ask you if this is a "next page" button. Since it is, click on "Yes" which should default to "Repeat the Current Template". Click on "Repeat Current Template".
Your project is now set to scrape all listing names and prices on every one of the results page on your URL.
** Note that the autolist.com website has an initial loading screen which may cause ParseHub to try to scrape results before they have had time to load. To ensure this is not the case, click on "Select Listing" and enable the "Wait up to 60 seconds for elements to appear" option below:
Scrape data from within each listing
If you would like to click into each listing to scrape data from within that listing's page, you can follow the instructions below.
1. Click on the + sign next to "Begin new entry in listing" and choose a Click command.
2. The pop-up will ask you again whether this is a "next page" button. This time click on "No" and you will be prompted to "Create New Template" which you can call something such as "listing_details". This template will specify the information you would like from each individual listing's page.
3. This should automatically open the first listing page and your new listing_details template on the left hand side.
4. Within this template, you can use a new Select command for each piece of data that you would like to extract. For each piece of data (e.g. mileage, photo... etc.), click on the + sign next "Select page", choose a Select command and click on that information. For example, I could extract the three fields under "Buyer Intelligence" as follows:
Click to reveal more details on a template
For some websites, you may need to click on a link to view more information. For example, the listing page for autolist.com has a "Detailed vehicle info" link which expands to reveal more specifications. To have ParseHub click on this link:
1. Click on the + sign next to "Select page", choose a Select command and use it to select the "Detailed vehicle info" link:
2. Click on the + sign next to your selection and choose a Click command. In the pop-up, choose "No" when asked if this is a "next page" button and then select he option to "Continue executing the current template".
Scrape unordered vehicle specifications
Within vehicle specifications, it is common for them to appear in different orders, depending on what information is available for that vehicle. So, for example, for one vehicle "Trim" may be the first specification but for another vehicle it may be the third. To resolve this issue we can follow the instructions below which are based on this tutorial.
1. Click on the + sign next to "Select page" and choose a Select command. Click on the first specification label (e.g. "Trim") and then on one or two others until all of the labels are selected in green.
2. Hover over "Begin new entry in labels" and click on the "x" to delete this command. This will remove "Being new entry in labels" and "Extract name".
3. Click on the + sign next to "Select labels", click on "Advanced" and choose a Conditional command. Conditionals allow you to specify a certain criteria that, if met, will execute the commands nested below.
4. In the "Expression" text box for your condition, enter the following:
$e.text.contains("Label you are interested in")
In this snippet, "$e" stands for "element" (the label you are currently selecting), "text" is the text for that element and "contains" checks that it contains the text within the quotation marks. So if we wanted to extract the VIN, we would write $e.text.contains("VIN") (note that this is case sensitive).
5. Click on the + sign next to "If $e.text.contains("VIN")" and choose a Relative Select command. Use this to relate the "VIN" label to the "VIN" number and call your relative selection "vin".
6. To scrape other specifications, hover over "Select labels", hold down the Shift key so that the + sign appears and repeat steps 3-5 above for each piece of data you would like to extract.
You can also copy and paste commands by clicking on a command to select it (make sure you're not clicked into the option to edit the condition's text), then using Ctrl (Windows) or Command (Mac) + C to copy the command and Ctrl (Windows) or Command (Mac) + V to paste the command. You can click and drag it to make sure it's nested below "Select labels" at the same level as your other conditions.
Your project should look similar to this:
What each condition does is check all of the labels and, whenever it finds one that has the test in your condition, it executes the command below which uses a Relative Select to select the element to the right of that label.
Note that each website will be slightly different, so some of the suggestions for individual listings above may not apply to your car dealership website. If you run into any trouble, please feel free to contact us for support.