Now that you've downloaded ParseHub, opened your first project and have an idea of how the ParseHub tools, templates and commands work from the previous three lessons, you're ready to try your first sample project!
Many websites have similar layouts, which is a page containing a list of results and another page containing details after you've clicked into one of those results: this is the case for directories, listings, e-commerce sites, classifieds sites, real estate listings, dealerships, blogs, news sources... etc.
For this example, we'll scrape details from Society6 as an example of how you can scrape information from this type of layout.
Be sure to save your project regularly by clicking on "Save" under
the project options icon:
Extract every product name and url
1. Open the ParseHub client and click on New Project [No "New Project" button? Check out this troubleshooting guide]. We are going to scrape directly from the "Wall Tapestries" category and the URL we'll be using is https://society6.com/tapestries - enter this into the text box and click on Start project on this URL.
2. This will load the page in the page view area and on our sidebar you will find a main_template for this page layout and an "Empty selection1" command which you can use to select information on the page. In this case, we will be clicking on the very first product name which should highlight in green after you've selected it as well as highlight other product names in yellow to indicate that ParseHub has identified them as similar elements.
3. Click on the second product name and notice how the number of elements next to "Select selection1" command has increased (it now shows "Select selection1 (42)") - if it hasn't yet selected all of the products on the page, click on another product name until all of the product names on the page are highlighted in green.
Our selection "Select selection1" is currently selecting 42 elements in our project (the number of elements on the page may vary on your project), all of which will be highlighted in green on the page.
4. If you double click on the text "selection1", you can rename this to something more descriptive, such as "Product". Names may only contain letters, numbers and underscores (_).
Once you've selected more than one element, as we have done in step 3 above, ParseHub will automatically add a Begin new entry command (hidden under list icon ), which ensures that each of the products selected will be on their own CSV row or have their own scope in JSON.
ParseHub has also automatically added Extract commands for both the name and the url, which you can preview in the bottom pane:
You could always delete one or both of these commands and your preview data would update accordingly.
5. In order to have ParseHub scrape not only results on the first page but also results on other pages, you'll need to add pagination. In this case, click on the + sign next to "Select page" and choose a new Select command.
6. A new "Empty selection1" command will appear. Click on the "Next" button on the website, which should highlight in green and double-click on the name "selection1" to rename it to "nextButton". It should show (1) element selected which references the "Next" button:
7. In order to teach ParseHub how to click on the element we just selected, click on the + sign next to "Select & Extract nextButton" and choose a Click command.
8. This will bring up a pop-up asking you what you would like to do once the "Next" button has been clicked. If you click on "Yes" when it asks if this is a "next page" button, it will default to "Repeat the current template" as ParseHub should repeat everything we did on page 1 on the results for every subsequent page.
9. To recap what we've done so far, our project has the following commands:
Following these commands, ParseHub will:
- Select page: load and select the whole page
- Select Product: select all of the product names on the page
- Extract name: extract the name of the product in to that product's entry
- Extract url: extract the url of the product into that product's entry
- Select Product: select all of the product names on the page
- Select nextButton: select the "next" button
- Click and go to main_template: click the "next" button and return to this template to repeat the above actions on the next page.
Extract additional product information from the results page
10. If we wanted to extract more than just the name and URL for each product, we could do so by using a Relative Select command which associates data. Click on the + sign next to "Select Product" and choose a Relative Select command which will add a "Relative selection1" to your commands.
11. First click on the main item, in this case the product name (you can click on any one of the products), and as you move your cursor away from the product name you'll see there's an arrow stemming from that selection. Click on the item that you would like to extract - for example, the artist and rename your "selection1" to "artist".
12. This should automatically include arrows from all other product names to each of their associated artists but, if that's not the case, click on a product name which isn't doing this and then click on its artist to teach ParseHub to include that element as well.
13. You can repeat steps 10 - 12 to extract any other information that appears on this page such as the price, number of likes or sale price. There are some useful tricks that you can use such as zooming in and out to select elements or scraping ratings and reviews but if you're having trouble extracting the data you need, you can always contact us.
Click into each product page to extract more details
14. If we wanted to click into each product to get more information from that product's listing page, we could add a Click command to each product entry. To do this, click on the + sign next to "Begin new entry in Product" and choose Click.
15. This will open a pop-up asking us if what we are clicking on is a next page button. In this case, we will choose "No", create new template and can call that template something like "details" since this template will apply to the layout when we're on each product's individual page.
16. This will open up the page for the first product and our new "details" template with an "Empty selection1". We can click on the first piece of data we're interested in extracting - for example, the product's number of reviews - and rename "selection1" to "reviews".
17. For each new piece of data we wish to extract, we can click on the + sign next to "Select page", choose a Select command and click on that new item, which will result in multiple Select commands.
If you wish to move between templates, you can always open the page corresponding to that template (by going to the browser tab with that page, entering it in the URL or navigating to it in Browse mode) and double-click on the template name to open that template.