Now that you've downloaded ParseHub, opened your first project and have an idea of how the ParseHub tools, templates and commands work from the previous three lessons, you're ready to try your first sample project!
Many websites have similar layouts, which is a page containing a list of results and another page containing details after you've clicked into one of those results: this is the case for directories, listings, e-commerce sites, classifieds sites, real estate listings, dealerships, blogs, news sources... etc.
For this example, we'll scrape details from Society6 as an example of how you can scrape information from this type of layout.
Tip! Be sure to save your project regularly by clicking on "Save" under the project options icon: |
Extract every product name and url
1. Open the ParseHub client and click on New Project [No "New Project" button? Check out this troubleshooting guide]. We are going to scrape directly from the "Wall Tapestries" category and the URL we'll be using is https://society6.com/tapestries - enter this into the text box and click on Start project on this URL.
2. This will load the page in the page view area and on our sidebar you will find a main_template for this page layout and an "Empty selection1" command which you can use to select information on the page. In this case, we will be clicking on the very first product name which should highlight in green after you've selected it as well as highlight other product names in yellow to indicate that ParseHub has identified them as similar elements.
3. Click on the second product name and notice how the number of elements next to "Select selection1" command has increased (it now shows "Select selection1 (42)") - if it hasn't yet selected all of the products on the page, click on another product name until all of the product names on the page are highlighted in green.
Our selection "Select selection1" is currently selecting 42 elements in our project (the number of elements on the page may vary on your project), all of which will be highlighted in green on the page.
4. If you double click on the text "selection1", you can rename this to something more descriptive, such as "Product". Names may only contain letters, numbers and underscores (_).
Once you've selected more than one element, as we have done in step 3 above, ParseHub will automatically add a Begin new entry command (hidden under list icon ), which ensures that each of the products selected will be on their own CSV row or have their own scope in JSON.
ParseHub has also automatically added Extract commands for both the name and the url, which you can preview in the bottom pane:
You could always delete one or both of these commands and your preview data would update accordingly.
5. To scrape additional details about each product (price, for example), we can use the Relative Select command to relate each product to its corresponding price. Click on the + sign next to "Select Product" and choose the Relative Select command.
6. To use the Relative Select command, click on the orange highlight that is around one of the product names. An arrow will appear when you do this. Click on the price of the product using this arrow to relate each product on this page to its corresponding price. Rename the command to "Relative price".
7. You can repeat steps 5 and 6 for any other pieces of information you'd like to scrape from each product (ex. the artist).
8. In order to have ParseHub scrape not only results on the first page but also results on other pages, you'll need to add pagination. In this case, click on the + sign next to "Select page" and choose a new Select command.
9. A new "Empty selection1" command will appear. Click on the "Next" button on the website, which should highlight in green and double-click on the name "selection1" to rename it to "nextButton". It should show (1) element selected which references the "Next" button:
10. In order to teach ParseHub how to click on the element we just selected, click on the + sign next to "Select & Extract nextButton" and choose a Click command.
11. This will bring up a pop-up asking you what you would like to do once the "Next" button has been clicked. If you click on "Yes" when it asks if this is a "next page" button, it will default to "Repeat the current template" as ParseHub should repeat everything we did on page 1 on the results for every subsequent page. You can also set how many more times you want ParseHub to repeat this template. To go through until it reach the last page, you can keep it at 0 (repeats the template an unlimited number of times).
12. To recap what we've done so far, our project has the following commands:
Following these commands, ParseHub will:
- Select page: load and select the whole page
- Select Product: select all of the product names on the page
- Extract name: extract the name of the product in to that product's entry
- Extract url: extract the url of the product into that product's entry
- Relative price: select all of the prices and connect them to their corresponding products
- Select Product: select all of the product names on the page
- Select nextButton: select the "next" button
- Click each nextButton item: click on each "next" button that "Select nextButton" is selecting
- and go to main_template: repeats the main_template after each click
- Click each nextButton item: click on each "next" button that "Select nextButton" is selecting
Click into each product page to scrape more details
13. If we wanted to click into each product to get more information from that product's listing page, we could add a Click command to each product entry. To do this, click on the + sign next to "Begin new entry in Product" and choose Click.
14. This will open a pop-up asking us if what we are clicking on is a next page button. In this case, we will choose "No", create new template and can call that template something like "details" since this template will apply to the layout when we're on each product's individual page.
15. This will open up the page for the first product and our new "details" template with an "Empty selection1". We can click on the first piece of data we're interested in extracting - for example, the product's number of reviews - and rename "selection1" to "reviews".
16. For each new piece of data we wish to extract, we can click on the + sign next to "Select page", choose a Select command and click on that new item, which will result in multiple Select commands.
If you wish to move between templates, you can always open the page corresponding to that template (by going to the browser tab with that page, entering it in the URL or navigating to it in Browse mode) and double-click on the template name to open that template.