Lesson 4: Building your First Sample Project

Now that you've downloaded ParseHub, opened your first project and have an idea of how the ParseHub tools, templates and commands work from the previous three lessons, you're ready to try your first sample project! 

Many websites have similar layouts, which is a page containing a list of results and another page containing details after you've clicked into one of those results: this is the case for directories, listings, e-commerce sites, classifieds sites, real estate listings, dealerships, blogs, news sources... etc. 

For this example, we'll scrape details from Yelp as an example of how you can scrape information from this type of layout.



Be sure to save your project regularly by clicking on "Save" under

the project options icon: Screen_Shot_2017-10-05_at_12.27.29_PM.png


Extract every restaurant name and url

1. Open the ParseHub client and click on New Project. To keep this example simple, we are going directly to the URL containing the search results for "Restaurants in Toronto" on Yelp, but it is possible to start from the homepage and teach ParseHub how to input search criteria. The URL we'll be using is https://www.yelp.ca/search?find_desc=Restaurants&find_loc=Toronto%2C+ON&ns=1 - enter this into the text box and click on Start project on this URL.



2. This will load the page in the page view area and on our sidebar you will find a main_template for this page layout and an "Empty selection1" command which you can use to select information on the page. In this case, we will be clicking on the very first restaurant name which should highlight in green after you've selected it as well as highlight other restaurant names in yellow to indicate that ParseHub has identified them as similar elements.



3. Click on the second restaurant name and notice how the number of elements next to "Select selection1" command has increased (it now shows "Select selection1 (13)") - if it hasn't yet selected all of the restaurants on the page, click on another restaurant name until all of the restaurant names on the page are highlighted in green.



Our selection "Select selection1" is currently selecting 13 elements in our project (the number of elements on the page may vary on your project), all of which will be highlighted in green on the page.


4. If you double click on the text "selection1", you can rename this to something more descriptive, such as "Restaurant". Names may only contain letters, numbers and underscores (_).


Once you've selected more than one element, as we have done in step 3 above, ParseHub will automatically add a Begin new entry command which ensures that each of the restaurants selected will be on their own CSV row or have their own scope in JSON.

ParseHub has also automatically added Extract commands for both the name and the url, which you can preview in the bottom pane: 


You could always delete one or both of these commands and your preview data would update accordingly. 


5. In order to have ParseHub scrape not only results on the first page but also results on other pages, you'll need to add pagination. In this case, click on the + sign next to "Select page" and choose a new Select command. 



6. A new "Empty selection1" command will appear. Click on the "Next" button on the website, which should highlight in green and double-click on the name "selection1" to rename it to "nextButton". It should show (1) element selected which references the "Next" button:



7. In order to teach ParseHub how to click on the element we just selected, click on the + sign next to "Select & Extract nextButton" and choose a Click command.



8. This will bring up a pop-up asking you what you would like to do once the "Next" button has been clicked. In this case we will choose "Go to Existing Template" and choose "main_template" from the drop-down menu as ParseHub should repeat everything we did on page 1 on the results for every subsequent page.



9. To recap what we've done so far, our project has the following commands: 


Following these commands, ParseHub will:

  • Select page: load and select the whole page
    • Select Restaurant: select all of the restaurant names on the page
      • Begin new entry in Restaurant: start a new entry for each restaurant
        • Extract name: extract the name of the restaurant in to that restaurant's entry
        • Extract url: extract the url of the restaurant into that restaurant's entry
  • Select nextButton: select the "next" button
    • Click and go to main_template: click the "next" button and return to this template to repeat the above actions on the next page. 


Extract additional restaurant information from the results page

10. If we wanted to extract more than just the name and URL for each restaurant, we could do so by using a Relative Select command which associates data. Click on the + sign next to "Begin new entry in Restaurant" and choose a Relative Select command which will add a "Relative selection1" to your commands.



11. First click on the main item, in this case the restaurant name (you can click on any one of the restaurants), and as you move your cursor away from the restaurant name you'll see there's an arrow stemming from that selection. Click on the item that you would like to extract - for example, the phone number and rename your "selection1" to "phoneNumber".



12. This should automatically include arrows from all other restaurant names to each of their associated phone numbers but, if that's not the case, click on a restaurant name which isn't doing this and then click on the phone number to teach ParseHub to include that element as well. 


13. You can repeat steps 10 - 12 to extract any other information that appears on this page such as the address, number of reviews or types of cuisine. There are some useful tricks that you can use such as zooming in and out to select elements or scraping ratings and reviews but if you're having trouble extracting the data you need, you can always contact us


Click into each restaurant page to extract more details

14. If we wanted to click into each restaurant to get more information from that restaurant's listing page, we could add a Click command to each restaurant's entry. To do this, click on the + sign next to "Begin new entry in Restaurant" and choose Click.



15. This will open a pop-up asking us what we want to do once we've clicked on the restaurant. In this case, we will choose "Create New Template" and can call that template something like "details" since this template will apply to the layout when we're on each restaurant's individual page. 



16. This will open up the page for the first restaurant and our new "details" template with an "Empty selection1". We can click on the first piece of data we're interested in extracting - for example, the restaurant's website - and rename "selection1" to "website".



17. For each new piece of data we wish to extract, we can click on the + sign next to "Select page", choose a Select command and click on that new item, which will result in multiple Select commands.



If you wish to move between templates, you can always open the page corresponding to that template (by going to the browser tab with that page, entering it in the URL or navigating to it in Browse mode) and double-click on the template name to open that template.


Lesson 5: Testing your First Project ⇨