Many users are interested in scraping larger directories such as Yelp or the Yellow Pages. The tutorial below demonstrates how you can scrape directories such as Yelp using ParseHub.
Please note that you may need to enable IP Rotation to successfully scrape Yelp. |
Extract every restaurant name and url
1. Open the ParseHub client and click on New Project. To keep this example simple, we are going directly to the URL containing the search results for "Restaurants in Toronto" on Yelp, but it is possible to start from the homepage and teach ParseHub how to input search criteria. The URL we'll be using is https://www.yelp.ca/search?find_desc=Restaurants&find_loc=Toronto%2C+ON&ns=1 - enter this into the text box and click on Start project on this URL.
2. This will load the page in the page view area and on our sidebar you will find a main_template for this page layout and an "Empty selection1" command which you can use to select information on the page. In this case, we will be clicking on the very first restaurant name which should highlight in green after you've selected it as well as highlight other restaurant names in yellow to indicate that ParseHub has identified them as similar elements.
3. Click on the second restaurant name and notice how the number of elements next to "Select selection1" command has increased (it now shows "Select selection1 (30)") - if it hasn't yet selected all of the restaurants on the page, click on another restaurant name until all of the restaurant names on the page are highlighted in green.
Our selection "Select selection1" is currently selecting 30 elements in our project (the number of elements on the page may vary on your project), all of which will be highlighted in green on the page.
4. If you double click on the text "selection1", you can rename this to something more descriptive, such as "Restaurant". Names may only contain letters, numbers and underscores (_).
Once you've selected more than one element, as we have done in step 3 above, ParseHub will automatically add a Begin new entry command (hidden under a list icon ), which ensures that each of the restaurants selected will be on their own CSV row or have their own scope in JSON.
ParseHub has also automatically added Extract commands for both the name and the url, which you can preview in the bottom pane:
You could always delete one or both of these commands and your preview data would update accordingly.
5. In order to have ParseHub scrape not only results on the first page, but also results on other pages, you'll need to add pagination. In this case, click on the + sign next to "Select page" and choose a new Select command.
6. A new "Empty selection1" command will appear. Click on the "Next" button on the website, which should highlight in green, and double-click on the name "selection1" to rename it to "nextButton". It should show (1) element selected which references the "Next" button:
7. In order to teach ParseHub how to click on the element we just selected, click on the + sign next to "Select & Extract nextButton" and choose a Click command.
8. This will bring up a pop-up asking you what you would like to do once the "Next" button has been clicked. Choose "Yes" when asked if this is a "next page" button which will default to "Repeat the Current Template" as ParseHub should repeat everything we did on page 1 on the results for every subsequent page. Leave the click set to repeat 0 times (will repeat until ParseHub reaches the last page).
9. To recap what we've done so far, our project has the following commands:
Following these commands, ParseHub will:
- Select page: load and select the whole page
- Select Restaurant: select all of the restaurant names on the page
- Begin new entry in Restaurant (): start a new entry for each restaurant
- Extract name: extract the name of the restaurant in to that restaurant's entry
- Extract url: extract the url of the restaurant into that restaurant's entry
- Begin new entry in Restaurant (): start a new entry for each restaurant
- Select Restaurant: select all of the restaurant names on the page
- Select nextButton: select the "next" button
- Click each nextButton item: click each selection under "Select nextButton" one by one
- and go to main_template: repeats the main template
- Click each nextButton item: click each selection under "Select nextButton" one by one
Extract additional restaurant information from the results page
10. If we wanted to extract more than just the name and URL for each restaurant, we could do so by using a Relative Select command which associates data. Click on the + sign next to "Select Restaurant" and choose a Relative Select command which will add a "Relative selection1" to your commands.
11. First click on the main item, in this case the restaurant name (you can click on any one of the restaurants), and as you move your cursor away from the restaurant name you'll see there's an arrow stemming from that selection. Click on the item that you would like to extract - for example, the phone number and rename your "selection1" to "phoneNumber".
12. This should automatically include arrows from all other restaurant names to each of their associated phone numbers, but if that's not the case, click on a restaurant name which isn't doing this and then click on the phone number to teach ParseHub to include that element as well.
13. You can repeat steps 10 - 12 to extract any other information that appears on this page such as the address, number of reviews or types of cuisine. There are some useful tricks that you can use such as zooming in and out to select elements or scraping ratings and reviews but if you're having trouble extracting the data you need, you can always contact us.
Click into each restaurant page to extract more details
14. If we wanted to click into each restaurant to get more information from that restaurant's listing page, we could add a Click command to each restaurant's entry. To do this, click on the + sign next to "Select Restaurant" and choose Click.
15. This will open a pop-up asking us what we want to do once we've clicked on the restaurant. Select "No" when asked if this is a "next page" button which should default to "Create New Template" which we can call something like "details", since this template will apply to the layout when we're on each restaurant's individual page.
16. This will open up the page for the first restaurant and our new "details" template with an "Empty selection1". We can click on the first piece of data we're interested in extracting - for example, the restaurant's website - and rename "selection1" to "website".
17. For each new piece of data we wish to extract, we can click on the + sign next to "Select page", choose a Select command and click on that new item, which will result in multiple Select commands.
If you wish to move between templates, you can always open the page corresponding to that template (by going to the browser tab with that page, entering it in the URL or navigating to it in Browse mode) and double-click on the template name to open that template.