In this video tutorial, I'm going to show you how you can scrape data from a directory-type website by using ParseHub.
Creating your project
To begin, open your ParseHub client and click on "New Project". From here, you can enter the URL of the website you would like to scrape data from. For this example, we'll be using the Yellow Pages, which you can find at www.yellowpages.com. We're going to click on the option to start a new project on this URL.
The ParseHub tool
When the page loads, you'll notice that there are three main sections on the ParseHub tool:
- On the left hand side is where you have your commands and your settings.
- In the middle is the interactive view of the website.
- And here [at the bottom] is where you'll be able to preview your data in either CSV or JSON.
Your project will automatically be in Select mode and there will already automatically be an empty Select command available for you. However, if this is not the case, you can always click on the + sign and choose a Select command.
The Yellow Pages homepage
For this page, we want to teach ParseHub how to input the business that we're looking for, how to input the city and to then click on the search button.
Using our Select command, the first thing that we'll do is click into the first box. You'll notice that by default ParseHub has already identified that it's an input field and added an Input command. In this Input command, we're going to type in the word that we want to search for. In this case, we're searching for "web developers". As you can see, ParseHub is already typing "web developers" into the box. The command is named selection1 by default. However, we can change it to something more descriptive such as "business".
For the location we'll do something similar. From where it says "Select page", we'll click on the + sign and choose Select. We'll select the input box for the location, the input command will automatically appear for us and from here we can type "New York City, NY". We can also rename our selection1 to something more descriptive such as location.
Finally, we'll choose a third Select command and we'll click on the search button. We can rename this search button to something more descriptive such as "search_button". However, in this case, because it's not an input field, what we want to do is teach ParseHub how to click on it. To do so, we'll click on the + sign that appears next to "Select & Extract search_button" and we'll choose a Click command.
The Click command will automatically load a pop-up such as this one which asks you what you want to do once the button has been clicked. In this case we're going to create a new template for the new page and we can call this something such as "results".
The Yellow Pages results page
Once we click on Create New Template, the results page will load and our new template that we just created will appear on the left hand side. You can see the results template [in the commands area].
In this page, what we want to do is have ParseHub select the name, the address and the phone number. To select the name we'll choose a Select command from next to "Select page" by clicking on the + button and choosing Select and we'll begin to select the name of the first business. When I click on it, other businesses will be highlighted in yellow to show that ParseHub has identified them as the same type of element. If I click on the second one, the third one should also highlight in green. If this is not the case, you can click on it to teach ParseHub that that is too a similar element. As you scroll down the page, all of the titles should now be highlighted in green.
At the bottom of the page you'll see that we now have a preview in JSON of our first selection, including the name and the url. If you'd like to see this in CSV or Excel, you can click on this button here. This shows our first selection of our name and our URL in each column correspondingly. If you want to see more data, you can click on the "See more data" button.
You'll notice that our Select command automatically created a Begin new Entry command. Each one of these new entries create a new line on your CSV row [*file]. It's also automatically added commands [Extract commands] for extracting the name and the URL. If we're not interested in extracting the URL, we can click on the X sign to remove that. As you can see below, we now only have the name. We can also choose to rename selection1 to something more descriptive. In this case, we can rename selection one to "business".
To choose the address we're going to use what's called a Relative Select command. Relative Select commands allow you to associate data, in this case we're associating the title to the address. To do this, click on the + sign that appears to "Begin new entry in business", choose the Relative Select command, choose the name and then associate that with the address below. ParseHub should automatically do this now for every single one of your results. You can also view the addresses below on the new column that's been created. We can once again rename selection1 to something more descriptive such as "address".
Repeat the same process for the phone number, which is to click on the + sign next to "Begin new entry in business", choose a Relative Select command, click on the name of the business and click on the phone number. Once again, ParseHub should automatically do this for all of the phone numbers. However, if this is not the case, you can re-train ParseHub by clicking on any missing ones and by doing so it should be able to identify the rest of the phone numbers on the page.
Clicking through to the details page for each result
We now have three columns containing our name, our address and our phone number. However, if you wanted to extract information from within the result, that's possible as well. To do so, we're going to use a Click command to click into the result. To add a Click command, click on the + sign next to "Begin new entry in business" and choose a Click command. Once again, this pop-up will appear asking us what to do with our Click command and in this case we'll create another new template and this will be called "details". When we click on "Create New Template", the details page will load as well as a new template on the left hand side.
Selecting data from the details page
On this page, I can use multiple Select commands to choose the data that I'm interested in. For instance, I could use a Select command by going to "Select page", clicking on the + sign and choosing Select to choose the number of years that they have been in business. I can rename that to "years_in_business" to make that more descriptive below.
I may also be interested in extracting the business' description. To do so, I'll click on the + sign next to "Select page", choose the Select command, click on the description and rename that to "description". As you can see, we now have one column for the years in business and one for the description.
Adding pagination to the results template
There's one last thing that we need to do to make our template [*project] complete. This is back on the previous tab where we had our results.
If we scroll all the way to the bottom of the page, you'll notice that this is not the only page of results, there are multiple more. What we want to do is teach ParseHub to click on the next button every time it reaches the end of one page of results.
To do this we'll click on the + sign that appears next to "Select page", choose a Select command, click on the "Next" button, rename this to something more descriptive such as "next" and then choose a Click command next to "Select & Extract next" to add a click.
In this page we want to continue executing on the results template. The reason for this is because the second page of results is going to have the exact same layout as the first page of results and therefore we want to execute everything in the exact same way as we did on the first page of results. All you need to do now is click on "Go To Existing Template" and our project will be finalized.
We've now created three templates, our main_template which searches for web developers in New York City; our results template which goes through each result and extracts the name, address and phone number and our details template which extracts the years in business and a description for each of these businesses. The results template then goes on to the second page and continues to do the same thing until we reach all of the results.
Testing our project
If we wanted to test our project, we could simply click on "Get Data" and test run our project. Here you have the option to either click on this button [first button on the left] to go through the project step-by-step or click on the play button to run through the whole project. If we click on "Play", you'll see that ParseHub starts to do everything automatically: searching for web developers in New York City, going into the results page, finding all the results which extracts in JSON below and repeating this up to five pages.
The complete run will contain all of the data but the test run only has five pages. You can preview this data in JSON just here below, so we can see all the different businesses and the information for each one.
Need more help?
In this tutorial, I have shown you how you can scrape data from a directory-type website such as the Yellow Pages using ParseHub.
If you have any questions at all with your project, you can always contact us at email@example.com, we're always happy to help with any questions you might have.
Download this Project
You can download the project that we just created here: Yellow_Pages_-_Scraping_Directories.phj
To open the project in your account, open ParseHub, go to My Projects, click on Import Project and select the file. Note that this project will work on the Yellow Pages only. However, you can customize the project to include your own keyword and location, extract different data from the results page and add more commands.
Please also see this written tutorial on scraping directories with the Realtor website as an example.