For some websites, you may want to input two lists of data into two search fields to scrape the results. For example, you may have a list of keywords that you want to cross-reference with a list of locations:
- Keywords: Plumbers, Locksmiths, Dentists, Doctors...
- Locations: "Buffalo, NY","Portland, OH","Miami, FL"
Combining the two lists your searches would be:
- "Buffalo, NY" Plumbers
- "Buffalo, NY" Locksmiths
- "Buffalo, NY" Dentists
- "Buffalo, NY" Doctors
- "Portland, OH" Plumbers
- "Portland, OH" Locksmiths
- .... etc.
You could perform these searches on the Yellow Pages website, for example, which is what we will be doing in this tutorial.
Creating Your Lists
This is the format of a JSON list that you can use in ParseHub:
{
"keywords":["Plumbers","Locksmiths","Dentists","Doctors"]
}
When you combine the two, your lists should be in this format:
{
"keywords":["Plumbers","Locksmiths","Dentists","Doctors"],
"locations":["Buffalo, NY","Portland, OH","Miami, FL"]
}
We recommend using a tool like Mr. Data Converter to convert a list of words into a JSON list.
In the "Input" section type in your list name (e.g. "keywords") into the first row followed by each list item on a separate row. Change the Output to "JSON - Column Arrays" and copy the JSON list from the Output field.
Building Your Project
1. Open your ParseHub client, click on "New Project" and input the URL you would like to scrape data from. For this example we will be using the Yellow Pages, you can type https://www.yellowpages.com/ into your project if you would like to follow along. Click on "Start project on this URL".
2. Click on the gear icon at the top left corner and choose "Settings" from the menu.
3. In the "Starting Value" box, paste your lists of keywords in JSON which you created in the first part of this tutorial. You will see both lists appear in the preview section at the bottom of ParseHub:
4. Click the "Back to Commands" option on your project. Then click on the "+" button next to "Select page" and click on the "Advanced" arrow to show more tools.
5. Choose the "Loop" command. The loop command iterates through a list and is good for repeating commands multiple times.
6. In the text boxes - change "item" to "keyword" and type in "keywords" in the list text box (without the quotation marks).
- You can change "item" to anything you want. The item represents one keyword in your list of keywords.
- Make sure the list name is the same as your list name in JSON. If you typed in {"keywords":....} make sure to keep the text in the text box as keywords (this is case sensitive).
7. Click on the "+" button next to "For each keyword in keywords", click on the "Advanced" arrow to show all the commands and select a "Begin New Entry" command. Now the results for each one of the keywords will go into a separate row in Excel and a separate scope in JSON. If you don't use this command anywhere in your project, the results scraped for each keyword will override one another.
8. Rename the "list1" name that appears next to "Begin new entry" to something else like "jobs". Make sure not to name the list command the same as the list that holds your keywords. The list command should have a unique name.
9. Click on the "+" button next to "Begin new entry in jobs" (or "Begin new entry in list1" if you did not rename it in the previous step) and choose a Select command.
10. Click on the left-hand search box on the Yellow Pages (which says "Search by business or keyword"). ParseHub will automatically create an Input command for you. Instead of typing the actual keyword, just type in "keyword". This will tell ParseHub to add in the current keyword in your list of keywords. Also, ensure that you select "expression" in the "Input type" drop-down menu so that ParseHub will read the text as an expression instead of just plain text.
11. [Optional] Now you have the option to extract each keyword in your list along with the related results. This step will add a new column for the keywords that you provided as the starting value.
If you would like to do this, click on the "+" button next to "Begin new entry in jobs" (or "Begin new entry in list1" if you did not rename the command), click on the "Advanced" arrow to show all the commands and select an Extract command. Instead of $location.href enter "keyword" and rename the Extract command to "currentkeyword". In your final results, you will have a column which extracts the associated keyword per each result.
12. To nest our loops, we will now repeat steps 6 - 11 for our locations list which will be nested in our keywords list.
Click on the "+" button next to "Begin new entry in jobs" (or "Begin new entry in list1" if you did not rename the command), click on the "Advanced" arrow to show all the commands and select another "Loop" command. In the text boxes - change "item" to "location" and type in "locations" in the list text box (without the quotation marks).
- You can change "item" to anything you want. The item represents one location in your list of locations.
- Make sure the list name is the same as your list name in JSON. If you typed in {"locations":....} make sure to keep the text in the text box as keywords (this is case sensitive).
12. Click on the "+" button next to "For each location in locations", click on the "Advanced" arrow to show all the commands and select a "Begin New Entry" command. Now the results for each one of the keywords will go into a separate row in Excel and a separate scope in JSON. Rename the "list1" name that appears next to "Begin new entry" to something else like "cities". Make sure not to name the list command the same as the list that holds your locations. The list command should have a unique name.
13. Click on the "+" button next to "Begin new entry in cities" (or "Begin new entry in list2" if you did not rename it in the previous step) and choose a Select command. Click on the right-hand search box on the Yellow Pages (which should have a current location such as "Fort Lauderdale, FL"). ParseHub will automatically create an Input command for you. Instead of typing the actual keyword, just type in "location". This will tell ParseHub to add the current location in your list of locations. Also, ensure that you select "expression" in the "Input type" drop-down menu so that ParseHub will read the text as an expression instead of just plain text.
14. [Optional] Now you have the option to extract each location in your list along with the related results. This step will add a new column for the locations that you provided as the starting value.
If you would like to do this, click on the "+" button next to "Begin new entry in cities" (or "Begin new entry in list2" if you did not rename the command), click on the "Advanced" arrow to show all the commands and select an Extract command. Instead of $location.href enter "location" and rename the Extract command to "current location". In your final results, you will have a column which extracts the associated location per each result.
15. To select and click on the search button, click on the "+" button next to "Begin new entry in cities" (or "Begin new entry in list2" if you did not rename the command) and choose a Select command. Select the search button on the Yellow Pages website.
16. Click on the "+" button next to "Select & Extract selection3" and choose a Click command. The Click command lets you click on anything on the page to open dropdowns, tabs, etc or to click on buttons that will take you to another page.
17. Every time you add a click command to your template it will ask you if it's a Next page button. Since this is not the case you will choose "No". Select the "Create New Template" option which you can call "results". Clicking on the button will produce a new page of results, therefore, you should be creating a new template to make a new set of instructions. Remember, you should use a new template for every page that looks different. Click on "Create New Template".
18. On the new template you can go ahead and select and extract any of the results that you want to scrape. ParseHub will repeat the instruction of searching for the keyword and scraping results for all of the keywords and locations you added to the "Starting value" in the project settings.
Download this Project
You can download the project that we just created here: Yellow_Pages_-_Nested_Loops.phj
To open the project in your account, open ParseHub, go to My Projects, click on Import Project and select the file. Note that this project will work on the Yellow Pages only. However, you can customize the project to include your own keyword lists, extract different data from the results page and add more commands. This video tutorial has an example of how to extract more information from the Yellow Pages.