There may be many use cases in which you need to use a list of keywords to scrape your target website. For example:
- Adding a list of keywords to be searched through one by one in a search box.
- Scraping data from a series of pages with similar URLs (e.g. example.com/123, example.com/456, example.com/789... etc.).
One option you have is to use JSON as outlined in the Enter multiple keywords into a search box and Navigate to a list of URLs (using an ID value in the URL) tutorials. Alternatively, you can use a list of keywords on Google Sheets, which ParseHub can scrape and then use as explained on this tutorial.
This tutorial will demonstrate how to create a list of keywords on Google Sheets and then search for each keyword on Amazon's search bar. You can follow along using Amazon or you can apply the instructions to your own website and search bar.
Note: You might not be able to get all the data you need, as ParseHub's Free tier only offers a maximum of 200 pages per run. If you need to scrape more, consider upgrading to one of our premium or enterprise plans!
If you only need to input a single keyword into a search box, follow the instructions in this tutorial.
Creating a Google Sheet
1) Login to your Google account and choose Drive from the menu on the top-right corner.
2) Click on NEW and choose "Google Sheets".
3) Rename the Google sheet to whatever you like, and enter your list of keywords on this sheet.
4) To make this sheet accessible to the public and therefore to ParseHub, you should publish this Google sheet. Click on the "File" from the menu and choose "Publish to the Web".
5) A pop up will appear which allows you to choose the publish options and publish the sheet.
6) Once you confirm to publish the sheet, Google generates a URL for your sheet which you can use in the ParseHub project. Please copy this link and save it for the next section.
Building the project on ParseHub
1) Choose "Create a new project" from the toolbar and enter the Google sheet URL that you copied from the previous section.
2) If not already in select mode, click on the "Select page" command + button that is located on the right of the command and choose the "Select" tool in the tool menu. Select the first keyword on the page.
3) "Select" the second keyword from the list (highlighted in yellow) to select all the keywords. If this selection is not selecting all the keywords, you can click on more yellow-highlighted keywords to train ParseHub on more selections.
Once you've selected all the keywords, ParseHub will create Begin New Entry, Extract name and Extract url commands automatically. You can keep the Extract name command in order to additionally extract the text of each keyword associated with the data in the final results. You can also change the selection text to "keywords" and the extract text to "keyword" (note that this extract text is what we will use in steps 8 to 12).
4) Click on the + button next to the "Select page", click on "Advanced" to show more commands and select the "Go To Template" command. Choose the "Go to URL" radio option and enter the URL of the page where you would like to use your keywords (in this case 'http://www.amazon.com/' - don't forget to add the quotation marks! - single or double). Create a new template for your keywords (in this case we will create the template "keyword_search") and click on "Create New Template".
5) Your main_template template should look like this:
6) First, we want to create a loop which will iterate through our list of keywords, and is good for repeating commands multiple times. Next to "Select page", click on the + button, click the "Advanced" arrow to show more tools and choose the "Loop" tool.
7) In the text boxes - leave "item" and type in "keywords" in the list text box (without the quotation marks).
- You can change "item" to anything you want. The item represents one keyword in your list of keywords.
- Make sure the the list name is exactly the same as your list name in step 3.
8) Click on the + button on the right side of the "For each item in keywords" command. Click on the "Advanced" arrow to show all of the tools. From the tool box choose the "Begin New Entry" tool. Now the results for each one of the keywords will go into a separate row in Excel and a separate scope in JSON. If you don't use the list tool anywhere in your project, the results scraped for each keyword will over-ride one another.
9) Rename the "list1" command to something else like "products". Make sure not to name the list command the same as the list that holds your keywords. The list command should have a unique name.
10) Now we can apply our keywords to our Amazon search. First, click on the + button next to "Begin new entry in products" and click on "Select". We'll use the Select command to select the Amazon search bar, which will automatically allow us to input information.
11) From the input command options, select "Expression" from the dropdown. ParseHub will read the text as an expression instead of just plain text. Instead of typing the actual keyword, just type in "item.keyword" (which is what we named each of our keywords in step 3). This will tell ParseHub to add in the keyword that is represented by the item in the list of your keywords on the main_template.
12) Click on the + button next to "Begin new entry in products" and click on "Select". We'll use the Select command to select the search button.
13) Then we will use a "Click" command to click the search button. When asked if this is a "next page" button, click "No" and you will be prompted to create a new template (in this case we will create the template "search_results") for this click. Our new template will contain the actions we want to take once we've reached our list of results. Remember, you should use a new template for every page that looks different (has a different structure).
14) On the new template, you can go ahead and select and extract any of the results that you want to scrape. ParseHub will repeat the instruction of searching for the keyword and scraping results for all of the keywords you added into your Google sheet.