Sometimes, you need to scrape data from a series of pages with similar URLs. This tutorial will show you how to navigate to, and scrape data from, a list of URLs when you have a list of changing ID numbers or extensions.
In this tutorial, I'll be scraping stock data from http://www.barchart.com/stocks/sp500.php.
Setting up the list of IDs to input
1. While working on a project, click the "Settings" button at the top of the left-side commands tab.
2. In the Starting Values text box, you can import a list of IDs from a CSV or JSON file or paste a JSON list directly into the textbox.
For example, if you have a CSV file in the following format (where the header is "stocks" and each stock ticker is on a different row):
Using the "Import from CSV/JSON" button next to "Starting Value" will convert it to a JSON list like the one below:
If you have a JSON list, you can also paste it in directly. I used this JSON object to get the top 25 stock tickers of the S&P 500 Index:
Navigating to each URL
3. Click the "Commands" button at the top of the left-hand tab.
4. On the main_template, click the plus button to the right of the Select page command. Add a Loop command from the "Advanced" menu.
5. Input "for each stock in stocks". This will cause ParseHub to perform the subsequent commands for each of the 25 stock tickers in the stocks list of the Starting Values.
6. Click the plus button to the right of the new Loop command. Add a Go to Template command from the "Advanced" menu.
7. Input the following expression into the Go to URL box in the pop-up that appears:
This will input the "stock" variable, which we defined on the Loop command, to the end of the URL in quotes.
You can use this for any URL. If the ID you need to input is in the middle of a URL, format it like this:
8. Choose to Go to a new template, and input a name such as stock_template. Then, press the green button to add the command.
9. On the new page that loads, you can select any data you need to scrape. ParseHub will get this data for each page with the ID in the URL. Make sure to add a Begin New Entry command at the start of this template, however. Otherwise, each new piece of data will overwrite the previous piece of data in the same scope.