Sometimes, you need to scrape data from a series of pages with similar URLs. This tutorial will show you how to navigate to, and scrape data from, a list of URLs when you have a list of changing ID numbers or extensions.
If you have a list of full URLs, you can follow this tutorial, and if you need to input a list of keywords into a text box on the page, follow this one.
In this tutorial, I'll be scraping stock data from http://www.barchart.com/stocks/sp500.php.
Setting up the list of IDs to input
1. While working on a project, click the "Settings" button at the top of the left-side commands tab.
2. In the Starting Values text box, you can import a list of IDs from a CSV or JSON file or paste a JSON list directly into the textbox.
For example, if you have a CSV file in the following format (where the header is "stocks" and each stock ticker is on a different row):
Using the "Import from CSV/JSON" button next to "Starting Value" will convert it to a JSON list like the one below:
If you have a JSON list, you can also paste it in directly. I used this JSON object to get the top 25 stock tickers of the S&P 500 Index: {"stocks":["XOM","GE","MSFT","BP","C","PG","WMT","PFE","HBC","TM","JNJ","BAC","AIG","TOT","NVS","MO","GSK","MTU","JPM","RDS.A","CVX","SNY","VOD","INTC","IBM"]}
Navigating to each URL
3. Click the "Commands" button at the top of the left-hand tab.
4. On the main_template, click the plus button to the right of the Select page command. Add a Loop command from the "Advanced" menu.
5. Input "stock" instead of item and choose your list from the List dropdown: "stocks". This will cause ParseHub to perform the subsequent commands for each of the 25 stock tickers in the stocks list of the Starting Values.
6. In order to create a list in your results file and extract each value for each stock number you need to add a Begin New Entry command. Click on the plus button next to the loop command, and choose the Begin New Entry command from the "Advanced" menu.
You can rename your list to results.
7. Click the plus button to the right of the Begin New Entry command. Add a Go to Template command from the "Advanced" menu.
7. Input the following expression into the Go to URL box in the pop-up that appears: "http://www.barchart.com/quotes/stocks/"+stock
This will input the "stock" variable, which we defined on the Loop command, to the end of the URL in quotes.
You can use this for any URL. If the ID you need to input is in the middle of a URL, format it like this: "http://firstpartoftheurl.com/"+item+"/second_part"
8. Choose to Go to a new template, and input a name such as stock_template. Then, press the green button to add the command.
9. On the new page that loads, you can select any data you need to scrape. ParseHub will get this data for each page with the ID in the URL.