VIDEO TRANSCRIPTION
In this tutorial, I'm going to show you how you can extract data from one or more tables on a website.
Creating your project
To begin, open ParseHub and click on "New Project" and input the URL of the website that you would like to extract data from.
For this tutorial, we'll be using the Bloomberg Stocks page. If you would like to follow along, you can also enter www.bloomberg.com/markets/stocks into your ParseHub account and click on "Start project on this URL".
The ParseHub tool
As the page loads, you'll notice that there are three different areas within the ParseHub client:
- On the left-hand side, we have your commands section and your settings
- In the middle we have the interactive view of the website
- And at the very bottom is where we're going to be able to preview our data in either JSON or CSV format
When you start working on ParseHub, your project is already going to be in Select mode and there's already going to be an empty Select command for you. However, if this is not the case, you can always click on the "+" sign and choose the Select command.
What the Select command does is it allows you to select elements on the page that you would like to extract.
The Bloomberg Stocks page
As we scroll down the page for the Bloomberg Stocks, we can see that there are three tables: one with the American stocks, one with EMEA stocks and one with Asia Pacific stocks.
Extracting data from one table
For the first part of this tutorial, we're going to be focused on extracting data from the first table alone.
As we're already in Select mode, we can already proceed to selecting the data on the first column. If you hover over these two elements you'll notice that you can select either/or but not both of them. If you'd like to select both, there's a trick you can use which is to select either Command or Ctrl (if you're using a Windows [device]) and then 1 to zoom out or Command or Ctrl and 2 to zoom back in. In this way you can zoom all the way out and all the way in on the element that you would like to extract.
In this case I'm going to click on this first element and ParseHub is going to automatically highlight other elements that it considers to be the same in yellow. I'm going to select the second element that's the same on this table to train it and it should already be able to identify the remaining elements on that first column.
If we look at our preview below, we can switch it to CSV/Excel to see that what it's done is to put the name on those stocks into this first column and their URL into the second column. We can also expand that to see the remaining ones.
We can see on the side that our Select command now has the default name selection1, it's added automatically a Begin New Entry command - this is the one that allows us to create a new entry per CSV row for each of the data that we've selected. It's extracted the name and it's extracted the URL. If we don't want the URL, we can always click on the X sign to remove that. We can also rename "selection1" to something more descriptive such as "stock".
To associate the data in the first column with the data in the second column, we're going to to use what's called a Relative Select command which does precisely that: it associates data. To do so, we'll click on the "+" sign next to "Begin new entry in stock", click on "Relative Select", click on the item in the first column and associate it with the item in the second column. ParseHub should automatically figure this out for the remaining items on the second column and add them to your CSV file in the second column. We can then rename that "selection2" to another more descriptive name, such as "value".
We can do the same thing for the third column by clicking on the "+" sign next to "Begin new entry in stock", Relative Select, the first column and the third column to select that data. You'll notice that only two element have been selected here, it hasn't selected the negative ones because it doesn't consider them to be the same type of data. This is quite easy to resolve, all you need to do is click on the second element and on the third column for the second element to train ParseHub to understand that you want the entire column full of data. We can once again rename that to something more descriptive such as "net_change" and now we have our preview of our table which has our stock name, it's value and it's net change.
We could continue doing this for the remaining columns but it would be exactly the same as the prior two steps where we click on the "+" sign next to "Begin new entry in stock", choose a Relative Select command, choose the first column and then choose that new column.
Extracting data from multiple tables
What I want to show you during the second part of this tutorial is how you can scrape data from all three tables in the one go.
The first thing I'll do is delete this ["Select selection1"] so we can start from a clean slate and you'll notice that once again we have an empty Select command.
This time what we want to to is select the title of the table, so I'm going to click on "Americas", which is going to be highlighted in green and now appears in my first column in the preview below and I'm going to scroll down and see that indeed it's identified "Europe, Middle East & Africa" as a similar element. If I click on it, I'm training it to identify all of these elements as similar elements and it's now captured "Asia Pacific" as well and all three appear below.
Once again, I can give that a more descriptive name, and you'll see the Begin New Entry command has automatically been added as well as extract name and url, which I can remove if I'm not interested.
To associate the header of the first table with the data on the first table, I'm going to use another Relative Select command which I'll select from next to "Begin new entry in region". If I click on Relative Select, I'll associate the header with the first item in the first column. To tell it that I want all items in that column, I'll scroll up again, select the header, and associate it with the second item in that column.
You can now see that not only has it selected all the items in the Americas table on that first column, but it's also done the same for the EMEA table and the APAC table.
If I want to associate data in the other columns as well, I'll just repeat what I did in the previous section. Here I can rename that to something more descriptive such as "stock", I can choose to remove the "Extract url" and next to where it says "Begin new entry in stock", I can choose a Relative Select command to associate the first column with the second column, rename that to "value" and associate the first column with the third column, train it to identify the negative values and rename that to "net_change".
Testing our project
You can see now that I have the data for all three tables and the first three columns in each of those tables. This appears in the preview, but if I wanted to test that, I could also click on "Get Data", perform a test run and open a test run and then click on the play button so that it'll run through that project and show me the preview in JSON of what's going to be extracted.
Note that on this test run, you also have the option on the first hand to go step-by-step through the project and this will show you that data in JSON.
If you wanted to run the full project, you would simply stop running, go to "Get Data", click on Run and "Save and Run". As it starts running you'll see the progress and once it's finished you'll be able to download your data in CSV or JSON.
Need more help?
In this tutorial I have shown you how you can scrape data from one or more tables by using ParseHub. If you have any questions at all regarding your own project you can always contact us at hello@parsehub.com. We're always happy to help!
Happy parsing!
Please also see this written tutorial on scraping tables with the NBL Canada stats website as an example.