ParseHub automatically extracts the text and the url of any element that you select when possible.
You can refine this extraction and tell Parsehub to extract any HTML attribute.
What you can extract from an element:
- href Attribute - the url (if you selected a link previously)
- src Attribute - the url of an image (if you selected an image previously)
- Full HTML
- Inner HTML
- Page URL - the url of the current page that you have associated with the template
- text
- class Attribute - best used to get information about images and icons such as the product rating behind the stars
How to change what ParseHub is extracting:
1. Select similar elements on the page by clicking on the + button on Select page. Now we are going to change the extraction option from the selections we made. If you want to learn how to scrape from e-commerce websites like the amazon storefront below, follow this tutorial.
2. Click on the Extract name command on the left-hand side on your command bar to reveal it's settings. From the extract command options, dropdown select any option for extracting what you need.
Example 1: Get the product rating behind the stars
For this example go to a list of products on Amazon - https://www.amazon.ca/s?k=shoes&crid=1C28VKJD0ZDFL&sprefix=shoes%2Caps%2C160&ref=nb_sb_noss_1
1. Click on the "Select page" command + button that is located on the right of the command.
2. From the tool box choose the "Select" tool.
3. Click on the star ratings.
4. In this case a star rating of 4.1 out of 5 stars is extracted for you. Rename the selection "Stars".
5. Click on the button with the arrow to expand the command to reveal the extract command. Click on the Extract command to reveal it's settings.
6. From the dropdown in the extraction command options select "innerHTML".
Your CSV sample results should look like this:
Your JSON sample results should look like this:
Example 2: Get the full HTML from behind the product title
For this example go to a list of products on Walmart - https://www.amazon.ca/s?k=shoes&crid=1C28VKJD0ZDFL&sprefix=shoes%2Caps%2C160&ref=nb_sb_noss_1
1. Click on the "Select page" command + button that is located on the right of the command.
2. From the tool box choose the "Select" tool.
3. Click on the product title.
4. In this case the product title text and the url will be extracted for you. Rename the selection "title_html".
5. Click on the on the selection to expand the command. This will reveal the extract commands and its' settings.
6. From the dropdown in the first extract command options select "full HTML". Notice how now the html extraction is there.
Your CSV sample results should look like this:
Your JSON sample results should look like this: