ParseHub automatically extracts the text and the url of any element that you select when possible.
You can refine this extraction and tell Parsehub to extract any HTML attribute.
What you can extract from an element:
- href Attribute - the url (if you selected a link previously)
- src Attribute - the url of an image (if you selected an image previously)
- Full HTML
- Inner HTML
- Page URL - the url of the current page that you have associated with the template
- text
- class Attribute - best used to get information about images and icons such as the product rating behind the stars
How to change what ParseHub is extracting:
1. Create an extract command by clicking on the + button of the selection that you want to extract. Click on the "Advanced" button and choose extract. Even though ParseHub created an extraction for you, we want to create a new extraction to be able to refine it.
2. From the command options, dropdown select any option for extracting what you need.
Example 1: Get the product rating behind the stars
For this example go to a list of products on Walmart - http://www.walmart.ca/en/appliances/small-appliances/coffee-maker/N-658
1. Click on the "Select page" command + button that is located on the right of the command.
2. From the tool box choose the "Select" tool.
3. Click on the star ratings.
4. In this case nothing will be extracted for you because there is no text on the page only images of stars. Rename the selection "stars".
5. Click on the + button of the "stars" selection & extraction command.
6. Choose the "Extract" tool from the tool box.
7. From the dropdown in the extraction command options select "class Attribute".
Your CSV sample results should look like this:
Your JSON sample results should look like this:
Example 2: Get the full HTML from behind the product title
For this example go to a list of products on Walmart - http://www.walmart.ca/en/appliances/small-appliances/coffee-maker/N-658
1. Click on the "Select page" command + button that is located on the right of the command.
2. From the tool box choose the "Select" tool.
3. Click on the product title.
4. In this case the product title text and the url will be extracted for you. Rename the selection "title_html".
5. Click on the + button of the "title_html" selection & extraction command.
6. Choose the "Extract" tool from the tool box.
7. From the dropdown in the extraction command options select "full HTML". Notice how only the html extraction remains. The title and url that were automatically extracted before where erased to make room for your new extraction.
Your CSV sample results should look like this:
Your JSON sample results should look like this: