On ParseHub, you can scrape web pages which list directories, such as businesses, professionals and stores.
In this tutorial, you will learn how to scrape directories and extract all the details for each of the items (members) in the directory.
For this example, we will scrape the Realtor website in order to extract real estate agents in Toronto.
1. Click on the + button next to the select page, then choose the Select tool. Next, select the first agent's name:
2. Select the second agent's name (highlighted in yellow) in order to select all the listed agents:
3. By selecting all the agents, ParseHub creates a Begin New Entry (hidden under list icon ) node and extracts the name and the URL of each agent. If you are not interested in the URL you can hover on the extract node and remove it by clicking on the trash icon.
4. You can continue by selecting the relevant information for each agent on the same page. Click on the + button next to the Select Agents and choose the Relative Select tool:
5. Select the target information (for example, Brokerage name) for the first agent. If the same information is not selected for the second agent, train ParseHub on more selections by selecting the second agent's brokerage name as well:
5. ParseHub automatically extracts the text and the URL of each of the brokerages. If you want to remove the automatic URL extraction, you can click on the + button and add an extract node to extract the text of the selection only:
6. If you want to extract more information for each agent, please repeat steps 4 and 5.
7. There might be some information you need that is available only on the agent's profile page. In that case we should visit the agent's profile page to extract this information. To complete this task, click on the + button next to the Select Agents and choose the Click tool:
8. By choosing the click tool, the configuration pop up will appear. When it asks you if it is a "next page" button, click "No". Since the website is loading a new page (new URL) for each agent, you should create a new template:
9. After choosing to create a new template, ParseHub takes you to the Profile template and loads the agent's profile page. Now that you are on the agent's profile page, you can separately select each piece of information that you are interested in. Note that each new selection creates a new column in your final results.
10. Please note that on the main_template (starting page), we had more than one page of results. To be able to scrape all the results from all the pages, you need to add the pagination step to the main_template. If there is a "next" button (which takes you to the next page) shown on the starting page, please follow this article. However, if the starting page shows the page numbers only (linking to specific pages of the results), please follow this one.
Please also see this video tutorial on scraping directories with the Yellow Pages website as an example.