ParseHub can be used to get data from websites that require infinite scrolls on the web page to load the listings (AKA Lazy Load). However, some websites might need hundreds of scrolls to load all the listings and this can cause ParseHub to run out of memory.
In order to avoid memory shortage we can scroll, extract and remove the elements from the web page and repeat the same commands.
For this example, we will demonstrate by scraping the YellowPages website.
How to get data from an undetermined number of scrolls
- Click on the + button next to "Select page", click on advanced, and then choose an Extract command. Name this command "listingValue", then clear the contents in the extract configuration box and type zero(0) to set the value of the "listingValue" to 0. We will use this value later on to check and see if we need to scroll or not.
- Click on the + button next to "Select page" and choose a Select command. Move your cursor onto the first listing and hold CTRL key (CMD on Mac) +1 to zoom out on the selection. Now that you can select the whole container of the listing, click on the first listing to select the container. The other listings will be highlighted in yellow; click on the next one as well to select all the listings. You may need to do this multiple times to select all of the listings.
- By selecting all the listings, ParseHub creates a Begin New Entry (list) node which is hidden on the Select container command and extracts the text within the container (name). If you are not interested the Name you can hover on the extract node and remove it by click on the trash icon.
- Expand the Begin New Entry command by clicking on the list icon . We can now select and extract the data from each listing by clicking on the + button next to Begin New Entry (listing) and choosing Relative Select. Click on the container that we selected earlier and use the arrow to select any element you want to extract.
- Hover over the "Select listing" command and hold the Shift key. Click on the + button that appears and choose an Extract command. Rename this command to "remove", and choose "Delete element from page" from the Extract dropdown menu.
- Hover over the "Select listing" command and hold the Shift key. Click on the + button that appears and choose an Extract command. Rename this command "listingValue" (same as above). This command will execute and set the value of listingValue to 1 once all the products from the list are extracted and removed.
Make sure that your Extract commands are not nested within your Begin new entry command! Otherwise the "listingValue" variable that you are setting equal to 1 will belong to a new scope, and therefore will not be able to be referred to as "listingValue" outside of that scope. The commands should be in line with the Begin new entry command in your command structure.
- Click on the + button next to "Select page", click advanced, and choose a Conditional command. In the conditional command just write listingValue. This condition will be true if listingValue equals to 1 (ie. when we are done with the previous steps' commands (extracting and removing)).
- Select the container that holds all of the content that you want to scroll. To do this you need to refine what is highlighted on the page. Hover over any element contained in the scrollable content, such as a product title. Hold down the CTRL key (or the CMD key on a Mac) and press the number 1 on your keyboard. Press 1 again and again until you see a highlight around all of the loaded content. Without letting go of CTRL/CMD, click on the highlight to make a selection. If you zoom too far, press the number 2 on your keyboard to shrink the selection.
- Click on the + button next to "Select container" command and add a Scroll command. In the configuration, choose the "Align to bottom" radio button. If the listingValue is 1 then the conditional command will proceed and scroll down the page to load more products.
- To tell ParseHub to repeat this template, hold the Shift key while hovering over the Scroll command and click the + button that appears. Choose a Go to template command and set it up so that it stays on the current page and repeats the template where the scroll is (ie. the main_template).
- If we ran our project as is, ParseHub would stop after the first scroll because by default, ParseHub does not visit pages that it has already scraped. We can disable this feature by going to our template settings and disabling "No duplicates".
Now ParseHub will select, extract, and remove the data from the HTML and scroll down to load more listings.
If this method does not work for your particular website (ie. you are getting duplicated data), you can try our other method of infinite scrolling.