Infinite scrolls for a large set of listings

Parsehub can be used to get data from websites that require infinite scrolls on the web page to load the listings. However, some websites might need hundreds of scrolls to load all the listings and this can make ParseHub run out of memory.

In order to avoid memory shortage we can scroll, extract and remove the elements from the web page and repeat the same commands by creating a Loop command.

For this example, we will scrape the Razoo.com

How to get the data from hundreds of scrolls

1. Click on the + button next to the select page, click on advanced then choose the Loop command. In the loop configuration create an array to define the number of times that you want to repeat the commands. Add the $createArray(x) function in the In field to create an array. Set the value of 'x' according to the number of scrolls needed on the website to load all the listings.

2017-04-10_14-30-06.png

 

2. Click on the + button next to the For each item command, click on advanced then choose the Extract command tool. Clear the contents in the extract configuration box and type zero(0) to set the value of the "listingValue" to 0. We will use this value later on to check and see if we need to scroll or not.

 

2017-04-10_15-03-53.png

 

3. Click on the + button next to the select page, then choose the Select tool. Move your cursor on the first advert and hold command/control +1 to zoom out on the selection. Now you can select the whole container of the advert, click on the first advert to select the container. The next adverts will be highlighted in yellow, click on the next one as well to select all the adverts.

2017-04-10_15-04-53.png

 

4. By selecting all the adverts, ParseHub creates a Begin New Entry (list) node and extracts the name and the URL of each advert. If you are not interested in the URL or Name you can hover on the extract node and remove it.

5. We can select and extract the data from each advert by clicking on the + button next to Begin New Entry (adverts) and choose relative selection. Click on the image and point the arrow to any element you want to extract.

2017-04-10_15-06-31.png

6. Click on the select adverts command node and hold shift on your keyboard, now the +button will appear. Click on the + button and add an extract command.  In the extract configuration box remove the text ($e.parentProp("href") and add this function: $e.remove(). This is the remove command which will execute for each advert and will remove each of them from the HTML one by one. Since we are adding this command after the extract commands, ParseHub will first scrape the information for each advert and then it will remove them.

2017-04-10_15-18-42.png

 

7. Click on the + button next to For each item $createArray(1000) loop command and add an extract command. In the extract configuration box remove the text ($e.parentProp("href") and add enter "1". Name this command as same as listingValue. This command will execute and set the value of listingValue to 1 once all the adverts from the list are extracted and removed.

2017-04-10_15-23-58.png

 

8. Click on the + button next to the For each item in $createArray(1000) command and add a Conditional command. In the conditional command just write listingValue. This condition will be true if listingValue equals to 1. This condition will be true whenever we are done with the previous steps' commands (extracting and removing).

2017-04-10_15-32-04.png

9. Select the container that holds all of the content that you want to scroll. To do this you need to refine what is highlighted on the page. Hover over any element contained in the scrollable content, such as a product title. Hold down the CTRL key (or the CMD key on a Mac) and press the number 1 on your keyboard. Press 1 again and again until you see a highlight around all of the loaded content. Without letting go of CTRL/CMD, click on the highlight to make a selection. If you zoom too far, press the number 2 on your keyboard to shrink the selection.

2017-05-24_09-15-42.png

10. Click on the + button next to Select container command and add a Scroll command. In the configuration, check the box Align to bottom. If the listingValue is 1 then the conditional command will proceed and scroll down the page to load more adverts. This process will be repeated based on the value of $createArray(x).

Now Parsehub will select, extract and remove the data from HTML and scroll down to load more adverts.

2017-05-24_09-17-18.png

 

 

 

 

Have more questions? Submit request!

0 Comments

Please sign in to leave a comment.