You may see differences between the results on your project's test run and the results you receive when you run your project. For example:
- The data scraped on the run is different from the test run (e.g. different prices)
- There is no data or there is data missing on the run, but data is scraped on the test run
- The data wasn't scraped in the same order on the run as it was on the test run
There are a few reasons why you may be experiencing these issues:
- IP addresses: your test run runs locally on your device and the run runs on our servers
- Speed: runs will typically be faster than test runs
- Parallelization: on our paid plans, the order in which pages are scraped is different to the test run
Below we discuss each of these in further detail.
IP addresses
When you test run your project, it runs locally on your computer using your IP address. However, when you run the project, it runs on our server and will either use one static IP address or IP Rotation (from a pool of proxies).
Our IP address and proxies are based in North America and some websites may change their content based on your location. Therefore if you are building a project to scrape prices in GBP on a .co.uk website, for example, and that price detects the US IP address and updates their prices to USD accordingly, you may see prices extracted in USD.
You can use the Server Snapshot command to view what our servers see when they scrape the page. If you would like to use an IP address from a particular location, you have the option to use custom proxies on any of our paid plans.
Speed
Sometimes when you test run the project in play mode, you'll see that ParseHub tries to scrape the data, does not find it as it has not had time to load and therefore ends the project.
However, even if this is not the case, runs on our servers are much faster compared to your local machine. Therefore, ParseHub may try to scrape the page before it's had time to load the element you are trying to extract.
The solution, in both cases, is to enable the "Wait up to 60 seconds for elements to appear" on your selection:
Note that it will only wait this long if the element does not appear. We recommend only enabling this setting on the first element on each template that is guaranteed to appear as enabling it on elements which aren't always present on the page can slow down your project considerably.
Parallelization (Paid Plans)
On the test run, projects are scraped in the order that the commands are in. However, on the actual run ParseHub may scrape your data in a different order.
Running multiple projects at the same time
On our paid plans you have access to multiple "workers" on our servers, and you can have more than one worker on your run. This increases speed and being able to distribute workers allows you to have multiple runs at the same time. You can view how many workers are assigned to a run by clicking on the "little person icon" on the top right-hand corner and going to "My Runs".
This link has more detailed information on the number of workers and parallelizing projects.
Order in which pages are scraped
When multiple workers are running on a project, they will typically start by scraping all of the top-level pages and will then be distributed into lower level pages.
For example, if you are scraping a website that has a list of results (e.g. e-commerce website, real estate listing, dealership lists, directories... etc.) where you are clicking into each result, ParseHub will first collect all of the results and will then start clicking into each one.
Therefore, if you have 20 pages of results with 15 results per page (300 items to click into) and you set a limit of 150 pages, ParseHub will first scrape all 20 pages of results and then click into each item with the remaining 130 pages - so only 130 items will be scraped. If you have a small limit like 15 pages, ParseHub will scrape 15 results pages but is unlikely to click into any of them.
Troubleshooting your run
The following settings allow you to limit the number of pages or clicks on a run, which can be useful for troubleshooting certain issues without having to run the entire project.
Max Pages
If you would like to limit the number of pages that will be scraped on a run, you can set a maximum limit by opening your project, clicking on the Settings tab and setting a number on the "Max Pages" field (0 means there is no limit):
Please note that clicks, scrolls and visits to new URLs will result in increased page counts. So, for example, if you set your "Max Pages" to "5" and your project has to click on 5 links before it starts scraping data, no data will be extracted.
Max depth
On any Click command you will see a setting below that says "Max depth" that allows you to specify how many times you would like to click on that element.
For example, in the screenshot below "nextButton" selects the ">" button on the site (highlighted in green). There are 100 pages of results on this website but, because we've set the Max depth to 8, ParseHub will only click on the "next" button 8 times, resulting in a total of 9 pages of results scraped.