Some websites ask you to solve a Captcha in order to access their data.
Please note that, at this time, only Captchas that show an image that needs to be translated into characters are solvable by ParseHub. ParseHub cannot currently solve reCaptcha v2.
In this article, we will show you how to add a Captcha solver to your template in order to scrape Captcha enabled websites.
There are two group of websites which generate Captcha images. The first group has the Captcha image on each page (this article's example) and you need to solve the Captcha to access the data. The second group of websites shows you a Captcha if they detect ParseHub as a bot that is sending the requests. This group will send the Captcha randomly during the run on our serves (a good example is Amazon) and you might not notice them while testing the project locally.
After running the project on our servers, you will be able to see the Captcha image in server snapshots (available on the run page). To add a Captcha solver for these websites, you need to first open the server snapshot that shows the Captcha image (similar to image below) from the run page. Once you have the server snapshot open in one of the browser's tabs, you can navigate to any of your templates and follow the steps below to have the Captcha solver added.
1. If not already in select mode, click on the + button next to the "Select page" command and choose the "Select" tool in the menu.
2. Select the Captcha image. You can rename this selection to "image".
3. Click on the + button next to the "Select & Extract image", then click on Advanced options and choose the "Extract" command.
In the Extract drop down menu, choose the "Solve Captcha" option.
This is an internal function which will automatically solve the Captcha during the run.
You can also rename the Extract command to "captcha".
4. Please note that the Captcha solution will not work automatically while building the project or doing a test run. During test runs, ParseHub asks you to answer the Captcha manually in order to proceed with the test run. However, once the project runs on ParseHub servers, the Captcha solver will work properly.
Now that you added the Captcha solver, you can choose the answer field and enter the solution via a ParseHub expression.
Click on the + button next to the "Select page" command and choose the Select tool. Select the answer field. An "Input" command will be created automatically. Change the Input format to "expression" from the drop down menu that appears at the bottom of the command, and enter "captcha" without quotations. This value is the solution from the Captcha solver which was extracted as "captcha" in the previous step.
5. Normally there is a submit button available on the page that you can select to submit the Captcha solution.
Choose the + button next to the "Select page" command and choose the Select tool. Select the "Submit" button.
In the process of building the project, you must enter the Captcha solution manually. Before adding the next command, go to "Browse" mode by clicking on the grey "Browse" button on top of the template. Next, enter the Captcha solution manually on the website.
6. Click on the + button next to the "Select & Extract submit" command and choose the "Click" command.
The Click command's configuration pop up will appear. You can either choose to repeat the same template or you can create a new template in case the website is loading the results on a different page.
If you need more help with your project, please email us at hello@parsehub.com. We would be happy to help you.