Enter a list of URLs to crawl

With ParseHub you can navigate between links and categories on a website automatically. Sometimes, you may want to add hundreds of links directly into ParseHub, instead of selecting links on the website.

You can add a list of urls in JSON format into the "Starting value" of the project in the "Settings" tab.

Follow the instructions below to enter a list of urls into your Project.

1. Open your project and any webpage of the website.

2.  Go under the "Settings" tab of the project. 

3. In the "Starting value" text box add in your links in the following format.

  • The "urls" can be renamed to anything you want such as "links" or something more descriptive like "shoes" or "brands.
  • You can enter as many links in the structure as you want. We have 3 in this examples, but you can keep adding more links inside quotation marks and separated by a comma.
{
"urls": [
"https://www.amazon.ca/DADAWEN-Canvas-Lace-up-Oxford-Shoes-Black/dp/B013YBWV1W/ref=sr_1_1?ie=UTF8&qid=1458158940&sr=8-1&keywords=shoes",
"https://www.amazon.ca/DADAWEN-Leather-Lace-up-Bussiness-shoes-Black/dp/B00Y2HNSNI/ref=sr_1_2?ie=UTF8&qid=1458158940&sr=8-2&keywords=shoes",
"https://www.amazon.ca/Gleader-Casual-High-top-Waterproof-Sneakers/dp/B00X9GBXTE/ref=sr_1_3?ie=UTF8&qid=1458158940&sr=8-3&keywords=shoes"
]
}


 
4. Go back to the "Commands" tab of your project.
 
5. Click on the + button on the right side of the "Select page" command.
 
6. From the tool box choose the "Loop" tool. It is inside "Advanced". The loop tool iterates over a list. It is good for repeating commands multiple times.
 
 
7. In the text boxes below your commands, leave "item" and type in "urls" in the list text box (without the quotation marks).
  • You can change "item" to any name you want. The name represents one url in your list of urls.
  • The list name must be exactly the same as your list name in JSON. If you put in {"urls":....} make sure the text in the text box is "urls".

8. Click on the + button on the right side of the "For each item in urls" command.

9. From the tool box choose the "Begin New Entry" tool. Now the results of each one of the urls will go into a separate row in CSV and a separate object in JSON. If you didn't use the "Begin New Entry" tool anywhere in your project, the result scraped for each url would override the previous one. 

10. Rename the "list1" command to something else like "links". Make sure not to name the Begin new entry command the same as the list that holds your urls. The Begin new entry command should have a unique name. 

11. Click on the + button on the right side of the "links" command or if you didn't rename it the "list1" command.

 

12. Choose the "Go To Template" from the tool box. The Go To Template command will let you specify which url you want to go to and which type of page you want to open.

13. On the pop up window, choose the "Go to URL" option instead of the "Stay on the Same Page" option.

14. In the text box type in "item" without quotation marks, assuming you didn't name it something else.

15. In the "Create New Template" text box type in the name of a new template you want to open for each link - such as "results". Click "Create New Template". You should now be taken to the first url in your JSON list and a new template should be created for you.

16. On this new template continue making commands that will be applied to each of the urls in your list in turn.

Bonus Tip:

If your list of links is in an Excel you can easily convert them into JSON with a handy Mr. Data Converter tool. 

1. Copy your list of links from Excel or any other text file.

2. Go to Mr. Data Converter.

3. In the first box enter all of your links. Make sure to type in a heading name at the top of the column such as "urls".

4. From the dropdown select JSON - column arrays.

5. Copy and paste your finished JSON into the "Starting value" of the project "Settings tab".

 

Have more questions? Submit request!

0 Comments

Article is closed for comments.