There are projects that you may wish to schedule daily and have them only scrape today's results. If each listing has a date or a time stamp, you can do this by following the instructions below.
For this example, we will be using the Financial Times' news feed located at https://www.ft.com/news-feed. We will divide this into four sections:
- Setting up the basics of our project
- Getting today's date and time in the correct format
- Scrape articles only if they have been published today
- Stop paginating once we've gone past today's articles
Setting up the basics of our project
1. Open ParseHub and start a new project with the URL https://www.ft.com/news-feed
2. Click on the "+" sign next to "Select page", choose a Select command and click on the timestamp for the first article and then on one or two more article timestamps until all of them are selected and highlighted in green. You can rename your selection to "Article" and, where it says "Extract name", change "name" to "time".
3. To add pagination, click on the "+" sign next to "Select page", choose a Select command and click on the "Next" button. You can rename that selection to "next". Click on the "+" sign next to your "next" selection and choose a Click command. When asked if this is a "next page" button, choose "Yes" which should default to "Repeat the Current Template".
Getting today's date and time in the correct format
We are going to amend our basic project above by adding on two conditions:
- One checks to see if the article has been published today and, if so, scrapes the information for that article (e.g. title, author... etc.).
- The other checks the last article on the page and, if that article has been updated before today, stops paginating so that the project will stop running.
Dates can appear in many different formats - e.g. "Wednesday, 8 November, 2017", "November 8, 2017", "08-11-2017", "Nov 8, '17"... etc. - so we need to ensure ours matches the website.
You can get today's date on ParseHub using the $date.toString() method in an Extract command. Between the parenthesis you can add in how you would like your date formatted using the table below for standard date and time format specifiers:
Format | Description | Example |
---|---|---|
s | The seconds of the minute between 0-59. | "0" to "59" |
ss | The seconds of the minute with leading zero if required. | "00" to "59" |
m | The minute of the hour between 0-59. | "0" or "59" |
mm | The minute of the hour with leading zero if required. | "00" or "59" |
h | The hour of the day between 1-12. | "1" to "12" |
hh | The hour of the day with leading zero if required. | "01" to "12" |
H | The hour of the day between 0-23. | "0" to "23" |
HH | The hour of the day with leading zero if required. | "00" to "23" |
d | The day of the month between 1 and 31. | "1" to "31" |
dd | The day of the month with leading zero if required. | "01" to "31" |
ddd | Abbreviated day name. Date.!CultureInfo.abbreviatedDayNames. | "Mon" to "Sun" |
dddd | The full day name. Date.!CultureInfo.dayNames. | "Monday" to "Sunday" |
M | The month of the year between 1-12. | "1" to "12" |
MM | The month of the year with leading zero if required. | "01" to "12" |
MMM | Abbreviated month name. Date.!CultureInfo.abbreviatedMonthNames. | "Jan" to "Dec" |
MMMM | The full month name. Date.!CultureInfo.monthNames. | "January" to "December" |
yy | Displays the year as a two-digit number. | "99" or "07" |
yyyy | Displays the full four digit year. | "1999" or "2007" |
t | Displays the first character of the A.M./P.M. designator. Date.!CultureInfo.amDesignator or Date.!CultureInfo.pmDesignator | "A" or "P" |
tt | Displays the A.M./P.M. designator. Date.!CultureInfo.amDesignator or Date.!CultureInfo.pmDesignator | "AM" or "PM" |
S | The ordinal suffix ("st, "nd", "rd" or "th") of the current day. | "st, "nd", "rd" or "th" |
So, for example, these would be the codes for the following dates:
- "Wednesday, 8 November, 2017" - $date.toString("dddd, d MMMM, yyyy")
- "November 8, 2017" - $date.toString("MMMM d, yyyy")
- "08-11-2017" - $date.toString("dd-MM-yyyy")
- "Nov 8, '17" - $date.toString("MMM d, 'yy")
Scrape articles only if they have been published today
1. Figure out today's date based on the date and time formatting guide above. If you would like to extract it, you can click on the "+" sign next to "Select page", go to "Advanced", choose an Extract command and enter your code. You can drag this command to the top of the template to see it published there.
2. Hover over "Select Article" or hold down your Shift key until the "+" sign appears. Click on it, go to "Advanced" and choose a Conditional command. In the box for the condition, set it to compare the today's date (in the same format) with the text for your selection ($e.text). For example:
$date.toString("dddd, d MMMM, yyyy") == $e.text
3. Drag your conditional command and nest it between your selection and your "Begin new entry" command so that an entry is only created if the condition is met. Make sure your "Begin new entry" is to the right of the selection (nested below).
4. Now you can add more details to your entry using Relative Select commands by clicking on the "+" sign next to your "Begin new entry" command and relating your timestamp to other data such as the article headline or preview.
Stop paginating once we've moved past today's articles
1. Click on the "+" sign next to "Select page" and choose a Select command. Use it to click on the date of the last article on the page. Drag it up to above your "Select next" command and rename selection to "pageDate".
2. Click on the sign next to "Select & Extract pageDate" to open up the command to reveal the extract command. On the drop-down menu for your extract command, choose the option that holds the article's date. In this case, this is the "title Attribute".
3. To remove the time (as we'll just be comparing dates to see if both are the same) we can use regex. In this example, we can click on the "Use regex" checkbox below and enter (.*)\d:
4. Hover over your "Select next" command or hold down Shift so that the "+" sign appears. Click on it, go to "Advanced" and choose a Conditional command. Enter a condition that compares the pageDate we've selected with today's date in the same format (see above for instructions). In this case our condition will be pageDate == $date.toString("MMMM d, yyyy"). Drag your click so that it's nested below the conditional:
And that's it! Your project should be ready to only extract an entry for today's data and only move on to the next page to select more entries if the last entry of the current page still has today's date.