You can use regular expressions within ParseHub to hone in on specific text within an element, and to filter out characters that don't contain text matching the pattern.
You can also use the Regex Cheatsheet for a condensed list of helpful expressions.
How to use Regular Expressions
1. Click on the + button next to the "Select page" command, or any other command in the template that you want to add an extraction to.
2. Choose the "Select" tool from the tool box.
3. Select the element on the page that you want to extract and modify.
4. The element text is extracted for you automatically. We need to modify this extraction.
5. Click on the "Select & Extract selection1" command + button.
6. From the tool box select the "Extract" tool. A new "Extract selection1" command will appear for you.
7. Check mark the "Use regex" check box in the extraction command options panel.
8. Enter your regex in the text box. You must use captures "()" for any text you want included in your results. Optionally, you can turn on "extract all occurrences". This will make the extract command return a list of matches. This is useful for breaking up text into parts. For example, breaking up a full name into a first and last name.
9. Check your results by looking at the output in the results pane.
Regular Expressions Examples
Get the number of the price without the currency (dollar sign)
Sometimes, you will want to clean up your pricing data and remote the dollar sign beside the number.
1. Select the price on the page (or all of the prices on the page).
2. Enter the following RegEx into the text box:
Get an email address from a block of text
It can be very tricky to parse all different types of email addresses from a block of text but with RegEx you can at least get some of the email addresses.
1. Select the text that may contain an email
2. Enter the following RegEx into the text box -
Remove the label text beside a value
If you have a label beside another piece of text, you can remove it.
For example, if you have "Tel: 553-235-23453" or "Postal Code: 102323434" - you can remove the written text before the number.
1. Select the text and label that you want to modify
2. Enter the following RegEx into the text box
Tel:(.*) or Postal Code:(.*)