CHAPTER 4
In the previous chapter, we looked at the various options for matching characters within a search text; however, the examples assumed that we knew the exact number of characters we are searching for. In this chapter, we will explore how we can search for patterns of varying length.
The curly brackets metacharacters allow you to specify the number of times a particular character set occurs. For a simple example, let’s take our euro regex from the previous chapter and make it a more general purpose currency search.
First, we want a symbol, which in this case is the euro, British pound, or dollar sign. [\u20AC\u00A3$] followed by at least one digit, but no more than five digits, followed by a period, and two digits. \d{1,5}\.\d\d
The {1,5} following the \d character indicates we need at least one, but no more than five digits after the currency symbol and before the period. Our options with the curly brackets can be:
Note that using {,n} will not search for a value between 0 and “n” times, but rather be interpreted as a literal value. If you can accept 0 occurrences, you need to specify it exactly, i.e. {0,}.
To revisit some of the examples from the previous chapter, here are some examples.
Social Security number
Our SSN checker can be simplified from [0-9][0-9][0-9][-\.][0-9][0-9][-\.][0-9][0-9][0-9][0-9] to \d{3}-\d{2}-\d{4}.
Zip code
Our Postal ZIP+4 code checker can be improved from \d\d\d\d\d[-][0-9][0-9][0-9][0-9] to \d{5}-{0,1}\d{0,4}. However, even though this expression is improved, and will handle both five digit zip codes and ZIP+4, it is not perfect because it would match five digits followed by a dash, five digits followed by and dash, and then less than four digits.
Tip: Whenever you build a regex pattern, it is always advisable to run it in a test program to ensure it matches what you’d expect. As patterns get more and more complex, it is likely you’ll find some unexpected matches.
Now combining our first two chapters, let’s put together a regex pattern to see if we can extract the time from our messages. We know that a time starts with a number between 1 and 12, can optionally be followed by a colon, and then two digits, the first between 0 and 5 and the second between 0 and 9. Finally, it should end with AM or PM, case insensitive.
Table 5: Time Pattern
English rule | Regex pattern |
Might start with 0 or 1 | [0-1]{0,1} |
Need a number | \d |
And a colon | : |
A number between 0 and 5 | [0-5] |
A number between 0 and 9 | [0-9] |
An optional whitespace character | \s{0,1} |
Case insensitive a or p | [aApP] |
Case insensitive m | [Mm] |
The resulting regex pattern is [0-1]{0,1}\d:[0-5][0-9]\s{0,1}[aApP][Mm] and will match the following times: 7:30 p.m., 12:45 p.m., 6:45 a.m., etc. While it is not perfect, is it pretty close (and will get closer in later chapters). Unfortunately, it will also match 19:00 p.m., which is an invalid time, but still looks alright according to the defined pattern rules.
In addition to using the curly brackets to specify the number of occurrences you’ll accept, there are other quantifiers available. However, to use these quantifiers, it is important to distinguish between greedy and lazy quantifiers. For example, let’s assume we are parsing an HTML file and we want to remove the <> tags from the strings. The first line we encounter in our text is <b>Financial Report</b>. We need to write a regular expression pattern to find (and eventually remove) these tags.
Our first attempt might look like this <.{0,}>, which says find a literal < sign, followed by 0 or more of any character, followed by a literal > sign. However, when processing that regex, we get the following result: <b>Financial Report</b>. In this case, the entire string was returned. This is an example of a greedy quantifier. A greedy quantifier takes the entire string and sees if it can match the pattern. Since the entire string matches the pattern, the regex returns the entire string (it begins with a < sign, contains any number of characters, and ends with a > sign).
By default, all quantifiers in curly brackets are greedy. Take the entire string and see if it matches. If not, it back tracks a character at a time until its finds a match.
Greedy quantifiers are quite resource intensive. They take the entire string and test it. If it does not match, the process backs up a character and checks to see if a match is found. The process also keeps track of where it is at, in case it needs to go back to a prior location to pick up the search.
If our example was missing the final > character, the regex engine would check all of the following strings against the pattern.
While this example shows 20 different strings being looked at, imagine if your regex method were attempting to process an entire HTML file.
A lazy quantifier operates using the opposite approach. It starts at the beginning of the text looking for matches. We can make any curly bracket quantifier lazy by following it with a question mark. So if we take our pattern <.{0,}> and change it to <.{0,}?> we get a different result, in this case: <b>Financial Report</b>. The regex engine found the <, the letter b, and the >, so it found a match. Since we used the global option, the engine kept reading, and eventually found another <, followed by 2 characters /b, followed by the > sign.
A possessive quantifier operates just like a greedy quantifier, except that it doesn’t back track. If the entire string matches the pattern, a match is returned. If not, the match fails. You can add a + to the end of a quantifier to make it a Possessive quantifier. However, the Microsoft regex engine does not support possessive quantifiers. However, you can instruct the regex group to not backtrack, making it operate like a possessive quantifier.
Possessive quantifiers offer a substantial speed improvement over greedy quantifiers when the match fails. A greedy quantifier says, “If at first you don’t succeed, try, try again.” A possessive quantifier says, “If at first you don’t succeed, give up, no sense being a fool about it.”
While you can use the {} symbols to specify the number of matches, there are three commonly used quantifier symbols as well. Remember by default, they are greedy, trying to match as much as possible.
Table 6: Common Quantifiers
Quantifier | Matches | Same as |
* | Match zero or more times | {0,} |
+ | Match one or more times | {1,} |
? | Match zero or one time | {0,1} |
To make any of the above lazy, add the question mark immediately after the quantifier. To review our example, let’s look at how the <b>Financial Reports</b> string will be matched. This assumes global, i.e. all possible matches are returned.
Table 7: Financial Reports String
Pattern | Result |
<.*> | <b>Financial Report</b> |
<.+> | <b>Financial Report</b> |
<.?> | <b>Financial Report</b> |
<.*?> | <b> Financial Report </b> |
<.+?> | <b> Financial Report </b> |
<.??> | <b>Financial Report</b> |
Notice that using the question mark quantifier (same as {0,1}) will not match the final </b> tag since it contains two characters between the < and > characters.
Note: Be careful with the * quantifier. \d* might look like it should match any digit, but it will match everything, since pattern of 0 digits will always be found in a search string. The match object will return the Success property of true, but the Value property will be an empty string and Index and Length will both be zero.
In this chapter, we covered the quantifier characters, to allow you to search for strings of varying lengths. Table 8 below summarizes the quantifiers.
Table 8: Quantifiers
Character | Description |
|---|---|
{n} | Preceding patterns occurs EXACTLY n times |
{n,} | At least n times, but no upper limit |
{n,m} | Between n and m times |
* | Match zero or more times, sames as {0,} |
+ | Match one or more times, same as {1,} |
? | Match zero or one time, same as {0,1} |