CHAPTER 6
Alternation is the ability in a regex to match one of a list of choices. For example, if we want to write a regex to match a URL, we might tackle it as shown in Table 10:
Table 10: Alternation Rules
English rule | Regex pattern |
Begins with www | www |
Need a period | \. |
Then any number of word characters | \w{1,} |
Another period | \. |
One of the following TLDs (net, com, org, edu, info) | (com|net|org|edu|info) |
With the proliferation of TLDs (top level domains), the English rules above are limited, but the concept of alternation is illustrated. Alternation provides a list of alternatives to consider, enclosed in parenthesis and separated by the pipe character |. The choices in the list do not have to be the same size.
Alternation provides a list of alternative choices that the search text needs to contain one of. In is simplest for, the list is simply a list of words to match. If we wanted to expand the URL pattern search to include FTP sites as well as website, we could replace the www in the pattern with (www|ftp).
In addition, the items between the pipe characters do not simply have to be words; they can be patterns to search for. Imagine we want a regex pattern to find our assigned help desk person, who we know can be John, Sue, or Bill. One approach to find the technician might be to list all of the variations of their names: (John|Jon|Sue|Susan|Bill|Will). While such an approach can work, we can also use patterns to improve the likelihood of a match.
As a simple example, let’s find John, Jon, Sue, Susie, Suzie, Bill, or Will.
Table 11: Alternation Patterns
English rule | Regex pattern |
|---|---|
John or Jon | Joh{0,1}n |
OR | | |
Sue | Sue |
OR | | |
Susie or Suzie | Su[sz]ie |
OR | | |
Will or Bill | [WB]ill |
The regex alternation pattern becomes Joh{0,1}n|Sue|Su[sz]ie|[BW]ill. While the ability to add patterns in addition to literal strings is very powerful, it also can introduce some subtle issues. Looking at the URL pattern we wrote at the beginning of this chapter:
www\.\w{1,}\.(net|com|org|edu).
We could have written it without the parentheses: www\.\w{1,}\.net|com|org|edu . However, this would be interpreted differently by the regex engine. Without the parentheses, this would be interpreted as the following:
Find a string that begins with www, followed by a period, any number of letters, another period, and the word net OR the word com OR the word org OR the word edu. So the following would be the results, probably not what we were expecting.
www.facebook.com
www.yahoo.net
comic books
comedy
By adding the parentheses, the regex www\.\w{1,}\.(net|com|org|edu) returns the expected result, i.e. the two websites.
www.facebook.com
www.yahoo.net
comic books
comedy
Note: Using parentheses in an expression creates a group, which we will cover in more detail in a later chapter. Groups come in handy when we want to use a regular expression to break a larger string to components and manipulate the components. We’ve been focusing on searching with regex in these first few chapters, where the group is less important.
One caveat when working with alternation is to make sure your choices can be reached. Alternation will generally be resolved left to right. If some of the leftmost patterns are always matched, the alternation will stop rather than looking for a better match later in the list. For example, if we wrote the following pattern:
\d*|[A-Fa-F]\*\d*|0X\d*
It might look like a valid pattern to find either a numeric value or a hex number. However, the \d* at the beginning means 0 or more occurrences of a number. Any string at all will match zero occurrences of a number, so this pattern will always resolve the left-most item in the list.
Alternation lets you choose a match among a delimited list of patterns. The alternation metacharacter is simply the | (vertical pipe) to separate a list of patterns. Sometimes, you’ll need to wrap your alternation syntax in parentheses to make sure your logic intent is clear.