CHAPTER 5
One of the problems we might run into with simple character sets and quantifiers in a regex, is that the search string might be imbedded with a larger string, and we might not want this. For example, if we were to use ‘com’ to search for an internet URL extension, it would also find comedy, comic, become, etc.
Let’s assume we got this message from our help system.
Hey Suzie,
I called March 26th about Internet issues on hulu.com and ebay.com. I can’t connect to either one, so I am missing my favorite sitcoms. I also want to sell some of my comic books, but I can’t get on eBay either. Call me at (619) 555-1212.
Thanks
Tom
If we asked the regex engine to find the pattern com (global and case insensitive), we might get the result below:
I called March 26th about Internet issues on hulu.com and ebay.com. I can’t connect to either one, so I am missing my favorite sitcoms. I also want to sell some of my comic books, but I can’t get on eBay either.
That’s not quite what we were expecting.
An anchor metacharacter allows us to restrict what matches are acceptable, based on the string position and/or characters around the string.
For example, the ^ (when outside of square brackets) and the $ are anchor metacharacters with special meaning.
Although our examples so far have dealt with searching a larger string, often regex patterns can be used to validate that a single string looks like the pattern we are looking for. For example, if we designed an input form asking for the user’s social security number, we could use the ^\d{3}-\d{2}-\d{4}$ pattern to not only ensure what they entered looks like a valid SSN, but also nothing else was entered in the text field.
By default, the ^ and $ anchors refer to the first and last positions in a string; however, the regex option multi-line changes that behavior, so that the ^ and $ anchors refer to the beginning and ending of a line, instead of the entire string.
Note: Anchor characters do not take up position in the string, they simply indicate to the regex engine what is expected around the search string.
In our help system, the last word from the message is fairly likely to be the person’s name, so we can use the following regex pattern to find the last word from the text, \w{1,}\s{0,}$. Note that the pattern says to find the last word of any length \w{1,} followed by any number of whitespace characters. Often, the final character of a string is a line-feed, even if not visible in the text. So the expression \w{1,}$ might appear valid, but won’t match if any whitespace exists after the final word.
Finding the first word in a string is simpler, using the pattern ^\w{1,} which says look for any number of word characters starting at the beginning of the string. If you use this in conjunction with the multi-line regex option, you can find the beginning word of every line.
Tip: Some flavors of regex (including Microsoft .NET) support the \A and \Z anchors, beginning and ending of string regardless of multi-line setting. However, JavaScript and some other regex flavors do not support these anchors.
Another very common anchor is the \b metacharacter, which is used to indicate a word boundary. In our previous example, if we searched for \bcom\b rather than just com, we would find only the com in the hulu.com and ebay.com websites. You will often see this referred to as a whole word search. The \b does not take up a character position in the search string; it is only used to “look around” the search text.
Word anchors are the first character in a string, the last character in the string, or between two characters where one character is a word character (\w) and the next character isn’t (\W).
Anchors do not need to be used in pairs; for example, using the pattern \bcom would find any reference to com by itself or at the beginning of a word. So .com and comic would match, but sitcoms would not.
You can use the \B anchors to find imbedded patterns within strings. For example, using \Bcom\B would only find sitcoms in our example text, since it is the only occurrence of com that has a valid word character or either side, even though those word characters are not consider part of the found result.
Anchors are different that using other characters to help isolate words in a subtle way. Referring to our sample text from above:
I called Mar 26th about Internet issues on hulu.com and ebay.com. I can’t connect to either one, so I am missing my favorite sitcoms. I also want to sell some of my comic books, but I can’t get on eBay either.
If we use the pattern \bcom\b, we will find the com domain name for hulu.com and ebay.com, but the delimiting characters (dot at the beginning and either whitespace or dot at the end) are not part of the result.
If we use the pattern \Wcom\W we will find the same two locations, except that this time, the dot and space will be included in the result string. While this distinction might not be important for searching, it can become an issue in later chapters as we try to manipulate the values returned by the regex engine.
In this chapter, we covered the anchor metacharacters. While they are not part of the search string, they are useful to ensure you find the result in the location you are expecting it. Table 9 shows the anchor characters.
Table 9: Anchor Characters
Character | Description |
^ | Beginning of string (when outside []) |
$ | End of string or line (depending on options) |
\b | Word anchor |
\B | Must have valid word around match |