CHAPTER 11
Look-arounds allow a regular expression to find patterns that are either followed or preceded by another pattern. The look-arounds are not actually matches or groups, but simply assertions that the rest of the pattern can be tested for a match. You can use look around to check that patterns are or are not found before or after the pattern you are searching for.
Look-aheads are used to test whether your pattern is followed (or not followed) by another pattern. For an example, let’s consider that we have a list of cities and state codes. We want to find any city that is in New Jersey, Pennsylvania, or Delaware, and ignore all other states. Our sample data is shown below:
Baltimore MD Trenton, NJ Philadelphia PA Rochester, NY Claymont, DE |
Our English and regex rules are shown below. We are using the positive look ahead pattern of (?=<pattern>).
Table 14: Postive Look Ahead
English rule | Regex pattern |
Start a named group | (?<city> |
Any word | \w+ |
Close group | ) |
One or more commas or spaces | [, ]+ |
Must be followed by NJ, PA, or DE | (?=NJ|PA|DE |
Close the look ahead | ) |
If we use the regex pattern (?<city>\w+)[, ]+(?= NJ|PA|DE) then the following results will return matched. Note that while the comma or space is matched as part of the search, it is not part of the group’s collection strings.
Baltimore MD Trenton, NJ Philadelphia PA Rochester, NY Claymont, DE |
The regex object will create a group containing the city name we found. We could add a second named group to get the state code as well, using the following pattern:
(?<city>\w*)[, ]+(?=(?<state>NJ|PA|DE))
This type of pattern is called a positive look ahead. Look for the pattern after the potential match and ensure the pattern is found.
One approach to understand look-aheads is to consider it a conditional expression. We are saying you can accept this pattern IF and ONLY if you find an additional pattern in the string as well. Let’s see how we might use this to ensure a password matches the business rules.
Our password rules are very simple; it must be between 6 and 15 characters, contain one uppercase letter, and one number. Of course, our definition of simple and our user’s definition might differ a tad. Since we expect to test the entire string, we will use anchors as well.
We start with a rule to create a password group, only if we find a string of letters between 6 and 15 characters long.
^(?<pwd>(?=\w{6,15}$).*)
Essentially, our group will match all characters (.*) as long as the condition of length between 6 and 15 characters is met. However, we need additional conditions, so we add another look ahead pattern to the expression. This one ensures that in addition to being between 6 and 15 characters, at least one digit is found.
^(?<pwd>(?=\w{6,15}$)(?=.*?\d).*)
Our last rule requires an upper case letter, so we add that look-ahead pattern as well.
^(?<pwd>(?=\w{6,15}$)(?=.*?\d)(?=.*?[A-Z]).*)
Our final expression says “accept all characters as long as the three look-ahead conditions are satisfied.” You can have multiple look-aheads in your pattern to make the search rules as detailed as you need them to be.
Now, let’s imagine I won the lottery, and decide to buy myself a Lexus. However, I don’t want a hybrid, call me eco-unfriendly. So I want to scan the online ads looking for a Lexus car, but I do not want a hybrid. In English terms: find any Lexus model UNLESS it is a hybrid. To build the expression, we use the (?!<pattern>) negative look ahead, as Table 15 shows.
Table 15: Negative Look Ahead
English rule | Regex pattern |
The word Lexus | Lexus |
Any number of spaces | \s+ |
Start a named group | (?<model> |
Two letters | [A-Z]{2} |
Close the group | ) |
Not followed by Hybrid | (?!Hybrid) |
Now, we find a classified ad with the following text:
I am selling my 2013 Lexus CT Hybrid, stop by for a test drive.
The regex pattern will not find a match in the above text, however, it will for the following ad:
I am selling a 2012 Lexus RX with low mileage and a great price!
This ad will be matched by the pattern Lexus\s+(?<model>[A-Z]{2})(?!\sHybrid) and the model group will contain RX.
Now, let’s revisit our revisit our password regex. Users don’t like the uppercase requirement, so we decide to remove that. Our simpler regex pattern is now this:
^(?<pwd>(?=\w{6,15}$)(?=.*?\d).*)
However, we don’t want to accept passwords that are too easy to get, so we are going to add some conditional logic to reject certain passwords. We do this with a negative look ahead. Our new pattern is:
^(?<pwd>(?=\w{6,15}$)(?=.*?\d)(?!abc123|123456|password\d).*)
One pattern now says the password must be between 6 and 15 characters, at least one digit, AND cannot be abc123, or 123456, or the word password followed by a digit.
Look-behinds work similarly to look-aheads, except they check for patterns that precede the search pattern. The look-behind syntax is:
(?<= <pattern> ) Positive look-behind, pattern must be found before result text
(?<! <pattern> ) Negative look-behind, pattern must not be found before result text
To consider an example, let’s consider a JSON file that contains a person’s address information.
{ "firstName": Michelle, "lastName": Obama, "age": 50, "address": { "streetAddress": 1600 Pennsylvania Ave, "city": Washington, "state": DC, "zipCode": 20500 }, } { "firstName": Mary, "lastName": Bachman, "age": 24, "address": { "streetAddress": 105 Fayette Street, "city": Conshohocken, "state": PA, "zipCode": 19428 }, } |
Our application needs to extract everyone who lives in Pennsylvania from this JSON file, so we need to confirm the current record has a state code of PA or the state spelled out. However, we don’t simply want to search for the state name, because there is a rather famous building on Pennsylvania Avenue in Washington DC.
The regex pattern to find Pennsylvania residents is simply:
(?<state>PA|Pennsylvania)
The pattern means find any PA or Pennsylvania and place in a group called State. Of course, in this example, the JSON entry for Michelle Obama will also show up, since she lives at Pennsylvania Ave.
In order to assure we find only state codes from the “state” entry in the JSON file, we use a positive look-behind. It is constructed as shown below:
Table 16: Positive Look-Behinds
English rule | Regex pattern |
Must find “state”: first | (?<=\"state\":) |
Now a space | [ ] |
Any number of characters | .* |
Start a named group | (?<state> |
Either PA or Pennsylvania | PA|Pennsylvania |
Close the group | ) |
Our resulting regex pattern is defined below:
(?<=\"state\":)[ ].*(?<state>PA|Pennsylvania)
This will find Mary’s record since she lives in the state of PA and will skip Michelle’s record since she lives in DC. We could also add a look ahead, in this case, if we want to make sure a zip code field can also be found:
(?<=\"state\":)[ ].*(?<state>PA|Pennsylvania)[, ].*(?=\"zipCode\")
Note: This pattern requires that the Single Line option be utilized, to allow the period character to match the new line character as well.
You can freely combine look-aheads and look-behinds to focus on the exact text you are looking for.
We could use a similar pattern to find the person’s age, matching a two digit number, as long as it was preceded by the keyword “age.”
A negative look-behind says find a pattern that is not preceded by another pattern. The regex syntax is (?<! followed by a pattern and a closing parenthesis. For example, imagine we want to find all websites in a Windows host file, but want to exclude those on comment lines. The sample host file might look like the following:
# For any domain names you would rather not see, simply add a line # that reads "127.0.0.1 machine.domain.tld". This will redirect any # requests to that host to your own computer. #<localhost> 127.0.0.1 localhost 127.0.0.1 localhost.localdomain 255.255.255.255 broadcasthost 127.0.0.1 local 127.0.0.1 www.badStuff.net #</localhost> |
We now want a simple regex pattern to find IP addresses with the file, unless they are preceded by a # comment marker. The regex pattern would look like this:
(?<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?<!#.*)
The first group looks for a very simple IP address (it is not limited to integers of 255 as an actual IP address check would need to do). The second (non-matching group) is the code for a negative look-behind which says to match the IP address only if there is a not a comment character, any number of characters, followed by the end of line marker (we use the multi-line option to make the $ represent end of line).
Look-arounds are useful patterns to let you perform conditional logic as part of your search pattern. Only accept a match if one or more other patterns can be found (or not found). They provide a very powerful addition to the regex pattern matching logic.