left-icon

Regular Expressions Succinctly®
by Joseph D. Booth

Previous
Chapter

of
A
A
A

CHAPTER 10

Groups

Groups


In the previous chapters, we explored searching with regex, but did not do very much with the results; we simply checked to see if the conditions were met, and possibly used the result to illustrate in the RTF control where the text was found. In the next few chapters, we are going to focus on using the results for further processing, such as data cleaning, text replacements, etc.

Regex Groups

A group in a regular expression is an imbedded regular expression within a larger one. You can name the groups or refer to them by numeric position in a collection. By using groups, you can perform multiple operations on a string and gather multiple results. You can also refer to other groups from within the expression, to search for duplicate words, etc.

Capturing Groups

To create a group to capture a pattern, you must enclose a regular expression within parentheses. This will create a subexpression pattern within the overall pattern. The subexpression can be any valid regular expression pattern inside the parentheses. For example, if we take a very simple phone number pattern of \({0,1}\d{3}\){0,1}[/\\ -]\d{3}-\d{4}, it will match phone numbers of the following forms:

  • 610-555-1212
  • (215) 555-1212
  • 609\555-1212

However, our requirement is to split the phone number into area code and phone number. While the pattern will detect valid numbers, we won’t know what the area code and the phone number is. By using groups, we can both search for phone numbers and split them into two component pieces.

\({0,1}(\d{3})\){0,1}[/\\ -](\d{3}-\d{4})

By placing parentheses around the area code and the phone number patterns, we now not only get valid phone number, but the phone number is broken into two pieces, one for each sub pattern. Each component piece is referred to as a group and the groups are numbered from left to right, starting at the index of 1.

Note: Regex groups start at the index of 1, rather than 0 as most .NET collections do; keep this in mind when working with regex.

When we now process this regex (using Match to obtain the match object), the match object will contain a collection property called Groups. In this example, there will be two groups: Group 1 containing the area code digits, and Group 2 containing the actual phone number.

Naming Groups 

The Microsoft .NET framework (and other regex engines) allows you to assign names to the groups within a sub-expression. To assign a name to a group, you can use either of the following two methods:

  • (?<name>regular expression)
  • (?’name’regular expression)

You can refer to the group by name in the expression using the \k<name> or \k’name’ syntax. We can add names to our phone number expression, as shown below:

\({0,1}(?<areacode>\d{3})\){0,1}[/\\ -](?<number>\d{3}-\d{4})

When using the named approach, the group property of the match object can be referenced by either name (case sensitive) or index number. In the example above, Group[1] and Group[“areacode”] both refer to the same capturing group. One caveat though, any unnamed group will be before the named group, so if we used the following pattern:

\({0,1}(?<areacode>\d{3})\){0,1}[/\\ -](\d{3}-\d{4})

The phone number would be Group 1, the group named areacode would be Group 2. To avoid this problem, I would recommend being consistent with group naming; you should either name all groups or none. Group names can make your regular expression more readable, particularly when most regex engines support up to 99 groups with a pattern.

Note: Naming of groups is one area that regex engines often vary in. Python’s regex engine uses the ?P<name> syntax to name groups and the ?P=name to reference them. Python was the first to offer named groups, but no standard was set, leaving .NET engineers to define their own syntax for group names.

Multiple Groups

You can create a regex using groups and quantifiers; however, the regex engine will only put the last match into the group. For example, if we use the following expression, we are asking to get the function name and all parameters from a SQL function call (for simplicity, removed the parameter data types).

(?<Method>\w*\()(?Parameter>@\w*[,)])+

If we run our code against this example function:

Function GetName(@ObjId,@runDate) as INT

BEGIN

END

We will get two groups back. The first one, called <Method>, will contain GetName( and the second group, <Parameter>, will contain @runDate). Only the last parameter will be assigned to the group.

Fortunately, the Microsoft regex will also save all the texts that met the search pattern. While only the last one will be put in the group object, the others will be put into the Captures collection for the group.

Tip: The Captures collection is zero-based, while the Groups collection is one-based. Keep this in mind if you need to work with both collections.

Back References

Your regular expression can make references to capturing groups using a back reference. Since each group is numbered (starting left to right at one), you can refer to a group by a backslash followed by a group number, or the \k<name> syntax for named groups.

As an example, let’s consider how we might write a regex pattern to detect duplicates words in a sentence. To describe this in English, we would say: get the first word, then the next word. If the next word is the same as the previous word, report it.

Table 13: Back References

English rule

Regex pattern

Find a word

(\w+)

A space or comma

[\s,]

The last word we found

(\1)

If we now were to write the following sentences in our test program:

She thought her test was very, very unfair

However, her professor said the the test was actually pretty easy

If we then apply our regex pattern of (\w+)[\s,](\1), we would see the following results:

She thought her test was very, very unfair

However, her professor said the the test was actually pretty easy

You don’t need to make the backreference a group, as the pattern (\w+)[\s,]\1 would give the same results; however, no information would be available about the second duplicate word in the string.

Group 0

The regex engine has a group 0, which is the entire regex pattern (as though you put parentheses around the entire expression). If you are using groups within your pattern, you probably don’t need to reference group 0 at all, but it can be convenient to get the captures evaluated by the entire expression.

Note: In our regex tester program, we hide group 0 from the tree view display, but you can replace the  if (groupName != "0") with if (true) to display the group 0 details and capture information in the treeview.

Non-Existing Groups

If you try to reference a non-existing group, the group object will return a value of false for the Success property. You might also have an empty group, a group that was requested in the pattern, but never found in the search string. For example, if we were to validate a phone number, we might want to look for a left parenthesis to start the area code; which would look like the following:

(\()?

Look for the parenthesis character zero or one times. If the left parenthesis is found, it will be placed in a numeric group. You need to capture the group for later referencing to find out if the parenthesis was found (and hence look for the matching right parenthesis).

Testing for Groups

You can use the regex conditional pattern to test whether a group has a value or not. The pattern is indicated by (?(group number or name)<pattern>). This construct says, if the group is found, apply the pattern indicated. If not, then do nothing. You can also use the | character to act as an else statement, in case you want to use one pattern if the group is found and another pattern if it is not. The general syntax is as follows:

(? (group name or number) <true-pattern>|<false-pattern)

Forward References

A forward reference is a group reference number that is found in the pattern prior to the group it references. The regex engine can handle forward references as well as backward references.

Non-Capturing Groups

Generally, when you creating sub-patterns using the group construct, you want to do something with the groups. However, it is very possible you need to group for searching purposes, but don’t plan on using it. When this occurs, you should use a non-capturing group, which is a group that takes part in the search but cannot be referenced in the pattern. A non-capturing group will not be returned in the regex group collection.

Creating a Non-Capturing Group

To create a non-capturing group, you use the pattern (?: <pattern>) syntax. You should create non-capturing groups if you will never reference the group (either through a back reference or in the group collection in the regex object).

There is some memory overhead associated with saving groups, so using a non-capturing group will (very slightly) reduce the memory used by the pattern.

Summary

Groups of sub-patterns within a regex come in very handy to slice out parts of the string and are very useful to replace and reformat strings. You should name groups you need and set the non-capturing option for the groups which are needed for processing, but not needed by your code.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.