left-icon

Regular Expressions Succinctly®
by Joseph D. Booth

Previous
Chapter

of
A
A
A

CHAPTER 17

Other Regex Options

Other Regex Options


Microsoft’s regex object has a few additional options beyond the ones we’ve covered in this book. Two of these options are particularly useful when working with international text.

CultureInvariant

When the regex engine performs case insensitive comparisons, the casing rules of the current culture are used to map equivalent uppercase and lowercase letters. If you need to use a particular culture, you can use the following code to switch cultures used for processing the regex pattern.

  CultureInfo defaultCulture = Thread.CurrentThread.CurrentCulture;

  Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-FR");

  // Process the regex, using tr-FR (French) casing rules.

  Thread.CurrentThread.CurrentCulture = defaultCulture;  // Restore default culture.     

Note: You will need to include the System.Globalization and System.Threading assemblies to use the above code.

For most languages this is acceptable; however, there are some languages where the comparison might not yield the expected results. In particular, the Turkish language has two lowercase “i’s,” one dotted and one not. When they are converted to uppercase, two uppercase characters are used: one without the dot and one that keeps the dot. If your regex pattern doesn’t account for this, you might not get the matches you expected.

CultureInvariant Regex Option

By adding the CultureInvariant regex option, you request that the special Invariant Culture be used for processing the regex. This culture is a special fake culture (based on English) which is available in the System.Globalization name space. This culture handles issues and other cultural variations. In addition to being used for regex processing, it is handy for any data that needs to be stored and retrieved without regard to the culture at the time.

RightToLeft

Another option to consider is that some languages, such as Arabic and Hebrew, are read from right to left. In these languages, you would want to have your regular expression search the same way, starting at the right. Adding the RightToLeft regex option will cause the pattern matching to start at the rightmost character and move towards the left.

While this is useful in Arabic and Hebrew languages, you can also use it to find the last item in a list of items. For example, if we had a text file of birthdays and want to get the year associated with the first one, we could use the regex pattern as follows: [ /]\d{4}.

March 3rd, 1926

July 25th, 1933

June 21st, 1957

July 5th, 1958

9/12/1959

3/3/1961      

With regular processing, the year 1926 would be returned. By setting the right to left option, we get the year 1961 back instead. You could use this approach if you want to find the last <p> tag added to an HTML document, etc.

Note: The regex object has a Boolean property RightToLeft to determine if the Right to Left option has been set.

ECMAScript

ECMAScript is a scripting language standard that is often used for client-side scripting, most notably in JavaScript. ECMAScript has support for regular expressions, but some of the features differ from the Microsoft regex implementation. You can use the ECMAScript regex option to cause the regex object to follow the ECMAScript implementation. The notable differences are described below:

Regex Options

When the ECMAScript option is enabled, the only other available options are the IgnoreCase and multi-line options. Trying to combine any other options with ECMAScript will result in an Argument Out of Range exception.

Character Classes

ECMAScript does not support Unicode characters, so any string with Unicode characters will not match using ECMA option. If we have Japanese text, such as こんにちは, the pattern \w+ will match it using a Microsoft regex, but will fail to match if we use the ECMAScript option.

Back References

A back reference is a \ character followed by a numeric value. However, an octal escape sequence can also be a \ followed by a numeric. The regex engine needs to decide how to interpret such a sequence, as either a backreference or an octal escape.

Table 18: Back References

Pattern

Microsoft Regex

ECMAScript

\0dd

As long as dd is a valid octal, treat as octal escape

Same

\[1-9]

Always treat as a back reference

Backreference if group exists, other interpret as literal value

\[1-9]dddd

If group exists, assume backreference; otherwise, interpret up to 3 digits as octal and remaining digits as literal value

Interpret as back reference by using as many digits as match; if no match, assume octal for first 3 digits and literal values beyond that

While the differences are minor between the two implementations, you could see some issues when a valid Microsoft regex pattern works differently in JavaScript. By using this option, you can at least simulate to a degree how JavaScript is likely to handle a particular regex pattern.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.