CHAPTER 17
Microsoft’s regex object has a few additional options beyond the ones we’ve covered in this book. Two of these options are particularly useful when working with international text.
When the regex engine performs case insensitive comparisons, the casing rules of the current culture are used to map equivalent uppercase and lowercase letters. If you need to use a particular culture, you can use the following code to switch cultures used for processing the regex pattern.
CultureInfo defaultCulture = Thread.CurrentThread.CurrentCulture; Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-FR"); // Process the regex, using tr-FR (French) casing rules. Thread.CurrentThread.CurrentCulture = defaultCulture; // Restore default culture. |
Note: You will need to include the System.Globalization and System.Threading assemblies to use the above code.
For most languages this is acceptable; however, there are some languages where the comparison might not yield the expected results. In particular, the Turkish language has two lowercase “i’s,” one dotted and one not. When they are converted to uppercase, two uppercase characters are used: one without the dot and one that keeps the dot. If your regex pattern doesn’t account for this, you might not get the matches you expected.
By adding the CultureInvariant regex option, you request that the special Invariant Culture be used for processing the regex. This culture is a special fake culture (based on English) which is available in the System.Globalization name space. This culture handles issues and other cultural variations. In addition to being used for regex processing, it is handy for any data that needs to be stored and retrieved without regard to the culture at the time.
Another option to consider is that some languages, such as Arabic and Hebrew, are read from right to left. In these languages, you would want to have your regular expression search the same way, starting at the right. Adding the RightToLeft regex option will cause the pattern matching to start at the rightmost character and move towards the left.
While this is useful in Arabic and Hebrew languages, you can also use it to find the last item in a list of items. For example, if we had a text file of birthdays and want to get the year associated with the first one, we could use the regex pattern as follows: [ /]\d{4}.
March 3rd, 1926 July 25th, 1933 June 21st, 1957 July 5th, 1958 9/12/1959 3/3/1961 |
With regular processing, the year 1926 would be returned. By setting the right to left option, we get the year 1961 back instead. You could use this approach if you want to find the last <p> tag added to an HTML document, etc.
Note: The regex object has a Boolean property RightToLeft to determine if the Right to Left option has been set.
ECMAScript is a scripting language standard that is often used for client-side scripting, most notably in JavaScript. ECMAScript has support for regular expressions, but some of the features differ from the Microsoft regex implementation. You can use the ECMAScript regex option to cause the regex object to follow the ECMAScript implementation. The notable differences are described below:
When the ECMAScript option is enabled, the only other available options are the IgnoreCase and multi-line options. Trying to combine any other options with ECMAScript will result in an Argument Out of Range exception.
ECMAScript does not support Unicode characters, so any string with Unicode characters will not match using ECMA option. If we have Japanese text, such as こんにちは, the pattern \w+ will match it using a Microsoft regex, but will fail to match if we use the ECMAScript option.
A back reference is a \ character followed by a numeric value. However, an octal escape sequence can also be a \ followed by a numeric. The regex engine needs to decide how to interpret such a sequence, as either a backreference or an octal escape.
Table 18: Back References
Pattern | Microsoft Regex | ECMAScript |
\0dd | As long as dd is a valid octal, treat as octal escape | Same |
\[1-9] | Always treat as a back reference | Backreference if group exists, other interpret as literal value |
\[1-9]dddd | If group exists, assume backreference; otherwise, interpret up to 3 digits as octal and remaining digits as literal value | Interpret as back reference by using as many digits as match; if no match, assume octal for first 3 digits and literal values beyond that |
While the differences are minor between the two implementations, you could see some issues when a valid Microsoft regex pattern works differently in JavaScript. By using this option, you can at least simulate to a degree how JavaScript is likely to handle a particular regex pattern.