CHAPTER 14
Unicode is a computing standard for representing characters in almost every language in the world. There are well over 100,000 characters represented by the Unicode standard. The Unicode standard is a complex but very powerful set of rules to represent characters. For the purpose of this chapter, we are going to look at some very simple Unicode features and how to use regex to search for them.
Although a bit more complex in practice, you can think of a code point as a 4 digit hex number representing a character. If you run the regex tester programmer and enter the number 0936 while holding the alt key, you’ll get the Greek capital letter Psi (Ψ). If you use the regex pattern \u03A8 you’ll get a match. The 03A8 is the UTF-16 Hex code for the Psi character. You can use .NET to determine the hex value using the following code example. Remember, when using the \u pattern, the hex number must be four characters long, including the leading zero.
int Dec = 936; string HexC = Dec.ToString("X"); |
You can also visit the Unicode Character Search site to look up the hex codes for various Unicode code points.
Although you may have the need to search for individual characters, you could also search for categories. Each character in Unicode belongs to a category, such as a letter, or a numeric symbol, etc. To specify a Unicode category, you use the \p{ category } syntax. The lowercase \p matches Unicode characters within the category, and the upper case \P matches characters that are not in the category. Some common categories are described as follows.
The \p{L} category represents characters that are considered letters. If you use this regex pattern against the Psi symbol from above, it will find a match since Psi is considered a letter in Unicode.
You can also distinguish between lower case and upper case letters using the following:
The \p{Z} category represents characters that are considered separators or white space.
The \p{S} category represents symbols, such as math, box-drawing, currency, etc. You can drill down further using the following:
The \p{N} category represents numeric symbols, such as digits and things like Roman numerals. You can break these down further using:
The \p{P} category represents punctuation symbols, such as dashes, brackets, etc. You can break these down further using:
Each Unicode code point is assigned into a script. A script is the set of code points used in a particular writing system. For example, the Psi symbol would be considered part of the Greek script, but there are also Arabic script sets, Cyrillic sets, etc. You can use the \p{ script } syntax to find if a character or word that is part of the script. If we applied \p{IsGreek} to our text of the Psi symbol, it would match since Psi is a Greek character.
Microsoft refers to these Unicode scripts as Named Block and supports a large number of them based on the Unicode 4.0 standard. Some common ones are described below:
Note that the Named Block is case sensitive, and an error will occur if you use the wrong case or spell the named block wrong. You can find the complete list of the Unicode named blocks supported by the Microsoft .NET regex framework here: Unicode Character Blocks.
The Unicode standard allows many language systems to be represented by computer systems. Using the .NET regex object, you can search for individual Unicode characters, Unicode character classes, and Unicode writing systems.