left-icon

Regular Expressions Succinctly®
by Joseph D. Booth

Previous
Chapter

of
A
A
A

CHAPTER 14

Unicode

Unicode


Unicode is a computing standard for representing characters in almost every language in the world. There are well over 100,000 characters represented by the Unicode standard. The Unicode standard is a complex but very powerful set of rules to represent characters. For the purpose of this chapter, we are going to look at some very simple Unicode features and how to use regex to search for them.

Code Points

Although a bit more complex in practice, you can think of a code point as a 4 digit hex number representing a character. If you run the regex tester programmer and enter the number 0936 while holding the alt key, you’ll get the Greek capital letter Psi (Ψ). If you use the regex pattern \u03A8 you’ll get a match. The 03A8 is the UTF-16 Hex code for the Psi character. You can use .NET to determine the hex value using the following code example. Remember, when using the \u pattern, the hex number must be four characters long, including the leading zero.

      int Dec = 936;

      string HexC = Dec.ToString("X");

You can also visit the Unicode Character Search site to look up the hex codes for various Unicode code points.

Unicode Categories

Although you may have the need to search for individual characters, you could also search for categories. Each character in Unicode belongs to a category, such as a letter, or a numeric symbol, etc. To specify a Unicode category, you use the \p{ category } syntax. The lowercase \p matches Unicode characters within the category, and the upper case \P matches characters that are not in the category. Some common categories are described as follows.

Letter Categories

The \p{L} category represents characters that are considered letters. If you use this regex pattern against the Psi symbol from above, it will find a match since Psi is considered a letter in Unicode.

You can also distinguish between lower case and upper case letters using the following:

  • \p{Ll}- Lowercase letter
  • \p{Lu} - Uppercase letter
  • \p{Lt} - Title letter (start of word when first letter is capitalized)

Separator Categories

The \p{Z} category represents characters that are considered separators or white space.

Symbol Categories

The \p{S} category represents symbols, such as math, box-drawing, currency, etc. You can drill down further using the following:

  •  \p{Sm}- Math symbol
  • \p{Sc}- Currency symbol
  • \p{So}- Other symbol, not currency or math

Number Categories

The \p{N} category represents numeric symbols, such as digits and things like Roman numerals. You can break these down further using:

  • \p{Nd} - Digits zero through nine
  • \p{Nl} - Numbers that look like letters, i.e. Roman Numerals
  • \p{No} - Superscripts and subscripts, a number outside of 0 through 9, etc.

Punctuation Categories

The \p{P} category represents punctuation symbols, such as dashes, brackets, etc. You can break these down further using:

  • \p{Pd} - Dash or hyphen characters
  • \p{Ps} - Opening bracket characters
  • \p{Pe} - Closing bracket characters
  • \p{Pi} - Opening quote characters
  • \p{Pf} - Closing quote characters
  • \p{Pc} - Connecting, such as underscores that connect words
  • \p{Po} - Other punctuation symbols

Unicode Scripts

Each Unicode code point is assigned into a script. A script is the set of code points used in a particular writing system. For example, the Psi symbol would be considered part of the Greek script, but there are also Arabic script sets, Cyrillic sets, etc. You can use the \p{ script } syntax to find if a character or word that is part of the script. If we applied \p{IsGreek} to our text of the Psi symbol, it would match since Psi is a Greek character.

Microsoft refers to these Unicode scripts as Named Block and supports a large number of them based on the Unicode 4.0 standard. Some common ones are described below:

  • IsBasicLatin: Latin characters, English, and many European languages fall with this set of code points.
  • IsGreek: Greek language code points (range 0370 -03FF).
  • IsHebrew: Hebrew language code points (range 0590 -05FF).
  • IsCurrencySymbols: Currency Symbol code points (range 20A0-20CF).
  • IsMathematicalOperators: Math symbol code points (range 2200-22FF).

Note that the Named Block is case sensitive, and an error will occur if you use the wrong case or spell the named block wrong. You can find the complete list of the Unicode named blocks supported by the Microsoft .NET regex framework here: Unicode Character Blocks.

Summary

The Unicode standard allows many language systems to be represented by computer systems. Using the .NET regex object, you can search for individual Unicode characters, Unicode character classes, and Unicode writing systems.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.