left-icon

Regular Expressions Succinctly®
by Joseph D. Booth

Previous
Chapter

of
A
A
A

CHAPTER 3

Character Sets

Character Sets


A regex pattern is a string expression that describes how to find matching text. The simplest pattern is simply a letter or number we want to match. For example, in our help system text, we could find the time by using the regex pattern 7pm. While searching specifically for 7pm is not very useful, it illustrates the first rule of a regex.

Literal Characters

Regex has a number of characters that have special meaning called metacharacters—we will explore how to use them later in this chapter. The metacharacters are:

^ $ . | {  } [ ] ( ) * + ? \

All other characters in the pattern are literals. The regex search engine expects to find the exact literal character somewhere in the string. In our example above, 7pm are three literal characters, and will only match the exact text 7pm in the search string. Note that, by default, regex is case sensitive, so 7PM would not match. Although we can add an option to have the regex engine ignore case, let’s look at a slightly different approach to solving the problem.

Square Brackets []

We are going to introduce our first metacharacters, the square brackets []. The square brackets are used to provide a list of potential matching characters at a position in the search text. So, to search for 7pm, case insensitive, we could use the regex pattern 7[Pp][Mm]. This pattern tells the regex to find the number 7, followed by either an upper case or lower case P, followed by an upper or lower case M. Using this pattern, 7pm, 7Pm, 7pM, or 7PM would all be found. However, the [] will only match one character in the list, so 7PPM would not match the pattern.

We can also adjust our pattern to find any time between 1 and 9 a.m. or 1 and 9 p.m. (we will cover how to get 10, 11, or 12 in a later chapter). We are now going to tweak our regex pattern to find a number between 1 and 9, followed by an upper or lower case A or P, followed by an upper or lower case letter M. The resulting pattern is [123456789][aApP][Mm]. This pattern will handle any time beginning with a 1 through 9, followed by AM or PM (regardless of case).

Ranges

In addition to listing out all the possible characters within the brackets, you can include a dash character to indicate a range of characters. For example, we can change our time search pattern from [123456789][aApP][Mm] to [1-9][aApP][Mm]. This shorthand notation makes the regex a bit easier to write. You can also combine ranges between the brackets. For example, the following regex will find a two digit hex number [0-9A-F][0-9A-F]. We could add a-f if we wanted to include lower case letters as well.

You can also combine ranges and characters together, so the regex [0-9ABCDEF] is the same as [0-9A-F] to return a single hex digit.

Note that if you want to include the dash within the brackets (as a literal character), then it must come at the beginning or end of the string. To test for a +, -, or digit, you could use the following [-+0-9] pattern, which says accept a dash (literal character) or a plus sign, or any number between 0 and 9.

Non-printable characters

There are certain characters than cannot be entered into a regex pattern, such as a tab a new line character. These characters are known as character escapes and allow just about any character to be searched for within a text.

  • \a - Matches the bell character.
  • \b - Matches a backspace (when used between brackets).
  • \t - Matches a tab character.
  • \r - Matches a carriage return.
  • \n - Matches a new line; most Windows files use \r\n to indicate the end of a line.
  • \f - Matches Form feed.
  • \v - Matches a vertical tab.
  • \xNN - Matches the ASCII character represented by the hex value NN.
  • \cX - Matches an ASCII control character, such as \cC is Ctrl-C.

When the \ is followed by another character that is a not an escape character, it matches that character, even if the character normally has special meaning in the regex. For example, as we will see later, the asterisk has a special value inside a regex pattern. However, the \* character overrides the special value (within brackets) and matches the asterisk character.

Some other characters do not have ASCII equivalents; however, you can use the Unicode value to search for those characters. For example, if we were looking for the Euro symbol €, we could use the following regex pattern \u20AC. Not all regex implementations support the \u metacharacter, although JavaScript and the .NET framework do support it. You can use the following Unicode symbols to find various currency codes in a text:

  • Euro € - \u20AC
  • British pound £ - \u00A3
  • Yen ¥  -\u00A5
  • Dollar sign $ - \$ or \u0024 or \x24

To find any of the above currency symbols, the [\u20AC\u00A3\u00A5\$] should do the trick. Note that the $ is a metacharacter in regex, which is why we escape it with the backslash character.

EMCAScript is a scripting language specification that web client languages, such as JavaScript, are based on. One notable difference is that EMCAScript does not support Unicode in its regex processing rules.

Negation

When you build a character set within the brackets, it is sometimes easier to specify what values you want to exclude. This is what the ^ metacharacter within the square bracket allows you to do. The pattern [aeiou] would find all vowels, and the pattern [^aeiou] would find all consonants (i.e. anything that is not a vowel).

The negation character applies to everything within the square brackets. For example, if you negate the 0-9A-F pattern, i.e. [^0-9A-F], you will find everything that is not a valid hex number.

Shortcuts

In addition to ranges and lists of characters, there are also a number of shortcut metacharacters available for some common character patterns.

1.1.1.1.9Digits

The \d metacharacter means to find any digits, it is the same as [0-9]. You can negate that by using the uppercase \D to find any non-digits, the same as [^0-9].

1.1.1.1.10Alphanumeric (word characters)

The \w metacharacter means to match any word character (any uppercase or lowercase letter, any digit, and the underscore). It is the same as [a-zA-Z0-9_].The \W metacharacter negates that, and matches any non-word character (i.e. [^a-zA-Z0-9_]).

1.1.1.1.11White space

The \s metacharacter means to match any white space character in a string. White space includes carriage return, line feed, tab, form feed, and the space character, i.e. [\r\n\t\f ]. The \S metacharacter is the negation, and matches any non-white space character.

The shortcut metacharacters can be used outside of the square brackets, so \d will find a digit, while 0-9 (without brackets) will look for the exact pattern 0-9.

The dot

The dot metacharacter matches any single character (with the exception of \n-linefeed). When regex was first designed, it was targeted toward lines from text files, which is why the linefeed character is not matched. In .NET, the single line regex option causes the dot to also match the linefeed (\n) character.

To actually match a dot or period, you need to escape it with the backslash. For example, matching a decimal number with two decimal places might look like this \d\d\.\d\d if you didn’t escape the dot, it would match, but so would a 5 digit number without a decimal point, two digits, a dash, two more digits, etc.

Some examples

The character class regex patterns can be handy when you want to match a string of the same length as the pattern. In the next chapter, we will see how to match varying lengths. Let’s review a couple examples.

Social Security Number

A social security number is a nine digit number, grouped into three digits, a dash, two digits, another dash, and then four final digits. To express this pattern as a regular expression, we could use the following: [0-9][0-9][0-9][-][0-9][0-9][-][0-9][0-9][0-9][0-9].

ZIP+4 code

A ZIP+4 code is a postal code used by the United States to provide a very specific mailing address location. The pattern consists of five digits followed by a dash or space and followed by four more digits. \d\d\d\d\d[-\s][0-9][0-9][0-9][0-9]

Euro price

This example shows how to find a price expressed in euros in a text string.

[\u20AC]\d\d\.[0-9][0-9]

This pattern looks for the euro symbol, followed by two digits, a decimal point, and two more digits.

Summary

In this chapter, we covered literal characters and metacharacters. Literal characters match exactly, while metacharacters have special meaning. The metacharacters we covered are described in Table 4.

Table 4: Meta-characters

Character

Description

[]

Provide a list of possible characters to match

\a

Bell character

\b

Backspace (when between brackets)

\t

Tab character

\r

Carriage return

\n

New line

\f

Form feed

\v

Vertical tab

\xNN

ASCII character NN

\cX

Control character

\uXXX

Unicode character

^

Negates characters within brackets

\d

Any single digit

\D

Any non-digit

\w

Alphanumeric (word character)

\W

Not a word character

\s

White space

\S

Non white space

.

Matches anything (except linefeed)

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.