left-icon

Regular Expressions Succinctly®
by Joseph D. Booth

Previous
Chapter

of
A
A
A

CHAPTER 1

Text Processing

Text Processing


Computers are very good at dealing with structured data, but cannot come anywhere close (yet) to how good people are at dealing with text. For example, if the following text were recorded in a help desk system:

Hi,

I called Jon on Tuesday, March 25th at 7pm and expressed a concern about my slow times accessing www.cnn.com. He said he would fix it, but I never heard back. Can someone contact me at Kellie.Booth@if.com ASAP?   What does Ctrl-F5 mean, by the way?

Thanks

Kellie

Without much effort, a person can read that text and easily find the person Kellie talked to, the problem website, the date, and the email address. However, such a simple task for a person is rather difficult for a computer. What happens if the date was written as Mar-25 instead? Or if the website didn’t include the “www” prefix? Or if it included “http:” in the address?

Patterns

How do people do it so well? Essentially, we understand patterns and can easily handle missing information. We know a date may or may not have the day as part of it, and a month name might be spelled out or might be abbreviated. We know that a website typically is a word or more (without spaces), followed by a period and a common abbreviation (com, net, org, etc.).

Constructing a regex pattern is a matter of understanding the rules of what some text looks like and then describing those rules in the regex pattern language. Throughout the rest of this book, we are going to work with the above example, a help system that contains a date when a message was received and the text contents of the message. We will explore using regex patterns to parse these text messages and extract dates, times, URLs, email addresses, etc.

Here is another example message from our help system:

Hey jerks,

I called on Mar 15, spoke to Suzie, and complained about not be able to access livedates.net after 7:30PM. What the heck is up? Get back to me, [email protected]. Or call me at 619-555-1212. Hopefully, somebody will read this…

Very upset, Fred!

Our ongoing task through this book will be to use regular expressions to figure out how to read these messages, and determine when the call was made, who the call was made to, what website they are having issues with, and their email address or phone number.

Building Patterns

As we work through the book, we will see a variety of ways that a regex pattern can match a result. The best way to build your pattern is to step away from the computer and write out in the steps in English of how you recognize the pattern. Once you describe your search rules in English, you then convert those rules to the appropriate regex patterns that match the logic you would employ to search for the text yourself.

For example, let’s say we are looking for price information in an online ad. We want to find a car for sale, but we don’t want to spend more than $10,000. Here are the rules we might employ in English to find the price.

Table 1: Pattern Rules

Rule

Regex Pattern

Match all text up until the first occurrence of the $ symbol.

.*\$

Now we need some numbers.

\d+

There may be a comma.

,?

And some more numbers

d+

While you employ these steps without much thought, you’ll need to figure them out in order to convert them to the regex pattern language to teach the computer what you do intuitively.

Note: Don’t worry about the regex pattern side yet; those characters will make much more sense in a few chapters.

Our Help Desk

For the sake of our example, we’ assume our help desk is staffed by three people: John, Susan, and Bill. SMS text messages or emails sent to the help system are automatically logged to a database table holding only two fields: the date and time the message was received, and the text of the message. The company wants to start reporting a summary of these text messages, to whom the message was sent and when, the recipient’s contact information, and other relevant information.

Help Desk Workflow

Figure 1: Help Desk Workflow

A person can easily read the messages from the help desk messages table and extract the information into the help desk reporting table. As we work through this book, our aim is to be able to extract most of the information from the help desk messages table to create an entry in the reporting table.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.