left-icon

Regular Expressions Succinctly®
by Joseph D. Booth

Previous
Chapter

of
A
A
A

CHAPTER 7

Searching Summary

Searching Summary


In the previous few chapters, we covered a number of the regex searching patterns to find strings in text, and in this chapter, we will provide a few examples of how to use those options to solve some text processing problems.

Help System Parser

In the beginning of the book, we described the help system we wanted to process messages from. We can now use our various regex patterns to create an object to process those help messages. The purpose of the object is to take a text message and fill the following properties if they can be extracted:

  • AssignedTo: Help desk rep who took the message.
  • UserDate: Date the user indicated they called.
  • UserTime: Time of the call.
  • EmailAddr: Email address, if one is found.
  • PhoneNumber: Phone number, if found.
  • ComplainURL: Website the user complained about.

To create the parse class, we first need to create the class and declare a few variables.

public class HelpDeskParser

{

     private Dictionary<string, string> RegExprList = new Dictionary<string, string>();

     public string AssignedTo;

     public string UserDate;

     public string UserTime;

     public string EmailAddr;

     public string PhoneNumber;

     public string ComplainURL;

We are keeping the class very simple, using public variables rather than properties. In our constructor, we are going to load our regex expression list. It is often a good idea to keep the regular expressions loaded in one spot, in this case in a dictionary, to make it easier to update them as testing progresses or to enhance the expressions.

public HelpDeskParser()

{

    // Load the patterns into the RegExprList

            RegExprList.Add("ASSIGNEDTO", "(Joh{0,1}n|Sue|Su[sz]ie|[BW]ill)");

            RegExprList.Add("USERDATE", @"((sun(day)?|mon(day)?|tue(sday)?|wed(nesday)?|thu(rsday)?|fri(day)?|sat(urday)?),?\s)?((jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\s)+((0?[1-9]|[1-2][0-9]|3[0-1]),?(\s|st|nd|rd|th))+([1-2][0-9][0-9][0-9]){0,1}");

            RegExprList.Add("USERTIME", @"\b([1-9]|[0-2]|0[1-9])((:[0-5][0-9]){0,1}\s{0,}[aApP][mM])\b");

            RegExprList.Add("EMAILADDRESS",@"\b([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)\b");

            RegExprList.Add("PHONENUMBER", @"\b((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}\b");

            RegExprList.Add("COMPLAINURL", @"(www\.){0,1}\w{1,}\.(net|com|org|edu|info) ");

        }

And finally, we include a method to take the message string and extract the pieces it can get from the message. There is not a lot of code in the method, and the power of the regex engine is on display here.

     public bool ExtractHelpDeskInfo(string msg)

        {

            // Clear out previous values.

            AssignedTo  = "";

            UserDate    = "";

            UserTime    = "";

            EmailAddr   = "";

            PhoneNumber = "";

            ComplainURL = "";

            foreach (KeyValuePair<string, string> entry in RegExprList)

            {

                Match theMatch = Regex.Match(msg, entry.Value, RegexOptions.IgnoreCase);

                if (theMatch.Success)

                   {

                   switch (entry.Key)

                       {

                         case "ASSIGNEDTO": { AssignedTo = theMatch.Value; break; };

                         case "USERDATE": { UserDate = theMatch.Value; break; };

                         case "USERTIME": { UserTime = theMatch.Value; break; };

                         case "EMAILADDRESS" : { EmailAddr = theMatch.Value ; break ; };

                         case "PHONENUMBER": { PhoneNumber = theMatch.Value ; break ; };

                         case "COMPLAINURL": { ComplainURL = theMatch.Value; break; };

                         default:

                            break;

                       }

                  }

            }

            return true;

        }

While there are improvements we can make (and will in later chapters), this simple class illustrates how powerful regular expressions can be when dealing with a larger amount of unstructured text.

Named Entity Parsing

Natural Language Processing (NLP) is a part of artificial intelligence that deals with having a computer appear to understand native language text, such as English. The complexities of the English language make this an extremely difficult task for computers to handle. One component of NLP is trying to extract “named entities” from text; named entities could be phone numbers, currency amounts, city and state names, people, etc. This is done with lists, grammar rules, and often with some regular expressions.

For example, if the following ad were found online, the highlighted text might be the named entities we want to extract: the year, the type of car, and the price.

2011 Honda Civic excellent condition color BLUE 56K miles for questions ask for George 215 555 1212 from 8AM TO 5PM THANK YOU... Firm $10,300.

To find the type of car, you would most likely have a list of car makes and models and do a simple string search for these. However, the year and price could easily be handled using regex values.

By combining the list lookup and the regex patterns, writing a program to “read” online classified ads and report on year, type of car, and price would not be very difficult.

An example CarSearch class is illustrated below:

    public class CarSearch

    {

        private string[] CarMakes = { "ford", "honda", "toyota", "pontiac" };

        public string CarMake;

        public int? Year;

        public double? Price;

        private Regex FindPrice = new Regex(@"\$\d{3,5}|\$\d{1,2},\d{3}");

        private Regex FindYear = new Regex(@"20\d\d|'\d\d");

        public Boolean ReadAd(string AdText)

        {

            Boolean FoundCar = false;

            CarMake = "";

            Year = null;

            Price = null;

            // Search for a car make in Ad text.

            foreach (string Car in CarMakes)

            {

                string CarPattern = @"\b" + @Car + @"\b";

                if (Regex.IsMatch(AdText, CarPattern, RegexOptions.IgnoreCase))

                {

                    CarMake = Car;

                    FoundCar = true;

                    break;

                }

            }

            // Try to get year and price.

            if (FoundCar)

            {  

                // Can we find a price?

                Match PriceMatch = FindPrice.Match(AdText);

                if (PriceMatch.Success)

                {

                    Price = Convert.ToDouble(PriceMatch.Value.Replace(",","").Replace("$",""));

                }

                // Can we find the year?

                Match YearMatch = FindYear.Match(AdText);

                if (YearMatch.Success)

                {

                    Year = Convert.ToInt16(YearMatch.Value.Replace("'", "20"));

                }

            }

            return FoundCar;

        }

    }

The class provides a method called ReadAd() which takes the ad content as a parameter. It will search for one of several models, and if the model is found, will attempt to return the price and year (as long as it is greater than 2000) of the car it found.

Note: The ? after the INT and DOUBLE type declarations allow us to place a NULL value in those fields, which is more appropriate if we don’t find a match in the message text. It is a shortcut for Nullable<int> or Nullable<double>.

Validation

One caveat with regular expressions is that they will sometimes find patterns that look alright, but might not be accurate. While regex searching by itself is a very useful tool, you should also look further to see if the returned string is valid, not just that it matches a pattern. Let’s look at some examples.

Credit Cards

The following regex will identify most credit cards:

^((4\d{3})|(5[1-5]\d{2}))(-?|\040?)(\d{4}(-?|\040?)){3}|^(3[4,7]\d{2})(-?|\040?)\d{6}(-?|\040?)\d{5}

However, even values that match the pattern might not be a valid credit card number. But credit card numbers have a checksum digit built (using the Luhn algorithm) into the number that ensures not only the pattern looks OK, but that it is also a valid combination of numbers. The following C# function performs the checksum function and returns a Boolean value indicating whether or not the credit card number is valid.

public bool CC_Check(string CCcard)

{

     // Create an array of digits from the credit card

     CCcard = CCcard.Replace("-", "");

     int[] digits = new int[CCcard.Length];

     for (int len = 0; len < CCcard.Length; len++)

         {

             digits[len] = Int32.Parse(CCcard.Substring(len, 1));

         }

     //Luhn Algorithm

     int RunningTotal = 0;

     bool IsEven = false;

     int nextDigit;

     for (int i = digits.Length - 1; i >= 0; i--)

         {

            nextDigit = digits[i];

            if (IsEven)

               {

                   nextDigit *= 2;

                   if (nextDigit > 9)   {  nextDigit -= 9; }

               }

               RunningTotal += nextDigit;

               IsEven = !IsEven;

          }

         // Must be divisible by 10

         return RunningTotal % 10 == 0;

}

If you need to validate credit card numbers, use the regex in combination with the CC_Check() function to ensure you’ve got a valid card number.

Note: It is still possible after both checks to have a valid, but unissued credit card number, but such a function would require some sort of API communication with the credit card issuer.

EAN 8 number

The EAN 8 number is a simple 8 digit number typically used with a barcode for smaller products. The regex is valid and the EAN-8 number is simply 8 digits, \d{8}. This is a primary example where a regular expression would very likely find more matches than only EAN-8 bar codes. Fortunately, the last digit is a checksum; the code below is a C# function to validate that checksum.

        public bool EAN8_check(string EAN8)

        {

            // Create an array of digits from the code.

            int[] digits = new int[8];

            int theSum = 0;

            int lastDigit;

            for (int x = 0; x <= 7; x++)

            {

                digits[x] = Int32.Parse(EAN8.Substring(x, 1));

                if ((x % 2) == 0)

                {

                    digits[x] *= 3;

                }

                if (x < 7) { theSum += digits[x]; }

            }

            lastDigit = digits[7];

            return ((theSum + lastDigit) % 10 == 0);

        }

Email

Email is often validated using regular expressions; a quick search for regex email validation will yield dozens of email validation regex patterns. However, most patterns are either too broad (i.e. (\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3}) ) or will prevent some valid email patterns from getting through. For example, many expressions don’t accept a + sign in the email address, even though it is a valid character. Email addresses are based on RFC (Request for Comment) 5321 (and others), so a true regex pattern to validate email should handle that standard. You can view such a pattern at the website http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html. However, the pattern is over 6,000 bytes long.  

Note: The example email regex pattern above does not work in .NET, only Perl, but illustrates how complex a pattern to validate email can become.

If you are creating an application that requires email validation, you should visit Dominic Sayer’s website http://www.isemail.info/ which provides a great email validator that handles email addresses that many regex patterns don’t. Of course, it would probably be better to accept an email address from a potential customer rather than display an error message for an invalid email address and lose that customer.

Summary

In the first half of this book, we focused on how to search with regular expressions and covered a fair amount of what can be done with the regex literal characters and metacharacters. It is important to keep in mind that while regex is a powerful tool for searching text, you should also consider additional validations, such as checksum, API look-ups, etc. You can use the regex to determine whether or not a string is worth running a checksum against or sending to an API, but don’t rely on it for strict validation that the data is OK.

In the next section of the book, we are going to explore some additional regex features, and while we continue to focus on searching, we will also begin to explore what can be done with the resulting string or group(s) that a regex finds. For example, we might to want some code to not only extract phone numbers, but to also slice the phone number into area code, main number, and possible extension. So if you feel comfortable with using regex patterns for searching, read on to see what else can be done.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.