CHAPTER 13
A popular saying among computer professionals is “garbage in, garbage out,” which simply means that the output of a computer application can only be as good as the data put in. People are very flexible when it comes to understand text in various forms, while computer system generally work better with structured data. For example, it is easy to recognize the area code portion of a phone number written in any of the following manners.
Often, programmers will be asked to “clean-up” or normalize data coming into a system. Regular expressions can be a very useful tool in a programmer’s toolbox when faced with such a task.
We are going to combine our group pattern with the replace to extract the area code and number from any of the phone number strings and place into a consistent format. Our phone number regex is described below:
Table 17: Phone Number Rules
English rule | Regex pattern |
Might begin with a one and some punctuation | (?:1[-\s.])? |
Possibly a left parenthesis | (\)? |
The areacode | (?<areacode>\d{3}) |
If group 1 has left parenthesis, get right | (?(1)\)) |
Space, dash, slash, or period | [-\s.\/]? |
Optional space | \s? |
Phone number (split into two groups) | (?<prefix>\d{3})[-\s.] (?<number>\d{4}) |
The resulting pattern is:
(?:1[-\s.])?(\()?(?<areacode>\d{3})(?(1)\))[-\s.\/]?\s?(?<prefix>\d{3})[-\s.](?<number>\d{4}) |
This pattern will handle each of the phone format examples from above. For any successful match, we will have the area code, the prefix, and the four digit number. We can now apply the replace method to create a consistently formatted phone number. Our goal is to produce the following strings from the above phone numbers.
Our code will create a regex object and assign the pattern to the object. We then loop through our phone numbers, building replacement, and nicely formatted strings.
string[] PhoneNumbers = { "1-215-555-1212", "(610) 867-5309", "484/ 555-1234", "203.514.2213" }; string CleanedPhone = ""; Regex thePhone = new Regex(@"(?:1[-\s.])?(\()?(?<areacode>\d{3})(?(1)\))[-\s.\/]?\s?(?<prefix>\d{3})[-\s.](?<number>\d{4})"); Regex FinalFormat = new Regex(@"(\d{3}) \d{3}-\d{4}"); foreach (string OnePhone in PhoneNumbers) { CleanedPhone = thePhone.Replace(OnePhone, "(${areacode}) ${prefix}-${number}"); if (CleanedPhone==OnePhone) { // Nothing changed. if (FinalFormat.IsMatch(OnePhone)) { // Phone number is already formatted as expected. } else { // The phone number string we got didn't match pattern and is not // already in our final format, should probably log for manual review. } } else { // Update the phone number database. } } |
If a phone number looks the same after the replace, it is either already in the correct format (which we check for) or cannot be converted, so we should probably review the output. Most likely, when you run this type of clean-up, you’ll get some non-matched input phone numbers. By reviewing the list, and tweaking the pattern, you can probably achieve a pretty high ratio of phone number clean-ups.
Names are notoriously difficult to handle, because of the tremendous variety of names. We are going to look at a simple name parser. It will handle common titles, such as Dr, Miss, Mrs, etc. It will also get first names and last names that are either full words or begin with an uppercase letter followed by an apostrophe. This is by no means a comprehensive name checking, but it could have a reasonable number of names. The pattern is as follows:
^(?<title>Mrs|Mrs\.|Mr|Mr\.|Miss|Ms|Dr|Dr\.)\s?(?<first>\w+)\s+(?<last>([A-Z]')?\w+)?$
When we process the names, we are going to break the name into three fields and build a SQL update statement to update the record in a MySQL database table. Our sample data looks like this:
For simplicity, we will load out list of names into an array of strings and build our SQL statement from that list.
string[] ListOfNames = {"Mr. Joe Booth", "Mr Steve Boatman", "Dr. Peter Deitz", "Dr John Hoffler", "Mrs. Janet Green", "Mrs Jaspreet Kaur", "Ms Kellie Helene", "Miss Mary Bachman", "Mr. Joe D'Angelo" }; StringBuilder sb = new StringBuilder(); Regex NameParse = new Regex(@"^(?<title>Mrs|Mrs\.|Mr|Mr\.|Miss|Ms|Dr|Dr\.) \s?(?<first>\w+)\s+(?<last>([A-Z]')?\w+)?$", RegexOptions.IgnoreCase | RegexOptions.Multiline); foreach (string FullName in ListOfNames) { sb.AppendLine(NameParse.Replace(FullName, "UPDATE PeopleTable " + "SET firstName = `${first}`,LastName=`${last}`,Title=`${title}` " + "WHERE SourceName = `$0`")); } // The string builder variable sb now contains the SQL to perform the updates. |
Although regular expressions are a powerful tool, they are not the only tool available. Often, a programmer will be tasked with reading data from a comma separated file (CSV), and regular expressions seem like great way to go, until you start to try it. The rules for the CSV file layout are more complex than they seem at first glance. For example, when should quotes be used? How are imbedded commas handled?