CHAPTER 15
A complicated regex pattern that is processing a large amount of text can be quite a resource hog, both in terms of memory and processing time. In this chapter, we can consider a few things to do to speed your expression up. Keep in mind, the more you know about the data you are searching through, the better you’ll be able to optimize your regex performance.
Also, keep in mind that regex processing is very fast, and the optimizations will most likely not be noticeable unless you are trying to save milliseconds. If you need to spend a few hours optimizing a regex pattern that is used once a month, it will take a long time savings milliseconds to recoup your development effort. Consider the optimizations as you design your regex, then you don’t need to come back and revisit the code.
Although the processing difference is miniscule, it makes sense to put the more likely occurrences at the beginning of the alternation list. For example, if you are writing a regex expression to validate domains for websites, you should be listing com, then net, then org, etc. This is a general rule of thumb for any sequential search processing; put the most likely matches at the beginning of the list.
Lazy quantifiers potentially will run quite a bit faster than greedy quantifiers if the search string is relatively small and close to the beginning of the text. For example, if we use the greedy quantifier <H3>.*</H3> to look for tags in a large HTML document, the process will load the entire HTML document into memory. Trying to find a <H3> pattern in the source code to my Stack Overflow profile took about 8 milliseconds using the greedy * pattern. Using the lazy pattern, the same search ran in less than .5 milliseconds.
Of course, the reverse holds true if you are searching for a pattern that you’d expect to be the entire string or at least most of it. In this case, you’d rather stick with the greedy operator, anticipating a more likely match to occur quicker by looking at the entire string.
If you are relying on groups for your processing, you should only capture the groups you plan on using in the expression.
(?: <pattern> )
The above syntax creates a non-captured group based on the <pattern> indicated. For example, see the following pattern:
(?<first>\w+)\s(?:\w+)\s
To extract the first two words of a sentence, when you can discard the second word, runs about a millisecond faster than naming the second group.
If you set the Microsoft compile option on, then the regex will first be compiled and stored in the cache. The first time the regex is encountered, the regex engine will compile it and put it into the cache. Subsequent calls to the same regex pattern will run much quicker.
The default cache size is 15 entries for regular expressions. You can set the cache size using the CacheSize property. It is a static property on the regex object that can be set to 0 to clear the cache or any number to change the cache size.
Backtracking is a process used by the regex engine to try alternative paths if the first branch fails. If you know that backtracking will not succeed, you can instruct the regex engine not to bother even trying. For example, we need a simple regex pattern to check if a file contains a valid HTML syntax, or if the text begins with <html> and ends with </html>.
(<html>.*</html>)
If you use the pattern above, then the regex engine will look at the file and see that the first tag is <html> and the last tag is </html>. If not, the regex engine will try other paths through the text to see if it a match. However, we know that if such a match is not found, then the file is not valid for our purposes.
By adding the ?> non-backtracking characters to our expression, it becomes the following:
(?><html>.*</html>)
We instruct the engine to give up if a first attempt fails. On a good size text, I see a 3 to 7 millisecond savings by eliminating backtracking. However, for our purposes it will most likely fail.
Note: If you use the .* syntax, remember to include the SingleLine option, since the . will normally not match line-feed characters, which will most likely be found in your text file.
Another optimization is to use anchors when possible, particularly at the start and end of line or string anchors. This allows the regex engine to simplify its search to check the entire string. If the entire string does not match the pattern, and the anchors characters indicate the pattern must be found at the beginning or end, then the regex engine can quickly “fail” if the pattern expected is not found at the anchor position.
Optimization of regex processing is a habit you should get into at the start instead of trying to optimize the regex after it has been tested. The better you know your data, the more likely you can apply optimizations to allow the regex to find (or not find) the result more quickly.