left-icon

Data Capture and Extraction with C# Succinctly®
by Ed Freitas

Previous
Chapter

of
A
A
A

CHAPTER 4

Extracting Meaning from Text

Extracting Meaning from Text


Introduction

In previous chapters we’ve seen how text can be extracted from emails, screenshots, and the web. We have examined how emails with a particular set of words can trigger an automatic response in an attempt to simulate that the email context was somewhat understood (in a very basic and rudimentary way). Basically, given a particular set of words, emails are checked for a match and are sent back the response that would correlate to the match found.

It is one thing is to extract text but quite another to understand what that text means in a given context. Extracting meaning from text is a broad field and falls mostly into Natural Language Processing (NLP), which includes subcategories such as Sentiment Analysis, Named Entity Recognition, and Relationship Extraction.

Sentiment Analysis consists of analyzing the set of words found within text and determining, based on particular rules, a polarity about a specific subject. For example, when analyzing the words in an email, Sentiment Analysis can be used to determine the tone of the email, e.g., polite, rude, adequate, offensive, negative, positive, etc. The extracted words are measured against standards that determine their sentiment.

Named Entity Recognition consists of determining how words map to proper names, i.e. places, people, organizations, etc. Capitalization of words also helps identify particular entities in some languages. It is also possible that two or more words might be used to identify a single entity in some languages, e.g., Las Ventas (Bull Arena in Madrid, Spain).

Relationship Extraction consists of determining the relationship between two or more entities, e.g., which person is the main subject of a news article or who is married to whom.

NLP is not limited to these three areas. It has a very broad scope. These three are just a few of the most common categories in which NLP is widely used and applied. In this chapter we will focus on the first two: Sentiment Analysis and Named Entity Recognition.

Being able to use NLP to make sense of text is a great asset for any developer, and, from a business perspective, this knowledge can help companies streamline and improve business processes such as automation of personnel recruitment (resume/ CV parsing and profile matching), classification and categorization, and email spam detection.

By the end of this chapter, you should be able to use some of these techniques in order to perform Naïves Bayesian Classification and Named Entity Recognition on any given text using C#.

Understanding contextualization

In order to categorize or classify items into particular categories, we must understand how particular words fit into a certain context. For our purposes, context is the part of a written or spoken statement that precedes or follows a specific word or passage, usually influencing its meaning or effect. Contextualization is the backbone of Sentiment Analysis, Text Classification and Categorization.

Text Classification and Categorization can be used to categorize a document or block of text. There are three commonly used techniques: Naïves Bayesian Classification, Vector Machines, and Semantic Indexing. In this chapter we will only focus on the first.

Naïves Bayesian is a machine learning technique that can predict to which category a particular data case belongs. It is accurate, robust, and relatively easy to implement. Naïves Bayesian was invented by British mathematician Thomas Bayes. In essence, we want to know P(A|X), usually read as “the probability of A given independent variable values X,” where X is one or more attributes. The term “naïve” in Naïves Bayes indicates that all X attributes are assumed to be mathematically independent.

In order to visualize this more clearly, let’s look at it as a mathematical equation in which P represents probability.

P(A|X) = (P(X|A) * P(A)) / PP(X)

Imagine a scenario in which you have three groups of people (Groups 1, 2 and 3) involved in a project, and each group is tackling a subset of the same project. Group 1 is working on 50% of the modules (attribute 1) and is producing a 3% flaw rate (attribute F). Group 2 is spending 30% on project management (attribute 2) and runs a 4% chance of failure (attribute F). Group 3 is doing 20% of support (attribute 3) and, because the project is mostly live, its chance of failing to find a solution to a reported problem is roughly 5% (exposure to a flaw—attribute F). Our question: what would be the partial probability of running into a flaw (attribute F) in general for all the groups (attributes 1, 2 and 3)?

The partial probability (PP) of running into a flaw (attribute F) in strict mathematical terms for all the groups would be as follows:

PP(X) = P(1)*P(1|F) + P(2)*P(2|F) + P(3)*P(3|F)

Where P(1|F) can be read as “the probability of Group 1 running into a flaw.”

PP(X) = (50%)*(3%) + (30%)*(4%) + (20%)*(5%) = 3.7%

PP(X) implies a 3.7% chance of running into a flaw throughout the whole group. This 3.7% represents PP in the Naïves Bayes equation.

However, the power of classification uncertainty requires us to calculate the probability that a randomly selected flaw might have come from any of the groups. Using the Naïves Bayes equation, we would get the following calculation for Group 1:

P(1|F) = ((3%) * (50%)) / 3.7% = 40.5%

For Group 2: P(2|F) = ((4%) * (30%)) / 3.7% = 32.5%

For Group 3: P(3|F) = ((5%) * (20%)) / 3.7% = 27%

This means that there is a 40.5% chance that a randomly selected flaw might come from Group 1, a 32.5% chance of such a random flaw coming from Group 2, and from Group 3 that chance is 27%.

Let’s now work with the following data set in a database table.

Department

Gender

Age

Role

Finance

Female

32

Assistant Controller

Finance

Female

36

Senior Controller

Finance

Male

46

Finance Director

IT

Male

40

IT Manager

Finance

Male

30

Financial Lead

Table 4: Sample Data Table

If we want to determine the probability of finding a female (A) given attributes X (where X specifies Department = finance, Age = less than 40, Role = senior), the equation would look like this:

P(A|X) = [P(X|A) * P(A)] / PP(X)

In this case, our groups are male and female.

Breaking X into each attribute, we get the following:

P(A|X) = [P(finance | A) * P(<40 | A) * P(senior | A) * P(A)] / PP(X)

P(female | X) = [P(finance | female) * P(<40 | female) * P(senior | female) * P(female)] / [PP(female | X) + PP(male | X)]

Because PP(X) considers all groups, the denominator becomes PP(female | X) + PP(male | X). In this example, because there are only two occurrences in the column Gender, the only two possible groups are female and male.

With our knowledge of the previous example, we should already know the probability of the X (attribute F). However, in this case we need to determine each one. Hence:

P(finance | female) = count(finance & female) / count(female)

P(<40 | female) = count(<40 & female) / count(female)

P(senior | female) = count(senior & female) / count(female)

P(female) = count(female) / count (gender)

PP(female | X) = P(finance | female) * P(<40 | female) * P(senior | female) * P(female)

Because we need to determine P(male | X), the same calculations made for female must be made for male for each attribute X.

P(finance | male) = count(finance & male) / count(male)

P(<40 | male) = count(<40 & male) / count(male)

P(senior | male) = count(senior & male) / count(male)

P(male) = count(male) / count(gender)

PP(male | X) = P(finance | male) * P(<40 | male) * P(senior | male) * P(male)

There is always a possibility that one of these calculations would give zero as a result. For example:

P(>50 | female) = count(>50 & female) / count(female)

Because there are no females above the age of 50 in our dataset, the result of this equation would be zero. It is not good when a joint count is 0. In order to avoid this situation, you simply add 1 to all joint counts. Although this might look like a devious trick, it has a solid mathematical basis and is known as add-one smoothing, which is a specific type of Laplacian smoothing .

With smoothing, the previous equations would appear as follows:

P(finance | female) = count(finance & female) + 1 / count(female) + 3

P(>50 | female) = count(>50 & female) + 1 / count(female) + 3

P(senior | female) = count(senior & female) + 1 / count(female) + 3

Smoothing can be resumed and applied as follows:

P(X|A) = count(X|A) + 1 / count(A) + number of X attributes

Notice that smoothing is only applied to calculations that involve counts.

As we have seen, logically organizing the extracted text content (words) is important. The data set presented in Table 4 was extracted as text and organized into a logical structure (database table). We do this so that Naïves Bayes can be applied successfully.

Let’s recap here. Naïves Bayes can be calculated as follows:

P(female | X) = [PP(female | X)] / [PP(female | X) + PP(male | X)]

PP(female | X) = P(finance | female) * P(<40 | female) * P(senior | female) * P(female)

PP(male | X) = P(finance | male) * P(<40 | male) * P(senior | male) * P(male)

Where female and male are the groups (described as A) and finance, <40, and senior are attributes (described as X).

With this theory behind us, let’s write a simple Naïves Bayes engine in C# that includes add-one smoothing. The engine will require as input the same attributes (X) that the equation expects, and it will require that the groups be evaluated.

Code Listing 20: A Naïves Bayes Engine with Smoothing

// Bayes Engine with & without smoothing.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

namespace TextProcessing

{

    public class BayesEngine : IDisposable

    {

        protected bool disposed;

       

        public BayesEngine()

        {

            dataset = new List<KeyValuePair<string, string[]>>();

            StrictCounting = false;

        }

        ~BayesEngine()

        {

            this.Dispose(false);

        }

        protected bool ExistsAinXRow(int xRow, string A, string aCol =       

        "")

        {

            bool res = false;

            if (dataset != null)

            {

                foreach (KeyValuePair<string, string[]> column in

                dataset)

                {

                    if (aCol == String.Empty ||

                    column.Key.ToUpper().Contains(aCol.ToUpper()))

                    {

                        if (StrictCounting)

                            res = (column.Value[xRow].ToUpper() ==

                            A.ToUpper()) ? true : false;

                        else

                        {

                            if

                            (column.Value[xRow].ToUpper().

                             Contains(" "))

                                res = (column.Value[xRow].ToUpper().

                                Contains(A.ToUpper()))

                                ? true : false;

                            else

                                res = (column.Value[xRow].ToUpper()

                                == A.ToUpper()) ? true : false;

                        }

                        if (res) break;

                    }

                }

            }

            return res;

        }

        public List<KeyValuePair<string, string[]>> dataset = null;

        public bool StrictCounting { get; set; }

        // Count(finance & male)

        public double CountXA(string xAttribute, string A, string xCol =

        "", string aCol = "")

        {

            double res = 0;

            if (dataset != null)

            {

                foreach (KeyValuePair<string, string[]> xColumn in

                         dataset)

                {

                    if (xCol == String.Empty ||                    

                        xColumn.Key.ToUpper().

                        Contains(xCol.ToUpper()))

                    {

                        int xRow = 0;

                        foreach (string x in xColumn.Value)

                        {

                            if (StrictCounting)

                            {

                                if (x.ToUpper() ==

                                    xAttribute.ToUpper() &&

                                    ExistsAinXRow(xRow, A, aCol))

                                    res++;

                            }

                            else

                            {

                                if (x.ToUpper().Contains(" "))

                                {

                                    if (x.ToUpper().Contains

                                       (xAttribute.ToUpper()) &&

                                        ExistsAinXRow(xRow, A, aCol))

                                        res++;

                                }

                                else

                                    if (x.ToUpper() ==

                                        xAttribute.ToUpper() &&

                                        ExistsAinXRow(xRow, A, aCol))

                                        res++;

                            }

                            xRow++;

                        }

                    }

                }

            }

            return res;

        }

        // Count(female, where female is a group)

        public double CountA(string A, string col = "")

        {

            double res = 0;

            if (dataset != null)

            {

                foreach (KeyValuePair<string, string[]> column in

                         dataset)

                {

                    if (col == String.Empty ||

                        column.Key.ToUpper().

                        Contains(col.ToUpper()))

                    {

                        foreach (string wrd in column.Value)

                        {

                            if (StrictCounting)

                            {

                                if (wrd.ToUpper() == A.ToUpper())

                                    res++;

                            }

                            else

                            {

                                if (wrd.ToUpper().Contains(" "))

                                {

                                    if

                                       (wrd.ToUpper().

                                        Contains(A.ToUpper()))

                                        res++;

                                }

                                else

                                    if (wrd.ToUpper() == A.ToUpper())

                                        res++;

                            }

                        }

                    }

                }

            }

            return res;

        }

        // Count(gender, where gender is a column—i.e. all groups)

        public double CountCol(string col)

        {

            double res = 0;

           

            if (dataset != null)

            {

                foreach (KeyValuePair<string, string[]> column in

                         dataset)

                {

                    if (col != String.Empty && column.Key.ToUpper().

                        Contains(col.ToUpper()))

                    {

                        res = column.Value.Length;

                        break;

                    }

                }

            }

            return res;

        }

        // P(male) = count(male) / count(gender)

        public double ProbA(string A, string aCol)

        {

            double res = 0;

            res = CountA(A, aCol) / CountCol(aCol);

            return res;

        }

        // P(finance | male) = count(finance & male) / count(male)

        public double ProbXA(string xAttribute, string A, string xCol =

        "", string aCol = "")

        {

            double res = 0;

            res = CountXA(xAttribute, A, xCol, aCol) / CountA(A,

                          aCol);

            return res;

        }

        // P(finance | male) = count(finance & male) + 1 / count(male) +

        // 3 (Add-one smoothing)

        public double SmoothingProbXA(string xAttribute, string A, int

        numAttributes, string xCol = "", string aCol = "")

        {

            double res = 0;

            res = (CountXA(xAttribute, A, xCol, aCol) + 1) /

                  (CountA(A, aCol) + numAttributes);

            return res;

        }

        // Decides whether to use ProbXA or SmoothingProbXA.

        public double CalcProbXA(bool smoothing, string xAttribute,

        string A, int numAttributes = 0, string xCol = "", string aCol =

        "")

        {

            double res = 0;

            res = ProbXA(xAttribute, A, xCol, aCol);

            res = (res == 0 || smoothing) ?

                       SmoothingProbXA(xAttribute, A,

                  numAttributes, xCol, aCol) : res;

            return res;

        }

        // PP(male | X) = P(finance | male) * P(<40 | male) * P(senior |

        // male) * P(male)

        public double PProbAX(bool smoothing, string A, string[]

        xAttributes, string[] xColls, string aCol = "")

        {

            double res = 0;

            if (xAttributes != null && xAttributes.Length > 0)

            {

                int i = 0;

                List<double> rlts = new List<double>();

                foreach (string xAtrrib in xAttributes)

                {

                    string xCol = (xColls != null && xColls.Length >

                                   0 && xColls.Length ==

                                   xAttributes.Length) ?

                                   xColls[i] : String.Empty;

                    rlts.Add(CalcProbXA(smoothing, xAtrrib, A,

                                 xAttributes.Length, xCol, aCol));

                    i++;

                }

                rlts.Add(ProbA(A, aCol));

                double tmp = 0;

                int cnt = 0;

                foreach (double r in rlts)

                {

                    tmp = (cnt == 0) ? r : tmp *= r;

                    cnt++;

                }

                res = tmp;

            }

            return res;

        }

        // P(female | X) = [P(finance | female) * P(<40 | female) *

        // P(senior | female) * P(female)] /

        // [PP(female | X) + PP(male | X)]

        public double BayesAX(string A, string[] G, string[] gColls,

        string[] xAttributes, string[] xColls, string aCol = "",

        bool smoothing = true)

        {

            double res = 0;

            double nonimator = PProbAX(smoothing, A, xAttributes,

                                           xColls, aCol);

            double denominator = 0;

            if (G != null && G.Length > 0 && gColls != null &&

                gColls.Length > 0)

            {

                if (G.Length == gColls.Length)

                {

                    int i = 0;

                    foreach (string group in G)

                    {

                        denominator += PProbAX(smoothing, group,

                                    xAttributes, xColls, gColls[i]);

                        i++;

                    }

                }

            }

            if (denominator > 0)

                res = nonimator / denominator;

            return res;

        }

        public virtual void Dispose(bool disposing)

        {

            if (!this.disposed)

            {

                if (disposing)

                {

                    dataset = null;

                }

            }

            this.disposed = true;

        }

        public void Dispose()

        {

            this.Dispose(true);

            GC.SuppressFinalize(this);

        }

    }

}

In Code Listing 20, the Bayes engine’s principal method is BayesAX, which calculates the nominator, i.e. PP(female | X). Then the denominator is calculated, i.e. [PP(female | X) + PP(male | X)], which returns the final result, i.e. P(female | X) = [PP(female | X)] / [PP(female | X) + PP(male | X)].

Method BayesAX has the following parameters: A, G, gColls, xAttributes, xColls, aCol, and smoothing. A represents the group (A) for which the probability will be calculated (in the given example, A is female). G represents a string array of the of the possible groups (in the given example, these groups are female and male). The parameter gColls represents a string array of the name of the columns in which the female and male groups are found (in the given example, the column name for both female and male is Gender). The parameter xAttributes represents a string array of the X values for calculating probability (in the given example, finance, senior, and <40). xColls is also a string array that represents the name of the columns for the defined X attributes (in the given example, Department, Role, and Age respectively). And finally, aCol is the name of the column for the group Gender.

The Bayes engine counts the xAttributes per group, calculating the PP(X) for each group and then determining the overall probability. The class BayesEngine has a very important property: StrictCounting. When set to false (which is the default value), StrictCounting indicates that the count will be done on word matches using Contains(), whereas, when it is set to true, counting will be done on exact string matches (using the == operator). In both cases, string comparison is not case sensitive.

BayesEngine also caters for add-one smoothing, which by default is assumed and applied (value set to true). If smoothing is set to false, smoothing will only be automatically applied when a result joint count is zero. For top accuracy it is recommended to use smoothing.

In order to better understand these concepts, let’s consider a scenario in which there are two groups: one group of 24 males and one group of 16 females. Job, Handed, and Height are X attributes. The possible values for Job are Admin, Const, Edu, and Tech. The possible values for Handed are Right and Left. The possible values for Height are Short, Tall, and Medium.

We can use this information to create the dataset in Table 5.

Males

Females

Admin = 2

Admin = 7

Const = 5

Const = 0

Edu = 2

Edu = 4

Tech = 15

Tech = 5

Left = 7

Left = 2

Right = 17

Right = 14

Short = 1

Short = 6

Medium = 19

Medium = 8

Tall = 4

Tall = 2

Table 5: Sample Dataset

Using Table 5’s dataset, we can calculate P(male | X) and P(female | X) to obtain the results in Table 6.

Formula

Result with Smoothing

Result without Smoothing

P(Edu | Male)

0.1111

0.0833

P(Right | Male)

0.6667

0.7083

P(Tall | Male)

0.1852

0.1667

P(Male)

0.6

0.6

P(Edu | Female)

0.2632

0.2500

P(Right | Female)

0.7895

0.8750

P(Tall | Female)

0.1579

0.1250

P(Female)

0.4

0.4

PP(Male | X)

0.008230

0.005903

PP(Female | X)

0.013121

0.010938

P(Male | X)

0.3855

0.3505

P(Female | X)

0.6145

0.6495

Table 6: Results from Preceding Sample Dataset

In order to verify that our BayesEngine can produce the same results for this dataset, let’s create a wrapper class around it.

Code Listing 21: A Bayes Engine Wrapper around the Sample Dataset

// BayesExample: A BayesEngine Wrapper

using System;

using System.Collections.Generic;

namespace TextProcessing

{

    public class BayesExample

    {

        public static void BayesEx()

        {

            using (BayesEngine b = new BayesEngine())

            {

                b.dataset.Add(new KeyValuePair<string, string[]>("Gender",

                new string[]

                { "male", "male", "male", "male", "male", "male", "male",

                  "male", "male", "male", "male", "male",

                  "male", "male", "male", "male", "male", "male", "male",

                  "male", "male", "male", "male", "male",

                  "female", "female", "female", "female", "female",

                  "female", "female", "female", "female", "female",

                  "female", "female", "female", "female", "female",

                  "female" }));

                b.dataset.Add(new KeyValuePair<string, string[]>("Job",

                new string[]

                { "tech", "tech", "tech", "tech", "tech", "tech", "tech",

                  "tech", "tech", "tech", "tech", "tech",

                  "tech", "tech", "tech", "const", "const", "const",

                  "const", "const", "admin", "admin", "edu", "edu",

                  "admin", "admin", "admin", "admin", "admin", "admin",

                  "admin", "edu", "edu", "edu", "edu", "tech",

                  "tech", "tech", "tech", "tech" }));

                b.dataset.Add(new KeyValuePair<string,

                string[]>("Handed",

                new string[]

                { "left", "left", "left", "left", "left", "left", "left",

                  "right", "right", "right", "right", "right",

                  "right", "right", "right", "right", "right", "right",

                  "right", "right", "right", "right", "right", "right",

                  "left", "left", "right", "right", "right", "right",

                  "right", "right", "right", "right", "right", "right",

                  "right", "right", "right", "right" }));

                b.dataset.Add(new KeyValuePair<string,

                string[]>("Height", new string[]

                { "short", "tall", "tall", "tall", "tall", "medium",

                  "medium", "medium", "medium", "medium", "medium",

                  "medium", "medium", "medium", "medium", "medium",

                  "medium", "medium", "medium", "medium", "medium",

                  "medium", "medium", "medium", "short", "short",

                  "short", "short", "short", "short", "tall", "tall",

                  "medium", "medium", "medium", "medium", "medium",

                  "medium", "medium", "medium" }));

                // P(male|(edu | right | tall)) with smoothing

                double r1 = b.BayesAX("male", new string[] { "male",

                   "female" }, new string[] { "Gender", "Gender" },

                    new string[] { "edu", "right", "tall" }, new string[]  

                    { "Prof", "Hand", "Height" },

                    "Gender");

                // P(male|(edu | right | tall)) without smoothing

                double r2 = b.BayesAX("male", new string[] { "male",

                   "female" }, new string[] { "Gender", "Gender" },

                    new string[] { "edu", "right", "tall" }, new string[]

                    { "Prof", "Hand", "Height" },

                    "Gender", false);

                // P(female|(edu | right | tall)) with smoothing

                double r3 = b.BayesAX("female", new string[] { "male",

                    "female" }, new string[] { "Gender", "Gender" },

                    new string[] { "edu", "right", "tall" }, new string[]

                    { "Prof", "Hand", "Height" },

                    "Gender");

                // P(female|(edu | right | tall)) without smoothing

                double r4 = b.BayesAX("female", new string[] { "male",

                    "female" }, new string[] { "Gender", "Gender" },

                    new string[] { "edu", "right", "tall" }, new string[]

                    { "Prof", "Hand", "Height" },

                    "Gender" ,false);

                Console.WriteLine("P(male|(edu | right | tall)) with

                smoothing: " + r1);

               

                Console.WriteLine("P(male|(edu | right | tall))

                without smoothing: " + r2);

                Console.WriteLine("P(female|(edu | right | tall)) with

                smoothing: " + r3);

               

                Console.WriteLine("P(female|(edu | right | tall))

                without smoothing: " + r4);

            }

        }

    }

}

// Main Program that calls BayesExample.

using TextProcessing;

namespace DataCaptureExtraction

{

    class Program

    {

        static void Main(string[] args)

        {

            BayesExample.BayesEx();

        }

    }

}

Running this code produces the results shown in Figure 13.

Result from the BayesEngine Wrapper around the Sample Dataset

Figure 13: Result from the BayesEngine Wrapper around the Sample Dataset

As you can see, BayesEx produces the same results (in fact, slightly more precise results) as were manually calculated and described in Table 5.

Naïves Bayes is fantastic method to determine the probability of running into a given group (A) that has one or more attributes (X). This method can be used a barometer for determining if one or more words (A) belong to certain categories (X), which allows for easier classification and categorization.  

Common data types & RegEx

In order to give meaning to text, we must know exactly which string data types are going to be extracted, and we must know how they can be identified and extracted from a set of specific words.

Regular Expressions (RegEx) is nothing more than a defined sequence of characters that define a search pattern for a particular string data type, e.g., an email address, post code, or anything else that has a specific, formatted pattern.

The most basic RegEx consists of a single literal character, such as ‘o.’ RegEx will match the first occurrence of that character in the string. For example, if the string is ‘John is a pilot,’ RegEx will match the ‘o’ after the ‘J.’

RegEx can also match the second ‘o’ as well. It will do this only when you tell the RegEx engine to search through the string after the first match.

There are 12 characters, called meta-characters, that have special meanings in RegEx, and it is important that you know these items:

  • the backslash ‘\’
  • the caret ‘^’
  • the dollar sign ‘$’
  • the period or dot ‘.’
  • the vertical bar or pipe symbol ‘|’
  • the question mark ‘?’
  • the asterisk or star ‘*’
  • the plus sign ‘+’
  • the opening parenthesis ‘(‘
  • the closing parenthesis ‘)’
  • the opening square bracket ‘[‘
  • the opening curly brace ‘{‘

If any of these characters are used as a literal in RegEx, they need to be escaped with a backslash ‘\’ character. If we want to match 1+1=2, the correct RegEx is 1\+1=2. Otherwise, the ‘+’ sign would have a special meaning. A great tutorial to get you started with RegEx can be found at RegExOne.

Let’s quickly explore how we can implement RegEx with C#.                 

Code Listing 22: A RegEx C# Example

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using System.Text.RegularExpressions;

namespace DataCaptureExtraction

{

    class Program

    {

        public static void RegExExample()

        {

            // First we see the input string.

            string input = "/server/thisUrl.html";

            // Here we call Regex.Match.

            Match match = Regex.Match(input, @"server/([A-Za-z0-9\-

                                               ]+)\.html$",

                                               RegexOptions.IgnoreCase);

            // Here we check the Match instance.

            if (match.Success)

            {

                // Finally, we get the Group value and display it.

                string key = match.Groups[1].Value;

                Console.WriteLine(key);

            }

        }

        static void Main(string[] args)

        {

            RegExExample();

        }

    }

}

Note that running the RegEx “server/([A-Za-z0-9\-]+)\.html$” code on this string "/server/thisUrl.html" extracts the word “thisUrl” from the URL string.

Here is some common and useful RegEx code:

User Name: ^[a-z0-9_-]{3,16}$

this-us3r_n4m3 would be a match. However, a string longer than 16 characters would not match.

Password: ^[a-z0-9_-]{6,18}$

thisp4ssw0rd would be a match. However, a string shorter than 6 characters would not.

Hex Value: ^#?([a-f0-9]{6}|[a-f0-9]{3})$

#a3c113 would be a match. However, #h3c113 would not because the letter ‘h’ is included.

Email: ^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$

[email protected] would match. However, [email protected] would not because it is too long.

Full URL: ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$

http://subdomain.vito.com/about would match. However, http://vito.com/some/page!.html  would not because it contains the ‘!’ character.

IP Address: ^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$

71.48.125.121 would match. However, 256.58.125.121 would not.

Let’s slightly adjust the code, accommodating for 256.58.125.121 in order to display when a match is produced.

Code Listing 23: Adjusted RegEx C# Example

public static void RegExExample()

{

    // First we see the input string.

    string input = " this-us3r_n4m3";

    // Here we call Regex.Match.

    Match match = Regex.Match(input, @"^[a-z0-9_-]{3,16}$",

                                       RegexOptions.IgnoreCase);

    // Here we check the Match instance.

    if (match.Success)

    {

        Console.WriteLine(match.Value);

    }

}

You may use any of the previous examples with the preceding code in order to verify that the matches provided work as expected.

Other common string data types that need to be extracted from text are emails, post codes, social security numbers, driver’s licenses, fiscal identification numbers, bank account numbers, phone numbers, area codes, etc. The site RegExLib contains a wealth of popular RegEx that can be applied in C# projects. Another useful site is RexEgg.

Identifying entities

Named Entity Recognition (NER) consists of locating and classifying subsets of text elements (Named Entities) into predefined categories such as the names of persons, organizations, locations, amounts, monetary values, percentages, etc.

Classification of words into categories using a probabilistic method like Naïves Bayes certainly allows us to place a set of words into specific categories based on a trained dataset, but this use of NER remains very basic. In order to have very accurate NER, we must use a well-trained dataset model.

A popular NER implementation for .NET and C# is the Stanford Named Entity Recognizer (NER) for .NET, which is also available via NuGet.

Stanford NER Installed as a NuGet Package

Figure 14: Stanford NER Installed as a NuGet Package

After you install the NuGet package, you will need to download the classifier library definitions, which can be found here: http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip. Further information can also be found here: http://stanfordnlp.github.io/CoreNLP/download.html.

When the library definitions have been downloaded, unzip and place the files into a folder on your local hard drive. You will need to reference this location in your code.

Let’s examine how to quickly implement NER with the Stanford NER for .NET.

            Code Listing 24: A C# Stanford NER Implementation Program

// Stanford NER C# Implementation

using System;

using System.Collections.Generic;

using System.Linq;

using edu.stanford.nlp.ie.crf;

using edu.stanford.nlp.pipeline;

using edu.stanford.nlp.util;

namespace TextProcessing

{

    public class NER : IDisposable

    {

        protected bool disposed;

        protected CRFClassifier Classifier = null;

        protected string[] ParseResult(string txt)

        {

            List<string> res = new List<string>();

            string[] tmp = txt.Split(' ');

            if (tmp != null && tmp.Length > 0)

            {

                foreach (string t in tmp)

                {

                    if (t.Count(x => x == '/') == 2)

                    {

                        res.Add(t.Substring(

                            0, t.LastIndexOf("/") - 1));

                    }

                }

            }

            return res.ToArray();

        }

        public NER()

        {

            string root = @"D:\Temp\NER\classifiers";

                Classifier = CRFClassifier.getClassifierNoExceptions

                (root + @"\english.all.3class.distsim.crf.ser.gz");

        }

        ~NER()

        {

            this.Dispose(false);

        }

        public string[] Recognize(string txt)

        {

            return ParseResult(Classifier.classifyToString(txt));

        }

        public virtual void Dispose(bool disposing)

        {

            if (!this.disposed)

            {

                if (disposing)

                {

                    Classifier = null;

                }

            }

            this.disposed = true;

        }

        public void Dispose()

        {

            this.Dispose(true);

            GC.SuppressFinalize(this);

        }

    }

}

// Wrapper class around the Stanford NER Implementation.

using System;

namespace TextProcessing

{

    public class NerExample

    {

        public static void nerExample()

        {

            using (NER n = new NER())

            {

                string[] res =

                         n.Recognize("I went to Stanford, 

                         which is located in California");

                if (res != null && res.Length > 0)

                {

                    foreach (string r in res)

                    {

                        Console.WriteLine(r);

                    }

                }

            }

        }

    }

}

// Main Program

using System;

using TextProcessing;

namespace DataCaptureExtraction

{

    class Program

    {

        static void Main(string[] args)

        {

            NerExample.nerExample();

        }

    }

}

The most important part of the code is the call to CRFClassifier.getClassifierNoExceptions , which is where the location of the classifier definitions (english.all.3class.distsim.crf.ser.gz) are physically located on disk.

Within Recognize, the Classifier.classifyToString method of the Stanford NER is invoked, and the results parsed. This will produce the output we see in Figure 15.

Stanford NER C# Implementation Output

Figure 15: Stanford NER C# Implementation Output

Using the input string “I went to Stanford, which is located in California," the Stanford NET C# program can recognize two named entities: Stanford (which is an organization) and California (which is a location).

Summary

Extracting meaning from text is a fascinating topic, whether we are examining how to extract specific data types, recognize entities, or classify words within text. When you are able to make sense of extracted data, you have access to a powerful tool that can help you improve, accelerate, and automate business processes. In fact, there is an unlimited potential of processes—from spam filters to text classification and beyond—that organizations can streamline and improve. We’ve only scratched the surface of what is possible with powerful C# code implementations.

Keep in mind that the techniques I have presented in this book are recommened for concept testing rather than production usage. We have focused on quick implementation of what might be achieved from a conceptual point of view, and these techniques do not compete with or undermine any commercial offerings. I encourage you to also consider the diverse range of commercial products that have powerful APIs and are professionally supported.

Thank you for reading. I hope this material helps had broaden your view on data capture and extraction with C#.

The complete Visual Studio project source code can be downloaded from this URL:

https://bitbucket.org/syncfusiontech/data-capture-and-extraction-with-c-succinctly

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.