CHAPTER 2
“To iterate is human, to recurse divine.”
L. Peter Deutsch
All the parsing of the CSCS language that we are going to develop in this book is based on the Split-and-Merge algorithm to parse a custom string. We will start with pure mathematical expressions, and then generalize them into any custom CSCS language expression.
The description that I provide here is based on the Split-and-Merge algorithm previously published in the MSDN Magazine,[11] ACCU Overload,[12] and in the CODE Magazine.[13]
Note that in the discussion that follows, we suppose that the parsing string doesn’t contain any extra blank spaces, tabs, new lines, etc. If this is not the case, we need first to get rid of all these extra characters. (The code that gets rid of these extra characters is available in the accompanying source code).
We could have removed these unneeded blanks at the same time that we parsed the script, but this would have unnecessarily made the algorithm look more complex than it should be.
The algorithm consists of two steps: as the algorithm name implies, the first part is a Split, and the second one is a Merge.
In the first step, the string containing an expression is split into a list of so-called “variables.” Each variable consists, primarily, of a number or a string and an action that must be applied to it. The numbers are internally represented as doubles, so in particular, they can be integers or Booleans.
For numbers, the actions can be all possible operations we can do with the numbers, for example, +, –, *, /, %, or ^ (power, 2^3 means 2 to the power of 3; that is, 8) and Boolean comparisons (&&, ||).
For strings, the actions can be a + (meaning string concatenation) or Boolean comparisons.
For convenience, we denote the action of the last variable in the list as ). It could’ve been any character, but we chose ) to be sure that it doesn’t represent any other action. We call this special action a “null” action. It has the lowest priority.
The separation criteria for splitting the string into tokens are mathematical and logical operators or parentheses. Note that the priority of the operators does not matter in the first step.
Let’s see an example of applying the first step to some expressions:
Split (“3+5”) = {Variable(3, “+“), Variable(5, “)“)}
Split (“11-2*5”) = {Variable(11, “-“), Variable(2, “*“), Variable(5, “)“)}
Easy! Now, what about parentheses? As soon as we get an opening parentheses in the expression string, we recursively apply the whole Split-and-Merge algorithm to the expression in parentheses and replace the expression in parentheses with the calculated result. For example:
Split (“3*(2+2)”) = {Variable(3, “*“),Variable(SplitAndMerge(“2+2”), “)“)} ={Variable(3, “*“), Variable(4, “)“)}
We apply the same procedure for any function in the expression to be parsed: first we apply recursively the whole Split-and-Merge algorithm to the function, replacing it with the calculated result.
For example, consider parsing an expression 1+sin(3-3). As soon as we get the opening parenthesis in the (3-3) sub-expression, we apply the whole algorithm to 3-3, which yields 0 as a result. Then our expression is reduced to 1+sin(0). Then we calculate the sine of 0, which is 0, reducing the expression to 1+0. Splitting this expression into a list of variables leads to two variables: Variable(1, “+”) and Variable (0, “)”) as a result.
Note that at the end of the first step, all of the parentheses and functions will be eliminated (by possibly recursive calls to the whole Split-and-Merge algorithm).
The priorities of the actions start counting in the second step, when we merge the list of variables created in the previous step. Code Listing 1Error! Reference source not found. contains the priorities of some actions that I defined.
Code Listing 1: Priorities of the actions
static int GetPriority(string action) |
Note that the last variable action ) has the lowest priority of 0, since it isn’t included in the switch statement.
The list of variables is merged one by one, starting from the first variable in the list. Merging means applying the action of the left variable to the values of the left and right variables.
The resulting variable will have the same action as the right variable. The variables can only be merged if the priority of the action of the left variable is greater or equal than the priority of the action of the right variable. As soon as we have only one variable left, its value will be the final result of the whole calculation.
For example, since the priority of the + action is greater than the priority of the null action, we have:
Merge {Variable(3, “+“), Variable(5, “)“)} = Variable(3 + 5, “)”) = Variable(8, “)”) = 8.
Now what happens if two variables cannot be merged because the priority of the left variable is lower than the priority of the right variable? Then we temporarily move to the next-right variable, in order to try to merge it with the variable next to it, and so on, recursively. As soon as the right variable has been merged with the variable on its right, we return back to the original, left variable, and try to remerge it with the newly created right variable.
Eventually we will be able to merge the whole list, since sooner or later we will reach the last variable of the list that has the lowest priority, and therefore, can be merged with any variable on its left.
Let’s see an example of merging a list of variables (11, “-“), (2, “*“), (5, “)“) from the Split section. Note that the priority of * is greater than the priority of - and therefore, we cannot merge the first and second variables right away, but must first merge the second and the third variables:
Merge {Variable(11, “-“), Variable(2, “*“), Variable(5, “)“)} =
Merge {Variable(11, “-“), Merge {Variable(2, “*“), Variable(5, “)“)} =
Merge {Variable(11, “-“), Variable(2 * 5, “)“) } =
Merge {Variable(11, “-“), Variable(10, “)“) } =
Variable(11 - 10, “)“) } = Variable(1, “)“) } = 1.
Let’s see the implementation of the Split-and-Merge algorithm in C#.
First we need to define a few auxiliary data structures. Code Listing 2 contains the ParsingScript class (the definition is incomplete there—we are going to see more functions and methods from the ParsingString class later on). It is a wrapper over the expression being parsed. It contains the actual string being parsed (m_data) and the pointer to the character where we currently are (m_from). It also has a few convenience functions for accessing and manipulating the string being parsed.
Code Listing 2: The ParsingScript class
public class ParsingScript public ParsingScript(string data, int from = 0) } public char Current { get { return m_data[m_from]; } } public bool StillValid() { return m_from < m_data.Length; } public char CurrentAndForward() { return m_data[m_from++]; } int maxChars = Constants.MAX_CHARS_TO_SHOW) { public char TryCurrent() { public char TryPrev() { public char TryPrevPrev() { public void MoveForwardIf(char expected, char expected2 = Constants.EMPTY) { } |
The Constants.Empty character is defined as ‘\0’.
The main task of the first step is to split the string into a list of variables. The Variable class is defined in Code Listing 3.
Basically, a variable is a wrapper over either a number (including integers, doubles, and Booleans), a string, or a list of other variables.
Code Listing 3: The Variable class
public class Variable public Variable() { Type = VarType.NONE; }
public void SetAsArray() { public static Variable EmptyInstance = new Variable(); } |
Code Listing 4 shows the main function to split an expression into a list of variables.
Code Listing 4: The Split part of the algorithm
static List<Variable> Split(ParsingScript script, char[] to) StringBuilder item = new StringBuilder(); do { // Main processing cycle of the first part. to, script, ref action); ch, ref action); // This happens when called recursively inside of a math expression: script.MoveForwardIf(Constants.END_ARG); // END_ARG = ')' return listToMerge; |
The item variable holds the current token, adding characters to it one by one as soon as they are read from the string being parsed. To figure out whether we need to add the next character to the current token, the StillCollecting method is invoked. We show it in Code Listing 5.
Code Listing 5: The StillCollecting method to check if we are still collecting the current token
static bool StillCollecting(string item, char[] to, ParsingScript script, ref string action) |
The StillCollecting method uses the following constants defined in the Constants class:
const char START_ARG = '(';
const char START_GROUP = '{';
Obviously, you are free to redefine them as you wish. The condition next == Constants.EMPTY means that there are no more characters in the script being parsed; that is, the ch argument is already the last character in the script.
The StillCollecting method in Code Listing 5 also invokes the Utils.VaildAction method to find the action that follows the token that we just extracted.
We show the VailidAction and UpdateAction methods in Code Listing 6.
Code Listing 6: The methods to return an action if the passed string starts with it
public static string ValidAction(string rest) foreach (string action in Constants.ACTIONS) { static string UpdateAction(ParsingScript script, char[] to) int advance = action == null ? 0 : action.Length; |
The Constants.ACTIONS is a string array containing all actions we want to define:
public static string[] OPER_ACTIONS = { "+=", "-=", "*=", "/=",
"%=", "&=", "|=", "^="};
public static string[] MATH_ACTIONS = { "&&", "||", "==", "!=", "<=", ">=",
"++", "--", "%", "*", "/", "+", "-", "^", "<", ">", "="};
// Actions: always decreasing by the number of characters.
public static string[] ACTIONS=(OPER_ACTIONS.Union(MATH_ACTIONS)).ToArray();
The actual dealing with every extracted token in the Split method (Code Listing 4) is in the following 2 lines of code:
ParserFunction func = new ParserFunction(script, token, ch, ref action);
Variable current = func.GetValue(script);
If the extracted token is not a function previously registered with the parser, this code will try to convert the extracted token to a double (and throw an exception if the conversion is not possible).
Otherwise, a previously registered function will be called, which can in turn recursively invoke the the whole Split-and-Merge algorithm. The appropriate ParserFunction will be selected based on the token being processed. How do we select the appropriate function? This is what we’ll explore next.
To choose the appropriate function, I decided to use the virtual constructor idiom. Virtual constructor? Is it legal? James Coplien described a trick that simulates a virtual constructor. He described this idiom to be used in C++, but it can be easily used in any other object-oriented language as well, so here we are going to use this idiom in C#.
In C# the same functionality is often implemented using a factory method that uses an additional factory class to produce the required object. But Coplien’s design pattern doesn’t need an extra class, and produces instead the required object on the fly. According to Coplien:[14]
“The virtual constructor idiom is used when the type of an object needs to be determined from the context in which the object is constructed.”
This is exactly the case when we associate each token being parsed with a function at runtime. The main advantage of this idiom is that we can easily add a new function to the parser without modifying the main algorithm implementation code.
The main idea of this idiom is to use the ParserFunction constructor to create the m_impl member of the object, which is a ParserFunction itself:
private ParserFunction m_impl;
So the constructor of ParserFunction initializes the m_impl member with the appropriate class, deriving from ParserFunction. See the implementation in Code Listing 7.
Code Listing 7: The ParserFunction base class
public class ParserFunction { char ch, ref string action) { (ch == Constants.START_ARG || !script.StillValid ())) { m_impl = GetFunction(token); string.IsNullOrWhiteSpace(token)) { public Variable GetValue(ParsingScript script) public static ParserFunction GetFunction(string item) if (s_functions.TryGetValue(item, out impl)) { // Derived classes may want to return a new instance in order to public static void RegisterFunction(string name, ParserFunction function) { } private static Dictionary<string, ParserFunction> s_functions = new Dictionary<string, ParserFunction>(); private static StringOrNumberFunction s_strOrNumFunction = |
Notice that the ParserFunction has a missing curly brace. This is because the class listing continues later on, in particular in Listings 8 and 9.
The Evaluate method (called from the base class GetValue method) will use the actual ParserFunction object.
So each class deriving from the ParserFunction class is expected to override the Evaluate method: this is where the real action takes place. We’ll see plenty examples later on.
Different parser functions can be registered with the parser using the RegisterFunction method in Code Listing 7. A dictionary is used to map all of the parser functions to their names:
private static Dictionary<string, ParserFunction> s_functions =
new Dictionary<string, ParserFunction>();
There are two special functions deriving from the ParserFunction class:
private static StringOrNumberFunction s_strOrNumFunction =
new StringOrNumberFunction();
private static IdentityFunction s_idFunction =
new IdentityFunction();
They are used explicitly in the ParserFunction constructor, and their implementation is provided in Code Listing 8.
Code Listing 8: IdentityFunction and StringOrNumberFunction implementations
class IdentityFunction : ParserFunction class StringOrNumberFunction : ParserFunction public string Item { private get; set; } |
The IdentityFunction is used when we have an expression in parentheses. The whole Split-and-Merge algorithm is called on that expression in parentheses, because the ParsingScript.ExecuteTo function is just a wrapper over, it as you can see in Code Listing 9.
Code Listing 9: The implementation of ParsingScript Execute, ExecuteTo, and ExecuteFrom functions
public Variable ExecuteTo(char to = '\0') public Variable ExecuteFrom(int from, char to = '\0') public Variable Execute(char[] toArray) |
The StringOrNumberFunction is a “Catch All” function that is called when no function is found that corresponds to the current token. Therefore, it yields either a number or a string in quotes. Otherwise, an exception is thrown.
A typical implementation of a function deriving from the ParserFunction is shown in Code Listing 10.
In order to use this function with the parser, one needs to register it using the RegisterFunction static method (refer to Code Listing 7).
Code Listing 10: An example of a ParserFunction: the exponential function
class ExpFunction : ParserFunction |
Basically, to register a function with the parser, there are three steps:
public const string EXP = "exp";
ParserFunction.RegisterFunction(Constants.EXP, new ExpFunction());
That’s it! After completing these steps, you can add any function you wish and use it with the parser. For example, we can now use the exponential function that we defined previously in expressions like 1+exp(10), and so on.
In the second part of the algorithm, we merge the list of variables created in the previous step one by one, applying recursion if we can’t merge items right away.
See Code Listing 11 for the implementation details.
From the outside, the Merge method is called with the mergeOneOnly argument set to false, so it will not return before completing the whole merging part.
When the Merge method is called recursively (when the left and right cells cannot be merged with each other because of their priorities), the Merge method will be called recursively with the mergeOneOnly parameter set to true.
This is because we want to return to where we were as soon as we complete one merge in the MergeCells method (“cell” is a synonym of “variable” in our context).
Code Listing 11: The implementation of the Merge part of the algorithm
static Variable Merge(Variable current, ref int index, List<Variable> listToMerge, bool mergeOneOnly = false) while (index < listToMerge.Count) { return current; |
We merge variables one by one, and for each pair we invoke the CanMergeCells method, shown in Code Listing 12. This method just compares priorities of the actions of each cell. We saw the priorities of each action in Code Listing 1.
Code Listing 12 also shows the implementation of the MergeCells method. There is a special treatment of the variables Break and Continue. It consists in not doing nothing. We will see the Break and Continue variables in more detail in the next chapter.
Code Listing 12: The implementation of the CanMerge and MergeCells methods
Depending on the cells being merged, they will be merged either as strings or as numbers. Code Listing 13 shows the implementation of the MergeNumbers method.
Code Listing 13: The implementation of the MergeNumbers method
Code Listing 14 shows the implementation of the MergeStrings method. Note that this method may result in a variable of the Number type, in case the action is a comparison operator (for example, “a” > “z” leads to a variable of type Number and value 0). Note that we represent Boolean values internally as integers with 0 corresponding to “false” and 1 corresponding to “true.”
Code Listing 14: The implementation of the MergeStrings method
The MergeStrings method heavily relies on the Variable.AsString method, which will be shown later on (you can jump to Code Listing 26 to check it out).
Note that at the end of the Merge method, only one Variable will be left and returned back. That Variable will contain the final result of the calculation.
We have seen how to parse a mathematical expression, containing any number of functions (previously registered with the parser) and any number of nested parentheses.
Note that the code we developed in this chapter is still not quite bullet-proof. Some of the items I took away in order to make the algorithm look easier and concentrate on the remaining items in later chapters. Some of these missing items that we are going to catch later on are:
Prefix and postfix operators, assignments, etc.