Software Science: October 2009

As I continually program throughout my career and daily life, I come across the problem often that there are simply too many ways to program a system. The other problem is that there really isn't much of a correct way to program any specific system. It seems that the means to an end often involves trends. To clarify, I have been trying to program a decent text-based calculator for some time now. I began in Java and moved to .NET because it was a little bit simpler and higher level for me. I'll likely go back to Java simply because it suits my needs a little better, whereas I need a tool that can run anywhere. Let me backtrack a little. Since my first attempt at the calculator, it has seen at least six incarnations, two as a Java application and four as a .NET application.

My first attempt involved simply trying to parse out numbers and symbols, transform them into tokens, and finally using the Shunting-yard algorithm, solve the given equation. That first attempt failed miserably, and I actually got feedback on it to prove it. One of my biggest problems was, has been, and likely will always be negation, though it isn't much of a problem now, it always looms as an issue to be resolved late in the development cycle (usually last after completing the basic system).
My second attempt was to try and do the same thing in .NET only do it a little better (usually when you redo something, you try and learn from your initial mistakes and do better). This time I thought more about something that one of my contacts said I should try, namely using regular expressions. I also though I'd use Linq to see if I could make it work. Basically I used regular expressions to see what was coming next in the expression and to remove the next item from the original expression. I (like in the first attempt) first turned the whole string into an array of chars. So when I was looking for tokens like numbers, I was reconstructing the number from individual characters. Ultimately I ended up with two clunky types: a Posfixator class, which also parsed the expression, and an Evaluator class, which took the postfixated expression and evaluated it for an answer. Some of the difficulties I had with this incarnation included parentheses and functions. One very difficult thing was implementing multiple-argument functions, like Power (defined in many languages as Math.Power(double x, double y)). I had reserved the comma for that purpose, but it wasn't really working.
The third attempt to make a good calculator involved going back to Java. The reason for this was to try and get the low-level concepts clarified. Since Java doesn't have frameworks like Linq, and its enumerable framework isn't quite as developed (in my opinion, or at least it has taken a different route from .NET), I thought the challenge would help me figure out where I could improve. I continued to use the idea of regular expressions to match patterns at the beginning (left side) of the expression and remove those bits from the expression, but this time I thought, why do I need to break the expression down into an array of chars, and why do I need to process whitespace? So I got rid of whitespace in the expression before tokenizing it, and left it as a string. The result was a little bit cleaner.
Right away I went back to .NET to fully clean up the design and followed the same pattern of first removing whitespace and then upon finding tokens based on regular expressions, removing them from the expression to be evaluated. This time I also decided to separate all of the possible tokens into their own classes that contained their regular expressions. It served to be a very manageable system, while at the same time it felt somewhat cluttered to have 28 classes all in a single project. I also started to work on the idea of having matrices and variables in the system. That required making a State class to keep track of the internal heap. I perfected [at least as much as I could] the notion of negative numbers, in that I kept track of the last token I read and if it was either nothing or a number then the hyphen was a subtraction symbol, but if I read a parenthesis or another operator, then I was pretty certain that I was dealing with negation. Then I would set myself up to negate the next number. That worked very well. I still had the function problem, and didn't find a way to resolve it at that point.
My fifth attempt took a wild turn. With all of this experience behind me I determined to figure out whether using regular expressions more would be worthwhile. I remained in .NET this time. I decided that I would use recursion and regular expressions to evaluate sub-expressions hierarchically and then instead of needing to worry about postfixation of the tokens, I would be able to immediately know the answers to individual complete components of the whole. This worked out very well in the end. I had a couple of giant regular expression to test for things like decimal and hexadecimal numbers, sub-expressions defined by what was contained in parentheses, and about this time I had discovered Perl more intimately, which has its own operator for the power function, **. I immediately included it in my own evaluation, and didn't have to worry too much anymore about multiple-argument functions. I also used regular expressions to find functions quickly and knowingly evaluate their contents first, and then them. I overcame a lot of previous hurdles in this attempt, however it ultimately was just a learning session. The regular-expression-recursion-based calculator, the one I had initially wanted to use, but didn't know how to do it just yet, turned out to be a flop. The evaluation of simple expressions such as
24 + 12
were taking far too long. I thought I had reduced complexity a little, by getting rid of the Shunting-yard algorithm altogether, and by evaluating the expression immediately rather than using lazy evaluation. I mentioned these facts about this particular incarnation of the calculator in my previous post. So I decided that I needed to try again to make a text-based calculator tool that I would be able to use on a daily basis that would be relatively fast and simple in design and structure as well. I wanted to ensure that it would be easy to add new functionality to it, and that I could prove its correctness.
Enter sixth incarnation of Personal Calculator. The latest and truly greatest attempt for the personal calculator was somewhat inspired by another project, totally unrelated to calculators at all that I happened to notice in my Visual Studio RSS Feed Reader.
Introducing Elevate. Elevate is a Boost-like library fot .NET that features many utilities and high performance tools that may be of value to developers.
I didn't really know what Boost was, and I thought, "what can it hurt? I'll take a look." So I did, and I found something in Elevate that I really liked. Elevate is an extension project for .NET to extend class to have the utilities that make programming easier and more sensible in other languages, like Ruby's "everything is an object" mentality. Or Haskell's Currying. Instantly I thought about how I had done parsing in the past to pull tokens from a mathematical expression using regular expressions. And I thought about the real differences between the new String.Split methods used in many languages and the old StringTokenizer class in Java. That class never really did what I expected it to, as all it does is provide a class to create an enumerable collection of tokens from a string using a delimiter. Isn't that what String.Split does? So I created a string extension method for .NET that tokenized the current string based on an array of regular expressions. It would follow the same pattern as the fourth and third attempts at parsing to tokenize a string, by looking for specific tokens at the beginning of the string and then removing the whole token upon finding them. This wouldn't differentiate between certain types of tokens though, that would be done later in the calculator. It worked like a charm. Here is the implementation:
```
public static string[] Tokenize(this string s, string[] patterns)
{
    List tokens = new List();
    bool noMatchesFound = false;
    while (!String.IsNullOrEmpty(s) && !noMatchesFound)
    {
        for(int index = 0; index <>
        {
            Regex pattern = new Regex(patterns[index]);
            if (pattern.IsMatch(s))
            {
                string match = pattern.Matches(s)[0].Value;
                tokens.Add(match);
                s = s.Substring(match.Length);
                index = -1;
            }

            if (index == patterns.Length - 1)
            {
                noMatchesFound = true;
            }
        }
    }

    return tokens.ToArray();
}
```
And here is an example of usage:
```
string[] tokens = expression.Tokenize(new string[] {
    "^[\\d]+([.]{1}[\\d]+){0,1}",
    "^[-]{0,1}[\\d]+([.]{1}[\\d]+){0,1}",
    "^[+]{1}",
    "^[-]{1}",
    "^(\\*\\*){1}",
    "^[*]{1}",
    "^[/]{1}"
});
```
In all this worked out great on reducing complexity in the calculator because I didn't have to worry about tokenizing anymore. Then I could focus on evaluation, which would still require the Shunting-yard algorithm, but I already knew that my implementation was relatively fast, and I still didn't really need to keep track of token types, because I could always look them up, which in a hashtable is a 1-1 relationship and O1 is all of the processor time I'd need for that.

As you can see I haven't gotten to implementing parentheses or functions yet in this version, but it only took me six hours to write it. And I've been able to abstract a good portion of the program out into a separate project that I call Nathandelane.System. I am also now able to accept command-line arguments to set the initial state of the calculator. Also my classes or more acceptable: Calculator which contains state and pattern matching, ExpressionEvaluator which puts the expression into postfix format and evaluates it, PrecedenceMap which defines precedence for operators and functions, TokenPatterns which defines all of the patterns for individual tokens, and TokenType which is a simple enumeration to keep track of the token type during evaluation. So with six different methods of conquering the same task, some of them worked better than others, some were more complex than others, some of them ended up taking more processing power than others. I guess programming boils down to experience. I have learned a lot doing this project. I probably would have done better if I had based my project fully on somebody else's project. If I could extract common patterns to use that make sense, which I have then I could also provide a better outcome in the end.

Software Science

13 October 2009

Two or Three Ways to Do Things

Labels

Blog Archive