Software Science: 2009

03 December 2009

Code Quality

As I program more and more the quantity of files, classes, enumerations, structs, and fields increases dramatically. With that increase comes the ever increasing possibility of defects. When I am programming I try to follow a set of guidelines known as a programming style guide. Style guides help us to be consistent in whatever we're doing. It is probably more common for style guides to be used in artistic endeavors like in creating a magazine or newspaper, a text book, or even a web site. Somewhere where readability and usability matter. But wait, doesn't readability and usability matter in code as well? I believe that it absolutely matters in code.

IDEs

1. Using an IDE for all of your coding is one good way to increase code quality.

A lot of style is handled easily by using and IDE (Integrated Development Environment). IDEs like Eclipse, NetBeans and Visual Studio .NET take care of indentation, tabs, closing curly braces and many other important semantics while programming. One feature that I really enjoy about most IDEs (even Vim an Emacs do this) is the ability to ensure that tabs are always tab characters, but their length in spaces can be fixed for different developers. This ensures consistency in the way code is formed, but allows each developer to retain his own look, or rather the look of his code for his own purposes.

Code Complete

2. Reading about best and good, sound practices is another great way to boost the quality of your code.

Code Complete[1] is a great book that I read recently. Some programmers will say that Code Complete is good for some programming languages but not for others, but I don't think that is correct. While Code Complete was written around C and C++ programming, the practices described in it are good practices when coding in an language and most of them can be applied to any language.

The practices described in Code Complete entail things like when to write a method, how to formulate field names, how to ensure that you are using fields in the correct scope, how to write self-documenting code, and how to keep your code from being too confusing to other programmers, or you after a year or two away from it.

I advise all people pursuing programming as a career or as a hobby (or both, as in my case) to read this book. If you can't fund the $50 ($40 if you buy it from Barnes and Noble online) to get the book, then check out your city library. They are very likely to have it. If they don't have it, then you can always stop into your favorite major bookseller, sit down, and take a few minutes to read it every couple of days. I recommend saving up and buying it though.

Design Patterns

3. Learn and use design patterns.

For a very long time now I have been aware of design patterns. What I was not aware of until recently is that design patterns are an incredible way for programmers to communicate ideas about projects. For example, instead of saying,

"I only want to have one global instance of this class available at any given time, so I'm going to put it into a settings class, and hopefully nobody will try to make their own instance",

you can say,

"I only want to have one instance of this class available at any given time, so I'll make it into a Singleton."

Singleton is one of those design patterns. But the second statement does two things for the programmer and his colleagues that the first one does not:

The first statement conveys concern that somebody will want to make multiple instances of his class, while the second one, by stating that the Singleton design pattern will be used, ensures that nobody will be able to make new instances of his class.
The first statement does little to convey intent, while the second one says his class will be a Singleton, so programmers who know that pattern already know how he is going to accomplish ensuring nobody will be able to make multiple instances and that it will be global. See the second statement never used the word "global".

Aside from knowing when to create classes or methods, design patterns are probably the programmer's best toolkit when it comes to communicating intent, and reasoning why something should be written a certain way. A good book about some design patterns that I am currently reading as a refresher is "Head First Design Patterns"[2] from O'Reilly. It hits several of the very common design patterns and highlights design principles, such as

"Program to and interface, not an implementation"

which help you keep the quality of your code in high standing.

Code Reviews

4. Ask for code reviews and do them for your peers.

Very few things can keep a programmer from writing messy code better than peer reviews of code. We did it in high school, we did it in college, and it wasn't just a good idea or for fun, because we should do it professionally also. If we don't do code reviews professionally then we are missing out on a time-tested proven method of increasing code quality. It all boils down to this: if your peers can't understand what's going on, then how can you?

The next time you write a class, pass it on to your neighbor in the next cubicle in an email or link him to your code and let him take a look to see if what you did makes sense. Ideally you probably shouldn't need to document every single line. If it's good and clean then your cube-neighbor will likely be able to see your intent and see the paths through your code.

Likewise ask to view others' code. One way to get better is to see how others program, share ideas, and take an active role in improving your code quality.

Conclusion

Nobody should be afraid to ask for help to ensure that you have high quality code. By following the above four principles in your coding, your code quality will improve drastically. I guarantee it, and if it doesn't then you're not trying hard enough. Here are the four principles again:

Using an IDE for all of your coding is one good way to increase code quality.
Reading about best and good, sound practices is another great way to boost the quality of your code.
Learn and use design patterns.
Ask for code reviews and do them for your peers.

If you have some other ideas, please share them.

Bibliography

13 October 2009

Two or Three Ways to Do Things

As I continually program throughout my career and daily life, I come across the problem often that there are simply too many ways to program a system. The other problem is that there really isn't much of a correct way to program any specific system. It seems that the means to an end often involves trends. To clarify, I have been trying to program a decent text-based calculator for some time now. I began in Java and moved to .NET because it was a little bit simpler and higher level for me. I'll likely go back to Java simply because it suits my needs a little better, whereas I need a tool that can run anywhere. Let me backtrack a little. Since my first attempt at the calculator, it has seen at least six incarnations, two as a Java application and four as a .NET application.

My first attempt involved simply trying to parse out numbers and symbols, transform them into tokens, and finally using the Shunting-yard algorithm, solve the given equation. That first attempt failed miserably, and I actually got feedback on it to prove it. One of my biggest problems was, has been, and likely will always be negation, though it isn't much of a problem now, it always looms as an issue to be resolved late in the development cycle (usually last after completing the basic system).
My second attempt was to try and do the same thing in .NET only do it a little better (usually when you redo something, you try and learn from your initial mistakes and do better). This time I thought more about something that one of my contacts said I should try, namely using regular expressions. I also though I'd use Linq to see if I could make it work. Basically I used regular expressions to see what was coming next in the expression and to remove the next item from the original expression. I (like in the first attempt) first turned the whole string into an array of chars. So when I was looking for tokens like numbers, I was reconstructing the number from individual characters. Ultimately I ended up with two clunky types: a Posfixator class, which also parsed the expression, and an Evaluator class, which took the postfixated expression and evaluated it for an answer. Some of the difficulties I had with this incarnation included parentheses and functions. One very difficult thing was implementing multiple-argument functions, like Power (defined in many languages as Math.Power(double x, double y)). I had reserved the comma for that purpose, but it wasn't really working.
The third attempt to make a good calculator involved going back to Java. The reason for this was to try and get the low-level concepts clarified. Since Java doesn't have frameworks like Linq, and its enumerable framework isn't quite as developed (in my opinion, or at least it has taken a different route from .NET), I thought the challenge would help me figure out where I could improve. I continued to use the idea of regular expressions to match patterns at the beginning (left side) of the expression and remove those bits from the expression, but this time I thought, why do I need to break the expression down into an array of chars, and why do I need to process whitespace? So I got rid of whitespace in the expression before tokenizing it, and left it as a string. The result was a little bit cleaner.
Right away I went back to .NET to fully clean up the design and followed the same pattern of first removing whitespace and then upon finding tokens based on regular expressions, removing them from the expression to be evaluated. This time I also decided to separate all of the possible tokens into their own classes that contained their regular expressions. It served to be a very manageable system, while at the same time it felt somewhat cluttered to have 28 classes all in a single project. I also started to work on the idea of having matrices and variables in the system. That required making a State class to keep track of the internal heap. I perfected [at least as much as I could] the notion of negative numbers, in that I kept track of the last token I read and if it was either nothing or a number then the hyphen was a subtraction symbol, but if I read a parenthesis or another operator, then I was pretty certain that I was dealing with negation. Then I would set myself up to negate the next number. That worked very well. I still had the function problem, and didn't find a way to resolve it at that point.
My fifth attempt took a wild turn. With all of this experience behind me I determined to figure out whether using regular expressions more would be worthwhile. I remained in .NET this time. I decided that I would use recursion and regular expressions to evaluate sub-expressions hierarchically and then instead of needing to worry about postfixation of the tokens, I would be able to immediately know the answers to individual complete components of the whole. This worked out very well in the end. I had a couple of giant regular expression to test for things like decimal and hexadecimal numbers, sub-expressions defined by what was contained in parentheses, and about this time I had discovered Perl more intimately, which has its own operator for the power function, **. I immediately included it in my own evaluation, and didn't have to worry too much anymore about multiple-argument functions. I also used regular expressions to find functions quickly and knowingly evaluate their contents first, and then them. I overcame a lot of previous hurdles in this attempt, however it ultimately was just a learning session. The regular-expression-recursion-based calculator, the one I had initially wanted to use, but didn't know how to do it just yet, turned out to be a flop. The evaluation of simple expressions such as
24 + 12
were taking far too long. I thought I had reduced complexity a little, by getting rid of the Shunting-yard algorithm altogether, and by evaluating the expression immediately rather than using lazy evaluation. I mentioned these facts about this particular incarnation of the calculator in my previous post. So I decided that I needed to try again to make a text-based calculator tool that I would be able to use on a daily basis that would be relatively fast and simple in design and structure as well. I wanted to ensure that it would be easy to add new functionality to it, and that I could prove its correctness.
Enter sixth incarnation of Personal Calculator. The latest and truly greatest attempt for the personal calculator was somewhat inspired by another project, totally unrelated to calculators at all that I happened to notice in my Visual Studio RSS Feed Reader.
Introducing Elevate. Elevate is a Boost-like library fot .NET that features many utilities and high performance tools that may be of value to developers.
I didn't really know what Boost was, and I thought, "what can it hurt? I'll take a look." So I did, and I found something in Elevate that I really liked. Elevate is an extension project for .NET to extend class to have the utilities that make programming easier and more sensible in other languages, like Ruby's "everything is an object" mentality. Or Haskell's Currying. Instantly I thought about how I had done parsing in the past to pull tokens from a mathematical expression using regular expressions. And I thought about the real differences between the new String.Split methods used in many languages and the old StringTokenizer class in Java. That class never really did what I expected it to, as all it does is provide a class to create an enumerable collection of tokens from a string using a delimiter. Isn't that what String.Split does? So I created a string extension method for .NET that tokenized the current string based on an array of regular expressions. It would follow the same pattern as the fourth and third attempts at parsing to tokenize a string, by looking for specific tokens at the beginning of the string and then removing the whole token upon finding them. This wouldn't differentiate between certain types of tokens though, that would be done later in the calculator. It worked like a charm. Here is the implementation:
```
public static string[] Tokenize(this string s, string[] patterns)
{
    List tokens = new List();
    bool noMatchesFound = false;
    while (!String.IsNullOrEmpty(s) && !noMatchesFound)
    {
        for(int index = 0; index <>
        {
            Regex pattern = new Regex(patterns[index]);
            if (pattern.IsMatch(s))
            {
                string match = pattern.Matches(s)[0].Value;
                tokens.Add(match);
                s = s.Substring(match.Length);
                index = -1;
            }

            if (index == patterns.Length - 1)
            {
                noMatchesFound = true;
            }
        }
    }

    return tokens.ToArray();
}
```
And here is an example of usage:
```
string[] tokens = expression.Tokenize(new string[] {
    "^[\\d]+([.]{1}[\\d]+){0,1}",
    "^[-]{0,1}[\\d]+([.]{1}[\\d]+){0,1}",
    "^[+]{1}",
    "^[-]{1}",
    "^(\\*\\*){1}",
    "^[*]{1}",
    "^[/]{1}"
});
```
In all this worked out great on reducing complexity in the calculator because I didn't have to worry about tokenizing anymore. Then I could focus on evaluation, which would still require the Shunting-yard algorithm, but I already knew that my implementation was relatively fast, and I still didn't really need to keep track of token types, because I could always look them up, which in a hashtable is a 1-1 relationship and O1 is all of the processor time I'd need for that.

As you can see I haven't gotten to implementing parentheses or functions yet in this version, but it only took me six hours to write it. And I've been able to abstract a good portion of the program out into a separate project that I call Nathandelane.System. I am also now able to accept command-line arguments to set the initial state of the calculator. Also my classes or more acceptable: Calculator which contains state and pattern matching, ExpressionEvaluator which puts the expression into postfix format and evaluates it, PrecedenceMap which defines precedence for operators and functions, TokenPatterns which defines all of the patterns for individual tokens, and TokenType which is a simple enumeration to keep track of the token type during evaluation. So with six different methods of conquering the same task, some of them worked better than others, some were more complex than others, some of them ended up taking more processing power than others. I guess programming boils down to experience. I have learned a lot doing this project. I probably would have done better if I had based my project fully on somebody else's project. If I could extract common patterns to use that make sense, which I have then I could also provide a better outcome in the end.

22 September 2009

The Programmer's Calculator

A while back I determined that I wanted an easy to use text-mode calculator that could perform all of the calculations I required in my job as a web programmer. Often these are simple arithmetic. As I set out to write my first calculator that was more than a simple GUI with buttons to click, I began asking questions about the best method to use in the process. I already knew about postfix notation, which is where you introduce operands first and the function last, as in

384 1024 +

This is the addition function with the two operands 384 and 1024 for its left and right. In standard mathematical infix notation this would simply look like 384 + 1024. To get from infix to postfix notation, the simplest method is to use what is known as the Shunting-yard Algorithm. This algorithm takes all of the tokens of a mathematical equation and divides them into four (or more) types: operators/functions, numbers, function operand separators (commas), and parentheses. Using these tokens the algorithm puts operators and functions in one stack and numbers in another stack. Ultimately what comes out is a hierarchical tree representing the expression given as input that can now be parsed linearly to produce a correct result. As an example take the following infix expression:

-(32 + 12) * 19 / 3 + (-1 * 92)

When converted to postfix format, the expression would appear as follows:

32 12 + - 19 * 3 / -1 92 * +

You begin to evaluate this then from left to right, first gathering arguments and then operators. I list the operations.

Put 32 into a stack
Put 12 into a stack
Found +, pop two numbers from the stack, right = 12, left = 32, and add them
Put 44 into a stack
Found -, try to pop two number, but found one, so this is a negation, then negate 44
Put -44 into a stack
Put 19 into a stack
Found *, pop two numbers from the stack, right = 19, left = -44, and multiply them
Put -836 into a stack
Put 3 into a stack
Found /, pop two numbers from the stack, right = 3, left = -836, and divide them
Put -278.66 into a stack
Put -1 into a stack
Put 92 into a stack
Found *, pop two numbers from the stack, right = 92, left = -1, and multiply them
Put -92 into a stack
Found +, pop two numbers from the stack, right = -92, left = -278.66, and add them
Put -370.66 into a stack
Nothing else to evaluate, so pop the only number on the stack, -370.66 is your answer

Just for the sake of checking the answer:

-(32 + 12) * 19 / 3 + (-1 * 92)
-44 * 19 / 3 + -92
-836 / 3 + -92
-278.66 + -92
-370.66

Fewer operations to do it by hand, but we end up with the same answer. So it works. My first few attempts at creating my calculator relied on this method of solving mathematical expressions. I wrote one in Java, and one in C#, and both were fast and efficient. One of the people I talked to suggested I try evaluating using regular expressions. So most recently I took on the challenge. I downloaded myself a decent regular expression tester (RegExr), and began working on what makes up each of the tokens. I would then extract the tokens based on their matches. In the first method I also used regular expressions to find numbers, functions and operators, but I only had to know about each token individually. The case with the regular expression based calculator would be that I would focus on expression snippets or sub-expressions. First I went to the regexp that I had originally used to define a number:

[\d.]+

I found that this was not exactly accurate, because with it I could have numbers like 24.1.0 or .1.1.1.2, which are not valid decimal values. So I dug a little deeper and came up with:

(([-]{0,1}[\d]+([.]{0,1}[\d]+){0,1})([d]{0,1}))

The individual components of a decimal number are first, the possibility of it starting with a negative sign [-]{0,1} then one or more digits [\d]+ possibly followed by a decimal and then one or more digits again ([.]{0,1}[\d]+){0,1} and then I added the possibility that you could suffix your number with the letter d. So the following values would be correct decimal numbers:

1
-1
24
19
236
1.12
0.98
20d
3.1412356487897

I didn't want to allow for commas because I wanted to reserve commas for function operand separators in case I had a function that would take more than a single value (like trig functions). I then set out defining the most basic arithmetic operator patterns including numbers. I will refer to the pattern for decimal numbers by the letter n. I came up with the following patterns:

Addition, n([+]{1})n
Subtraction, n([-]{1})n
Multiplication, n([*]{1})n
Division, n([/]{1})n
Power, n((\*\*){1})n
XOR, n([^]{1})n
Modulus, n([%]{1})n
AND, n([&]{1})n
OR, n([|]{1})n

To shorten this process I combined all of the operators together to look for arithmetic sub-expressions and then determined order of operations after that. This sort of evaluation allowed me to evaluate chunks of the expression one time. With the Shunting-yard algorithm I had to find all of the tokens first and put them in the correct order, and then evaluate them. With the regular expression algorithm I was able to evaluate recursively. The one downfall of the regular expression algorithm, even though it was cool, is that it generally takes five or more times longer to evaluate simple expressions:

Regular Expression algorithm, 24 + 12, five seconds
Shunting-yard algorithm, 24 + 12, less than one second

What have I learned? Without digging deeply into the possibility that recursion is to blame,

Regular expressions tend to be more efficient at locating sub-expressions
Evaluating once by regular expressions tends to be slower than parsing once and evaluating once
Complexity of the regular expression algorithm was lower because of recursion (almost a recursion anti-pattern, because recursion is generally more complex)
Shunting-yard algorithm seems to have been time tested and well thought out

Regular expression algorithm took a long time to find large regular expressions like the sub-expression regular expression, shown here:

([\(]{1})((([-]{0,1}[\dABCDEF]+)([h]{1}))|(([-]{0,1}[\d]+([.]{0,1}[\d]+){0,1})([d]{0,1}
))|(([-]{0,1}[01234567]+)([o]{1}))|(([-]{0,1}[01]+)([b]{1}))|((\*\*|[+-/*\(\)^%&|])
{1})|([-]{0,1}(e|pi|\$){1})|(([-]{0,1}([bdho]|cos|sin|tan){1})([\(]{1})((([-]{0,1}[\dAB
CDEF]+)([h]{1}))|(([-]{0,1}[\d]+([.]{0,1}[\d]+){0,1})([d]{0,1}))|(([-]{0,1}[01234567]+)
([o]{1}))|(([-]{0,1}[01]+)([b]{1}))|((\*\*|[+-/*\(\)^%&|]){1})|([-]{0,1}([bdho]{1}|
cos|sin|tan){1})|([-]{0,1}(e|pi|\$){1}))+([\)]{1})))+([\)]{1})

Nice, huh!

I have written four different calculators using Java and C#, and I have found that both languages handle both algorithms equally well. I have found several different ways to do the Shunting-yard algorithm that are satisfactory, and one way to do the Regular Expression algorithm well, but found it to be significantly slow. So what's next? I think I'm going to try using something like Shunting-yard to do prefix based calculations. Whether there is already a name for that, I don't know. Finding the most clean and minimal method of using the Shunting-yard algorithm is still beyond me. With my second Shunting-yard calculator (first C# calculator), I found that creating a class for each token worked out well, but with the Regular Expression calculator I found that I really didn't need to keep track of tokens, I just injected solutions to sub-expressions into the original expression and re-evaluated. Cheers.

Software Science