Software Science: September 2009

A while back I determined that I wanted an easy to use text-mode calculator that could perform all of the calculations I required in my job as a web programmer. Often these are simple arithmetic. As I set out to write my first calculator that was more than a simple GUI with buttons to click, I began asking questions about the best method to use in the process. I already knew about postfix notation, which is where you introduce operands first and the function last, as in

384 1024 +

This is the addition function with the two operands 384 and 1024 for its left and right. In standard mathematical infix notation this would simply look like 384 + 1024. To get from infix to postfix notation, the simplest method is to use what is known as the Shunting-yard Algorithm. This algorithm takes all of the tokens of a mathematical equation and divides them into four (or more) types: operators/functions, numbers, function operand separators (commas), and parentheses. Using these tokens the algorithm puts operators and functions in one stack and numbers in another stack. Ultimately what comes out is a hierarchical tree representing the expression given as input that can now be parsed linearly to produce a correct result. As an example take the following infix expression:

-(32 + 12) * 19 / 3 + (-1 * 92)

When converted to postfix format, the expression would appear as follows:

32 12 + - 19 * 3 / -1 92 * +

You begin to evaluate this then from left to right, first gathering arguments and then operators. I list the operations.

Put 32 into a stack
Put 12 into a stack
Found +, pop two numbers from the stack, right = 12, left = 32, and add them
Put 44 into a stack
Found -, try to pop two number, but found one, so this is a negation, then negate 44
Put -44 into a stack
Put 19 into a stack
Found *, pop two numbers from the stack, right = 19, left = -44, and multiply them
Put -836 into a stack
Put 3 into a stack
Found /, pop two numbers from the stack, right = 3, left = -836, and divide them
Put -278.66 into a stack
Put -1 into a stack
Put 92 into a stack
Found *, pop two numbers from the stack, right = 92, left = -1, and multiply them
Put -92 into a stack
Found +, pop two numbers from the stack, right = -92, left = -278.66, and add them
Put -370.66 into a stack
Nothing else to evaluate, so pop the only number on the stack, -370.66 is your answer

Just for the sake of checking the answer:

-(32 + 12) * 19 / 3 + (-1 * 92)
-44 * 19 / 3 + -92
-836 / 3 + -92
-278.66 + -92
-370.66

Fewer operations to do it by hand, but we end up with the same answer. So it works. My first few attempts at creating my calculator relied on this method of solving mathematical expressions. I wrote one in Java, and one in C#, and both were fast and efficient. One of the people I talked to suggested I try evaluating using regular expressions. So most recently I took on the challenge. I downloaded myself a decent regular expression tester (RegExr), and began working on what makes up each of the tokens. I would then extract the tokens based on their matches. In the first method I also used regular expressions to find numbers, functions and operators, but I only had to know about each token individually. The case with the regular expression based calculator would be that I would focus on expression snippets or sub-expressions. First I went to the regexp that I had originally used to define a number:

[\d.]+

I found that this was not exactly accurate, because with it I could have numbers like 24.1.0 or .1.1.1.2, which are not valid decimal values. So I dug a little deeper and came up with:

(([-]{0,1}[\d]+([.]{0,1}[\d]+){0,1})([d]{0,1}))

The individual components of a decimal number are first, the possibility of it starting with a negative sign [-]{0,1} then one or more digits [\d]+ possibly followed by a decimal and then one or more digits again ([.]{0,1}[\d]+){0,1} and then I added the possibility that you could suffix your number with the letter d. So the following values would be correct decimal numbers:

1
-1
24
19
236
1.12
0.98
20d
3.1412356487897

I didn't want to allow for commas because I wanted to reserve commas for function operand separators in case I had a function that would take more than a single value (like trig functions). I then set out defining the most basic arithmetic operator patterns including numbers. I will refer to the pattern for decimal numbers by the letter n. I came up with the following patterns:

Addition, n([+]{1})n
Subtraction, n([-]{1})n
Multiplication, n([*]{1})n
Division, n([/]{1})n
Power, n((\*\*){1})n
XOR, n([^]{1})n
Modulus, n([%]{1})n
AND, n([&]{1})n
OR, n([|]{1})n

To shorten this process I combined all of the operators together to look for arithmetic sub-expressions and then determined order of operations after that. This sort of evaluation allowed me to evaluate chunks of the expression one time. With the Shunting-yard algorithm I had to find all of the tokens first and put them in the correct order, and then evaluate them. With the regular expression algorithm I was able to evaluate recursively. The one downfall of the regular expression algorithm, even though it was cool, is that it generally takes five or more times longer to evaluate simple expressions:

Regular Expression algorithm, 24 + 12, five seconds
Shunting-yard algorithm, 24 + 12, less than one second

What have I learned? Without digging deeply into the possibility that recursion is to blame,

Regular expressions tend to be more efficient at locating sub-expressions
Evaluating once by regular expressions tends to be slower than parsing once and evaluating once
Complexity of the regular expression algorithm was lower because of recursion (almost a recursion anti-pattern, because recursion is generally more complex)
Shunting-yard algorithm seems to have been time tested and well thought out

Regular expression algorithm took a long time to find large regular expressions like the sub-expression regular expression, shown here:

([\(]{1})((([-]{0,1}[\dABCDEF]+)([h]{1}))|(([-]{0,1}[\d]+([.]{0,1}[\d]+){0,1})([d]{0,1}
))|(([-]{0,1}[01234567]+)([o]{1}))|(([-]{0,1}[01]+)([b]{1}))|((\*\*|[+-/*\(\)^%&|])
{1})|([-]{0,1}(e|pi|\$){1})|(([-]{0,1}([bdho]|cos|sin|tan){1})([\(]{1})((([-]{0,1}[\dAB
CDEF]+)([h]{1}))|(([-]{0,1}[\d]+([.]{0,1}[\d]+){0,1})([d]{0,1}))|(([-]{0,1}[01234567]+)
([o]{1}))|(([-]{0,1}[01]+)([b]{1}))|((\*\*|[+-/*\(\)^%&|]){1})|([-]{0,1}([bdho]{1}|
cos|sin|tan){1})|([-]{0,1}(e|pi|\$){1}))+([\)]{1})))+([\)]{1})

Nice, huh!

I have written four different calculators using Java and C#, and I have found that both languages handle both algorithms equally well. I have found several different ways to do the Shunting-yard algorithm that are satisfactory, and one way to do the Regular Expression algorithm well, but found it to be significantly slow. So what's next? I think I'm going to try using something like Shunting-yard to do prefix based calculations. Whether there is already a name for that, I don't know. Finding the most clean and minimal method of using the Shunting-yard algorithm is still beyond me. With my second Shunting-yard calculator (first C# calculator), I found that creating a class for each token worked out well, but with the Regular Expression calculator I found that I really didn't need to keep track of tokens, I just injected solutions to sub-expressions into the original expression and re-evaluated. Cheers.

Software Science

22 September 2009

The Programmer's Calculator

Labels

Blog Archive