Expression evaluation

In TRM, the values in the registers (including the PC) and memory uniquely determine one unique state of the computer. So it stands to reason that printing the registers and scanning the memory must be able to help us debug all the problems. But for ease of use, we also want the simple debugger to help us compute expressions with registers and memory. That's why you need to add expression evaluation functionality to the simple debugger. For the sake of simplicity, let's consider the implementation of math expression evaluation first.

Mathematical expression evaluation

You are given a string of expressions

"5 + 4 * 3 / 2 - 1"

How do you find its value? Expression evaluation is such a classic problem that there are many ways to solve it. We have weighed the knowledge required against the difficulty, and here we use the following approach to solving expression evaluation:

  1. first identify all the units in the expression
  2. recursively evaluate the expression according to its inductive definition.

Lexical Analysis

"Lexical analysis" is a fancy word for doing the first thing above, "identifying all the units in an expression". A "unit" in this case is a substring with its own meaning, formally known as a token. Specifically, we need to identify the tokens 5, +, 4, *, 3, /, 2, -, and 1 in the above expression. You might think that's a pretty simple thing to do, but consider the following expression.

"0x80100000+   ($a0 +5)*4 - *(  $t1 + 8) + number"

It contains many more features, such as hexadecimal integers (0x80100000), parentheses, access to registers ($a0), pointer dereferencing (the second *), and access to variables (number). In fact, such complex expressions are often used during debugging, and you need to be able to correctly recognize tokens with a variable number of spaces (zero or more). Of course, you can still do this manually (if you like challenging work), and a quicker and easier way to do it is to use regular expressions. Regular expressions make it easy to match complex patterns and are a must for programmers. If you've never encountered regular expressions before, please STFW. In the experiment, you just need to understand some basic knowledge of regular expressions (such as metacharacters).

Once you've learned how to use simple regular expressions, you can start thinking about how to use regular expressions to recognize tokens. Let's start with a simple case - an arithmetic expression, where only the following token types are allowed in the expression to be evaluated.

  • decimal integers
  • +, -, *, /
  • (, ).
  • A collection of spaces (one or more spaces).

First we need to write rules to recognize each of these token types using regular expressions. In the framework code, a rule is a pair (implemented with C struct) consisting of a regular expression and a token type. The framework code already gives rules for + and collection of spaces, where the token type for collection of spaces is TK_NOTYPE, since collection of spaces don't take part in the evaluation process and can be discarded once recognized, and the token type for + is '+'. In fact, the token type is just an integer, so just make sure that different types of tokens are encoded as different integers. There is also a rule in the framework code that recognizes double equals signs, but we can ignore it for now.

These rules are compiled via init_regex() during the initialization of the simple debugger into some internal data structure for pattern matching, which are used by the library functions, and which are used over and over again, but you don't have to worry about how they are organized. However, if the compilation of the regular expression fails, NEMU will trigger an assertion fail, in which case you need to check that the rules you've written conform to the syntax of the regular expression.

Given an expression to be evaluated, we first need to recognize the tokens in it, and this is done by the make_token() function. The make_token() function works in a very straightforward way, it uses the position variable to indicate where it is currently processing, and it tries to match the string at the current position with all the rules in order. When a rule matches successfully, and the matched substring is exactly where position is, we have successfully recognized a token, and the Log() macro outputs a message that the recognition was successful. All you need to do is save the information about the recognized token (with the exception of the space string), and we use the Token structure to save the token information:

typedef struct token {
  int type;
  char str[32];
} Token;

The type member is used to record the type of the token. For most tokens it is sufficient to just record the type, e.g. +, -, *, /, but this is not sufficient for some tokens: if we only record the type of a decimal integer token, then when evaluating the token we will still have no idea of what the decimal integer is. We should record the corresponding substring of the token, and the str member is used to do this. Note that the str member has a finite length, so when you realize that the buffer is about to overflow, you should handle it accordingly (think about it, how would you handle it?) , otherwise it will result in some hard to find bug. The tokens array is used to store the recognized tokens in order, and nr_token indicates the number of recognized tokens.

If all the rules tried can not match the token at the current position, the recognition fails, and the framework code outputs the current position of the token (when the expression is too long and requires line breaks in the terminal, ^ may not indicate the correct position, and it is recommended to locate the position of the token by the output position value). This is usually the result of a ill-formed expression, and the make_token() function will return false, indicating that the lexical analysis failed.

Implementing lexical analysis of arithmetic expressions

You need to accomplish the following.

  • Add rules for the various token types in an arithmetic expression. You need to be aware of the presence of escape characters in C strings and the function of metacharacters in regular expressions.
  • After successfully recognizing the token, record the information about the token into the tokens array in order.

Debugging axioms

  • The machine is always right.
    • Corollary: If the program does not produce the desired output, it is the programmer's fault.
  • Every line of untested code is always wrong.
    • Corollary: Mistakes are likely to appear in the "must-be-correct" code.

These two axioms mean: it's no use complaining, accept that the code is buggy, and be patient with debugging.

Jyy used to present them as facts. In fact, countless programmers (including seniors at your school) have proven them correct over and over again in practice, so here we present them as axioms.

How to debug

  • Don't use " gaze debugging ", think about how to use the right tools and methods to help debugging.
    • Staring at dozens of lines of program in programming class, you may be able to simulate the execution of the program in your brain like NEMU; but after the program gets bigger, you'll soon give up: your brain can't simulate a huge state machine.
    • We learn about computers to learn how they work, not how to work mechanically like a computer.
  • Use assert() to set up checkpoints, to intercept unintended situations.
    • For example, assert(p ! = NULL) can block segmentation errors caused by null pointer dereferences.
  • Use printf() to see how a program is executing in conjunction with an understanding of its behavior (note the line breaks in the strings)
    • printf() outputs an arbitrary message that can be used to check code reachability: the corresponding message is output, if and only if the corresponding block of code is executed.
  • printf() outputs the value of a variable can check how and why it has changed.
  • Use GDB to observe arbitrary state and behavior of a program.
    • Print variables, breakpoints, watchpoints, function call stacks...

If the above suddenly makes sense to you, it means you didn't get the training you needed in your programming class.

Why does the output of printf() need line breaks?

What might happen if there were no line breaks? You can try it out in code, think about why, and then STFW to compare your ideas.

The Golden Rule of System Design -- The KISS Rule

The KISS here is an abbreviation for Keep It Simple, Stupid, which translates to: Don't Strive for Absolute Perfection in the Beginning.

You've learned the basics of programming, which means you've learned to write programs, but that doesn't mean you're ready for PA just yet, because in the real world, we need systems that work, not little programs that find factorials. NEMU is a small system with a code size of more than 3000 lines (excluding blank lines). As the PA progresses, the amount of code will be increasingly large, and the interaction between modules will be more and more complex, and the maintenance of the project will become very difficult, and a very retarded bug may need to be debugged for several days. In this case, the system can work is the king, anything not functional is useless. The pursuit of everything will only increase the difficulty of code maintenance.

The only thing that can save you from the chaos of bugs is the KISS rule, which is about from easy to hard, step by step, one thing at a time, and fewer irrelevant things. If you don't know what this means, let's take the str member buffer overflow problem mentioned above as an example. The KISS rule tells you that you should use assert(0), which, even if it doesn't handle the above problem "properly", still doesn't affect the correctness of the core function of expression evaluation. If you remember the debugging axiom, you'll see that the two are related: the second point of the debugging axiom tells you that untested code is always wrong. Instead of writing so much "wrong" code all at once, you can use assert(0) to help minimize these "mistakes".

If you interpret the KISS rule in the context of software engineering, it emphasizes the importance of doing unit testsopen in new window: write a function, test it, write the next function and test it again ... A good way to test is to use assertions, reg_test() is an example. Learning how to use assertions can be beneficial for both testing and debugging your program.

The KISS rule is widely used not only in computing, but in many other fields as well, and hereopen in new window is an article with many examples, which we highly recommend you read, to realize the importance of the KISS rule.

Recursive evaluation

Once the tokens in the expression have been recognized, we can proceed with the evaluation. Note that we are now working with an array of tokens, which we will refer to as a "token expression" for convenience. For example, the expression to be evaluated is

"4 +3*(2- 1)"

has the token expression

+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| NUM | '+' | NUM | '*' | '(' | NUM | '-' | NUM | ')' |
| "4" |     | "3" |     |     | "2" |     | "1" |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+

By the inductive definition nature of expressions, we can easily use recursion to evaluate them. First we give the inductive definition of an arithmetic expression.

<expr> ::= <number>    # a number is an expression
  | "(" <expr> ")"     # Adding parentheses to both sides of an expression is also an expression
  | <expr> "+" <expr>  # Adding two expressions is also an expression
  | <expr> "-" <expr>  # Next you get it all!
  | <expr> "*" <expr>
  | <expr> "/" <expr>

The above representation is the well-known BNFopen in new window, which is used by any formal programming language tutorial to give the syntax of the programming language.

Based on the above BNF definition, a solution has developed: since a long expression is composed of short expressions, we evaluate the short expression first, and then evaluate the long expression. This natural solution is the application of divide-and-conqueropen in new window, which is easy to understand even if you haven't heard of this fancy term. And to implement this solution, recursion is the way to go.

To indicate a sub-expression in a token expression, we can use two integers p and q to indicate where the sub-expression starts and where it ends. This makes it easy to write out the framework of the evaluation function: p and q.

eval(p, q) {
  if (p > q) {
    /* Bad expression */
  }
  else if (p == q) {
    /* Single token.
     * For now this token should be a number.
     * Return the value of the number.
     */
  }
  else if (check_parentheses(p, q) == true) {
    /* The expression is surrounded by a matched pair of parentheses.
     * If that is the case, just throw away the parentheses.
     */
    return eval(p + 1, q - 1);
  }
  else {
    /* We should do more things here. */
  }
}

The check_parentheses() function is used to determine if an expression is surrounded by a matching pair of parentheses, and to check if the left and right parentheses of an expression match; if they don't match, the expression is ungrammatical, and there's no need to continue evaluating it. Let's see some examples of what the check_parentheses() function can do:

"(2 - 1)"             // true
"(4 + 3 * (2 - 1))"   // true
"4 + 3 * (2 - 1)"     // false, the whole expression is not surrounded by a matched
                      // pair of parentheses
"(4 + 3)) * ((2 - 1)" // false, bad expression
"(4 + 3) * (2 - 1)"   // false, the leftmost '(' and the rightmost ')' are not matched

As for how to check if the left and right parentheses match, let's leave that as a programming assignment for you to think about!

The above framework has considered the first two definitions of arithmetic expressions in BNF, and we'll consider the remaining cases (i.e., what's in the last else in the pseudo-code above). One question is, given a long expression whose leftmost and rightmost sides are not both parentheses, how do we properly split it into two sub-expressions? We define the "main operator" as the operator that is run at the last step of the expression when it is evaluated manually, which indicates the type of the expression (e.g. when the last step of an expression is a subtraction operation, it is essentially a subtraction expression). To split a long expression correctly is to find its main operator. We'll continue to explore this issue using the example above.

"4 + 3 * ( 2 - 1 )"
/*********************/
case 1:
    "+"
   /   \
"4"     "3 * ( 2 - 1 )"


case 2:
        "*"
       /   \
"4 + 3"     "( 2 - 1 )"


case 3:
              "-"
             /   \
"4 + 3 * ( 2"     "1 )"

The three possible splits are listed above, and note that we cannot split at a token that is not an operator, otherwise the result of the split would not be a legal expression. Based on the definition of the main operator, it is easy to see that only the first split is correct. This is consistent with our manual evaluation process: we first calculate 4 and 3 * ( 2 - 1 ), and then we add their results. The second kind of splitting violates the priority of arithmetic operations, and causes addition to take place before multiplication. The third type of splitting destroys the balance of the parentheses. So the result of the 2nd and 3rd splitting is not a legal expression.

With the simple example above, we can summarize how to find the main operator in a token expression:

  • A token that is not an operator is not a main operator.
  • A token that appears in a pair of parentheses is not a principal operator. Notice that there are no parentheses enclosing the entire expression, because this is handled in the corresponding if block of check_parentheses().
  • The main operator has the lowest precedence in the expression. This is because the main operator is the last operator to be performed.
  • When more than one operator has the lowest priority, the last operator to be combined is the main operator by combinability. An example would be 1 + 2 + 3, whose main operator would be + on the right.

To find the main operator, just loop through the token expression and uniquely identify the main operator as described above.

Once you've found the correct main operator, it's a simple matter of recursively evaluating the two split sub-expressions, and then evaluating the main operator on the values of the two sub-expressions. The complete evaluation function is then as follow:

eval(p, q) {
  if (p > q) {
    /* Bad expression */
  }
  else if (p == q) {
    /* Single token.
     * For now this token should be a number.
     * Return the value of the number.
     */
  }
  else if (check_parentheses(p, q) == true) {
    /* The expression is surrounded by a matched pair of parentheses.
     * If that is the case, just throw away the parentheses.
     */
    return eval(p + 1, q - 1);
  }
  else {
    op = the position of main operator in the token expression;
    val1 = eval(p, op - 1);
    val2 = eval(op + 1, q);

    switch (op_type) {
      case '+': return val1 + val2;
      case '-': /* ... */
      case '*': /* ... */
      case '/': /* ... */
      default: assert(0);
    }
  }
}

It is important to note that the above framework does not have error handling, and when an expression is found to be illegal during evaluation, an error flag should be returned to the higher-level function, telling it that "the result of the evaluation is invalid". For example, in the check_parentheses() function, (4 + 3)) * ((2 - 1) and (4 + 3) * (2 - 1) both return false, because in the first case, the expression is illegal, and there is no way to succeed in evaluating the value, while in the latter case, it's a legal expression, and the evaluation is successful. The latter case is a legitimate expression that can be evaluated successfully, except that it is not in the form of BNF's "(" <expr> ")", which needs to be handled using the main operator, so you need to find a way to distinguish between them. Of course, you can also use assert(0) to terminate the program if you find an illegal expression. In that case, however, you'll need to be very careful about how you use the expression evaluation function.

Finally, for the sake of consistency, we consider all results to be of type uint32_t.

Implementing recursive evaluation of arithmetic expressions

Since ICS is not an algorithms class, we have already laid out the idea and framework of recursive evaluation. All you need to do is understand the idea, and fill in the framework accordingly. The p command is easy to implement once you have implemented expression evaluation.

Evaluating Arithmetic Expressions with Negative Numbers (Optional)

In the above implementation, we did not consider the problem of negative numbers, such as

    "1 + -1"
    "--1"    /* We don't implement decrement op, here it should be interpreted as -(-1) = 1 */.

They will be considered as illegal expressions. To implement the functionality of negative numbers, you need to consider two problems:

  • The negative sign and the minus sign are both -, how can you tell them apart?
  • The minus sign is a unary operator, what do I need to be aware of when splitting?

You can choose not to implement negative numbers, but you'll soon be facing similar problems.

Peeking into the compiler from expression evaluation

As you know from your programming classes, compilation is the process of converting a high-level language into machine language. But have you ever wondered how the machine reads your code? Thinking back to your implementation of expression evaluation, did you learn anything new?

In fact, lexical analysis is the first step in the compiler's process of compiling your source code. The compiler needs to recognize tokens from your source code, and this can also be done with regular expressions, but the types of tokens are more varied and complex. This also explains why you can insert any number of whitespace characters (including spaces, tabs, and line breaks) into your source code without affecting the semantics of your program; you can also write all your source code in a single line and the compilation will still pass.

An interesting application related to lexical analysis is syntax highlighting. In your programming classes, you may not have thought about writing your own syntax highlighting program. The truth is, this seemingly magical stuff isn't that complicated, and you do have the ability to do it now: think of the source code as a string fed into a syntax highlighting program, and then, after identifying a token, you can just output its contents again in a different color depending on the type of the token. If you want to output the highlighted code to the terminal, you can use the ANSI escape code coloropen in new window.

In the recursive evaluation of an expression, two things are done in terms of code logic: the first is to analyze the structure of the expression based on the token (which of the BNF cases it belongs to), and the second is to evaluate it. There is a compiler equivalent: syntax analysis is like analyzing the structure of an expression, except that the compiler analyzes the structure of the program, i.e., which are functions, which are statements, etc. Of course, the structure of a program is more complex than the structure of an expression, so compilers generally use a standard framework for analyzing the structure of a program. Understanding this framework requires a lot more knowledge than we can provide here. Also, if you are interested, you can read the BNF for C syntax.

The counterpart to the final evaluation of an expression is code generation in the compiler. ICS theory classes have a special section on the relationship between C code and assembly instructions, so even if you don't know exactly how code is generated, you can still understand the relationship. This is because C code is intrinsically linked to assembly code, and highly skilled C programmers can even think in terms of switching between C and assembly code. If you look deeper into the code generation process, you can easily guess that it's done recursively: for example, to generate the code for a function, you first generate the code for each of its statements, and then link them together in some way.

The reason we're peeking into the compiler's makeup through the expression evaluation implementation is to enforce the idea that studying automotive manufacturing isn't just about learning to drive a car, it's about learning how to design an engine. We also highly recommend that you take the "Principles of Compilation" course in the future for an in-depth study of "how to design an engine".

How to test your code

You're going to be using your own implementation of expression evaluation to help you with subsequent debugging, which means that the days of "just test the code and turn it in and leave it alone" in your programming class are gone. Tests require test cases, and the more tests you run, the more confidence you'll have in your code. But if you're asked to design test cases, a dozen of them will make you bored already, is there a way to automate the generation of test cases?

A common method is randomized testingopen in new window . First we need to think about how to randomly generate a legal expression. In fact, expression generation is much easier than expression evaluation. Again with the BNF above, it is easy to write a framework for generating an expression.

void gen_rand_expr() {
  switch (choose(3)) {
    case 0: gen_num(); break;
    case 1: gen('('); gen_rand_expr(); gen(')'); break;
    default: gen_rand_expr(); gen_rand_op(); gen_rand_expr(); break;
  }
}

You should be able to understand at first glance how the above code works: uint32_t choose(uint32_t n) is a very simple but important function that generates a random number less than n, and almost everything that is randomly generated is chosen by it.

With these random expressions as test input, how do we know if the output is correct? If we have to manually calculate these expressions, it's too much trouble. If we can generate these expressions and their results at the same time, we can get OJ-like test cases! But our implementation of expression evaluation in NEMU is somewhat simplified, so we need a "calculator" that satisfies the following conditions.

  • Only unsigned operations are performed
  • Data width is 32 bits
  • Overflow is not handled

Hey, hey, hey! If we stuff these expressions inside the source file of the following C program.

#include <stdio.h>
int main() {
  unsigned result = ???; // replace ??? with expression
  printf("%u", result);
  return 0;
}

Then compiling it with gcc and executing it, so that it outputs the result of the expression, isn't that what we want from a "calculator"?

It's indeed possible to do this! We have the framework code for this expression generator already prepared (in nemu/tools/gen-expr/gen-expr.c). You need to implement the void gen_rand_expr() function, which outputs the randomly generated expression to the buffer buf. The code in the main function will call your implementation of gen_rand_expr(), and then put the random expression in buf into the code of the above C program. The only thing left to do is to compile and run the C program, which uses library functions such as system() and popen() to do this. Finally, the framework code outputs the printout of the C program along with the randomly generated expressions, thus generating a set of test cases.

How does an expression generator get a C program's output?

There are no comments in this part of the code, and the smartest person might immediately realize that this is a RTFM trap! Reading the manual to understand the behavior of the API is a basic programmer's skill. If you feel like you've spent the last year as a programmer for nothing, start working on it now.

But when it comes to implementation, you'll soon realize that you'll have to deal with a few other details:

  • How do you ensure that expressions perform unsigned operations only?
  • How to insert random spaces?
  • How to generate long expressions without overflowing the buf?
  • How to filter out expressions with divide-by-0 behavior during evaluation?

Most of these questions are related to C, so consider them as another C programming exercise.

Why use unsigned types? (Recommended to think about this in the 2nd round)

We agree in our expressions that all operations are unsigned. Do you know why we agreed on this? What could go wrong if we do signed operations?

Exact behavior of division by 0

If the generated expression has a divide-by-0 behavior, what about the behavior of the expression generator you wrote?

Filtering Expressions for Divide-by-0 Behavior

At first glance this looks like a difficult problem to solve, because the framework code is only responsible for generating the expression, and detecting the divide-by-0 behavior requires at least evaluating the expression. Combine the answers to the first two blue-box questions (provided you understand them well enough), you'll find a solution, and it's not the only one!

Implementing an expression generator

Based on the above, implement the expression generator. Once implemented, it can be used to generate test cases for expression evaluation.

./gen-expr 10000 > input

10,000 test cases will be generated into the input file, one for each row, in the format of

<Result> <Expression>

Modify NEMU's main() function a bit so that it reads in the test expression from the input file, calls expr() directly, and compares it to the result. You will also need to modify the size of the tokens array to accommodate long expressions.

As your program passes more and more tests, you'll feel more and more confident in your code.

Tips

This is the end of PA1 Stage 2.