Infrastructure (2)

A tool for bug diagnostics - tracer

We already know that a program is a state machine, and if the program becomes complex, so do the state machine's transfers. Not to mention the details of each state in the state machine, it's hard to fully understand how the state machine transfers between states.

It would be inefficient to use GDB to understand this process. To improve efficiency, we can use printf() to output some of the information we care about. The fact that we care about the details of how this state-machine transfer in its states it also means that we care about the execution of the program. In the field of software engineering, the information that records the execution of a program is called traceopen in new window. With the trace information, we can determine whether the program execution process is as expected, and thus diagnose bugs.

Traces of instruction execution - itrace

NEMU has implemented a simple trace function -- itrace (instruction trace), which can record every instruction executed by the client program. The implementation of itrace is very simple, the code just records each instruction fetched by instr_fetch(), and then calls the disassembly function provided by the llvm project (implemented in nemu/src/utils/ Itrace outputs the PC, binary representation of the instruction and the disassembly result. The framework code has this feature turned on by default, and instructions executed by the client program are logged to build/nemu-log.txt. By looking at this file, you can see how the client program is running.

NEMU can restrict when we output traces, you can manually specify when to output them, and even customize the conditions for outputting traces. For how to do this, please RTFSC. Since the current behavior of the program is deterministic, multiple runs will yield the same result. This is helpful when the program went in an error state.

For neat traces, we can also filter them with text processing tools such as grep, awk, sed, etc. So if you understand how to use shell commands for text processing, you can further improve your debugging efficiency.

instruction ring buffer - iringbuf

Generally speaking, we only care about the trace before the error scene, when running some big programs, the trace much earlier in the run is not necessary to check or even output. A natural idea is that we can output the most recent instructions executed right before the client program error happens (such as out of bounds memory access).

It is not difficult to implement this feature, all we need to do is to maintain a very simple data structure - a ring buffer. Specifically, each time an instruction is executed, the information about that instruction is written to the ring buffer; if the buffer is full, the old contents are overwritten. When the client program encounters the error, the instructions in the ring buffer are printed out for debugging purposes. An example of output is as follows, where --> indicates the instruction causing the error.

      0x80002b00: srli  a2, a2, 1                00 16 56 13
      0x80002b04: slli  a1, a1, 1                00 15 95 93
      0x80002b08: or    a0, s1, a5               00 f4 e5 33
      0x80002b0c: sub   s0, s0, a5               40 f4 04 33
      0x80002b10: jal   -112                     f9 1f f0 ef
      0x80002aa0: lui   a5, 524295               80 00 77 b7
      0x80002aa4: lw    a5, 2028(a5)             7e c7 a7 83
      0x80002aa8: addi  sp, sp, -32              fe 01 01 13
  --> 0x80002aac: sw    s2, 16(sp)               01 21 28 23
      0x80002b3c: ret                            00 00 80 67
      0x80002b14: add   s2, s2, a0               00 a9 09 33
      0x80002b18: bnez  s0, -40                  fc 04 1c e3
      0x80002af0: neg   a5, s0                   40 80 07 b3
      0x80002af4: and   a5, a5, s0               00 87 f7 b3
      0x80002af8: or    a2, s3, a5               00 f9 e6 33
      0x80002afc: or    a1, s4, a5               00 fa 65 b3

implement iringbuf

Based on the above, implement iringbuf in NEMU. You can design the output format according to your own preference. If you want to output the disassembly of an instruction, you can refer to the code for itrace. If you don't know what code to add where, you need RTFSC.

Trace of memory accesses - mtrace

Memory access accounts for a large portion of program execution, and if you have encountered some memory access related errors (e.g. access physical memory that is out of bounds), you will want to know the exact behavior of the program's memory accesses, and then identify incorrect accesses to help you diagnose the bug. In fact, we can easily track the results of memory accesses and compile a memory trace.

Implement mtrace

This function is so simple that you've already figured out how to implement it: Just log in paddr_read() and paddr_write(). You can define the format of the mtrace output yourself.

However, unlike iringbuf, which only outputs once at the end, programs generally execute many access instructions, This means that turning on mtrace will generate a lot of output, so it's best to turn mtrace off when you don't need it. Oh, well, let's look at the itrace implementation: try adding the appropriate code to Kconfig and the associated files to enables us to turn mtrace on and off via menuconfig. It is also possible to implement conditions on the mtrace output, e.g., you may only care about accessing a certain section of memory. With the associated conditional control, mtrace is much more flexible to use.

Trace of function calls - ftrace

Itrace and mtrace both trace the program from the underlying state machine perspective, but if we want to understand the semantic behavior of the program, itrace and mtrace can't help us. Therefore, we need a tool that traces the semantic behavior of a program.

The question is, what semantics should we choose? In programming class, we learned that a program consists of a number of functions, and a function consists of a number of statements, and a statement is compiled into a number of instructions, so the only thing that can clearly carry the semantics of a program is a function. A statement is compiled into a number of instructions, so the only thing that clearly carries the semantics of a program is a function. Imagine if we could design a tool, ftrace, to trace function calls and returns during program execution. Wouldn't we be able to tell how the program is working?

This is not difficult, because itrace already traces all the instructions executed by the program. To implement ftrace, all we need to care about is the function call and return instructions. We can record the destination address in the function call instruction, indicating that a function will be called. Then we can record the current PC in the function return instruction, indicating that we are returning from the function where the PC is located. It's easy to accomplish this. But the target address and PC value still lack program semantics, and would be easier to understand if we could translate them into function names!

Given an address in a code segment, how do you know which function it is in? This is where the symbol table in the ELF file comes in. The symbol table is a section of the executable file that records compile-time information about the program, including variables and functions. In order to implement ftrace, we first need to know what information is recorded in the symbol table.

Using the user program add in cpu-tests as an example, use the readelf command to view information about the ELF executable.

riscv64-linux-gnu-readelf -a add-riscv32-nemu.elf

You will see that the readelf command outputs a lot of information that is useful for understanding the structure of ELF, and we recommend that you read it carefully after class. For now, we only need to be concerned with the symbol table information listed below, which is found in the output。

Symbol table '.symtab' contains 28 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 80000000     0 SECTION LOCAL  DEFAULT    1
     2: 80000108     0 SECTION LOCAL  DEFAULT    2
     3: 8000010c     0 SECTION LOCAL  DEFAULT    3
     4: 8000020c     0 SECTION LOCAL  DEFAULT    4
     5: 00000000     0 SECTION LOCAL  DEFAULT    5
     6: 00000000     0 FILE    LOCAL  DEFAULT  ABS add.c
     7: 00000000     0 FILE    LOCAL  DEFAULT  ABS trm.c
     8: 80000108     1 OBJECT  LOCAL  DEFAULT    2 mainargs
     9: 800000e8    32 FUNC    GLOBAL DEFAULT    1 _trm_init
    10: 80009000     0 NOTYPE  GLOBAL DEFAULT    4 _stack_pointer
    11: 80000108     0 NOTYPE  GLOBAL DEFAULT    1 _etext
    12: 80000000     0 NOTYPE  GLOBAL DEFAULT  ABS _pmem_start
    13: 8000022c     0 NOTYPE  GLOBAL DEFAULT    4 _bss_start
    14: 80000109     0 NOTYPE  GLOBAL DEFAULT    2 edata
    15: 80009000     0 NOTYPE  GLOBAL DEFAULT    4 _heap_start
    16: 80001000     0 NOTYPE  GLOBAL DEFAULT    4 _stack_top
    17: 80009000     0 NOTYPE  GLOBAL DEFAULT    4 end
    18: 80000010    24 FUNC    GLOBAL DEFAULT    1 check
    19: 80000108     0 NOTYPE  GLOBAL DEFAULT    1 etext
    20: 80000000     0 FUNC    GLOBAL DEFAULT    1 _start
    21: 00000000     0 NOTYPE  GLOBAL DEFAULT  ABS _entry_offset
    22: 80000028   180 FUNC    GLOBAL DEFAULT    1 main
    23: 80000109     0 NOTYPE  GLOBAL DEFAULT    2 _data
    24: 8000010c   256 OBJECT  GLOBAL DEFAULT    3 ans
    25: 80009000     0 NOTYPE  GLOBAL DEFAULT    4 _end
    26: 800000dc    12 FUNC    GLOBAL DEFAULT    1 halt
    27: 8000020c    32 OBJECT  GLOBAL DEFAULT    4 test_data

Each row represents a table entry, and each column lists some attributes of the table entry. Now we only need to care about the table entries whose Type attribute is FUNC. If you look at the Name attribute, you'll see that these table entries correspond to the functions defined in the program. The corresponding Value attribute is their starting address (you can compare this with the disassembly result), and the corresponding Size attribute gives the size of the function.

Disappearing symbols

We have defined the macro NR_DATA in am-kernels/tests/cpu-tests/tests/add.c, the local variable c and the parameters a, b are also defined in the add() function. But you won't find them in the symbol table, why is that? Think about it, what is a symbol?

Oh, with the symbol table, we can establish a mapping between function names and their addresses! But the readelf output is already parsed, in fact the Name attribute in the symbol table holds the offset of the string in the string table. To look at the string table, let's look at the Section Headers in the readelf output.

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        80000000 001000 000108 00  AX  0   0  4
  [ 2] .sdata2.mainargs  PROGBITS        80000108 001108 000001 00   A  0   0  4
  [ 3] .data.ans         PROGBITS        8000010c 00110c 000100 00  WA  0   0  4
  [ 4] .data.test_data   PROGBITS        8000020c 00120c 000020 00  WA  0   0  4
  [ 5] .comment          PROGBITS        00000000 00122c 00001c 01  MS  0   0  1
  [ 6] .symtab           SYMTAB          00000000 001248 0001c0 10      7   9  4
  [ 7] .strtab           STRTAB          00000000 001408 00009b 00      0   0  1
  [ 8] .shstrtab         STRTAB          00000000 0014a3 000055 00      0   0  1

As you can see from the information in Section Headers, the string table is stored at the ELF file location with offset 0x1408 from the beginning. We can output the hexadecimal form of the ELF file directly with the following command:

hd add-riscv32-nemu.elf

Looking at the output of this command around offset 0x1408, we can see that the string table is nothing more than a string of identifiers stitched together. Now we can clarify the relationship between the symbol table and the string table:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 7] .strtab           STRTAB          00000000 001408 00009b 00      0   0  1
                        ++         V
00001400                |V         00 61 64 64 2e 63 00 74  | ........add.c.t|
00001410  72 6d 2e 63 00|6d 61 69  6e 61 72 67 73 00 5f 74  |rm.c.mainargs._t|
00001420  72 6d 5f 69 6e|69 74 00                    ^      |rm_init._stack_p|
                        |                            |
                        |                            +----------------+
                        |                                             |
                        +-----------------------------------------+   |
Symbol table '.symtab' contains 10 entries:                       |   |
   Num:    Value  Size Type    Bind   Vis      Ndx Name           |   |
     7: 00000000     0 FILE    LOCAL  DEFAULT  ABS 7 (trm.c)      |   |
     8: 80000108     1 OBJECT  LOCAL  DEFAULT    2 13(mainargs) --+   |
     9: 800000e8    32 FUNC    GLOBAL DEFAULT    1 22(_trm_init) -----+

Find "Hello World!"

Write a Hello World program under Linux, compile it and find the string table of the ELF file by the above method. Where do you find the "Hello World!" string in the string table? Why is it there?

Now we can translate a given address into a function name. Since the address ranges of functions are disjoint, we can check if the current function address falls in any interval [Value, Value + Size) specified in symbol table entry with Type attribute equal to FUNC. If it does, find the corresponding string in the string table according to the Name attribute of the table entry, and return it as the function name. If no symbol table entry is found, the string "???" is returned. But this is probably due to a bug in your implementation, and you need to check your implementation again.

Taking recursion from cpu-tests as an example, some sample output from ftrace is as follows (for reference only, different compiler versions may get different output, you can try to understand it using the disassembly results).

0x8000000c: call [_trm_init@0x80000260]
0x80000270:   call [main@0x800001d4]
0x800001f8:     call [f0@0x80000010]
0x8000016c:       call [f2@0x800000a4]
0x800000e8:         call [f1@0x8000005c]
0x8000016c:           call [f2@0x800000a4]
0x800000e8:             call [f1@0x8000005c]
0x8000016c:               call [f2@0x800000a4]
0x800000e8:                 call [f1@0x8000005c]
0x8000016c:                   call [f2@0x800000a4]
0x800000e8:                     call [f1@0x8000005c]
0x8000016c:                       call [f2@0x800000a4]
0x800000e8:                         call [f1@0x8000005c]
0x80000058:                         ret  [f0]              # note(2)
0x800000fc:                       ret  [f2]                # note(1)
0x80000180:                       call [f2@0x800000a4]
0x800000e8:                         call [f1@0x8000005c]
0x80000058:                         ret  [f0]
0x800000fc:                       ret  [f2]
0x800001b0:                     ret  [f3]                  # note (3)
0x800000fc:                   ret  [f2]
0x80000180:                   call [f2@0x800000a4]
0x800000e8:                     call [f1@0x8000005c]
0x8000016c:                       call [f2@0x800000a4]
0x800000e8:                         call [f1@0x8000005c]
0x80000058:                         ret  [f0]

Implement ftrace

Based on the above, implement ftrace in NEMU. You can decide the format of the output. You need to pay attention to the following:

  • You need to pass in an ELF file for NEMU, and you can do this by adding the relevant code to parse_args()
  • You may need to read the symbol and string tables from the ELF file when you initialize ftrace, for your subsequent use.
  • For information on how to parse ELF files, see man 5 elf.
  • If you choose riscv32, you also need to consider how to correctly recognize function call instructions and function return instructions from jal and jalr instructions

Mismatched function calls and returns

If you look closely at the sample output of recursion above, you'll notice something interesting. Specifically, the function ret at note (1) is matched to the corresponding call, i.e., the call calls f2, and the corresponding ret returns from f2. But the set of calls and rets indicated by note (2) is different, in the sense that call calls f1, but returns from f0. A similar phenomenon occurs with the set of calls and rets indicated in comment (3), where call calls f1, but returns from f3.

Try to analyze why this phenomenon occurs together with the disassembly results.

Redundant symbol tables

Write a Hello World program under Linux, and then use the strip command to strip the symbol table in the executable:

gcc -o hello hello.c
strip -s hello

If you look at hello program with readelf, you'll see that the symbol table has been stripped, so can the hello program run successfully?

There is also a symbol table in the object file, which we can strip as well.

gcc -c hello.c
strip -s hello.o

Look at the information for hello.o with readelf, and you'll see that the symbol table has been stripped. Try linking to hello.o:

gcc -o hello hello.o

What problems did you find? Try to compare the two situations and analyze the reasons behind it.

trace and perf optimization

We asked you to implement the trace tool in NEMU as an infrastructure to help you debug. In fact, in addition to helping you understand how a program works, tracer can also guide developers in optimizing their programs and systems such as:

  • Based on ftrace, we can further analyze the arguments when calling memcpy(), such as whether dest and src are aligned or not, and whether the length of the copied memory is long or short, and then optimize the algorithm implementation of memcpy() according to the frequent combinations.
  • You can count the number of function calls based on ftrace, and optimize the function with more accesses, which can significantly improve the performance of the program.
  • You can filter the branch jump instruction execution based on itrace and use it as an input to the branch predictor (a performance-enhancing component of modern processors) to adjust the branch predictor's implementation, thus improving the processor performance.
  • The sequence of memory accesses of the program can be obtained from mtrace and used as input to a cache (another performance-enhancing component of modern processors) model, which guides the optimization of prefetching and substitution algorithms (as you will see in Lab4).

trace is so important for performance optimization, because trace reflects the real behavior of the program, if you blindly optimize a function that will only be called once, you can imagine that such optimization will not improve the overall performance of the program. Trace actually shows us the details of the program execution events (such as function call), if we do statistical analysis of these events, we can tell which events are occurring frequently, and optimize these frequent events to improve the overall system performance. And this is the scientific method of performance optimization.

AM as infrastructure

You can write the klib, and then run the string program on NEMU to see if it passes the test. On the surface, this doesn't seem like a bad idea, but if it doesn't pass the test, you're going to be thinking when you're debugging it: did you write the klib wrong, or is there a bug in NEMU? If this problem is not solved, debugging becomes more difficult: it is possible to debug in NEMU for a week and find out that the klib implementation is buggy.

The reason why this is a problem is that both the software (klib) and the hardware (NEMU) are written by you, and their correctness is not 100% guaranteed. We all learned about the control variant method in high school: if you can replace one side with an implementation that you think is correct, you can test the correctness of the other side alone! For example, if we test klib on a real machine, if the test doesn't pass, then klib is the problem, because we can trust that the hardware implementation of the real machine will always be correct; on the other hand, if the test passes, then klib isn't the problem, but rather NEMU is buggy.

A new question is, can we really easily port software to other hardware for testing? If you're smart, you'll remember the core idea of AM: decoupling the program from the architecture through a set of abstract APIs. The idea of AM ensures that the code running on top of AM (including klib) is architecture-independent, which increases the portability of the code. Imagine if the code in string.c had a nemu_trap instruction that could only be executed in NEMU, then it would not run on a real machine.

The abstract-machine has a special architecture called native, which implements the AM API using the GNU/Linux default runtime environment. For example, when we compile a program with gcc hello.c, it compiles to the runtime environment provided by GNU/Linux. Super Mario, the program you tried in PA1, is also compiled to native and runs. Compared to $ISA-nemu, native has the following advantages:

  • Running directly on the real machine, you can trust that the real machine will always behave correctly.
  • Even if there are bugs in the software, it is easier to debug them on native (e.g. you can use GDB, which is much more convenient than NEMU's monitor).

Therefore, instead of debugging the software directly in $ISA-nemu, it is better to test NEMU by tuning the software on native and then switching to $ISA-nemu to run it. In abstract-machine, we can easily compile the program target another architecture, e.g., in the am-kernels/tests/cpu-tests/ directory, by running

make ALL=string ARCH=native run

You can compile the string program to native and run it. Since we will be compiling the program into a different architecture, you need to pay attention to the ARCH parameter in the make command. If the string program does not pass the test, the terminal will output

make[1]: *** [run] Error 1

Of course, it is also possible to output information such as segment errors.

How to generate native executables

Read the relevant Makefile, and try to understand how abstract-machine generates native executables.

Strange error code

Why is the error code 1? Do you know how the make program gets this error code?

Don't get too excited, when compiling to native glibc is linked by default, we need to switch to klib that we have written for testing. We can do this by defining the macro __NATIVE_USE_KLIB__ in abstract-machine/klib/include/klib.h. If this macro is not defined, the library functions will be linked to glibc, which can be used as a correct reference implementation for comparison.

How does this work?

How did we link these library functions on native to klib after defining the macro __NATIVE_USE_KLIB__? How exactly does this happen? Try to explain this phenomenon based on what you learned about linking in class.

Ok, now you can test/debug your klib implementation on native, and use putch() for character output to help you with debugging, or even GDB into it. Once implemented correctly, compile the program to $ISA-nemu (remembering to remove the putch() inserted during debugging), and test it against NEMU.

Writing portable programs

In order not to damage the portability of your program, you can no longer write your program with architectural assumptions, for example, "the length of a pointer is 4 bytes" will no longer hold, because the length of a pointer on native is 8 bytes, and a program written according to this assumption will most likely trigger a segmentation error when run on native.

Of course, there are ways to solve the problem, and how to do it is, as always, STFW.

Test your klib

The string program simply calls a function in klib, which itself acts as a client program to test the NEMU implementation, but it does not adequately test the klib implementation. For this reason, you need to write some adequate test cases to specifically test the klib implementation. Test cases consist of test inputs and test outputs, and if we want to construct test cases efficiently, we need to find a way to get test outputs independently of the test objects.

 +----> Test Objects ----> Actual Output
 |                                      |
Input                                    +----> Same?
 |                                      |
 +----> Some Method ----> Expected Output

In klib, there are three main categories of functions that you need to implement.

  1. Read-Write functions for memory and strings, e.g. memset(), strcpy(), etc..
  2. Read-only functions for memory and strings, e.g. memcmp(), strlen(), etc..
  3. Formatted output functions, such as sprintf(), etc.

For the first class of functions, how should we construct a test session so that there are methods to easily obtain the test output? Note that these functions all write to an area of memory, consider the following array.

#define N 32
uint8_t data[N];

void reset() {
  int i;
  for (i = 0; i < N; i ++) {
    data[i] = i + 1;

Each element of such an array is 1 byte and has a different value. If we test on this array, any one byte of the actual output that is incorrect can be detected with a high probability. In order to get the expected output, we also have to think about the expected behavior of the test function: The above functions all write to a contiguous segment of the array, so we can examine the expected output in three segments.

  • The first segment is the left-hand side of the interval that the function writes to, which is not written to, so there should be assert(data[i] == i + 1)
  • The second segment is the interval itself that the function writes to, and the expected result of this segment is related to the specific behavior of the function.
  • The third segment is the right-hand side of the interval that the function writes to, and this segment of the interval is not written to, so there should be assert(data[i] == i + 1)

So we can write two helper functions for checking this:

// Check that the values in the [l,r) interval are in order of val, val + 1, val + 2...
void check_seq(int l, int r, int val) {
  int i;
  for (i = l; i < r; i ++) {
    assert(data[i] == val + i - l);

// Check that the values in the [l,r) interval are all equal to val
void check_eq(int l, int r, int val) {
  int i;
  for (i = l; i < r; i ++) {
    assert(data[i] == val);

With these two functions, we can iterate over various inputs and easily write the expected output of the function we are testing. For example, for memset(), we can write the following test code.

void test_memset() {
  int l, r;
  for (l = 0; l < N; l ++) {
    for (r = l + 1; r <= N; r ++) {
      uint8_t val = (l + r) / 2;
      memset(data + l, val, r - l);
      check_seq(0, l, 1);
      check_eq(l, r, val);
      check_seq(r, N, r + 1);

Writing more tests

Try to understand how the above test code performs the tests and add a new test set klib-tests for klib in the am-kernels/tests/ directory, the file structure of the test set can be referred to am-kernels/tests/am-tests or am-kernels/kernels/hello.

Then write the test code for the first type of write function described above. There are a few things to keep in mind when writing the tests.

  • The behavior of memcpy() is UB when the intervals overlap, you can check if the intervals overlap while traversing, and if so, skip this check; or use another identical array as src so that there is no overlap.
  • String handler functions need extra attention for \0 and buffer overflows

Once written, you can test your test code with the glibc libraries on native, then test your klib implementation with the test code on native, and finally run the test code on NEMU to test your NEMU implementation.

These libraries are so simple, can I not test them?

Yes, you can. But untested code is always wrong, and you'll be using these libraries to write more complex programs (e.g., OS) later, so if you don't want to get screwed over later, take the time to test them now. Also, if you want to optimize these libraries later (e.g., memcpy() and memset() are two functions that will be used a lot later on), you may write a slightly more complex algorithm, and at that point, you'll see how important these tests are.

Writing more tests (2)

Try to add tests for klib-tests for the second class of read-only functions, e.g. memcmp(), strlen() etc. Think about it, how do you get the expected output of a function?

Finally, let's look at the formatted output function. Taking %d as an example, we need to construct some inputs. But the range of integers is too large to go through them all, so we need to pick some representative integers. The C standard header file limits.h contains the definitions of the maximum and minimum numbers, which you can read by opening /usr/include/limits.h. Some representative integers could be:

int data[] = {0, INT_MAX / 17, INT_MAX, INT_MIN, INT_MIN + 1,
              UINT_MAX / 17, INT_MAX / 17, UINT_MAX};

In order to get the desired output, we can write a native program to output them using printf, and then organize the output into test code. The expected output in cpu-tests is generated in the same way.

Writing more tests (3)

Try adding tests for formatted output functions to klib-tests. You can print the actual output to a buffer with sprintf(), and then compare it to the expected output with strcmp().

You could also consider implementing width, precision, length modifiers, etc., and generating test cases to test them.

Differential Testing

After understanding the execution process of instructions, adding various instructions is more of an engineering implementation. Engineering implementations will inevitably encounter bugs, how to quickly debug when the implementation is incorrect, in fact, also belongs to the scope of the infrastructure. Think about it, there are so many instructions (x86 instructions itself is a lot), some instructions may have more complex behavior (most of the x86 instructions are very complex), if one of the implementation is wrong, how do we find it?

Intuitively this doesn't seem like an easy thing to do, but let's discuss why. Suppose we accidentally fill in the wrong type for an instruction, and NEMU executes that instruction, decodes it with the wrong type, and either gets the wrong source operand, or writes the right result to the wrong destination operand. As a result, the NEMU executes the instruction in a way that violates its original semantics, which in turn causes other instructions that depend on the instruction to fail to execute correctly. From the perspective of the violation, the result is a UB. Eventually, we see the client program accessing memory out of bounds, getting stuck in a dead loop, or HIT BAD TRAP, or even the NEMU triggering a segmentation error.

We have already discussed debugging methods in PA1, but for bugs in the implementation of instructions, we find that these debugging methods are not very effective: it is difficult to express the correct behavior of instructions through assert(), and printf() and GDB don't actually reduce the distance between error and failure.

If there is a way to express the correct behavior of an instruction, we can base our assert()-like checks on that method. So, what exactly expresses the correct behavior of an instruction? The most direct, of course, is the ISA manual, but we implement instructions in NEMU based on the behavior of the instructions in the ISA manual, and the same set of methods can't be used for both implementation and checking. It would be nice to have a reference implementation of the ISA manual. Hey! Isn't the real machine we are using implemented according to the ISA manual? We let each instruction executed in the NEMU be executed in the real machine, and then compare the state of the NEMU and the real machine, and if the state of the NEMU and the real machine don't match, we catch the error!

This is actually a very effective testing methodology, known in the software testing field as [differential testing] difftestopen in new window (subsequently referred to as DiffTest). Typically, DiffTest is performed by providing a REF (Reference) that has the same functionality as the DUT (Design Under Test), but with a different implementation, and then letting them accept the same defined inputs to see if they behave the same way.

We have just mentioned "state", what exactly does "state" mean? As we recognized in PA1, a program and a computer can be viewed as a state machine, and the state can be represented as a pair S = <R, M>, where R is the value register, and M is the value of memory. To check whether the implementation of an instruction is correct, just check whether the states of DUT and REF are the same after executing the instruction! DiffTest can catch error in a very timely manner, and whenever you first notice that the state of the NEMU is different from the real machine, it is due to an error in the implementation of the instruction currently being executed. This is actually very close to the error, preventing the error from propagating further, and at the same time, it is much easier to backtrack to find the fault.

What a wonderful feature that illustrated the theory behind the computer! But unfortunately, don't forget that the real machine is running the operating system GNU/Linux, and we can't run AM programs compiled to x86-nemu in native, and mips32 and riscv32 programs can't be run directly on the real machine using x86. So, what we need is not only a correct implementation of the ISA manual, but we also need to be able to run AM programs compiled to $ISA-nemu correctly on it.

As we introduced in PA1, NEMU is a full-system emulator. We can consider other full-system emulators as REFs. They are a simulation of a complete computer system, and the goal of NEMU is to simulate only a subset of it hence programs that can run on NEMU can also run on it. Therefore, in order to test the correctness of NEMU's implementation through the DiffTest method, we let NEMU and another emulator execute the same client program instruction by instruction at the same time. After each instruction, both sides check the status of their registers and memory. If any inconsistency is found, an error is reported and the execution of the client program is stopped.

Infrastructure - The Secret to Winning the LoongArch competition

The idea of DiffTest is very simple: find a correct implementation and compare the results with it. In fact, the expression generator you implemented in PA1 has the same idea of DiffTest in it: the C program is the REF. The cpu-tests provided in am-kernels are also the REF: these tests will be run on the native first to obtain the correct results, so you're actually using the native as a REF to compare the results of the program execution (HIT GOOD/BAD TRAP).

Of course, the granularity of the DiffTest introduced here is much more detailed: we don't just compare the results of the program execution, but the behavior of each instruction. This can help us quickly find and locate bugs in the implementation of instructions. We inherited this idea in the LoongArch competition, using the CPU written in verilog/chisel as the DUT, and the implemented NEMU as the REF, we can quickly find and fix the bugs in verilog/chisel. With the help of DiffTest, we achieved the epicopen in new window that

Implement a full out-of-order processor correctly and run your own time-sharing multitasking operating system, Nanos. Then run a complex application, Chinese paladin, on it.

Another example of the power of DiffTest is a hot topic in July 2020, when five undergraduates from the Chinese Academy of Sciences graduated with their own processor chip designs. Under the guidance of a team of faculty members, the students used DiffTest to perform online comparisons of their processor designs with NEMU, successfully booting Linux and running the Busybox in 5 days, and booting Debian and running complex applications such as GCC and QEMU in 4 days. Students also presented at thecrvs2020open in new window(videoopen in new window, slideopen in new window) and riscv gfopen in new window(videoopen in new window, slideopen in new window) to share processor design experience. DiffTest is presented as a key technique.

The above examples repeatedly illustrate the importance of infrastructure: a well-developed infrastructure makes CPU design efficient and simple, and even accomplishes things that no one has been able to do before. With infrastructure, daunting processor hardware designs can be transformed and revitalized: you don't need to look at waveforms that make you dizzy to debug your hardware code. Recently, there has also been a surge in agileopen in new window hardware design, and the role of infrastructure in this has been obvious. If you're interested, please contact us and let's explore the direction of infrastructure improvement.

To facilitate the implementation of DiffTest, we define the following set of APIs between DUT and REF.

// Copy `n` bytes between `buf` of DUT host memory and `addr` of REF guest memory.
// `direction` specifies the direction of copying, `DIFFTEST_TO_DUT` means copy to DUT, `DIFFTEST_TO_REF` means copy to REF.
void difftest_memcpy(paddr_t addr, void *buf, size_t n, bool direction);
// When `direction` is `DIFFTEST_TO_DUT`, get register state of the REF and write to `dut`.
// When `direction` is `DIFFTEST_TO_REF`, set the register state of REF to `dut`.
void difftest_regcpy(void *dut, bool direction);
// Have REF execute `n` instructions
void difftest_exec(uint64_t n);
// Initializing REF's DiffTest function
void difftest_init();

Where the register state dut requires the registers to be in a certain order, without the required order, the behavior of difftest_regcpy() is undefined (which dumps the blame on you guys ^_^).The REF needs to implement these APIs, and the DUT will use these APIs for DiffTest. In this case, DUT and REF are NEMU and other emulators respectively.

The NEMU framework code is ready for DiffTest functionality, turn on the appropriate option in menuconfig.

Testing and Debugging
  [*] Enable differential testing

Then just recompile NEMU and run it. The NEMU configuration system selects the appropriate emulator as the REF based on the ISA.

  • x86: KVMopen in new window. KVM can use hardware virtualization technology to create a virtual computer system directly on the hardware, just like the real machine. KVM provides a set of ioctl() based APIs for Linux user programs. Which allow us to put the hardware into virtualization mode, run client programs on it, and get the state of the client programs.
  • mips32: QEMUopen in new window. QEMU is a complete system-wide emulator, supporting multiple ISAs. However, we can only communicate with QEMU via the socket-based GDB protocol to get its status, which has a high communication overhead. In order to run QEMU, you need to install it.
apt-get install qemu-system
  • riscv32: Spike. Spike is a full-system emulator from the RISC-V community that works very similar to NEMU. We have added a small number of interfaces to Spike to implement the DiffTest API, thanks to class of 2018 Chenluopen in new window. Since Spike contains a lot of source files, the compilation process may take a few minutes. In order to run Spike, you need to install another tool.
apt-get install device-tree-compiler

Fix Spike compilation errors

We updated Spike's code at 01:15 on April 8, 2023 . If you get NEMU's code before that time, but Spike's code after that time. You may have problems with tools/spike-diff/ compilation errors. Please update the relevant files according to this pageopen in new window.

If you get your Spike code before the above time, or your NEMU code after the above time, you don't need to worry about this.

Ignore the following message during the compilation of Spike

The following information does not affect the compilation result of Spike, you can ignore them.

Makefile:349: warning: overriding recipe for target 'disasm.o'
Makefile:349: warning: ignoring old recipe for target 'disasm.o'
make[2]: Circular <- dependency dropped.
make[2]: Circular <- dependency dropped.

The appropriate rules and parameters have been set in nemu/tools/, which will automatically go to the appropriate subdirectory under nemu/tools/ (kvm-diff, qemu-diff or spike-diff) to compile the dynamic libraries and pass them in as parameters to NEMU's --diff option. After enabling DiffTest, init_difftest() in nemu/src/cpu/difftest/dut.c` performs the following additional initialization:

  • Open the incoming dynamic library file ref_so_file.
  • Resolve and relocate the API symbols in the dynamic libraries through dynamic loading, and return their addresses.
  • Initialize the DIffTest function of the REF, the specific behavior varies from REF to REF.
  • Copy the guest memory of the DUT into the REF.
  • Copy the register state of the DUT into the REF.

After the above initialization, the DUT and REF are in the same state. The next step is to do an instruction-by-instruction comparison, which is done by the difftest_step() function (defined in nemu/src/cpu/difftest/dut.c). It will be called in the main loop of cpu_exec(), and after executing an instruction in NEMU, it will let REF execute the same instruction in difftest_step(), and then read out the registers in REF and compare them. Since the registers are different for different ISAs, the framework code abstracts the register comparison into an ISA-related API, the isa_difftest_checkregs() function (defined in nemu/src/isa/$ISA/difftest/dut.c). You need to implement the isa_difftest_checkregs() function, which compares the general-purpose registers and the PC with the values of the registers read from the DUT. If the comparison results are the same, the function returns true; if the values are found to be different, the function returns false and the framework code automatically stops the client program. Especially, if the comparison result of isa_difftest_checkregs() is inconsistent, the second parameter pc should point to the instruction that caused the inconsistency, which can be used to print a debug message.

Implementing DiffTest

In the introduction to the API conventions above, it was mentioned that register state r needs to put the registers in a certain order.

::::You first need to RTFSC, to find this ordering, and check that your NEMU implementation already meets the constraints.

Then add the appropriate code to isa_difftest_checkregs() to implement the core functionality of DiffTest. Once implemented correctly, you'll have an incredibly powerful testing tool.

After realizing the power of DiffTest, think about this. As an infrastructure, how much time can DiffTest save you in debugging?

Huh? Don't we need to compare the state of the memory? In fact, it's not easy to get the memory locations modified by instructions in the REF, and comparing the whole memory would be a huge overhead, so we don't compare the memory state. In fact, the simplified implementation of NEMU can also cause the state of some registers to be inconsistent with the results of the REF, for example, x86 EFLAGS, NEMU only implements a small number of flags in EFLAGS, and also simplifies the updating of EFLAGS by some instructions. In addition, some special system registers are not fully implemented. Therefore, our implementation of DiffTest is not a complete comparison of the state of the REF and NEMU, however, both memory and special registers will be used again in the near future as long as they have been modified by an instruction of the client program, and the difference can be detected. Therefore, we sacrifice some comparison accuracy for performance improvement, but even so, since DiffTest needs to communicate with the REF and let the REF execute instructions, it will still slow down NEMU's operation. Therefore, it is not recommended to turn on DiffTest to run NEMU unless you are debugging.

The simplification of NEMU leads to a difference in the behavior of some commands from the REF, which makes it impossible to compare them. In order to solve this problem, the framework has prepared two functions, difftest_skip_ref() and difftest_skip_dut():

  • There are commands that cannot be executed directly by REF, or the behavior of the executed commands must be different from NEMU.For example, the nemu_trap instruction, in REF, will throw a debugging exception. This can be calibrated with difftest_skip_ref(), after executing it, difftest_step() will cause the REF to skip the execution of the current instruction. At the same time, the current register state of NEMU will be synchronized to REF directly. The effect is equivalent to "the execution result of this instruction is based on the state of NEMU".

  • Due to implementation specificity, QEMU may on rare occasions package several instructions to be executed together. When we call difftest_step() once, QEMU will execute several instructions. However, since NEMU's fetch_decode_exec_updatepc() executes one instruction at a time, this introduces a bias. This can be calibrated with difftest_skip_dut(int nr_ref, int nr_dut). Immediately after executing it, the REF is told to single-step nr_ref times. The NEMU is then expected to catch up with the REF within nr_dut instructions, skipping the checking of all the instructions in the process.

Instructions that require calibration

Instructions in PA2 that require calibration:

  • x86: N/A
  • riscv32: N/A
  • mips32: Various jump commands
    • This is because mips32-NEMU does not implement a branch delay slot, which can be calibrated with difftest_skip_dut(2, 1)

Unbelievable QEMU behavior (a bit of a challenge)

In some older versions of mips32-QEMU, instruction packing is done only if the last 12 bits of the PC value of the above instruction are 0xffc. This packing condition seems very strange, do you know what could be the reason?

Spike does not support unaligned memory access.

RISC-V, as a RISC architecture, does not normally support unaligned access, and executing an unaligned access instruction in Spike will throw an exception, causing the PC to jump to 0. The NEMU does not do this for simplicity, so if you execute such an instruction in both the NEMU and Spike, DiffTest will throw an error. However, this is probably a problem with your software implementation (e.g. klib), so please check and correct the code.

When using QEMU as REF, do not run two copies of NEMU at the same time.

DiffTest will connect to QEMU through a fixed port, running two copies of NEMU with DiffTest open at the same time will result in the following message.

Failed to find an available port: Address already in use

If you are sure that you are not running two copies of NEMU at the same time, but still encounter the above message, you can kill the QEMU left in the background by executing the following command

pkill -9 qemu

One-click regression testing

In the process of implementing the instructions, you need to run the test cases one by one. But once the instructions are implemented correctly, does that mean you can say goodbye to the test cases? Obviously not. You will need to add new functionality to NEMU in the future, and in order to make sure that the new functionality doesn't affect the existing functionality, you will need to re-run the test cases. In order to ensure that the new functionality does not affect the implementation of the existing functionality, you need to re-run the test cases. In software testing, this process is called regression testopen in new window.

Since these test cases will be re-run in the future, it would be inefficient to re-run each test manually. To improve efficiency, we provide cpu-tests with a one-click regression test command:

make ARCH=$ISA-nemu run

With the above command you can automatically run all the tests in cpu-tests in a batch and report the results for each test case.

The Nature of NEMU

As you already know, a NEMU is a program that executes other programs. In computational theory, there is a special term for such a program, called a Universal Program, which means, in common parlance, that it can do what any other program can do. The existence of universal programs has been proved, and we won't go into it here, but the idea of universal programs is behind the fact that we can write NEMUs, do experiments with Docker/Virtual Machines, and even do all sorts of things on our computers: NEMUs and simulators are just instantiations of universal programs. It is not an exaggeration to say that a computer is a physical instance of a universal program. The existence of universal programs laid the theoretical foundation for the emergence of computers, and is an extremely important conclusion in computational theory. If the existence of a universal program is not proved, we cannot use computers with confidence, and we cannot say "the machine is always right". ``

The NEMU we write will eventually be compiled into x86 machine code, using x86 instructions to simulate the execution of client instructions. In fact, in 1983, [Prof. Martin Davis] martin davisopen in new window published a book entitled "Computability, complexity, and languages: fundamentals of theoretical computer science", L programming language, was proposed in the book, and it was shown that L is equivalent to all other programming languages. The three instructions in L are.

V = V + 1
V = V - 1

In terms of x86 instructions, these are inc, dec and jne.

Even more surprisingly, Prof. Martin Davis proved that it is possible to write a general-purpose program similar to NEMU in L, without taking physical limitations into account (think of unlimited memory capacity, where each memory cell can hold an arbitrarily large number of numbers)! And the framework of this general purpose program written in L is similar to the cpu_exec() function in NEMU: fetch, decode, execute... This is not a coincidence, but an application of simulationopen in new window simulationopen in new window in computer science.

Long before Prof. Martin Davis proposed the L-language, scientists have been exploring what problems are computable. Going back to the 1830s, in an attempt to answer this question, different scientists proposed and investigated different models of computation, including the recursive functionopen in new window by Gödelopen in new window, Herbrandopen in new window, and Kleenopen in new window, the λ-lambdaopen in new window by Churchopen in new window, and Turing machineopen in new window by Turingopen in new window, which were later found to be equivalent in terms of computational power; and by the 1840s computers were being built. It was even shown that if an infinite number of abacuses were spliced together, they would be equivalent to a Turing machine! We can deduce from this that general-purpose programs have different manifestations in different models of computation. NEMU as a general-purpose program was of great significance in the 1830's. If you could have designed NEMU 90 years ago, maybe the Turing Award would have been named after you. Limits of Computationopen in new window is a series of popular science articles about the development of computable theory that you can read to get a sense of human civilization (some math skills will be required). If you are interested in computational theory, you can take Mr. Fangmin Song's Introduction to Computational Theory course.

Returning our thoughts to PA, the nature of a general purpose program tells us that the potential of NEMU is unlimited. What do you think is missing from NEMU to create a colorful world?

Catching a infinite loop (a bit of a challenge)

In addition to being an emulator, NEMU has a simple debugging function, which allows you to set breakpoints and check the status of your program.

::::If you add the following feature to NEMU

When the guest program is trapped in a infinite loop, let the guest program pause, and output some corresponding log.

How do you think it should be done? If you are confused, search for information on the Internet.


This concludes phase 2 of PA2.