RTFM

We've already covered the concepts of how an instruction is executed in the previous section, some of which are even hard to expand on. But when we decided to add all kinds of efficient instructions to TRM, it also meant that we couldn't avoid the complexity of the details.

The first thing you need to know is the exact behavior of the instructions, and for that you need to read the instruction set section of the survival manual (The official ISA manual). Specifically, whatever ISA you choose, the corresponding manual will usually have the following, so try to RTFM and look for the location of these.

  • A description of the specific behavior of each instruction
  • a table of instruction opcodes's encoding

In particular, due to the complexity of the x86 instruction set, we have provided a simple tutorial for those who choose x86.

RISC - Another world parallel to the CISC

Do you think the format of the x86 instruction set is particularly complex? This is actually a characteristic of CISC, using complex instruction format, sacrificing the development cost of hardware, but also to make a single instruction can do more things, so as to increase the density of code, reduce the size of the program. As time progressed, architects realized that the complex control logic in CISC was not conducive to improving processor performance, so RISC was born. The purpose of RISC is simplicity, fewer instructions, fixed instruction length, and uniform instruction format, which is similar to the KISS rule. There is a short article comparing RISC and CISC at hereopen in new window.

Also we recommend this articleopen in new window, which tells the story of the birth of the RISC world and its integration with the CISC world, giving a sense of the RISC is a milestone in the evolution of computer architectures.

If you're lucky enough to choose riscv32, you'll find that you only need to read a very small portion of the manual at this point: in PA, riscv32's client program consists of only two types of instructions, RV32I and RV32M. This is due to the design concept of the RISC-V instruction set - modularity.

RISC-V - A well-designed instruction set

RISC-V is a very young instruction set - the first version of RISC-V was proposed by a team of researchers at UC Berkeley in May 2011, and has since taken the world by storm. Openness is RISC-V's biggest selling point, and even ARM and MIPS have been shocked by it, and even try to tearing each other apart for the competition... This articleopen in new window describes the concept of RISC-V and some of the history of its growth.

Of course, this is not really relevant to PAs in the teaching field. The point is.

  • RISC-V is really simple.
  • And while it is simple, there are a lot of thoughtful considerations about how the program should run. If you read the RISC-V manual, you'll see that there are a lot of design choices and trade-offs. In addition, Prof. David Patterson (who won the 2018 Turing Award for promoting RISC, and is considered to be one of the great masters in the field of architecture) has also written an introductory book for the promotion of RISC-V, The RISC-V Reader, which describes a large number of RISC-V design principles from a systematic point of view and compares them with the existing instruction set, which is well worth a read. The RISC-V Reader is an introductory book. Three graduate students at the Institute of Computing, Chinese Academy of Sciences, have written a Chinese translation of this book (one of them is also your direct senior), but since this book does not follow the latest contents of the official RISC-V manual, and there are a lot of clerical errors in the book (feel free to raise an issue in the corresponding github repo), we still recommend that you read the official RISC-V manual.

RTFSC(2)

After understanding how YEMU executes instructions in the previous subsection, you will have a basic understanding of the emulator's framework. NEMU has to simulate a real ISA, so the code is much more complex than YEMU, but the underlying principles are the same as YEMU. Below we describe how NEMU's framework code implements the execution of instructions.

During the course of RTFSC, you will encounter most of the APIs used to abstract ISA differences, so we recommend that you read this page to get a basic understanding of what these APIs do, so that you can refer to them when you encounter them in your code in the future.

We mentioned in PA1.

cpu_exec() in turn calls execute(), which mimics the way the CPU works: executing instructions over and over again. Specifically, the code calls the exec_once() function in a for loop, which does what we described in the previous subsection: it tells the CPU to execute an instruction pointed to by the current PC, and then update the PC.

Specifically, exec_once() accepts a pointer s to a structure of type Decode, which is used to hold information needed during the execution of an instruction, including the current PC of the instruction, the PC of the next instruction, etc. There is also some information that is ISA-related, which NEMU abstracts with a structure of type ISADecodeInfo, defined in nemu/src/isa/$ISA/include/isa-def.h. exec_once() first saves the current PC to members pc and snpc of s, where s->pc is the PC of the current instruction and s->snpc is the PC of the next instruction, where snpc means "static next PC".

The code then calls the isa_exec_once() function (defined in nemu/src/isa/$ISA/inst.c), because the exact process of executing the instruction is ISA-related, so we won't delve into the details of isa_exec_once() here. What is clear is that it modifies the value of s->snpc as it fetches the instruction, so that s->snpc will be the PC for the next instruction when it returns from isa_exec_once(). The code will then update the PC with s->dnpc, where dnpc stands for "dynamic next PC". The difference between snpc and dnpc is explained below.

Ignoring the rest of the trace-related code in exec_once(), we return to execute(). The code adds 1 to a counter used to record client instructions, then does some trace and difftest related operations (which we ignore at this point), and then checks to see if the NEMU status is NEMU_RUNNING, if it is, it continues with the next instruction, otherwise it exits the loop of executing instructions.

In fact, the exec_once() function covers all phases of the instruction cycle: fetch, decode, execute, and update the PC, so let's take a look at how the NEMU implements each phase of the instruction cycle.

instruction fetch, IF

The first thing isa_exec_once() does is fetch instructions. In NEMU, there is a function inst_fetch() (defined in nemu/include/cpu/ifetch.h) that does this. inst_fetch() ultimately calls vaddr_ifetch() (defined in nemu/src/memory/vaddr.c) with the parameter len, and vaddr_ifetch() accesses the contents of physical memory via paddr_read(). Thus, an instruction fetch operation is nothing more than a memory access.

isa_exec_once() passes in the address of s->snpc when calling inst_fetch(), so inst_fetch() ends up updating s->snpc with len, which points s->snpc to the next instruction.

instruction decode, ID

The code then goes to the decode_exec() function, which first performs decoding-related operations. The purpose of decoding is to get the operation and the oprand of the instruction, which is determined by looking at the opcode of the instruction. The opcode of different ISAs can be found in different places in the instruction, so we just need to recognize the corresponding opcode from the instruction according to the encoding format of the instruction.

Compared with YEMU, NEMU uses a higher level of abstraction: pattern matching, NEMU can specify the opcode in an instruction by a pattern string, for example, in riscv32 there is the following pattern: INSTPAT_START(), INSTPAT_START(), INSTPAT_START(), INSTPAT_START().

INSTPAT_START();
INSTPAT("??????? ????? ????? ??? ????? 00101 11", auipc, U, R(rd) = s->pc + imm);
// ...
INSTPAT_END();

Where INSTPAT (shorthand for instruction pattern) is a macro (defined in nemu/include/cpu/decode.h), which is used to define a pattern-matching rule. The format is as follows.

INSTPAT(pattern string, instruction name, instruction type, instruction execution operation);

Only four characters are allowed in the pattern string.

  • 0`` means that the corresponding bit can only match 0`
  • 1 means that the corresponding bit can only match 1
  • ? means the corresponding bit can match either 0 or 1
  • Space is a separator, only used to improve the readability of the pattern string, not involved in matching.

The instruction name' is used only as a comment in the code and does not participate in macro expansion; the instruction type' is used in the subsequent decoding process; and the `instruction execution operation' is the C code that simulates the real behavior of the instruction execution.

In addition, the macros INSTPAT_START and INSTPAT_END are defined in nemu/include/cpu/decode.h. INSTPAT uses two other macros, INSTPAT_INST and INSTPAT_MATCH, which are defined in nemu/src/isa/$ISA/inst.c. After expanding the above code with macros and simply organizing the code, you will end up with.

{ const void ** __instpat_end = &&__instpat_end_;
do {
  uint64_t key, mask, shift;
  pattern_decode("??????? ????? ????? ??? ????? 00101 11", 38, &key, &mask, &shift);
  if ((((uint64_t)s->isa.inst.val >> shift) & mask) == key) {
    {
      decode_operand(s, &rd, &src1, &src2, &imm, TYPE_U);
      R(rd) = s->pc + imm;
    }
    goto *(__instpat_end);
  }
} while (0);
// ...
__instpat_end_: ; }

The &&__instpat_end_ in the above code uses the label as valuesopen in new window extension provided by GCC, and the goto statement will jump to the last __instpat_end_ label. In addition, the pattern_decode() function is defined in nemu/include/cpu/decode.h and is used to convert pattern strings into three integer variables.

The pattern_decode() function extracts the 0 and 1 from the pattern string into the integer variable key, with mask indicating the mask of the key, and shift indicating the number of bits away from the lowest bit of the opcode, to aid in the optimization by the compiler. Specifically, in the above example.

key   = 0x17;
mask  = 0x7f;
shift = 0;

Consider the following instruction from the built-in client program introduced in PA1.

0x00000297   auipc t0,0

When NEMU fetches the instruction, it will record the instruction into s->isa.inst.val, and then the if statement in the above macro expansion evals to true, which means that it matches the opcode of the auipc instruction, and therefore further decoding operation will be performed.

We only know what the instruction does (e.g. auipc adds the current PC value to the immediate number and writes it to a register), but we still don't know what is the oprand (e.g. what the immediate number is, and what register it writes to). To solve this problem, the code needs to be decoded further, which is done by calling the decode_operand() function. This function will decode the operands according to the parameter type (instruction type), and the result will be recorded in the parameters rd, src1, src2, and imm, which represent the register number of the destination operand, the two source operands, and the immediate number, respectively.

We realize that operands such as registers and immediate numbers are very common types of operands. In order to further decouple operand decoding from instruction decoding, we have abstracted and encapsulated the decoding of these operands.

  • The framework code defines two helper macros, src1R() and src2R(), which are used to pass the results of register reads to the corresponding operand variables.
  • The framework code also defines helper macros such as immI, which are used to extract immediate numbers from instructions.

With these helper macros, we can use them to write decode_operand() conveniently, for example, the decoding process of type-I instructions in RISC-V can be realized by the following code:

case TYPE_I: src1R(); immI(); break;

A couple of additional notes.

  • decode_operand uses the macros BITS and SEXT, which are defined in nemu/include/macro.h and are used for bit extraction and symbol expansion, respectively.
  • decode_operand will first decode the destination operand through register operands decoding, i.e. call *rd = BITS(i, 11, 7), and rd can be used for different types of instructions depending on the situation.
  • At the end of the pattern matching process, there is an inv rule, which means that "if all the previous pattern matching rules fail to match successfully, then the instruction is considered illegal".

Variable-length instructions for x86

Due to the variable-length nature of CISC instructions, x86 instruction lengths and instruction forms need to be determined by fetching and decoding at the same time, unlike the RISC instruction set which handles the fetch and decode phases in a clear-cut manner, so you'll see inst_fetch() operations in the x86 decoding process.

The Story Behind Immediate Numbers

Framework code through the inst_fetch() function to fetch the instructions, even this is only 1 line of code, in fact, behind the scenes, it hides the careful consideration for endiannessopen in new window. Most of the students' hosts are x86 little-endian machines, when you use high-level language or assembly language to write a 32-bit constant 0x1234, in the generated binary code, the constant corresponds to the byte sequence as follows (assuming that the constant's starting address in memory is x).

x   x+1  x+2  x+3
+----+----+----+----+
| 34 | 12 | 00 | 00 |
+----+----+----+----+

Whereas most PCs are little-endian architectures (we're sure no students will use an IBM mainframe for PA), when NEMU is running

imm = inst_fetch(pc, 4);

This line of code reads the byte sequence 34 12 00 00 from memory into the imm variable as is, and the host CPU interprets this byte sequence in the little-endian fashion, resulting in 0x1234, which is what we expect.

The processors in the Motorola 68k series are all big-endian architecture. Now the question arises, consider the following two scenarios.

  • Suppose we need to run NEMU on a Motorola 68k machine (compiling the NEMU source code into Motorola 68k machine code).
  • Suppose we need to add Motorola 68k as a new ISA to NEMU.

In those cases, what problems do you need to be aware of? Why do these problems occur? How to solve them?

The fact is that it is not only the immediate number accesses, but also any memory accesses longer than 1 byte need to be considered similarly. We're going to throw the problems out here, and not discuss them separately in the future.

The Story Behind Immediate Numbers (2)

The instruction lengths of mips32 and riscv32 are only 32 bits, so they can't encode 32-bit constants from C code directly into a single instruction like x86 can. Think about it, how should mips32 and riscv32 solve this problem?

I'm getting confused by macro definitions, what should I do?

In order to understand the semantics of a macro, you may try to expand it manually, but you may encounter the following difficulties.

  • The more macros are nested, the more difficult it is to understand them.
  • Some Token-pasting (##) macros can interfere with the editor's go to definition functionality.

In fact, you don't need to do this manually in order to expand macros, because there are tools that can do it: all we need to do is to let GCC output the results of the preprocessing, and then we can see the result of the macro expansion. With the result of the macro expansion, you can quickly understand the semantics of the expansion, and in turn understand how the macro was expanded step-by-step.

Of course, the most convenient way to do this is to let GCC compile NEMU and output the preprocessing results at the same time, which is not difficult if you have some knowledge of the Makefiles.

execute, EX

After the decoding phase, the code will execute the instruction execution operation specified in the pattern matching rule, which will use the result of decoding and simulate the real behavior of the instruction execution by C code. For example, for the auipc instruction, since the U-type immediate number has already been recorded into the operand imm during the decoding phase, we only need to add the immediate number and the current PC value by R(rd) = s->pc + imm and write it to the target register, and then the execution of the instruction will be completed.

At the end of the instruction execution phase, the decode_exec() function returns 0, and goes all the way back to the exec_once() function. However, the current code does not use this return value, so it can be ignored.

Update PC

Finally, updating the PC. Updating the PC is very simple, just assign s->dnpc to cpu.pc. We mentioned snpc and dnpc earlier, so let's explain the difference now.

Static and dynamic instructions

In the field of program analysis, static instructions are the instructions in the program code, and dynamic instructions are the instructions during program execution. For example, for the following sequence of instructions

100: jmp 102
101: add
102: xor

The next static instruction for the jmp instruction is the add instruction, and the next dynamic instruction is the xor instruction.

With the concepts of static and dynamic instructions, we can explain the difference between snpc and dnpc: snpc is the next static instruction, while dnpc is the next dynamic instruction. For sequential instructions, their snpc and dnpc are the same; but for jump instructions, snpc and dnpc are different, and dnpc should point to the instruction that is the target of the jump. Obviously, we should use s->dnpc to update the PC, and maintain s->dnpc correctly during instruction execution.


The flow of an instruction execution in the NEMU has been described above, but there are a few details that have not been fully covered (e.g., the x86 instruction set decoder table), so it is up to you to try to understand these details.

Drive the project, not be driven by the project.

Your relationship with a program will go through 4 stages:

  1. Being driven: you don't know anything about it.
  2. Understood: you have a basic understanding of the main modules and functions.
  3. Fluent: you know most of the details of the entire project.
  4. Mastered: You can add any functionality you want to the project.

In PA, the main way to reach the second stage is to read the handouts and the code, and the main way to reach the third stage is to complete the experiments and debug them independently. To reach the fourth stage, it is up to you: what is not good enough in the code? What is good enough? What should be done to achieve this goal?

When you graduate and go into industry or academia, you'll find that real projects are all like this.

  1. Just on board with a new project, don't know how to start.
  2. RTFM, RTFSC, roughly understand the project organizational structure and basic workflow
  3. When you run the project, you find that there is unintended behavior (maybe a configuration error or environment error, may be an error in interfacing with an existing project, or may be the project's own bugs), and then debug it. In the debugging process, the understanding of these modules will gradually become clear. 4.
  4. One day you need to add a new feature in the project, you will realize that you can actually handle it.

What this shows is that if you go to the experts to help you debug a bug everytime you encounter it, you're missing out on a lot more opportunities and capabilities than you might think.

Structured program design

We have just introduced some utility functions and macros for the decoding process, which were introduced for the purpose of decoupling and improving maintainability. If the instruction set is more complex, there are more common features between instructions, for example, in the case of x86:

  • For different forms of the same instruction, the execution phase is the same. For example, add_I2E and add_E2G both add two operands and store the result in the destination operand.
  • For the same form of different instructions, the decoding phase is the same. For example, add_I2E and sub_I2E both decode an immediate number and an E operand in their decoding phase.
  • For different operand widths of the same form of the same instruction, the decoding and execution phases are very similar. For example, add_I2E_b, add_I2E_w, and add_I2E_l, all recognize an immediate number and an E operand, and then store the result of the addition into the E operand.

This means that if the decoding and execution process is implemented independently for each instruction with different forms and operand widths, a lot of duplicated code will be introduced. When modifications are needed, all the related code has to be modified separately, and missing a certain place will cause bugs, and the difficulty of project maintenance will increase rapidly. This is essentially another form of ODR (one definition rule) rule. Where you reuse code as much as possible so whenever you need to change something you only need to change it in a single place.

Let's experience this.

In the past, a student implemented the isa_reg_str2val() function with the following code.

if (strcmp(s, "$0") == 0)
  return cpu.gpr[0]._64;
else if (strcmp(s, "ra") == 0)
  return cpu.gpr[1]._64;
else if (strcmp(s, "sp") == 0)
  return cpu.gpr[2]._64;
else if (strcmp(s, "gp") == 0)
  return cpu.gpr[3]._64;
else if (strcmp(s, "tp") == 0)
  return cpu.gpr[4]._64;
else if (strcmp(s, "t0") == 0)
  return cpu.gpr[5]._64;
else if (strcmp(s, "t1") == 0)
  return cpu.gpr[6]._64;
else if (strcmp(s, "s2") == 0)
  return cpu.gpr[7]._64;
else if (strcmp(s, "s0") == 0)
  return cpu.gpr[8]._64;
else if (strcmp(s, "s1") == 0)
  return cpu.gpr[9]._64;
else if (strcmp(s, "a0") == 0)
  return cpu.gpr[10]._64;
else if (strcmp(s, "a1") == 0)
  return cpu.gpr[11]._64;
else if (strcmp(s, "a2") == 0)
  return cpu.gpr[12]._64;
else if (strcmp(s, "a3") == 0)
  return cpu.gpr[13]._64;
else if (strcmp(s, "a4") == 0)
  return cpu.gpr[14]._64;
else if (strcmp(s, "a5") == 0)
  return cpu.gpr[15]._64;
else if (strcmp(s, "a6") == 0)
  return cpu.gpr[16]._64;
else if (strcmp(s, "a7") == 0)
  return cpu.gpr[17]._64;
else if (strcmp(s, "s2") == 0)
  return cpu.gpr[18]._64;
else if (strcmp(s, "s3") == 0)
  return cpu.gpr[19]._64;
else if (strcmp(s, "s4") == 0)
  return cpu.gpr[20]._64;
else if (strcmp(s, "s5") == 0)
  return cpu.gpr[21]._64;
else if (strcmp(s, "s6") == 0)
  return cpu.gpr[22]._64;
else if (strcmp(s, "s7") == 0)
  return cpu.gpr[23]._64;
else if (strcmp(s, "s8") == 0)
  return cpu.gpr[24]._64;
else if (strcmp(s, "s8") == 0)
  return cpu.gpr[25]._64;
else if (strcmp(s, "s10") == 0)
  return cpu.gpr[26]._64;
else if (strcmp(s, "t2") == 0)
  return cpu.gpr[27]._64;
else if (strcmp(s, "t3") == 0)
  return cpu.gpr[28]._64;
else if (strcmp(s, "t4") == 0)
  return cpu.gpr[29]._64;
else if (strcmp(s, "t5") == 0)
  return cpu.gpr[30]._64;
else if (strcmp(s, "t5") == 0)
  return cpu.gpr[31]._64;

You should be able to imagine how this student wrote the above code. Now the question is, can you quickly check if the above code is correct?

And moreover, if you have a lot of code like this in your project, would you be willing to read it carefully?

Copy-Paste - A bad programming habit

In fact, when the first version of PA was released, the framework code was precisely designed to guide you through the process of decoding and executing each instruction independently. Most student implemented the decoding and execution by making several copies of the existing code, and then making minor changes (e.g., changing << to >>). The nightmare begins when you realize there is a bug in the code. It may take days just for you to uncover another bug and then remember that you've fixed it somewhere before. You know there are similar bugs in the code, but you can't tell which code was copied from which place at which time. Because the framework code did not pay enough attention to programming style, it traps students in the debugging rabbit hole, which is also a piece of PA's black history.

This bad programming habit is called Copy-Paste, and after analyzing it above, I'm sure you've already realized how terrible it is. In fact, Prof. Yuan-Yuan Zhou's teamopen in new window designed a tool, CP-Miner, in 2004 to automatically detect bugs caused by Copy-Paste in OS code. This tool also allowed Prof. Zhou to publish a paperopen in new window on the top OSDI conference. Which was also the first OSDI conference paper in the history of UIUC, where she worked at the time.

Later, Prof. Yuan-Yuan Zhou discovered that copy-paste is more common in application source code than in operating systems. So her team applied CP-Miner technology to application source code and founded PatternInsight. Many IT companies purchased PatternInsight's products and asked for customized services, and PatternInsight was eventually acquired by VMWare.

This story shows that programmers in big companies may be no better than you in their programming style, and that they too can write copy-paste code which is difficult to maintain. But on the other hand, emphasizing coding style is something that companies look for, and it's something you can start developing now.

A good practice is to decouple the code related to decoding, execution and operand widths, which is referred to as structured programming in programming classes. In the framework code, decoupling between decoding and execution is achieved by using pattern matching rules defined by INSTPAT, so that we can write decoding and execution logic separately, and then combine them: this design makes it easy to implement multiple instructions with the same execution behavior but different decoding methods. For x86, the decoupling between operand width and decode/execute is achieved by the width members of the ISADecodeInfo structure, which record the operand widths. During decode and execute, we will perform different operations according to the width, so that we can handle different operand widths with the same code for decode and execute.

RTFSC Understanding the process of instruction execution

This section is full of details, and you may need to read the handouts and code several times to understand each detail. According to the feedback from previous students, an effective way to understand the details is to organize them by taking notes. In fact, it's even more effective when combined with GDB.

Just in case you haven't made any progress in understanding the code for a long time, let's add a mandatory question.

Please write down how an instruction is executed in NEMU.

With the exception of nemu/src/device and nemu/src/isa/$ISA/system, the rest of the NEMU code you are already capable of understanding. So don't feel like you don't need to look at the files that aren't mentioned in the handout, try to understand every detail as much as you can! These details will be the clues that will help you debug when you run into a bug.

Run the first C program

Having said that, it's time to get hands-on. Start by cloning a new sub-project am-kernels (which you may have already cloned in PA1), which contains some test programs:

cd ics2023
bash init.sh am-kernels

Your first task in PA2, is to implement a number of instructions to make the first simple C program run in NEMU. This simple C program is am-kernels/tests/cpu-tests/tests/dummy.c, and it returns without doing anything.

Preparing the cross-compilation environment

If you choose an ISA other than x86, you will also need to have the appropriate gcc and binutils to compile correctly.

  • mips32
    • apt-get install g++-mips-linux-gnu binutils-mips-linux-gnu
  • riscv32(64)
    • apt-get install g++-riscv64-linux-gnu binutils-riscv64-linux-gnu

In the am-kernels/tests/cpu-tests/ directory type

make ARCH=$ISA-nemu ALL=dummy run

to compile the dummy program, and start NEMU to run it.

Fix riscv32 compilation errors

If you choose riscv32, and report the following error when compiling the dummy program.

/usr/riscv64-linux-gnu/include/bits/wordsize.h:28:3: error: #error "rv32i-based targets are not supported"

Then you need to modify the following files with sudo privileges.

--- /usr/riscv64-linux-gnu/include/bits/wordsize.h
+++ /usr/riscv64-linux-gnu/include/bits/wordsize.h
@@ -25,5 +25,5 @@
 #if __riscv_xlen == 64
 # define __WORDSIZE_TIME64_COMPAT32 1
 #else
-# error "rv32i-based targets are not supported"
+# define __WORDSIZE_TIME64_COMPAT32 0
 #endif

If the following error is reported.

/usr/riscv64-linux-gnu/include/gnu/stubs.h:8:11: fatal error: gnu/stubs-ilp32.h: No such file or directory

Then you need to modify the following files with sudo privileges.

--- /usr/riscv64-linux-gnu/include/gnu/stubs.h
+++ /usr/riscv64-linux-gnu/include/gnu/stubs.h
@@ -5,5 +5,5 @@
 #include <bits/wordsize.h>

 #if __WORDSIZE == 32 && defined __riscv_float_abi_soft
-# include <gnu/stubs-ilp32.h>
+//# include <gnu/stubs-ilp32.h>
 #endif

What's the right way to fix errors?

Note here modifying files directly in a tool chain is usually a quick and dirty fix which might lead to hard to find problems in the long run. Please think what is the correct way to fix the problem. You can starting by STFW for the error message and looking through the RISCV cross-compile toolchain github page.

In fact, not every program can run in NEMU, and the abstract-machine sub-project is specifically designed to compile programs to run in NEMU, which we will return to in the next sub-section.

Run the dummy program in NEMU, and you will find that NEMU outputs the following message (using riscv32 as an example).

invalid opcode(PC = 0x80000000):
        13 04 00 00 17 91 00 00 ...
        00000413 00009117...
There are two cases which will trigger this unexpected exception:
1. The instruction at PC = 0x80000000 is not implemented.
2. Something is implemented incorrectly.
Find this PC(0x80000000) in the disassembling result to distinguish which case it is.

If it is the first case, see
       _                         __  __                         _
      (_)                       |  \/  |                       | |
  _ __ _ ___  ___ ________   __ | \  / | __ _ _ __  _   _  __ _| |
 | '__| / __|/ __|______\ \ / / | |\/| |/ _` | '_ \| | | |/ _` | |
 | |  | \__ \ (__        \ V /  | |  | | (_| | | | | |_| | (_| | |
 |_|  |_|___/\___|        \_/   |_|  |_|\__,_|_| |_|\__,_|\__,_|_|

for more details.

If it is the second case, remember:
* The machine is always right!
* Every line of untested code is always wrong!

This is because you haven't implemented the 0x00000413 instruction yet, so you need to start adding instructions to NEMU.

Why does the above error message appear after executing an unimplemented instruction?

RTFSC, Understand what NEMU does when executing unimplemented instructions.

What instructions should be implemented to get dummy up and running in NEMU? The answer is in the disassembly result (am-kernels/tests/cpu-tests/build/dummy-$ISA-nemu.txt): you only need to implement the instructions that are not currently implemented. The pattern matching rules introduced by the framework code greatly simplifies the implementation of client instructions in NEMU. In order to implement a new instruction, all you need to do is to add the correct pattern matching rules to nemu/src/isa/$ISA/inst.c.

Cross-compilation toolchain

If your ISA is not x86, you need to use the appropriate cross-compiled version of tools (e.g. objdump, readelf, etc.) to view the binary information of the client program, such as mips-linux-gnu-objdump, riscv64-linux-gnu-readelf, etc., In particular, if your ISA of choice is riscv32, you can also use the riscv64-prefixed cross-compiled toolchain.

Again, you must RTFM to check the function of the instruction, and not take it for granted. The manual gives a complete description of the function of the instruction (including what it does, how it does it, and what it does), so be sure to read every word carefully, as misunderstandings and omissions can cause huge problems in debugging later on.

Some more x86 tips please!

  • call: call instruction has many forms, but in PA will only use a few of them, just need to implement the CALL rel32 form for now. As for the jump address, the framework code already has a lot of hints, treat it as an exercise for RTFSC.
  • push: Now we just need to implement PUSH r32 and PUSH imm32.
  • sub: Before implementing the sub instruction, you first need to implement the EFLAGS register. You just need to add the EFLAGS register to the register structure. EFLAGS is a 32-bit register, but in NEMU, we will only use the following five bits in EFLAGS: CF, ZF, SF, IF, OF, so the rest of the bits can be left unimplemented. Please refer to the i386 manual for the meaning of each bit in the EFLAGS register. Once you have implemented the EFLAGS register, you can implement the sub instructions
  • xor, ret: RTFM it.

Run the first client program

Implement the above mentioned instructions in NEMU, please be sure to refer to the manual for details. After successful implementation, run the client program dummy in NEMU, you will see the message HIT GOOD TRAP. If you don't see this message, your implementation is incorrect, and you can use the simple debugger implemented in PA1 to help you debug it.

Run more programs

Untested code is always wrong, you need more test cases to test your NEMU. We have prepared some simple test cases in the am-kernels/tests/cpu-tests/ directory. In that directory, run

make ARCH=$ISA-nemu ALL=xxx run

where xxx is the name of the test case (without the .c suffix).

The above make run command will eventually start NEMU and run the corresponding client program. If you need to use GDB to debug NEMU running client programs, you can execute the following command.

make ARCH=$ISA-nemu ALL=xxx gdb

Implement more instructions

You need to implement more instructions in order to pass the above test cases.

You are free to choose the order in which you want to implement the instructions. After training in PA1, you probably won't implement all the instructions before testing them. It's good to get into the habit of testing as early as possible, and the general rule is "implement as few instructions as possible for the next test". You don't need to implement all the instructions in all their forms, you just need to pass these tests. If you still encounter unimplemented directives in the future, implement them then.

The framework code already implements some of the instructions, but may not have written the corresponding pattern-matching rules. Also, some of the functions are not fully implemented (the framework code has inserted TODO() as a hint), and you will need to write the corresponding functions.

Since string and hello-str need additional implementation to work (described in the following subsections), you can use other test cases for now.

Don't think you only need to write code at the TODO

In the past, students often thought, "I only need to write code wherever a TODO appears, and if a feature doesn't have a corresponding TODO in the framework code, it's beyond the scope of must-do content, and I don't need to implement it."

In PA, this kind of thinking is wrong. If you RTFSC, you'll see that TODO() is just a macro that expands and calls panic(). So the TODO in the framework code is more about giving better readable results (e.g. xxx is not implemented) when NEMU runs, rather than making NEMU trigger segmentation errors that make you cringe.

When you graduate and join a company/project team, there will be no more handouts telling you exactly what to do, and there will come a time when you will have to do things independently. We want you to give up the illusion that the handouts and framework code will tell me everything I need to do in detail, and take responsibility for "understanding how the system works", rather than becoming a slave to the framework code. Therefore, when you are wondering whether a feature needs to be implemented, you should not judge it by the presence of a TODO in the framework code, but rather make a decision based on your understanding of the code and the needs of the moment.

Notes related to x86 instructions

  • The push imm8 instruction requires a sign extension for immediate numbers, which is not explicitly stated in the i386 manual. In IA-32 manualopen in new window it describes push as follow.

If the source operand is an immediate and its size is less than the operand size, a sign-extended value is pushed on the stack.

  • String manipulation instructions. Such as movsb, these instructions require the segment registers DS, ES and the DF flag in the EFLAGS register. There is no need to implement these registers in PA, and when RTFM you can assume that the value of these registers is always 0 to understand the semantics of the instruction.
  • The endbr32 instruction. See here for details

Branch delay slot for mips32

In order to improve processor performance, mips uses a technique called branch delay slotopen in new window. With this technique, the execution order of the program changes a bit: we call the static instructions immediately after the jump instructions (both conditional and unconditional) the delay slots, then program will first execute the instructions in the delay slots after the jump instructions, then execute the instructions in the jump target. For example

100: beq 200
101: add
102: xor
...
200: sub
201: j   102
202: slt

If the execution of the beq instruction results in a jump, the corresponding dynamic instruction flow is 100 -> 101 -> 200; if the execution of the beq instruction results in no jump, the corresponding dynamic instruction flow is 100 -> 101 -> 102; and for the j instruction, the corresponding dynamic instruction flow is 201 -> 202 -> 102.

You must be wondering how this counter-intuitive technique can improve processor performance. It requires some knowledge of microarchitecture, such as processor pipeliningopen in new window, but that's beyond the scope of the ICS course, so we won't explain it in detail, and you can STFW if you're interested.

But we know that delay slots require hardware and software collaboration to work correctly: the mips manual describes the convention, the processor designer designs the processor according to this convention, and the compiler developer makes the compiler responsible for placing a meaningful instruction in a delay slot, so that the sequence of execution according to this convention results in the correct execution, regardless of whether or not there is a jump.

If you were a compiler developer, how would you find the right instruction to put in the delay slot?

Branch delay slot for mips32-NEMU

Since mips has such a convention, and the compiler has followed it, we should also follow this convention in interpreting the semantics of programs generated by the mips32 compiler. This means that mips32-NEMU, as an emulated mips32 CPU, also needs to implement branch delay slot technology in order to properly support mips32 programs.

In fact, gcc provides a -fno-delayed-branch compilation option for mips32 program, so that nop instructions are placed in all delay slots in the mips32 program. This way, after executing the jump instruction, the next instruction can be executed directly at the jump target. Since the delay slots are filled with nop instructions, the correctness of the program will not be affected even if you don't execute them.

We have added this compilation option to the command that compiles the mips32 program, so we can simplify the implementation of mips32-NEMU by eliminating the need to implement branch delay slots.

There are other benefits to PAs of removing delay slots, which we will discuss in a later section.

instruction name cross-reference

There are a few instructions in the AT&T format disassembly results that don't match the names of the instructions listed in the manual, e.g., cltd for x86, and a number of pseudo instructions for mips32 and riscv32. Do you have a way to find the corresponding instructions in the manual other than STFW? If so, why does this work?

Hint

This concludes phase 1 of PA2.