C2 RV32E Single Cycle NPC

You have implemented the NPC with the minirv instruction set. However, since minirv only has 8 core instructions, this means that when programs are compiled for minirv, more instructions will be generated, resulting in slower program execution. To improve program performance, one approach is to implement more common instructions in NPC, allowing programs to be compiled for the full RV32E, thereby reducing the number of instructions in the program. Therefore, we need to "upgrade" the ISA used by NPC from minirv to RV32E. In phase A, we will further "upgrade" NPC to RV32IMAC.

It should be noted that although NEMU uses RV32IM, which is different from the RV32E used by NPC, RV32IM has all the functions required by RV32E, so RV32E programs can run directly on RV32IM processors. Therefore, as long as we ensure that the program has been compiled for RV32E, even if we use the RV32IM NEMU as the REF for DiffTest, it can work normally.

Building Infrastructure for NPC

By completing the PA, you should have realized the importance of infrastructure. There are four major infrastructure components in PA: sdb, trace, native, and DiffTest. Except for native, which belongs to AM, the other three infrastructure components can all be built in NPC.

I can already view waveforms, so why do I still need these infrastructure components?

This is to prevent everyone from becoming mere tool operators and thus wasting their lives.

Waveforms do contain information about all signals in the circuit at every cycle, but this information is too low-level. They cannot carry any high-level semantics, meaning users have to sift through massive amounts of information to find errors themselves.

In fact, errors caused by bugs manifest at different levels of abstraction. For example, a misplaced signal in the RTL implementation might, when reflected in program execution, result in fetching the wrong instruction, accessing illegal memory, or returning to an incorrect position from a function... While you could eventually analyze these errors from waveforms of 0s and 1s changing, wouldn't it be better if you could directly spot the problem from the output of itrace/mtrace/ftrace? Why waste so much time doing things that tools are already good at? Moreover, if the bug is at the software level, debugging by looking at waveforms is just asking for trouble.

A scientific debugging process first requires understanding how programs run on a computer. Additionally, it involves understanding the advantages and disadvantages of various tools, and selecting the right tool for analyzing problems based on different scenarios. From the perspective of abstraction layers in a computer system, we can observe program execution behavior at different levels:

Program -> Module -> Function -> Instruction -> Memory access -> Bus -> Signal

The higher the layer, the easier it is to understand the behavior, but the more vague the details become; the lower the layer, the more precise the details, but the harder it is to understand the behavior. Therefore, a scientific debugging method should be:

First, use appropriate software tools to help you quickly locate the approximate location of the bug.
Then, combine waveforms to conduct more fine-grained diagnosis within a small range to find the exact location of the bug.

Build sdb for NPC

You need to implement functions for NPC such as single-step execution, printing registers, and scanning memory. Expression evaluation and watchpoints are both based on printing registers and scanning memory. Single-step execution and memory scanning are relatively easy to implement.

To print registers, you need to access the general-purpose registers in the RTL. There are two ways to do this, and you can choose either:

Access via DPI-C.
Access general-purpose registers through C++ files compiled by Verilator, such as top->rootp->NPC__DOT__isu__DOT__R_ext__DOT__Memory. The specific C++ variable name is related to the module names and variable names in Verilog, and can be found by reading the compiled C++ header files. However, the C++ variable name may change when modifying RTL code or changing Verilator versions, requiring manual synchronization.

Add trace support to NPC

You have already implemented itrace, mtrace, and ftrace in NEMU; try to implement them in NPC. For implementing itrace, note the following two points:

You need to obtain the currently executing instruction via DPI-C.
You need to link the capstone library; for details, you can refer to nemu/src/utils/filelist.mk.

After obtaining the currently executing instruction and memory access information in the simulation environment, implementing mtrace and ftrace will not be difficult.

Add DiffTest Support to NPC

DiffTest is a powerful tool for processor debugging. It is wise to set up DiffTest for NPC before implementing more instructions in it. Here, the DUT is NPC, and the REF is NEMU. To do this, you need to:

Implement the DiffTest API in nemu/src/cpu/difftest/ref.c, including difftest_memcpy(), difftest_regcpy() and difftest_exec(). In addition, difftest_raise_intr() is prepared for interrupts and is not used currently.
Select the shared library as the compilation target in NEMU's menuconfig:

Build target (Executable on Linux Native)  --->
  (X) Shared object (used as REF for differential testing)

Recompile NEMU. After successful compilation, the dynamic library file nemu/build/riscv32-nemu-interpreter-so will be generated.
Link the above dynamic library file in the simulation environment of NPC through dynamic linking to implement the DiffTest function via its API. You can refer to the relevant code of NEMU for details.

Try to run the dummy program correctly in NPC with the DiffTest mechanism enabled. To check if the DiffTest mechanism works, you can inject an error into the implementation of the addi instruction in NPC and observe whether DiffTest can report this error as expected.

Note that you need to modify the compilation target in NEMU's menuconfig to recompile NEMU into an ELF.

Can I choose Spike as the REF?

Considering that NEMU is simpler to implement than Spike and that everyone is more familiar with it, we still recommend using your own NEMU as the REF. Someday you will need to add some personalized features to the REF to help you debug, and we don't want you to feel that the REF code has nothing to do with you. Therefore, if you have the ability to read open-source software code, you can use Spike as the REF.

Implementing the RV32E Instruction Set

To compile programs for RV32E, we need to set up a corresponding AM runtime environment in AM. The AM project already provides a basic framework for riscv32e-npc, and you need to perform some improvements based on this framework. However, since you have already set up the AM for minirv-npc before, this should not be difficult for you.

Set up AM for riscv32e-npc

You need to complete the following:

Provide a run target for riscv32e-npc to support one-click program compilation and simulation.
Implement the halt() function in riscv32e-npc to notify the NPC simulation environment to end the simulation, and let it know whether the program's running result is correct.

After preparing these infrastructure components, you can conveniently implement more RV32E instructions in NPC. You have already implemented these instructions in NEMU, but there are some details to note when implementing them in RTL:

Arithmetic and logic instructions: The execution of these instructions is mainly completed by the ALU unit, which you have encountered in digital circuit experiments. Specifically:
- Addition and subtraction operations - You have already implemented two's complement addition when implementing the addi instruction earlier. In circuits, two's complement subtraction can be implemented through two's complement addition. In RISC-V, addition and subtraction instructions do not need to judge carry and overflow.
- Logical operations - These are straightforward.
- Shift operations - These are also not difficult and can be implemented directly using operators.
- Comparison operations - These can be reduced to subtraction operations, and the result of the comparison operation is obtained by judging the result of the subtraction operation.
Branch instructions: The decision of whether a branch jumps can be calculated through subtraction operations in the ALU.

How does hardware distinguish between signed and unsigned numbers?

Try writing the following program:

#include <stdint.h>
int32_t fun1(int32_t a, int32_t b) { return a + b; }
uint32_t fun2(uint32_t a, uint32_t b) { return a + b; }

Then compile and view the disassembly:

riscv64-linux-gnu-gcc -c -march=rv32g -mabi=ilp32 -O2 test.c
riscv64-linux-gnu-objdump -d test.o

What's the difference between these two functions? Think about why this is the case.

If you're a beginner, try drawing the architecture diagram yourself

If you're new to processor design, try drawing a complete single-cycle processor architecture diagram.

Observe the synthesis results of the ALU

Try synthesizing the ALU using the yosys-sta project, observe the synthesis results, and answer the following questions:

We know that two's complement subtraction can be implemented with an adder, and comparison instructions and branch instructions also essentially need to use two's complement subtraction. If we directly write various operators like - or < in RTL code, can yosys automatically merge their subtraction functions into the same adder?
What circuits do the shift operators << and >> get synthesized into by yosys?
Is there room for improvement when yosys synthesizes circuits directly from operators?

Hint: If you find the synthesis results for 32-bit data difficult to read, you can consider first observing and analyzing the synthesis results for 16-bit, 8-bit, or even 4-bit data.

Run all previous tests correctly on NPC

With the strong support of the infrastructure, you should be able to easily and correctly implement NPC supporting RV32E. After implementation, try recompiling all previously run tests for riscv32e-npc and then run them on NPC.

RV32E does not include multiply and divide instructions. How can NPC correctly run C programs containing multiplication and division operations?

This is because the RISC-V instruction set is modular, and gcc can decide how to compile multiplication and division operations based on whether the instruction set includes the M extension. If the instruction set does not include the M extension, gcc will compile multiplication and division operations into function calls such as __mulsi3(). These functions are used to provide software-emulated versions of integer arithmetic operations, that is, to calculate the results of multiplication and division using addition and subtraction operations. The declarations of these functions can be referred to on this pageopen in new window, and their function bodies are in the library libgcc. Usually, libgcc is linked into the ELF executable file during the linking process.

We have ported some common software-emulated functions corresponding to integer multiplication and division operations in libgcc to riscv32e-npc. Therefore, we can compile ELF executable files that can correctly perform multiplication and division operations without including multiply and divide instructions.

Emancipate your mind and use the right tools for the job

Some students have asked: Why bother with tools like Verilator and Makefile when you can just click a button in ModelSim? This is because relying solely on waveforms for debugging is not a scientific approach. For small-scale programs like cpu-tests, you can get by even if you insist on debugging with waveforms; but as the scale of programs grows, debugging efficiency will drop drastically: If an error occurs after 100 million cycles of simulation, how will you find the error in the waveform?

However, most students have not previously thought about how to improve debugging efficiency. In fact, this is not because of a lack of ability (for example, a trace is essentially just a printf() statement), but because they are constrained by various unprofessional mindsets:

I'm not from the computer science department, so software has nothing to do with me.
I'm here to work on hardware; the software part can be just phoned in.
Companies now use Quartus/Vivado; using Verilator in "One Student, One Chip" is outdated.

These mindsets make people instinctively resist engaging with ideas from the software field. For example, in the Loongson Cup competition, successfully booting Linux is the pinnacle achievement in the system demonstration part. Yet, judging from the results, not every participating team can reach this peak. But we believe that as long as you learn to use the right tools, anyone can successfully boot Linux on a self-designed CPU within a reasonable time. For instance, in the third phase of "One Student, One Chip" (which lasted 3 months), an electronics student who had never designed a CPU before successfully booted Linux Debian on his own CPU by himself. In fact, even writing a small script can sometimes significantly improve your work efficiency. Rather than clinging to traditional methods, understanding, learning from, and absorbing advanced methods from other fields can make you stronger.

If you're a beginner, you can now look at the architecture diagrams in textbooks

If you're new to processor design, try comparing the single-cycle processor architecture diagram you drew with the ones in textbooks. Think about it: which architecture is better, and why?

# C2 RV32E Single Cycle NPC

# Building Infrastructure for NPC