E4 From C Code to Binary Program

The online site Compiler Explorer is great for developing programs with a single source file, but it becomes inconvenient when developing programs with multiple source files, or when needing to coordinate multiple programs to complete more complex tasks. After installing the Linux system, we will migrate our development environment to Linux and utilize some useful tools available in Linux to aid in advanced learning of the C language, laying a foundation for setting up the development environment for subsequent tasks.

Advanced learning of C Language in Linux:

In Linux, complete exercises 0-22, 24-25, and 27-33 from Learn C the Hard Wayopen in new window. You need to copy the sample code into Linux to compile and run it, combining the text RTFSC to understand the sample code and complete the corresponding additional questions.

Note that, the exercises mentioned above are part of the Entrance Interview. If you get stuck on certain questions, you can STFW or ask in the Pre-Study group chat.

Although the C language topics in this tutorial are similar to those in the previous “One-stop Learning for Linux C Programming” tutorial, it also includes introductions to various tools like Makefile and Valgrind, as well as some programming concepts worth learning. These will be very helpful for everyone in future project development work and are definitely worth practicing on Linux. In fact, if your foundational knowledge of C is solid enough, you should not spend too much time reading the code; rather, if you encounter many difficulties while reading the code, you should appreciate this opportunity to help you identify gaps in your understanding.

After becoming familiar with C language development in the Linux system, you already know, that programs can only run after the C code is compiled into a binary executable file. So, how does the compiler transform C code into an executable file? To answer this question, we will use some tools available in Linux to further understand the steps involved in this process.

Preprocessing

Preprocessing is a step that occurs before the actual compilation, and its essence is text processing. Preprocessing mainly includes the following tasks:

  • Including header files
  • Macro substitution
  • Removing comments
  • Joining the strings split by line continuation characters (\ at the end of a line)
  • Handling conditional compilation #ifdef/#else/#endif
  • Handling the stringification operator #
  • Handling the identifier concatenation operator ##.

For example, here is the following C code:

// a.c
#include <stdio.h>
#define MSG "Hello \
World!\n"
#define _str(x) #x
#define _concat(a, b) a##b
int main() {
  printf(MSG /* "hi!\n" */);
#ifdef __riscv
  printf("Hello RISC-V!\n");
#endif
  _concat(pr, intf)(_str(RISC-V));
  return 0;
}

You can preprocess the above C code by using this command:

gcc -E a.c

Observe the preprocessing results

Try running the above gcc command, then compare the preprocessing results with the differences in the source file.

A question worth discussing is: how does gcc find header files? To answer this question, we can refer to the tool’s logs and relevant manuals.

How to find header files

  1. Try executing gcc -E a.c --verbose > /dev/null,and look for header file-related content in the output results.
  2. Search for the description related to the -I option in man gcc and read it.

After understanding, try creating some stdio.h files, then use the -I option to let gcc include your created stdio.h instead of the one in the standard library. Use the -E option to check if the preprocessing results meet your expectations.

Observe the preprocessing results(2)

Try installing gcc targeted for the RISC-V architecture and use it for preprocessing:

apt-get install g++-riscv64-linux-gnu
riscv64-linux-gnu-gcc -E a.c

Check the preprocessing results at this point. What new changes do you notice?

The macro __riscv in the above C code is quite special, as it is predefined by riscv64-linux-gnu-gcc. Therefore, even if it is not directly defined in the C code, riscv64-linux-gnu-gcc still processes it as defined.

You can view all predefined macros with the following command:

echo | gcc -dM -E - | sort

This command allows gcc to preprocess an empty file and then print all the defined macros during this process, sorting them in the process.

Compare the predefined macros of gcc and riscv64-linux-gnu-gcc

Try to compare the predefined macros of gcc and riscv64-linux-gnu-gcc to understand the differences between the two during preprocessing. You only need to have a basic understanding of these differences; there’s no need to delve deeply into the specific meanings of each macro.

Hint:

  • Using diff or related commands can help you quickly identify the differences between the two files.
  • If you want to understand the meaning of some macros, you can refer to the gcc manualopen in new window.

Compilation

In a broad sense, compilation is the process of converting one language into another. For a C language compiler, compilation is the process of converting C language into target language, where the target language is related to the ISA, and typically refers to the assembly language of the target ISA. For example, for a computer with an x86 architecture, gcc will convert C language into x86 assembly language; while riscv64-linux-gnu-gcc will convert C language into riscv64 assembly language.

This process involves many details, and we will not delve into how each step is implemented. However, we will use appropriate tools to understand what each step does, thereby establishing a basic understanding of the compilation process.

To do this, you need to install clang, which is functionally equivalent to gcc and is also a C language compiler.

apt-get install clang

At the same time, we will observe the steps in the compilation process through the following program:

// a.c
#include <stdio.h>
int main() { // compute 10 + 20
  int x = 10, y = 20;
  int z = x + y;
  printf("z = %d\n", z);
  return 0;
}

Understand the compilation process

Try to refer to man clang and read the section that introduces the compilation stages to gain a general understanding of the compilation process.

Lexical Analysis

The purpose of lexical analysis is to recognize and record each token in the source file, including identifiers, keywords, constants, strings, operators, braces, semicolons, etc. If an illegal token (e.g., @) is encountered, an error is reported.

You can view the results of lexical analysis using the following command:

clang -fsyntax-only -Xclang -dump-tokens a.c

You will see that the results of lexical analysis also record the position of each token in the format filename:line number:column number.

In fact, a C language source file is essentially a text file, so it can also be viewed as a string, and the lexical analysis tool can be considered as a string matching program.

Lexical Analysis and Syntax Highlighting

You may have used the syntax highlighting feature in your editor. In fact, this feature is not difficult to implement: according to the definitions in the ANSI C, you can write a simple lexical analysis tool to recognize certain keywords in C code and output them in different colors. If you want to output to the terminal, you can use the ANSI escape code color featureopen in new window.

Syntax Analysis

The task of syntax analysis is to organize the recognized tokens into a tree structure according to the syntax of the C language, thereby clarifying the hierarchical structure of the source program, from files and functions to statements, expressions, and variables. If a syntax error is encountered (such as a missing semicolon), an error is reported. The result of syntax analysis is usually presented in the form of an Abstract Syntax Tree (AST).

You can view the results of syntax analysis using the following command:

clang -fsyntax-only -Xclang -ast-dump a.c

Semantic Analysis

The task of syntax analysis is to determine the type of each expression in the AST according to the semantics of the C language. In this process, compatible types will undergo type conversion based on the C language standard (such as arithmetic type promotion). For cases that do not conform to semantics, errors will be reported. Some situations that are syntactically correct but semantically incorrect include undefined references, mismatched operand types for operators (e.g., struct mytype a; int b = a + 1;), and mismatches in the type and number of function call parameters.

For clang, the types of expressions are already provided when outputting the AST. In fact, most compilers do not strictly separate syntax analysis and semantic analysis.

An important application of semantic analysis is static program analysis, which refers to analyzing source code without executing the program. It essentially analyzes the semantic information in the AST. The perspectives of analysis include code style and conventions, potential software defects, security vulnerabilities, performance issues, etc.

For example, consider the following program:

// a.c
#include <stdlib.h>
int main() {
  int *p = malloc(sizeof(*p) * 10);
  free(p);
  *p = 0;
  return 0;
}

This program conforms to the syntax of the C language, and each statement, when viewed in isolation, also conforms to the semantics of the C language. Compiling it directly with gcc a.c, does not produce any errors, and the compiled program even runs successfully. However, if you add the compilation option -Wall, gcc will perform more checks on the code and will issue a warning indicating that there is a use-after-free issue in the code, which means that access is attempted after the memory has been freed. Tools that report potential issues in code through static program analysis are called lint tools.

Similarly, you can use clang’s lint tool to analyze the above program:

clang a.c --analyze -Xanalyzer -analyzer-output=text

Emphasize the Importance of Lint Tools

Some beginner programmers may feel that having the compiler report more warnings adds extra workload to programming. In fact, using lint tools is almost cost-free for developers and can help them discover many potential issues. Once these issues enter the runtime phase of a program, developers will pay a higher price to debug them. Especially during the operation of large projects, problems caused by such code can be very subtle, and software may crash suddenly after running for a long time, making debugging very difficult. Therefore, large projects typically make full use of lint tools to improve project quality as much as possible.

Intermediate Code Generation

Intermediate code is a type of ISA defined by the compiler for compilation scenarios, also known as Intermediate Representation (IR) or Intermediate Language.

You can view the intermediate code generated by clang with the following command:

clang -S -emit-llvm a.c
cat a.s

Reviewing the state machine model, the main task of compilation is to translate the state machine of a C program into the state machine of the ISA, meaning translating the variables of the C program into registers or memory, and translating the statements of the C program into sequences of instructions. Since intermediate code can also be seen as a type of ISA, we can understand the work of intermediate code generation from a state machine perspective:

  • translating the variables of the C program into variables of the intermediate code, such as %1, %2, %3, etc.;
  • translating the statements of the C program into instructions of the intermediate code, including alloca, store, load, add, call, ret, etc.

Of course, the translation process needs to rely on the semantics analyzed by the AST to ensure that the behavior of the resulting intermediate code is equivalent to that of the input C program.

So why not translate directly to the target language, which is the processor-related ISA?

On one hand, there are many processor-related ISAs. If we directly translate to a processor-related ISA and need to optimize the program, we would have to implement an optimization technique separately for different ISAs, increasing the maintenance costs of the compiler. If we first translate to intermediate code and then to the processor-related ISA, we only need to implement the optimization technique on the intermediate code.

On the other hand, intermediate code can serve as a bridge between various source languages (such as C, Fortran, Haskell, etc.) and various target languages (such as x86, ARM, RISC-V, etc.). If there are source languages and target languages, directly translating to the target language would require implementing translation modules. By introducing intermediate code and dividing the compilation process into frontend and backend using intermediate code as the boundary, we only need to implement translation modules, where frontend modules are responsible for translating the source languages into intermediate code, and backend modules are responsible for translating the intermediate code into the target languages.

             frontend                              backend
           +----------+                        +------------+
      C -> |  Clang   | -+                 +-> |  llvm-x86  | -> x86
           +----------+  |                 |   +------------+
           +----------+  +-> +----------+ -+   +------------+
Fortran -> | llvm-gcc | ---> | llvm-opt | ---> |  llvm-arm  | -> ARM
           +----------+  +-> +----------+ -+   +------------+
           +----------+  |                 |   +------------+
Haskell -> |    GHC   | -+                 +-> | llvm-riscv | -> RISC-V
           +----------+  LLVM IR      LLRM IR  +------------+

Different compilers may use different intermediate codes. For example, the intermediate code used by clang is called LLVM IR, while the intermediate code used by gcc is called GIMPLE. However, we do not need to understand the specific details of the intermediate code; we only need to have a general understanding of the intermediate code generation process from the perspective of the state machine model.

Compilation Optimization

Compilation optimization is an important step in the modern software build process. Through compilation optimization, developers can focus their efforts on the development of the program’s business logic without having to overly consider the program’s performance during the development phase. The compiler can usually provide a decent performance baseline.

Definition of the Correctness of Compilation Optimization

We can understand compilation optimization from the perspective of program behavior: if two programs are “equivalent” in some sense, then a “simple” program can replace a “complex” one. The behavior that follows the C language standard and executes statements one by one is referred to as “strict execution”. Using “strict execution” as a baseline, the C language standard provides a rigorous definition of what is meant by “equivalent”. Specifically, the optimized program should maintain consistency in terms of “observable behavior” (C99 Standard Manualopen in new window Section 5.1.2.3, Point 6), which includes the following:

  1. Access to variables modified by the keyword volatile, must be strictly executed.
  2. Data written to files at the program’s end, must be consistent with that in strict execution.
  3. Input and output of interactive devices (stdio.h), must be consistent with that in strict execution.

“Observable behavior” describes the effect of a C program on the outside world from an external perspective. For example, the second point requires that external operations without real-time requirements appear “consistent” at the end, while the third point requires that real-time external operations appear “consistent” during execution. The first point constrains the internal behavior of the C program, which is related to the “memory-mapped I/O” mentioned in the F phase; however, we do not intend to elaborate on this here, as we will continue to discuss related content in the D phase.

Therefore, as long as the optimization continues to meet the consistency of observable program behavior, such optimization is considered “correct”. Under this condition, if the optimized program has fewer variables or statements, one can expect better performance.

Examples of Compilation Optimization Techniques

Real-world projects commonly use compilation optimization techniques, and in the future, you will run various compiled and optimized programs on processors you design yourself. Therefore, understanding some common optimization techniques and knowing why compilers generate corresponding instruction sequences will be beneficial for tasks such as debugging and architecture optimization in the future.

Below are some common compilation optimization techniques. Note that optimization techniques are not directly applied to C code in the compilation process, but for the sake of understanding, we present the semantics before and after optimization using C code.

  • Constant Propagation - If the value of a variable is a constant, that value can be substituted into its references. If the substitution results in a constant expression, the value of that expression can be computed directly. In the following example, since the value of a is constant, the expression a + 2 is also constant, making b’s value constant as well; further, the expression b * 3 is also constant. The compiler can compute these constant expressions directly and replace them with their computed results, thus eliminating the need to generate corresponding instructions (like addition and multiplication instructions) to calculate these expressions at runtime.
//          Before             |            After
  int a = 1;                   |    int a = 1;
  int b = a + 2;               |    int b = 3;
  printf("%d\n", b * 3);       |    printf("%d\n", 9);
  • Dead Code Elimination - Code that is unreachable or variables that are no longer used can be removed. In the following example, since the definition of the DEBUG macro is 0, the code inside the if block will never execute and can be removed. After removing the code in the if block, the variable a is unused and can be further removed.
//          Before             |            After
  #define DEBUG 0              |    #define DEBUG 0
  int fun(int x) {             |    int fun(int x) {
    int a = x + 3;             |      return x / 2;
    if (DEBUG) {               |    }
      printf("a = %d\n", a);   |
    }                          |
    return x / 2;              |
  }                            |
  • Elimination of Redundant Operations - Assignments that are overwritten without being read can be removed. In the following example, the assignment a = 3 will be overwritten by a = f(), so the former can be removed; similarly, the assignments a = f() and a = 7 can also be removed.
//          Before             |            After
  int a;                       |    int a;
  a = 3;                       |    f();
  a = f();                     |    a = 10;
  a = 7;                       |
  a = 10;                      |

Can it be further optimized?

The optimization result of the previous example retains the function call f(). Is it possible to further remove f()? Why?

  • Code Strength Reduction - Replace complex operations with simpler arithmetic operations. In the following example, the behavior of i * 4 and i << 2 is semantically equivalent when defined. However, on most computers, the execution cost of multiplication instructions is higher than that of shift instructions. By replacing the former with the latter, we can improve the program’s performance.
//          Before             |            After
  int x = a[i * 4];            |    int x = a[i << 2];
  • Extract Common Subexpressions - For subexpressions that are computed multiple times, an intermediate variable can be used to store the result, allowing that result to be referenced directly in subsequent code without repeating the computation. In the following example, a * b appears twice, and by introducing an intermediate variable temp to hold the result of a * b, we can reduce one multiplication operation, thereby enhancing the program’s performance.
//          Before             |            After
  int x = a * b - 1;           |    int temp = a * b;
  int y = a * b * 2;           |    int x = temp - 1;
                               |    int y = temp * 2;
  • Loop-Invariant Code Motion - Code that produces the same result in each iteration of a loop can be moved outside the loop to be computed once. In the following example, the expression a + 2 yields the same result in each iteration of the loop, so it can be moved outside the loop for a single computation, avoiding the need to recompute it during each iteration.
//          Before             |            After
  int a = f1();                |    int x = f1() + 2;
  for (i = 0; i < 10; i ++) {  |    for (i = 0; i < 10; i ++) {
    int x = a + 2;             |      int y = f2(x);
    int y = f2(x);             |      sum += y + i;
    sum += y + i;              |    }
  }                            |

Can it be further optimized? (2)

In the optimization result of the previous example, f2(x) is still in the loop. Can f2(x) be moved outside the loop for computation? Why?

  • Inline Function - For smaller functions, they can be expanded at the call site to save the overhead of a function call. In the following example, f2()‘s call to f1(x, 3) can be directly expanded to compute x + 3, eliminating the need for the function call and return process, thus improving the program’s performance.
//          Before             |            After
  int f1(int x, int y) {       |    int f1(int x, int y) {
    return x + y;              |      return x + y;
  }                            |    }
  int f2(int x) {              |    int f2(int x) {
    return f1(x, 3);           |      return x + 3;
  }                            |    }

Can it be further optimized? (3)

In the previous example, assuming f1(x, 3) is the only call to f1() within this source file, can we apply dead code elimination techniques to remove f1() in the optimization result? Why?

In addition to the techniques mentioned above, there are many other compilation optimization techniques, such as induction variable analysis, loop unrolling, software pipelining, automatic parallelization, aliasing, and pointer analysis, which we will not discuss here. Interested individuals can refer to relevant materials.

Enable Compilation Optimization

You can enable more compilation optimizations by providing the -O1 option in the clang command line:

clang -S -emit-llvm -O1 a.c
cat a.s

Compare the results of compilation optimization

Try to compare the intermediate code generated before and after adding -O1. What differences do you notice in the intermediate code after adding -O1?

Compare the results of compilation optimization (2)

Try adding the keyword volatile before int z = x + y; and regenerate the -O1 intermediate code. How does the intermediate code compare to the one generated previously?

Compilers usually provide different optimization levels, allowing developers to choose among various metrics such as program performance, code size, and compilation time. For example, in gcc, the optimization levels for program performance are as follows:

-Ofast > -O3 > -O2 > -O1 > -Og > -O0 (default).

The higher the optimization level, the better the performance of the generated program, but the longer the compilation time.

Most software projects that use gcc or clang typically use the -O2 optimization level, which allows the software to run with good performance. For -O3, gcc will also try to achieve higher program performance by generating more code. -Ofast is more aggressive: it may even employ optimization strategies that violate language standards to achieve better performance. -Og only adopts optimization strategies that are friendly for debugging and, compared to -O0, can enhance program performance while allowing the optimized program to maintain its original structure (like loops, function calls, etc.), making the generated instruction sequence correspond well to the C code, which facilitates debugging for developers.

In addition to program performance, gcc also provides optimization levels aimed at reducing code size:

-Oz > -Os > -O1 > -O0 (default). For clang, the aforementioned optimization levels are also applicable.

An optimization level typically includes many optimization techniques. For gcc, you can view the optimization techniques enabled by a specific optimization level using -Q --help=optimizers.

For example:

gcc -Q --help=optimizers -O1

This will show which optimization techniques are enabled by -O1.

For clang, you can see the sub-steps (pass) during the compilation process using -ftime-report:

clang -S -emit-llvm -O1 a.c -ftime-report

Of course, you are not required to understand the specific details of every optimization technique. If you are interested, you can refer to the relevant manuals.

Target Code Generation

The task of target code generation is to translate the optimized intermediate code into target code, which corresponds to the processor-specific ISA.

Similarly, we can understand the work of target code generation from a state machine perspective:

Translate the variables in the intermediate code into variables in the processor ISA, i.e., translate %1, %2, %3, etc., into registers or memory addresses. Translate the instructions in the intermediate code into instructions in the processor ISA, i.e., translate instructions like alloca, store, load, add, call, ret, etc., into corresponding processor ISA instructions.

You can view the target code generated by clang using the following commands:

clang -S a.c
cat a.s

The above clang command will generate assembly code that matches the local environment’s ISA by default, such as x86. You can also provide the compilation option --target=xxx to clang to generate the corresponding assembly code.

For example, using --target=riscv64-linux-gnu will instruct clang to generate riscv64 assembly code:

clang -S a.c --target=riscv64-linux-gnu

This process of generating assembly code for an ISA that differs from the local environment is known as "cross-compilation".

Using riscv64 as an example

Due to the lack of a runtime environment for riscv32 in the system, it is not possible to generate riscv32 code with the system’s default configuration, and subsequent steps cannot generate an executable file for riscv32.

For the sake of example, we will use riscv64 to demonstrate the RISC-V related compilation process. Compared to riscv32, riscv64 simply adds several additional instructions. However, to understand the general correspondence between C code and assembly code, you do not need to delve deeply into the specific semantics of each instruction.

Understand the association between C code and riscv instruction sequences

Read the riscv64 assembly code obtained from cross-compiling with clang, and try to identify which segment of assembly code was compiled from which segment of C code.

Understand the association between C code and riscv instruction sequences (2)

Add -O1 and recompile to obtain riscv64 assembly code, what differences do you notice in the generated assembly code? How does it relate to the C code?

Target code generation, as the final step in the compilation process, can also yield corresponding results using gcc:

gcc -S a.c  # compile to local ISA
riscv64-linux-gnu-gcc -S a.c  # cross-compile to riscv64

During the process of generating target code, the compiler will also perform optimizations related to the target ISA. For example, when translating variables in the intermediate code into ISA registers or memory, the compiler will try to place frequently used variables into registers and less frequently used variables into memory.

In modern computers, the efficiency of a processor accessing registers is significantly higher than that of accessing memory, so placing commonly used variables in registers can enhance overall program performance.

For instruction translation, the compiler will also attempt to generate sequences of instructions with fewer instructions, which can similarly improve program performance. These strategies are supported by certain theoretical foundations, which we will not delve into here. Interested readers may consider referring to materials related to compiler principles.

Generation and Execution of Binary Files

Assembly

The result of compilation is assembly code. We already know that assembly language is a symbolic representation of instructions; thus, assembly code is essentially readable text. However, the processor’s circuitry cannot understand text, so the assembly code needs to be converted into binary encoding of instructions. This is the job of the assembling step, and the tool that performs this task is called an assembler.

The assembler operates straightforwardly; overall, it consults the ISA manual and translates each text instruction in the assembly code into the corresponding binary encoding to produce an object file. You can use the following command to have clang generate the object file after assembly:

clang -c a.c
ls a.o

However, the contents included in the object file are no longer text, so opening it with a text editor will yield unreadable content. To view the content of the object file, we need some binary file parsing tools to help us decode the binary content into readable text.

For example, we can use the objdump tool from the binutils (Binary Utilities) package to parse the object file:

objdump -d a.o

The process of re-parsing assembly code from the object file is called “disassembly”. The above command will yield the x86 disassembly code, which is similar to the assembly file generated by the compiler.

To obtain the riscv64 object file, we need to perform cross-compilation:

clang -c a.c --target=riscv64-linux-gnu
riscv64-linux-gnu-objdump -d a.o

Similarly, we can also use gcc to generate the object file:

gcc -c a.c
riscv64-linux-gnu-gcc -c a.c

or use the LLVM toolchain for disassembly, which can automatically identify the ISA architecture corresponding to the object file, making it more convenient:

llvm-objdump -d a.o  # x86和RISC-V的目标文件均支持

View the disassembly results of the riscv64 object file

Based on the above commands, check the disassembly results of the riscv64 object file, and compare them with the assembly file generated by the compiler.

Linking

The linking process merges multiple object files into the final executable file. You can use the following command to have clang generate the linked executable file:

clang a.c
ls a.out

The executable file can also be disassembled using objdump.

However, you will find that the linked executable file contains significantly more content compared to the object file (a.o). To further understand where this content comes from, we can look at the log of the clang command execution:

clang a.c --verbose

At the end of the log, you can see commands related to linking, which also include several target files named like crt*.o. Here, crt stands for “C runtime”, which represents the runtime environment for C programs. In other words, the linking process combines the object files obtained from compiling and assembling a.c with the existing C runtime environment-related object files to ultimately generate the executable file. It is evident that the executable file needs to include these runtime environment-related object files to provide necessary support for the execution of the executable file.

Similarly, we can also use gcc to generate the executable file:

gcc a.c

We can also generate the riscv64 executable file through cross-compilation:

clang a.c --target=riscv64-linux-gnu
riscv64-linux-gnu-gcc a.c

View the disassembly results of the riscv64 executable file

Try to generate the riscv64 executable file, view its disassembly results, and compare them with the object file before linking.

The linking process involves many details, which need not be delved into at this time; we will elaborate on this in the D phase.

Execution

For x86, after compiling the executable file, execute it using the following command:

./a.out

You are already familiar with the process of the processor executing instructions: fetching, decoding, executing, and updating the PC. As long as the program’s instruction sequence is placed in memory and the PC points to the first instruction, the processor will automatically execute this program.

Compare the performance differences before and after compilation optimization

We previously introduced various compilation optimization options, and now you can experience the power of these options. Using the previously discussed program for summing a series, you can measure the runtime of the program under different compilation optimization levels to understand the impact of different optimization levels on program performance.

However, since the previous series summation program ends at 10, to conveniently measure performance differences, you need to adjust the program’s endpoint to increase its execution time. If the endpoint is large, you can also change the type of the sum variable to long long. You can measure the execution time of a command using the time command; f or example, time ls will report the execution time of the ls command.

After adjusting the endpoint of the series, compile and measure the program’s runtime under -O0, -O1, and -O2.

If you are interested, you can also disassemble to view the corresponding assembly code and try to understand based on the assembly code: why you achieve the corresponding performance improvement? What compilation optimization techniques might the compiler have applied? However, to answer these questions, you might need to RTFM or STFW to learn about the functions of some assembly instructions.

However, the generated executable file resides on external storage (disk or SSD); how do we place it in memory? Recalling the processor we implemented using Logisim, we directly connected the instruction sequence as constants to the circuit or used the GUI’s Load Image operation to load the program’s instruction sequence into the ROM component. Both methods essentially involve us manually completing the task of “placing the program’s instruction sequence in memory”, but in the command ./a.out, we did not manually perform this task.

When we execute ./a.out, who exactly completes the task of “placing the program’s instruction sequence in memory” so that the processor can fetch the program’s instruction sequence from memory for execution? In fact, there is a special program in modern operating systems called a “loader”, whose job is to read (load) other executable files from external storage into memory and jump to the corresponding program entry point to execute instructions.

However, the processor can only execute instructions that comply with its ISA specifications. For example, most students use computers equipped with x86 processors, so they can only execute x86 executable files and cannot execute RISC-V executable files; if an x86 processor is forced to execute a sequence of RISC-V instructions, it may misinterpret RISC-V instructions as other x86 instructions during the decoding process, or even recognize them as illegal instructions, causing the processor to fail to execute according to the original semantics of RISC-V instructions, resulting in unexpected program execution results. Therefore, the loader checks which ISA the program’s instruction sequence belongs to during the loading process, and if it does not match the current processor’s ISA, the loader will stop loading and report an error.

Loading the program is a necessary step before the program runs; therefore, the loader is part of the runtime environment. The aforementioned crtxxx.o target files contain some functionalities of the loader. The runtime environment also includes other functions, for instance, during the execution of ./a.out, the runtime environment’s role is also reflected in:

  • Before the program execution begins, preparing various initialization tasks. When learning C language previously, you might have thought that a program starts executing from main(). However, if that were the case, how would the program parameters we input from the command line be passed to main()’s argc and argv? In fact, this task is also completed by the runtime environment: it calls main() only after completing a series of preparatory tasks like loading the program and preparing parameters.
  • During the program execution, providing support for library functions like printf(). In the example code above, printf() was called directly, and no code for printf() was written, but when executing ./a.out, information was successfully output through printf(). Therefore, we have reason to suspect that the crtxxx.o target files provide a way to execute printf() either directly or indirectly. In fact, library functions are indeed part of the runtime environment.
  • After the program execution ends, providing a function for program exit. When learning C language previously, you might have thought that the program directly exits after returning from main(). However, if you can understand that the runtime environment did a lot of preparatory work before executing main(), you can easily guess that after returning from main(), it should return to the runtime environment, which will then carry out some cleanup tasks before the program exits.

In general, just having the program itself is not sufficient to run it. From a broader perspective, all functionalities that support program execution fall under the category of the runtime environment.

Manual Behavior and Coding Standards

Just as the ISA manual defines the semantics of instructions, the semantics of the C language is also defined by its corresponding manual. After its introduction in the 1970s, C language gained widespread use due to its efficiency, flexibility, and portability. However, various variants and extensions of C across different compilers have led to compatibility issues. To address this challenge, the ANSI established a committee in 1983 to standardize the C language. Over the years, the C language standard has evolved through various versions, including C90, C99, C11, C17, and C23. These versions are typically named based on the year of their release, for example, the C11 standard was published around 2011. In OSOC, we will take C99open in new window as an example to understand some definitions and concepts in the C language standard, thus establishing a more comprehensive understanding of C program behavior.

Semantics of Program Execution

Reviewing the state machine model of computer systems, the process of executing a C program is the process of changing the program’s state through the statements of the C program. This understanding is based on the state machine model. Now we can look at how the C language standard defines “program execution”, which will help us further understand the details of “program execution”.

Section 5.1.2.3 of the C99 manual defines “program execution”, and we will explain these definitions one by one:

1 The semantic descriptions in this International Standard describe the behavior of
an abstract machine in which issues of optimization are irrelevant.

In the manual, the semantic description of program execution is related to an abstract machine, and topics of optimization are not involved. Here, the concept of an abstract machine is similar to the model machine discussed earlier when introducing ISA; it discusses only the functions and behaviors it possesses without addressing its specific implementation.

2 Accessing a volatile object, modifying an object, modifying a file, or calling a
function that does any of those operations are all side effects, which are changes
in the state of the execution environment. Evaluation of an expression in general
includes both value computations and initiation of side effects. Value computation
for an lvalue expression includes determining the identity of the designated object.

Accessing a volatile object, modifying an object, modifying a file, or calling a function that includes any of these operations are referred to as “side effects”,
meaning changes in the state of the execution environment. The evaluation of an expression generally includes both value computation and the introduction of side effects. The computation for an lvalue expression also includes determining the identity of the target object.

Regarding “access” and “modify”, their concepts are defined in Chapter 3 of the manual, Terms, definitions, and symbols. Specifically, “access” refers to “reading” or “modifying” the value of an object, where “modify” includes situations where the new value to be stored is the same as the old value.

3 Sequenced before is an asymmetric, transitive, pair-wise relation between
evaluations executed by a single thread, which induces a partial order among those
evaluations. Given any two evaluations A and B, if A is sequenced before B, then the
execution of A shall precede the execution of B. (Conversely, if A is sequenced
before B, then B is sequenced after A.) If A is not sequenced before or after B,
then A and B are unsequenced. Evaluations A and B are indeterminately sequenced when
A is sequenced either before or after B, but it is unspecified which. The presence
of a sequence point between the evaluation of expressions A and B implies that every
value computation and side effect associated with A is sequenced before every value
computation and side effect associated with B. (A summary of the sequence points is
given in annex C.)
  • “Sequenced before” defines an asymmetric and transitive binary relation for evaluations executed by a single thread, inducing a partial order among those evaluations.
  • Given any two evaluations A and B, if A is sequenced before B, then the execution of A occurs before that of B.
  • Conversely, if A is sequenced before B, then B is sequenced after A.
  • If A is neither sequenced before nor after B, then A and B are unsequenced.
  • If A is sequenced before B, or A is sequenced after B, but it is not specified which, then A and B are indeterminately sequenced.
  • If there is a sequence point between the evaluations of expressions A and B, then all value computations and side effects associated with A must occur before those associated with B.
  • Annex C provides a summary of sequence points.

This entry strictly defines the order relations between different evaluation operations through the concept of sequence points, thereby rigorously defining the semantics of the entire program execution. For example, consider the following program:

a = 1;
b = a + 2;

According to the summary of sequence points in Annex C, there exists a sequence point between the expression evaluations of statement expressions. According to the definition of sequence points, this means that all value computations and side effects related to the expression a = 1 must occur and take effect before evaluating b = a + 2. Therefore, when evaluating b = a + 2, the value of a has already been modified to 1, allowing the program to read 1 from variable a at that moment.

You might feel this is of little use. Let’s look at the following example and see what the output of this program is:

#include <stdio.h>
int f() { printf("in f()\n"); return 1; }
int g() { printf("in g()\n"); return 2; }
int h() { printf("in h()\n"); return 3; }
int main () {
  int result = f() + g() * h();
  return 0;
}

In fact, this program can output any order of function calls because, within the same expression statement, multiple function calls are indeterminately sequenced, allowing them to be called in any order. If the behavior of the program depends on a specific calling order, the execution result may not meet expectations.

4 In the abstract machine, all expressions are evaluated as specified by the
semantics. An actual implementation need not evaluate part of an expression if it
can deduce that its value is not used and that no needed side effects are produced
(including any caused by calling a function or accessing a volatile object).

In the abstract machine, all expressions are evaluated according to the corresponding semantics. In a specific implementation, if the value of a part of an expression is not used and no side effects are produced (including those caused by function calls and accessing volatile objects), that part of the expression may not be evaluated.

This entry actually indicates the optimization space in the expression evaluation process.

5 When the processing of the abstract machine is interrupted by receipt of a signal,
the values of objects that are neither lock-free atomic objects nor of type volatile
sig_atomic_t are unspecified, as is the state of the floating-point environment. The
value of any object modified by the handler that is neither a lock-free atomic
object nor of type volatile sig_atomic_t becomes indeterminate when the handler
exits, as does the state of the floating-point environment if it is modified by the
handler and not restored to its original state.

This entry is related to signal mechanisms and is beyond the current learning scope, so it will not be elaborated here.

6 The least requirements on a conforming implementation are:
— Accesses to volatile objects are evaluated strictly according to the rules of the
  abstract machine.
— At program termination, all data written into files shall be identical to the
  result that execution of the program according to the abstract semantics would
  have produced.
— The input and output dynamics of interactive devices shall take place as specified
  in 7.21.3. The intent of these requirements is that unbuffered or line-buffered
  output appear as soon as possible, to ensure that prompting messages actually
  appear prior to a program waiting for input.
This is the observable behavior of the program.

A conforming implementation must meet at least three requirements, which aligns with the “consistency of observable behavior of programs” discussed earlier regarding compiler optimization.

7 What constitutes an interactive device is implementation-defined.

What exactly constitutes an interactive device is defined by the implementation.

8 More stringent correspondences between abstract and actual semantics may be
defined by each implementation.

Each implementation may further define a stricter correspondence between abstract semantics and actual semantics.

Implementation of Standard Specifications

The essence of standard specifications is a set of definitions and conventions, usually presented through manuals. To implement the standard specifications, they must be realized in some form, referred to as the "implementation" of standard specifications. For example, the ISA is essentially a standard specification, and processors that implement this specification through digital circuits are one form of ISA implementation. Clearly, the behavior of the processor must conform to the ISA standard specification.

Similarly, the C language standard also has corresponding implementations. Section 3.12 of the C99 manual defines “what is an implementation”:

particular set of software, running in a particular translation environment
under particular control options, that performs translation of programs for,
and supports execution of functions in a particular execution environment

In other words, the implementation of the C language standard is a series of specific software that runs in a particular translation environment, used for translating programs and supporting the execution of corresponding functions in a specific execution environment. If we use concepts we are familiar with to understand, the implementation of the C language standard is the compiler (used for program translation) and the runtime environment (used to support program execution). Similarly, both the compiler and runtime environment must conform to the C language standard.

Standard specifications define many details, including various “should” or “should not” behaviors, which are clearly defined behaviors, and specific implementations must follow these standard specifications. However, not all behaviors can be clearly defined by the standard specifications. Understanding how specific implementations handle these behaviors will help us understand how computer systems work and guide us in writing more standardized code to avoid these behaviors.

以C99标准为例, 无法明确定义的行为有以下几类, 它们的概念均在手册第3章Terms, definitions, and symbols中定义. Taking the C99 standard as an example, the types of behavior that cannot be clearly defined include the following categories, all of which are defined in Chapter 3 of the manual, Terms, definitions, and symbols.

Unspecified Behavior

The C99 manual defines Unspecified Behavior as follows:

use of an unspecified value, or other behavior where this International
Standard provides two or more possibilities and imposes no further requirements
on which is chosen in any instance

In other words, for this type of behavior, the C language standard provides multiple choices for the result but does not specify which one to choose; the specific implementation can choose one from them.

The C99 manual provides the following example:

An example of unspecified behavior is the order in which the arguments to a
function are evaluated

That is, the order of evaluation of function call parameters is unspecified. We can verify this with the following program:

// a.c
#include <stdio.h>
void f(int x, int y) {
  printf("x = %d, y = %d\n", x, y);
}
int main() {
  int i = 1;
  f(i ++, i ++);
  return 0;
}

In yzh’s system, the program was compiled and run using gcc and clang, yielding the following results:

$ gcc a.c && ./a.out
x = 2, y = 1
$ clang a.c && ./a.out
x = 1, y = 2

As we can see, even with the same program code, compiling with different compilers yields different results.

Experience Unspecified Behavior

Try compiling and running the above program on your system and observe the results.

If you try to compile and run the above program on your system, you may get results different from yzh. If this is the case, it indicates that even with the same compiler, using different versions to compile the program can yield different results, but this still conforms to the C language standard. In an extreme case, a compiler could randomly decide the order of evaluation for function call parameters, which would still conform to the C language standard!

if (rand() & 1) { evaluate from left to right; }
else            { evaluate from right to left; }

Since neither the C language standard nor its implementation is wrong, the problem can only lie within the program. In fact, the behavior of the aforementioned program relies on the unspecified behavior in the C language standard, resulting in different performances across different compilers. To avoid such outcomes, we should write code that is not affected by unspecified behavior. For example, rewriting the above code as follows:

  int i = 1;
  int x = i ++;
  int y = i ++;
  f(x, y);

This way, regardless of the evaluation order of f(x, y), the program’s behavior will always output x = 1, y = 2, thus avoiding the effects of unspecified behavior.

Try understanding parameter evaluation order from the perspective of sequence points

Consult the C99 manual regarding parameter evaluation, and see how the manual describes the order of parameter evaluation through the perspective of sequence points.

Implementation-defined Behavior

The C99 manual defines Implementation-defined Behavior as follows:

unspecified behavior where each implementation documents how the choice is made

This type of behavior is a special kind of unspecified behavior, but specific implementations must document how the choice is made. Unlike general unspecified behavior, once a specific implementation documents its choices for implementation-defined behavior, it cannot be modified. Specific implementations must not only conform to the C language standard but also to their own documentation. Therefore, programs containing such behavior will yield the same results when compiled and executed multiple times in a specific environment (including the compiler and runtime environment).

A common example is the length of integer types. In fact, the C language standard has never defined how long each integer type is. Section 5.2.4.2 of the C99 manual states the following regarding integer type lengths:

An implementation is required to document all the limits specified in this
subclause, which are specified in the headers <limits.h> and <float.h>.
Additional limits are specified in <stdint.h>.

Regarding the lengths of integer types, the C99 manual’s subsection 5.2.4.2.1 states:

The values given below shall be replaced by constant expressions suitable for
use in #if preprocessing directives. Moreover, except for CHAR_BIT and
MB_LEN_MAX, the following shall be replaced by expressions that have the same
type as would an expression that is an object of the corresponding type
converted according to the integer promotions. Their implementation-defined
values shall be equal or greater in magnitude (absolute value) to those shown,
with the same sign.

Here, we list some of the values mentioned above:

ValueDescription
SCHAR_MINMinimum value of signed char
SCHAR_MAXMaximum value of signed char
UCHAR_MAXMaximum value of unsigned char
SHRT_MINMinimum value of short int
SHRT_MAXMaximum value of short int
USHRT_MAXMaximum value of unsigned short int
INT_MINMinimum value of int
INT_MAXMaximum value of int
UINT_MAXMaximum value of unsigned int
LONG_MINMinimum value of long int
LONG_MAXMaximum value of long int
ULONG_MAXMaximum value of unsigned long int
LLONG_MINMinimum value of long long int
LLONG_MAXMaximum value of long long int
ULLONG_MAXMaximum value of unsigned long long int

As we can see, the C language standard only defines the minimum range of values for each integer type, and specific implementations can define a larger range for each integer type but cannot define a range smaller than the minimum defined by the C language standard.

Why does the C language standard specify this? We can understand the objectives of the C language standard from the summary section of the C99 manual:

This International Standard specifies the form and establishes the
interpretation of programs expressed in the programming language C. Its purpose
is to promote portability, reliability, maintainability, and efficient
execution of C language programs on a variety of computing systems.

To maximize compatibility across various computer systems, the C language standard must consider:

  • Supporting older computer systems, so many provisions cannot be overly rigid. You may have noticed that in the C language standard, the maximum value for signed char minimum value is , but the minimum number representable by an 8-bit two’s complement is . This is because the C language standard considers that some older computers used sign-magnitude or one’s complement representation, where the minimum representable number by an 8-bit signed number is . If the C language standard had defined the maximum value for signed char minimum value as , it would not be compatible with those computers.

How long is one byte?

You might answer without thinking, “8 bits”. However, in fact, the C language standard’s section 3.6 states that the number of bits in 1 byte is "implementation-defined" :

A byte is composed of a contiguous sequence of bits, the number of which is implementation-defined

This is clearly to be compatible with older computer systems: Historically, the length of 1 byte has been defined as different numbers of bits in different computers, ranging from 1 bit to 48 bitsopen in new window. Therefore, the C language standard cannot directly define 1 byte as 8 bits, leaving it to specific implementations to define.

You may have also noticed that the content quoted from the C language standard uses ranges of values to indirectly describe the length of integer types instead of directly describing them as 4 bytes, which is also to avoid using the ambiguous concept of byte that varies across different implementations.

  • Supporting future computer systems, so many provisions must be compatible with future developments. The C language standard only defines the minimum range of values for each integer type, which allows for future flexibility. For example, regarding the int type, in some environments from the 1990s (like Turbo Copen in new window, used in some old textbooks), int was 16 bits; in 32-bit systems (like Windows VC 6.0 or Linux GCC), int is 32 bits; in modern 64-bit systems, int is still 32 bits.

Check the range of integer types in Linux

The Linux runtime environment is also part of the specific implementation of the C language standard. You can check /usr/include/limits.h to understand the range of integer types in this specific implementation.

In summary, programs containing implementation-defined behaviors can yield the same results when compiled and executed multiple times in a specific environment. However, when porting the program to another environment, one must consider the differences in implementation-defined behavior between the new and old environments. For instance, if one were to port a 32-bit program to a 16-bit environment (though such needs are rare today), one consideration would be that int type data might overflow, and one solution could be to change it to long type, as the C language standard specifies that the range for long type is at least , which can accommodate the range of int type in the 32-bit environment.

Locale-specific Behavior

The C99 manual defines Locale-specific Behavior as follows:

behavior that depends on local conventions of nationality, culture, and
language that each implementation documents

This means that the results of this type of behavior depend on local customs regarding nationality, culture, and language, and specific implementations must document the results of these behaviors. This is a special kind of implementation-defined behavior.

An example is the extended character set. What characters are included in the extended character set is a locale-specific behavior. Consider the following program:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>

#define Main main
#define Return return

char* string_concat(char *str1, char *str2) {
  char *new_str = malloc(strlen(str1) + strlen(str2) + 1);
  assert(new_str);
  strcpy(new_srt, str1);
  strcat(new_str, str2);
  Return new_str;
}

int Main() {
  char *info = string_concat("One Student One Chip", "very simple");
  printf("%s\n", info);
  free(info);
  Return 0;
}

The above program uses strings and identifiers that include Chinese characters. In specific implementations that support Chinese characters, this program can compile and run successfully; however, in implementations that only support the basic character set (i.e., ASCII), the above program will fail to compile.

Typically, when developing internationalized software (i18n), one must consider locale-specific behaviors. Besides character sets, locale-specific behaviors also include decimal point characters, currency symbols, formats for time and date, etc. However, since OSOC does not involve such software, you do not need to explore this deeply.

Undefined Behavior

The C99 manual defines Undefined Behavior as follows:

behavior, upon use of a nonportable or erroneous program construct or of
erroneous data, for which this International Standard imposes no requirements

This is a type of erroneous behavior where the program or data does not conform to the standard, but the C language standard imposes no constraints on the result of such behavior. In other words, regardless of the outcome, it conforms to the C language standard. In simple terms, “anything can happen”.

The C99 manual lists some possible outcomes:

Possible undefined behavior ranges from ignoring the situation completely with
unpredictable results, to behaving during translation or program execution in a
documented manner characteristic of the environment (with or without the
issuance of a diagnostic message), to terminating a translation or execution
(with the issuance of a diagnostic message).

These outcomes include:

  • Errors and exits during compilation or execution
  • Handling according to the specific implementation’s documentation, possibly without warnings
  • Completely unpredictable results

An example of undefined behavior is a buffer overflow. Consider the following program:

// a.c
#include <stdio.h>
int main() {
  int a[10] = {0};
  printf("a[10] = %d\n", a[10]);
  return 0;
}

Experience Undefined Behavior

Try compiling and running the above program on your system multiple times and observe the results.

In the above program, when calling printf(), a[10] is referenced, accessing an element that has exceeded the bounds of the array. According to the C language standard, the value of a[10] is undefined. Therefore, whether the compiler throws an error, the program runs with an error, or the program outputs 0 or garbage values, all of these behaviors conform to the C language standard.

Programs containing undefined behavior are unlikely to yield correct results even when compiled and run multiple times. If your program produces different results across multiple runs, after excluding external random factors, you may suspect that your program has undefined behavior.

Why RTFM

The reason we quoted the original text from the C language standard above is to let everyone see the precise definitions of some concepts in the C language standard. You are unlikely to find similar concepts and their definitions in other materials. For instance, even if you have attended a C language course or learned C from other materials, you might be hearing about concepts like “undefined behavior” and “sequence points” for the first time.

This indicates that the C language textbooks and learning materials you have encountered are not the entirety of the C language. In fact, the C language standard precisely defines every detail of the C language, and we have reason to believe that, regarding understanding the C language, the creators of the C language standard have deeper insights than authors of books and blogs.

Of course, different individuals have different goals in learning the C language. If your goal is simply to write some basic programs in C or to pass exams, most textbooks and learning materials will suffice. On that basis, if you wish to delve deeper into learning C, or understand the fundamental principles of how computer systems work, or even build your own computer system, reading the C language standard manual will be your best choice.

The goal of OSOC is the latter, because processor chips differ from other chips in that their design goal is to run software programs. Therefore, we view processor chip design from the perspective of computer systems: not only do you need to learn to design a processor chip, but you also need to learn how to run programs on this processor chip to check if the programs run correctly and efficiently.

As your learning deepens, you will gradually encounter some questions that books and blogs cannot clarify. At that point, you need to realize that you have begun to enter the realm of professionals. Learning to read manuals is a passport to becoming a professional.

However, the content in manuals is extensive, and our suggestion for you to read the manuals is not to expect you to grasp all of it in a short time, but more importantly, to cultivate the awareness of reading manuals: when you want to thoroughly understand a problem, you should think of reading related content in the manual; even if you can find relevant answers online, you should consider how the manual describes it, unless those answers explicitly quote the manual (such answers are excellent!). If you can establish this awareness and put it into practice, your understanding of C language will already be in the top 1% of the industry.

Reference list of the aforementioned unspecified behaviors

Annex J of the C99 manual lists all the contents of the aforementioned four types of behaviors in the C language standard, which can be consulted if necessary.

Application Binary Interface (ABI)

We have learned that the specific implementation of the C language standard consists of a series of software, mainly compilers and runtime environments, where the compiler is responsible for generating the binary executable files of programs, and the runtime environment is responsible for supporting program execution. On one hand, the binary executable files of programs relate to the ISA, as programs are executed on the processors corresponding to the ISA; on the other hand, the runtime environment contains library functions and some functionalities of the operating system, which the program needs to interact with before, during, and after execution.

Thus, concepts like programs, compilers, operating systems, library functions, and ISA are interconnected as a whole within a computer system. This connection is generally presented through agreements and specifications, which is the Application Binary Interface (ABI)open in new window, meaning that ABI is the interface specification between a program at the binary level and the aforementioned concepts.

Due to its need to be compatible with various computer systems, the C language standard cannot precisely define the results of many behaviors. However, for a specific computer system, many conditions are fixed. As a set of agreements at the binary level for this computer system, ABI can be viewed as a document of a concrete implementation of the C language standard. Therefore, many choices regarding implementation-defined behavior in the C language standard will be documented within the ABI.

For example, for a specific ISA, the bit width of bytes and general registers is determined. Based on this, ABI can determine the range of values for integer types as defined in the C language standard, and the understanding of integer type ranges by the entire computer system will follow the ABI, thereby collectively supporting the execution of programs based on a consistent understanding.

As a specification, ABI includes:

  • Instruction sets of processors, register structures, stack organization, memory access types, etc.
  • Sizes, layouts, and alignment of basic data types that can be directly accessed by the processor
  • Calling conventions that specify how function parameters are passed and how return values are obtained
  • How applications initiate system calls to the operating system
  • Formats of object files, supported runtime libraries, etc.

We see again that the execution results of programs on a specific computer system are related to source code, compilers, runtime environments, ISA, hardware, etc.; and ABI is also an important reflection of the collaboration between software and hardware in a computer system, jointly supporting program execution. Currently, you only need to have a basic understanding of ABI. We will expand on the specific contents of ABI with RISC-V as an example in Phase D.

RTFM

Of course, not only the C language standard manual, but we also recommend that everyone read other manuals in their leisure time, including the RISC-V manualopen in new window, Verilog manualopen in new window, and RISC-V ABI manualopen in new window, as they can provide the most comprehensive and authoritative help for understanding related details. If you find reading them very challenging, don’t worry; as you accumulate more foundational knowledge with deeper learning, going back to read them will become smoother.

Instruction Set Simulator - A Program That Executes Programs

We have introduced the steps from C code to binary programs. To further understand the essence of program execution, we attempt to implement the process of program execution using the C language. Such a program used to execute other programs is called a "simulator". In the industrial design process of processors, simulators play a very important role. In the future, you will also use simulators to help you complete some important tasks in the processor design process, but for now, let’s focus on how to implement a simple simulator.

You have already tried to design a CPU in Logisim, where the process involves using a digital circuit state machine to implement the ISA state machine. To implement a simulator in C, we need to consider how to use the state machine of the C program to realize the ISA state machine. Therefore, we will first review the C program and ISA from the perspective of state machines:

C ProgramISA
State
EventExecute StatementExecute Instruction
State Transition RuleStatement SemanticsInstruction Semantics

To use the C program’s state machine to implement the ISA state machine, we need to develop a C program that includes the following functionalities:

  • Use the C program’s state to represent the ISA’s state, which means using the C program’s variables to represent the ISA’s PC, GPR, and memory.
  • Use the C program’s state transition rules to implement the ISA’s state transition rules, which means using C language statements to implement the semantics of instructions.

A simulator that only implements instruction behavior from the ISA perspective is called an instruction set simulator. Similarly, there are architecture simulators and circuit simulators, but we will not touch on those for now.

Basic Implementation of sEMU

Let’s illustrate how to implement an instruction set simulator using sISA as an example, referring to the corresponding instruction set simulator as sEMU (simple EMUlator). For details about sISA, please refer to previous lectures.

sEMU needs to use C program variables to represent the ISA’s PC, GPR, and memory, which is not difficult:

#include <stdint.h>
uint8_t PC = 0;
uint8_t R[4];
uint8_t M[16];

Here, although the width of PC in the ISA layer is 4 bits, there is no 4-bit basic data type in C, so we use uint8_t here. Moreover, since the PC width is only 4 bits, it indicates that sISA programs can contain at most 16 instructions, and therefore the size of the array M used to implement memory only needs to be set to 16.

sEMU needs to use C language statements to implement the semantics of instructions, which is also not difficult. From the perspective of the instruction cycle, we need to write a function inst_cycle() to achieve the following functionalities:

  • Fetch - Directly index memory M based on PC to fetch an instruction.
  • Decode - Use C language bitwise operations to extract the opcode field of the instruction and check which instruction it belongs to; then extract the operand fields based on the instruction format and obtain the corresponding operands.
  • Execute - If the executed instruction is not bner0, write the result back to the destination register; otherwise, decide whether to jump based on the condition.
  • Update PC - If not jumping, increment PC by 1.

After implementing inst_cycle(), we just need to keep calling it in sEMU:

while (1) { inst_cycle(); }

The above has considered all the details of sISA, but to run a program in sEMU, we need to implement not only the ISA-level functionalities but also consider how to provide the corresponding runtime environment for the program. However, since the functionality of sISA is limited, it can only support the execution of a few programs. Therefore, we will discuss what support the runtime environment should provide, assuming it only runs the summation program. We will consider it from three aspects:

  • Before program execution begins: One question to consider is how to load the program. However, since sEMU only intends to run the summation program, we can directly initialize M:
    uint8_t M[16] = { ... };
    
    Since the summation program does not currently involve parameters during its execution, we can temporarily ignore the transmission of program parameters in the runtime environment.
  • During program execution: Since the summation program does not currently involve calling library functions, we can temporarily ignore the support for library functions in the runtime environment.
  • After program execution ends: Since the summation program indicates program completion through an infinite loop, which has already been implemented through the bner0 instruction at the instruction level, there is no need to provide additional functionality for program exit.

Implement sEMU

Based on the above ideas, implement sEMU in C code and run the previous summation program. Since the summation program itself will not end, you can modify the loop condition in the while statement to exit after a certain number of iterations, and then check if the summation result meets expectations.

Reinforcing the Runtime Environment

It can be seen that this runtime environment is much simpler than the one discussed in ./a.out. In fact, the C language standard defines two types of runtime environments:

one is called a “freestanding environment”, which typically has no operating system support, and programs usually run directly on bare-metal; the runtime environment discussed earlier for sEMU belongs to the freestanding environment.

The other is called a “hosted environment”, which refers to the situation where an operating system is running, and programs typically run on an operating system, supported by relevant components of the operating system; the runtime environment of ./a.out falls under the hosted environment.

You can understand that if the ISA of the program and the ISA of the processor are inconsistent, the program cannot run. However, even if the ISAs are consistent, the program may still fail to execute successfully; runtime environment considerations are also necessary. Suppose you implemented an x86 instruction set simulator similar to sEMU, but you still cannot run ./a.out because ./a.out assumes it runs in the Linux hosted environment and needs the runtime environment to provide a series of functional supports. However, the runtime environment provided by your instruction set simulator belongs to the freestanding environment, which does not meet the needs of ./a.out. Another example is if you copy a Windows program to Linux, it will also fail to run successfully, although both Windows and Linux are hosted environments, but they provide different specific supports.

From another perspective, in order to make the program’s functionalities richer, we can not only add more instructions but also provide a more powerful runtime environment for the program. Below we attempt to add more functionalities to sEMU.

First, we will add output functionality to sEMU, allowing programs running on it to output an integer. Specifically, you need to add a new instruction out rs to sISA, which will output R[rs] to the terminal upon execution. You can decide the encoding of this instruction and then implement it in sEMU.

One issue to consider is how to implement the “output to terminal” functionality, as the concept of “terminal” does not belong to the scope of sISA; precisely, it belongs to the scope of the Linux hosted environment. Note that while sISA programs run in the runtime environment provided by sEMU, sEMU itself also runs in the runtime environment provided by Linux. Therefore, we can utilize the functionalities of the Linux runtime environment to achieve the effect of “output to terminal”. Naturally, this functionality is provided by printf(), which means sEMU provides runtime environment support for programs running on it by calling printf(), thus implementing the agreed functionality of the out instruction. From the perspective of the runtime environment, this functionality belongs to the support during the program execution process.

Implement output functionality

Based on the above, implement the out instruction in sEMU and modify the summation program so that after calculating the result, it outputs the calculation result using the out instruction. If your implementation is correct, you should see the terminal display 55.

Next, we will consider implementing parameter passing for the program. The previous summation program could only calculate 1+2+...+10. We want to take the last term of the sequence as a parameter, i.e., we want the user to input the parameter n from the command line, allowing the summation program to calculate 1+2+...+n. To pass n to the program, we need to let the runtime environment place n in a designated location before the program begins execution, so that the program can read n from that location to perform the calculation. A simple way to do this is to place this parameter in a register, for example, r0.

Implement parameterized summation

Modify the summation program to obtain the last term of the sequence from r0. Then modify sEMU to place the user input parameter in r0. After implementation, conduct simple tests of your implementation, for example, typing ./semu 10, the program should output 55; typing ./semu 15, the program should output 120. We assume that the user’s input will not cause the calculation process to overflow, so you do not need to consider how to handle result overflow.

After the above enhancements, we can summarize the runtime environment that sEMU provides to the program.

We consider it from three aspects:

  • Before the program execution begins, the program’s instruction sequence will be stored in memory. Additionally, the program parameters will be stored in the r0 register.
  • During program execution, the value of a specified register can be output as an integer via the out instruction.
  • After program execution ends, the program will enter an infinite loop.

Implementation of minirvEMU

You already know how to implement sEMU; similarly, you can implement an instruction set simulator for the previous minirv, which we will call minirvEMU. For details about minirv, please refer to previous lectures.

Implement minirvEMU for two instructions

Design a minirvEMU that supports addi and jalr instructions and allows it to run the previously executed two-instruction test program in Logisim.

Adding the remaining 6 instructions to minirvEMU is not difficult, but running the sum and mem programs requires additional considerations. Both of these programs have many instructions, making it inconvenient to manually initialize M.

To automatically initialize M, we can have minirvEMU read the program to be executed into M from a file. For example, for the sum program, we can use the following command to convert the previous .hex file into a binary .bin file:

tail -n +2 sum.hex | sed -e 's/.*: //' -e 's/ /\n/g' | sed -e 's/\(..\)\(..\)\(..\)\(..\)/\4 \3 \2 \1/' | xxd -r -p > sum.bin

After obtaining the .bin file, you can use file operations in C to read it into M and execute it.

Implement complete minirvEMU

Add the remaining 6 instructions to minirvEMU and update the program loading method, then run the previously executed sum and mem programs in Logisim. To determine if the program has successfully ended, you can refer to the previous judgment method used in Logisim.

Previously, we manually judged whether the programs running in Logisim ended successfully. One reason is that Logisim, as a GUI program, is not very convenient for adding personalized functionalities. In contrast, as a C program, we can attempt to have minirvEMU automatically determine whether the program has successfully ended.

Specifically, we can add an ebreak instruction and define its functionality as program termination. To implement automatic judgment of the program termination status, we let minirvEMU stop simulation after executing ebreak.

To ensure the correctness of the program termination status, we expand the functionalities of the runtime environment, agreeing to write an integer representing the termination status into the a0 register before the program ends, and then execute the ebreak instruction. If this integer is 0, it indicates that the program ended correctly; if it is not 0, it indicates that an error occurred.

Based on this agreement, we need to modify the instruction sequence at program termination to comply with this agreement; on the other hand, after minirvEMU executes ebreak, we can check the value of the a0 register to see if the program ended successfully. For example, if a0 is 0, output the message HIT GOOD TRAP, indicating the program ended successfully; if a0 is not 0, output the message HIT BAD TRAP, indicating an error occurred.

To modify the program to comply with the above runtime environment agreement, we can have minirvEMU manually write the encoding of the ebreak instruction into the correct position after reading the program into M: regarding the specific encoding of the ebreak instruction, please refer to the relevant documentation; and for the “correct position”, it refers to a location near the halt function in the program.

For example, you can write the ebreak instruction as follows:

M[? + 0] = ?;
M[? + 1] = ?;
// ......

You need to find the correct memory position and the encoding of the ebreak instruction, then replace the above ?.

Implement automatic judgment of program termination

Based on the above runtime environment agreement, add and implement the ebreak instruction in minirvEMU, then modify the instruction sequence of the program to execute the ebreak instruction upon termination. If your implementation is correct, you will see the program automatically terminate and output termination information via minirvEMU.

Of course, this process still involves quite a few manual operations, but currently, the number of programs we need to run is not many, and the overhead of these manual operations is still acceptable. As the number of programs that need to be run increases, we will still need to find a way to achieve fully automated judgment methods. We will continue to discuss this issue in the C phase.

Programs Supporting GUI Input and Output

You have previously run a program that displays images using the minirv processor in Logisim. Now, can you run this program in minirvEMU as well? However, the C programs you developed before can only perform input and output through the terminal. So, how are the GUI programs we commonly use developed? To establish a simple understanding, let’s first let you experience a classic game: Super Mario.

Run Super Mario

Read the introductory part of the lecture in PA1 titled “Before starting the joyful PA journey -> What is NEMU?” and try to run Super Mario according to the instructions given in the lecture. Complete checks for the screen, keys, and sound.

You will surely be curious about how Super Mario runs specifically. In fact, Super Mario itself is a sequence of instructions, but these instructions belong to an instruction set called 6502. This instruction set emerged in the 1980s and is now rarely used, along with its corresponding processors. However, you did not run Super Mario on a real 6502 processor; so how does Super Mario actually run?

What happened is that a simulator program for the 6502 instruction set, called fceux, was run first. This program behaves in a unique way; it can simulate the execution of 6502 instructions through software, thereby simulating the execution process of the entire Super Mario game!

Experience GUI Input and Output

But first, let’s consider the content related to GUI programs, and we’ll return later to discuss simulators. We introduce a runtime environment called abstract machine (AM for short), which provides some APIs (Application Programming Interface) in the form of library functions to implement some commonly used input and output functions in GUIs.

The fceux program achieves the operations and displays of Super Mario by calling these APIs. However, the behavior of fceux is still too complicated for everyone, so let’s start with some simple example programs to understand the input and output functions used in fceux.

Experience clock functionality

Run the clock test program using the following command:

cd am-kernels/tests/am-tests  
make ARCH=native mainargs=t run  

You will find that the program outputs a line of text every second on the terminal. Additionally, the program will pop up a new window that is completely black, but currently, it has no functionality; you don’t need to worry about it for now.

After experiencing the above functionality, try reading am-kernels/tests/am-test/src/tests/rtc.c to understand how this functionality is implemented. Among them, the code io_read(AM_TIMER_UPTIME).us will retrieve the time elapsed since the program started, in microseconds(us).

Experience key functionality

Run the key test program using the following command:

cd am-kernels/tests/am-tests  
make ARCH=native mainargs=k run  

You will find that the program pops up a new window that is completely black. When you press a key in the new window, you will see the corresponding key information output in the terminal, including the key name, keyboard code, and key status. After experiencing the above functionality, try reading am-kernels/tests/am-test/src/tests/keyboard.c to understand how this functionality is implemented (you can currently ignore the uart related functionality in the code). Among them, the code io_read(AM_INPUT_KEYBRD) will obtain a key event ev, where ev.keycode indicates the key code and ev.keydown indicates whether the key is pressed or released. You can check the key code values in abstract-machine/am/include/amdev.h, which all start with AM_KEY_, such as the code for the A key being AM_KEY_A. Specifically, AM_KEY_NONE indicates no key event.

Experience display functionality

Run the display test program using the following command:

cd am-kernels/tests/am-tests  
make ARCH=native mainargs=v run  

You will find that the program pops up a new window and plays an animation.

After experiencing the above functionality, try reading am-kernels/tests/am-test/src/tests/video.c to understand how this functionality is implemented. Among them, the code io_write(AM_GPU_FBDRAW, x, y, pixels, w, h, true) indicates drawing a rectangular image of w*h at the screen coordinates (x, y). The image pixels are stored in pixels in row-major order, with each pixel described by a 32-bit integer in the 00RRGGBB format.

Design a Simple Screensaver

After understanding how to use GUI input and output in code, we will next implement a simple screensaver to learn how to develop a simple GUI program on AM.

First, let’s try to make the popped-up window display a certain color. We provide the following framework code:

#include <am.h>
#include <klib-macros.h>

void draw(uint32_t color) {
  // add code here
}

int main() {
  ioe_init(); // initialization for GUI
  while (1) {
    draw(0x000000ff);
  }
  return 0;
}

To use the features provided by AM, we need to write a Makefile that conforms to the AM specification, for example:

NAME = screensaver
SRCS = screensaver.c
include $(AM_HOME)/Makefile

If the filename of your saved C code is not screensaver.c, you need to modify the above Makefile according to the actual filename. Then compile and run using the following command:

make ARCH=native run

You should see a completely black new window popped up by the program, because you have not correctly implemented the void draw(uint32_t color) function yet.

We will introduce detailed information about the Makefile and AM in Phase D; for now, you only need to know how to use AM’s features to develop, compile, and run the program.

Implement display of a single color

Implement the draw() function described above, which will fill the popped-up window with the color indicated by the parameter color, with a resolution of 400x300.

After implementation, recompile and run. If your implementation is correct, you should see the popped-up window filled with blue.

The desired effect of the screensaver we need to implement is to create a gradient between multiple colors on the screen. The gradient algorithm needs to linearly interpolate colors between the current color and the target color, and display them sequentially to achieve the gradient effect. Assuming the current color is , and after steps, it changes to the target color , where each color component value is in , then the color to be displayed at step () should be:

After one round of gradient ends, choose the next target color for the next round of gradient. The target colors can be randomly selected from the following colors:

0x000000, 0xff0000, 0x00ff00, 0x0000ff, 0xffff00, 0xff00ff, 0x00ffff, 0xffffff

We only need to control the interval of displaying the screen color using clock-related code to achieve the gradient effect.

Implement color gradient effect

Implement the color gradient effect as described above. You can decide how long each gradient round lasts and how many steps there are in one round of gradient.

Finally, let’s make use of the key functionality. Specifically, we hope to achieve the following key functionalities:

  • If the ESC key (key code AM_KEY_ESCAPE) is pressed, the program will exit.
  • If any other key is pressed, it will speed up the time of one round of gradient; after releasing the key, the gradient time will return to normal.

Add key effects

Add key effects to the screensaver program as described above.

How screensavers protect screens

In the past, people used CRT displays, which work by an electron gun firing high-speed electrons at the phosphor layer of the screen to emit light and display images. However, if the screen displays the same image for a long time, the electron beam will bombard the same spot on the phosphor layer for too long, causing accelerated aging of that spot, or even penetrating the picture tube, damaging the screen. Screensavers prevent damage to the screen by constantly changing the displayed image on the screen, thereby changing the position and intensity of the electron beam’s bombardment.

However, nowadays, people mostly use LCD displays (liquid crystal displays), which operate differently from CRT displays, and past screensavers can no longer serve their protective purpose. The above mandatory tasks are merely a simple exercise and do not consider their protective functionality. In fact, a static screen usually indicates that the user has not operated the computer for a long time; in this case, the best practice is to turn off the display.

After implementing the screensaver, you should have a basic understanding of the GUI program development process. However, we have not further explained how io_read() and io_write() actually implement input and output: why calling io_read(AM_INPUT_KEYBRD) can obtain key information? Why calling io_write(AM_GPU_FBDRAW, x, y, pixels, w, h, true) can draw pixels on the screen? In real computer systems, this requires coordination among multiple layers of abstraction, including runtime environments, instruction sets, processors, buses, device controllers, and I/O devices. As you progress in your studies, you will learn about each detail in this process in Phase B.

Develop Your Own Mini-Game

If you are interested, you can also use the above APIs to develop your own mini-game! A mini-game is essentially just a loop:

while (1) {
  wait_for_a_new_frame();
  handle_user_keys();
  update_game_logic();
  draw_a_new_screen();
}

The screensaver you just developed is fundamentally also such a loop, but its game logic is very simple, consisting only of handling a gradient algorithm. Moreover, am-kernels also provides some examples of mini-games:

  • am-kernels/kernels/typing-game - Typing game
  • am-kernels/kernels/snake - Snake game

option:: Develop your own Mini-Game

Try to conceive and design a mini-game. On one hand, this mini-game can run on the processor you design; on the other hand, we are also collecting fun and interesting mini-games for long-term use. If your designed game is excellent enough, it may be included in our maintained game collection for other students to enjoy, and even showcased on real chips.

Some useful resources:

  • Displaying characters: am-kernels/kernels/typing-game/font.c provides bitmap for characters; please refer to the relevant code to understand how to use it.
  • Displaying images: You can use the command convert abc.jpg abc.txt to convert an image into a text file of enumerated pixels, where each pixel’s color code is listed in order. You can further process this text file with scripts to convert it into a C array.

If you plan to submit your designed game to us, please contact the teaching assistant. However, to ensure your game can run in other students’ environments, please additionally ensure the following content:

  • Portability:
    • Do not use floating-point numbers in the code.
      • When needing to perform calculations related to decimals, scale them to integers first, for example, to handle three decimal places, scale it by 1000 before processing.
      • When calculating elementary functions like sin, approximate them using power series expansion.
    • When needing a 64-bit integer type, use uint64_t or int64_t.
    • When using integer types to store pointers, use uintptr_t or intptr_t.
  • Playability: After entering the game, if the player does not operate for some time, the game will fail.
  • Exit Capability: Pressing the ESC key calls halt(0) to exit the program.

Add Graphic Display Functionality to minirvEMU

Since you have designed a screensaver, adding graphic display functionality to minirvEMU should not be difficult. Specifically, you need to define an additional array to store the pixel information the program will write and determine in the implementation of the sw instruction whether the address falls within [0x20000000, 0x[number_3}). If so, write this write operation to the above array. After the program finishes running, use the APIs provided by AM to display these pixels on the screen.

Add graphic display functionality

Following the above ideas, add graphic display functionality to minirvEMU, and then run the previously executed vga program in Logisim.

Some hints:

  • To use the features provided by AM, you can refer to the relevant code of the screensaver described above.
  • To prevent minirvEMU from exiting immediately after displaying the image, let it enter a loop.

Although the current minirvEMU only runs a program that displays an image, it is not hard to imagine that if it supports more instructions and provides more complete input and output functionality such as clock and keys, you will have a more powerful simulator, similar to fceux, that could run Super Mario or even games you develop yourself! By Phase D, you will achieve this goal!

Moreover, as a runtime environment, AM allows us to compile programs that can end via the ebreak instruction without having to manually write the ebreak instruction at the correct position in minirvEMU. We will elaborate on the details of AM in Phase D.

Finally, you should feel that creating a simulator in C to execute minirv instructions is much simpler than developing a processor in Logisim: we only need to consider how to implement the behavior of instructions in C, without having to consider circuit implementations or wiring details.

In fact, implementing a real processor requires consideration of more factors, including timing, area, and power consumption; you will gain a deeper understanding of this in Phase B. However, from a learning perspective, it is precisely because implementing a simulator does not require consideration of these factors that it becomes a very good way to understand instruction sets and program behavior.

In the processor design process in the industry, simulators also play an important role in functional validation, program analysis, and debugging, and some more detailed architecture simulators are important components in exploring processor design space. As you delve deeper into your studies, you will gradually appreciate the significance of simulators.