Programs and memory location

We have implemented a multi-tasking system for multi-programming. However, we encountered an issue when attempting to run the second user process. If you recall, when compiling programs in Navy-apps, we linked them to memory addresses 0x3000000 or 0x83000000. This is the reason for the problem: when loading the second user process, we overwrite the content of the first process!

The fundamental reason for the previous issue is that the condition of "multiple processes coexisting in memory simultaneously" is being violated. Clearly, we need to manage memory instead of allowing multiple processes to use it freely. As we mentioned in PA3, an operating system has the obligation to manage system resources. In a multitasking operating system, memory, as a resource, naturally needs to be managed. The operating system must keep track of memory allocation. When it's time to run a new process, it should allocate an unused memory area and load the process into that memory location.

However, the free memory location is specified by the operating system's loader at the time of loading. But can the process code really run correctly at this memory location?

Absolute code

Typically, the memory location of a program is determined at link timeopen in new window (as is the case with programs in Navy-apps). Some earlier programmers even used absolute addresses for memory access within their programs. These two types of code are referred to as absolute code. Absolute code assumes that objects (functions and data) in the program reside at fixed locations. For instance, after link-time relocation, the program might assume that a variable a resides at memory location 0x83001234. If we load the program into a different location, the actual location of variable a changes accordingly. However, the absolute code within the program still assumes that variable a resides at memory location 0x83001234, thus attempting to access 0x83001234 would not access the actual variable a. Clearly, absolute code can only run correctly at fixed memory locations.

However, the vacant memory location allocated by the operating system at load time does not always guarantee correct execution for such programs. Therefore, one solution to this problem is for the operating system to keep track of the program's load location. When a program attempts to load into a memory location that is already in use, the loading will fail, and the operating system will return an error. To avoid loading failures, one approach is to maintain multiple versions of the program with different load addresses, hoping that one version can be successfully loaded.

Relocatable code

But this is too cumbersome. Why do we have to determine a program's load position in advance? If we delay the relocation phase until later, during linking, wouldn't we break the limitations of absolute code?

So, some programmers developed a category of special programs called "self-relocatingopen in new window" programs. These programs, at the beginning of execution, relocate themselves to another memory location before starting actual execution. This type of relocation is called "run-timeopen in new window relocation."

In PA1, we mentioned the concept of BIOS. In fact, programs in BIOS are precisely this type of self-relocating programs. This is because BIOS is typically embedded in ROM, and data in ROM cannot be written to. When the computer starts up, it executes programs from the BIOS. However, the BIOS program immediately copies itself to another memory location, performs some relocation work, and then truly begins execution. At this point, the data resides in writable RAM, allowing the BIOS program to perform write operations on it.

But for a multitasking operating system, this doesn't truly solve the problem because the program at runtime doesn't know if the target memory location for relocation is free.

Since only the operating system knows whether memory is free, let's just have the loader perform relocation. This gave rise to the concept of "load-timeopen in new window relocation." Specifically, the loader will allocate a free memory location, load the program into that location, relocate the program to that memory location, and then execute the program. This is how GNU/Linux inserts kernel modulesopen in new window today.

Position-independent code

To some extent, load-time relocation introduces additional overhead: if a program needs to be executed multiple times, it will require multiple load-time relocations. In the early days of computing, when computers were slower, the overhead of load-time relocation was considered significant and could not be ignored. Additionally, load-time relocation requires processing the entire program. If the program is large, the overhead becomes even more pronounced.

Is there a way to save the overhead of relocation or even avoid relocation altogether? However, link-time relocation may result in absolute code, which is not desirable. If all addressing within the program is done relative to the program's location, the program can be loaded and executed at any position without encountering issues with absolute code.

This is the basic idea of Position-Independent Code (PICopen in new window). Nowadays, dynamic libraries are all PIC, enabling them to be loaded into any memory location. Additionally, if an executable file consists entirely of PIC, it has a new name called Position-Independent Executable (PIEopen in new window). Compilers can generate PIE by specific options. Unlike regular programs, PIE also provides some interference against malicious attack programs to some extent: malicious programs cannot assume the address where PIE will run in advance.

Also, due to this security-related feature, many recent GNU/Linux distributions come with gcc configured to default to generating Position-Independent Executables (PIE). However, using relative addressing can increase the code size of the program and may have some performance implications. But for early computers, where memory was a precious resource, any performance reduction was undesirable. Therefore, when considering Position-Independent Code (PIC) and Position-Independent Executables (PIE), people tend to weigh their options carefully.

Implementing a PIE-based loader (suggested second round consideration)

There's no such thing as a free lunch. The reason PIE achieves position independence is actually relying on a data structure within the program called the Global Offset Table (GOT)open in new window. To run PIE correctly, the loader needs to fill the GOT with the correct content when loading the program.

Students who are interested can make the Nanos-lite loader support PIE. Of course, this requires understanding some details related to ELF, for which specific details can be found in the ABI manual.

If all programs in the world were relocatable or Position-Independent Code (PIC), the problem of memory overlay would be solved. However, there are always some programs that contain absolute code. Considering compatibility issues, is there a better, more permanent solution?

The Magic of Intertwining Reality and Fantasy

We just considered the program itself, so we couldn't avoid the issue of absolute code. To solve this problem, we need to think from another perspective - memory - . We know that a program goes through four stages: compilation, linking, loading, and running. After the absolute code is compiled and linked, the memory addresses seen by the program are determined. During loading and running, the program uses these memory addresses to ensure correct operation. One attempt is to decouple the memory seen by the program from the memory it actually uses at runtime. This is the idea behind virtual memory.

Virtual memory is an abstraction layer above real memory (also known as physical memory) that is dedicated to processes. With virtual memory, a process only needs to think it is running on virtual addresses. When actually running, the virtual addresses are mapped to physical addresses. By linking a program to a fixed virtual address and loading it into different physical addresses, and maintaining the mapping between virtual addresses and physical addresses, we can permanently solve the above problems!

So, during the execution of a process, who maps the virtual address to the physical address? We have already learned about the lifecycle of an instruction in PA1:

while (1) {
  Fetch the instruction from the memory location indicated by PC;
  Execute the instruction;
  Update PC;
}

If a virtual memory mechanism is introduced, then PC becomes a virtual address. We need to complete the mapping from virtual address to physical address before accessing the memory. Although the operating system manages all the resources in the computer, from the computer's perspective, it is just another program. As a program running on the computer, the operating system cannot interfere with the specific process of instruction execution. Therefore, it is impossible for the operating system to map virtual addresses to physical addresses.

Therefore, performing this mapping in hardware is the only choice: We add a new hardware module MMU (Memory Management Unit) between the processor and memory. It is the core of the virtual memory mechanism and carries the most important address mapping function of this mechanism. It should be noted that the statement "MMU is located between the processor and memory" is only a conceptual description. In fact, the virtual memory mechanism is so important in modern computers that the MMU is physically implemented inside the processor chip.

However, only the operating system knows the specific physical addresses to which virtual addresses should be mapped. Therefore, the virtual memory mechanism is a mechanism that can only work through collaboration between software and hardware: The operating system is responsible for managing physical memory and deciding which physical addresses to map the virtual addresses of a process to when loading the process. Before the process actually runs, the MMU must be configured to implement the previously determined mapping in hardware. When the process runs, the MMU will perform address translation, mapping the virtual addresses of the process to the physical addresses desired by the operating system. It is worth noting that this mapping is process-specific: different processes have different mappings, which means that for different processes, the same virtual address may be mapped to different physical addresses. This neatly and permanently solves the problem of memory overlap. Most multitasking operating systems work in this way.

Segmentation

Regarding how the MMU specifically performs address mapping, there are currently two main mainstream methods. The simplest method is that the physical address equals the virtual address plus an offset. This most basic method is known as the segment-based virtual memory management mechanism, or simply segmentation. Intuitively, this means dividing physical memory into several segments. Different processes are placed in different segments to run. The processes do not need to worry about which segment they are in. As long as the operating system uses different offsets for different processes, there will be no interference between them.

The implementation of the segmentation mechanism in hardware can be very simple, requiring only the implementation of a segment base register in the MMU. When the operating system runs different processes, it sets different values in the segment base register. The MMU will add the segment base to the virtual address used by the process to generate the actual physical address used to access memory. This achieves the goal of "using different segments for different processes." The teaching operating system Minix works in this way, and some simple embedded systems and real-time systems also manage virtual memory through the segmentation mechanism.

In reality, the segmentation mechanism in processors can be much more complex. For example, the i386, due to historical reasons and for compatibility with its predecessor, the 8086, was forced to introduce concepts such as segment descriptors, segment selectors, Global Descriptor Table (GDT), Global Descriptor Table Register (GDTR), etc. Besides the segment base, the segment descriptor also describes attributes such as the segment's length, type, granularity, access permissions, and so on. To compensate for the performance issues of segment descriptors, concepts like descriptor cache were added... We can witness the elegance of the i386 segmentation mechanism:

           15              0    31                                   0
  LOGICAL +----------------+   +-------------------------------------+
  ADDRESS |    SELECTOR    |   |                OFFSET               |
          +---+---------+--+   +-------------------+-----------------+
       +------+         V                          |
       | DESCRIPTOR TABLE                          |
       |  +------------+                           |
       |  |            |                           |
       |  |            |                           |
       |  |            |                           |
       |  |            |                           |
       |  |------------|                           |
       |  |  SEGMENT   | BASE          +---+       |
       +->| DESCRIPTOR |-------------->| + |<------+
          |------------| ADDRESS       +-+-+
          |            |                 |
          +------------+                 |
                                         V
              LINEAR  +------------+-----------+--------------+
              ADDRESS |    DIR     |   PAGE    |    OFFSET    |
                      +------------+-----------+--------------+

It is quite confusing at first glimpse.

In NEMU, what do we need to understand? Nothing. Most modern operating systems no longer use the segmentation mechanism, and even the i386 manual mentions ways to "bypass" it to improve performance: Set the segment base address to 0 and the length to 4GB, so it seems as if there is no concept of segments. This is what the i386 manual refers to as "flat mode." Of course, "bypassing" here does not mean simply turning off the segmentation mechanism (which is actually not possible). The concept of privilege levels mentioned in PA3 regarding the i386 protection mechanism is actually provided by the i386 segmentation mechanism, and it would be unwise to discard it. However, we don't plan to implement the protection mechanism in NEMU, so we won't be incorporating various concepts of the segmentation mechanism into NEMU either.