E5 From RTL Code to Tapeout-Ready Layout

You have already learned how to use Verilog through the online learning platform HDLBits. Similarly, to design your own processor, you will need to use Linux as the development environment.

With a Linux development environment in place, you can carry out more steps in the chip design process. Specifically, writing RTL code using Verilog is just one part of the overall flow - the logic design stage. Following that, additional tasks such as verification and evaluation are involved, including:

  • Functional Verification - Simulating the RTL to check whether the described circuit behaves as expected
  • Circuit Evaluation - Converting the RTL code into a gate-level netlist using standard logic cells
  • Physical Design - Transforming the netlist into a tapeout-ready layout for fabrication

RTL Simulation - Functional Verification

It is not hard to develop RTL code in Linux, as long as we have a text editor. However, we also need a simulator to check if our circuit from the RTL code functions as what we expected after we finish our RTL code development.

Basic Principles of RTL Simulations

The essence of RTL simulation is to use a software program to mimic the behavior of a hardware circuit. Therefore, to implement RTL simulation, we need to consider how to use a state machine in a C program to implement the state machine defined by the ISA. To do this, let’s first review how a C program looks from the perspective of a state machine:

C ProgramDigital Circuit
StateSequential Logic Circuit
Stimulus EventExecuting StatementsProcessing Combinational Logic
State Transition RuleSemantics of StatementsCombinational Logic

To implement a digital circuit state machine using a C program state machine, we need to develop a C program with the following functionalities:

  • Implement the states of the digital circuit using the states of the C program, that is, implement the sequential logic circuit using variables in the C program.
  • Implement the state transition rules of the digital circuit using the state transition rules of the C program, that is, implement the logic of the combinational logic circuit using C language statements.

Let's examine an example of a running light. A running light is a group of lights that turn on and off in sequence. The following is the Verilog code for a running light module:

module light(
  input clk,
  input rst,
  output reg [15:0] led
);
  reg [31:0] count;
  always @(posedge clk) begin
    if (rst) begin led <= 1; count <= 0; end
    else begin
      if (count == 0) led <= {led[14:0], led[15]};
      count <= (count >= 5000000 ? 32'b0 : count + 1);
    end
  end
endmodule

We can write a C program to simulate this circuit in the RTL level according to the above Verilog code.

#include <stdio.h>
#include <stdint.h>

#define _CONCAT(x, y) x ## y
#define CONCAT(x, y)  _CONCAT(x, y)
#define BITMASK(bits) ((1ull << (bits)) - 1)
// similar to x[hi:lo] in verilog
#define BITS(x, hi, lo) (((x) >> (lo)) & BITMASK((hi) - (lo) + 1))
#define DEF_WIRE(name, w) uint64_t name : w
#define DEF_REG(name, w)  uint64_t name : w; \
                          uint64_t CONCAT(name, _next) : w; \
                          uint64_t CONCAT(name, _update) : 1
#define EVAL(c, name, val) do { \
                             c->CONCAT(name, _next) = (val); \
                             c->CONCAT(name, _update) = 1; \
                           } while (0)
#define UPDATE(c, name)    do { \
                             if (c->CONCAT(name, _update)) { \
                               c->name = c->CONCAT(name, _next); \
                             } \
                           } while (0)

typedef struct {
  DEF_WIRE(clk, 1);
  DEF_WIRE(rst, 1);
  DEF_REG (led, 16);
  DEF_REG (count, 32);
} Circuit;
static Circuit circuit;

static void cycle(Circuit *c) {
  c->led_update = 0;
  c->count_update = 0;
  if (c->rst) {
    EVAL(c, led, 1);
    EVAL(c, count, 0);
  } else {
    if (c->count == 0) {
      EVAL(c, led, (BITS(c->led, 14, 0) << 1) | BITS(c->led, 15, 15));
    }
    EVAL(c, count, c->count >= 5000000 ? 0 : c->count + 1);
  }
  UPDATE(c, led);
  UPDATE(c, count);
}

static void reset(Circuit *c) {
  c->rst = 1;
  cycle(c);
  c->rst = 0;
}

static void display(Circuit *c) {
  static uint16_t last_led = 0;
  if (last_led != c->led) { // only update display when c->led changes
    for (int i = 0; i < 16; i ++) {
      putchar(BITS(c->led, i, i) ? 'o' : '.');
    }
    putchar('\r');
    fflush(stdout);
    last_led = c->led;
  }
}

int main() {
  reset(&circuit);
  while (1) {
    cycle(&circuit);
    display(&circuit);
  }
  return 0;
}

The program implements sequential logic circuits through the structure variable circuit, which includes led and count. Although clk and rst do not belong to sequential logic circuits, they are included in the structure as inputs to the circuit and also part of the circuit state. In addition, the program implements the logic of combinational logic circuits through C language statements. It can be seen that the content of the above cycle() function is basically a direct translation of the corresponding Verilog code, except for the introduction of some intermediate variables with the suffix next and update flags with the suffix update. These intermediate variables are introduced to implement the semantics of Verilog non-blocking assignments, that is, the update of the corresponding signals needs to wait until the end of the cycle. Therefore, during the simulation process, it is necessary to first temporarily store the calculation results of the combinational logic in these intermediate variables and set the update flags. Only at the end of a cycle, according to the update flags, the calculation results are truly written into the variables related to the sequential logic elements.

The while loop in the main() function reveals the main process of RTL simulation: By continuously executing the cycle() function, it realizes the function of "calculating and updating the new state based on inputs and the current state". This function is actually the essence of how digital circuits work. However, before entering the while loop, it is necessary to reset the circuit through the reset() function. In addition, although the display() function does not belong to the circuit itself, it is used to demonstrate the circuit's functionality. The display() function outputs corresponding characters according to the state of each bit of the led signal, thereby showing the running light effect on the terminal.

The above simulation program is written manually. However, if developers have to manually write corresponding simulation programs for each circuit design to carry out functional verification, it will bring a lot of troubles to developers. For this reason, developers generally use an RTL simulator software to automatically convert RTL code into a C program for simulating circuit behavior. This C program is the circuit simulation program corresponding to the RTL code.

STFW + RTFM to Build The Verilator Simulation Environment

Verilator is an open-source Verilog simulator that you will use for RTL functional simulation. The framework code provides an npc directory by default, where "npc" stands for "New Processor Core". You will design your own processor in this directory in the future. The processors designed by everyone will be collectively referred to as NPC, but you can certainly give your processor a more personalized name. However, to set an environment variable NPC_HOME, you need to run the following command:

cd ysyx-workbench
bash init.sh npc

This environment variable will be used in the future. There are some simple files in the npc directory:

ysyx-workbench/npc
├── csrc
│   └── main.cpp
├── Makefile
└── vsrc
    └── example.v

Currently, these three files are almost empty. We will guide you through setting up the Verilator simulation environment and writing two simple digital circuit modules for simulation.

There isn't even a simulation framework? Lame!

The reason why we have included this part of the experiment is to let everyone understand that all details in the project are relevant to you. In previous course experiments, more or less, everyone would feel that the framework should naturally be provided by the teaching assistants. Doing the experiment just meant writing the corresponding code in the designated places, and all other codes/files were irrelevant and didn't require attention. In fact, such an experimental approach is very dangerous. It not only fails to train you into a truly professional person but also makes it impossible for you to survive in real projects:

  1. When encountering systematic bugs, you will definitely not be able to fix them. Because even the modules that call your code are considered irrelevant to you, let alone having a clear understanding of the entire project's architecture and every detail within it.
  2. Without lecture notes, you can't do anything. Because you are always waiting for others to clearly tell you what to do next and how to do it, just like these lecture notes do, instead of practically analyzing what should be done from the project's perspective.

A very realistic scenario is that when you join a company or a research group in the future, there will no longer be lecture notes or framework codes to assist you. If your boss says "Come and try Verilator", you have to get Verilator up and running by yourself, write a usage report, and present your work to the boss at the group meeting next week. Therefore, we hope to provide you with more realistic training: set a goal, and let you learn to break down the goal and achieve it step by step with your own skills. Building a Verilator simulation framework is actually a goal that can be easily achieved, so it is also very suitable as a small training to test your abilities.

If you want to use Chisel

Chisel can generate functionally equivalent Verilog code, which can then be simulated using Verilator. For now, we will focus on the usage of Verilator. If you wish to use Chisel, we also recommend that you first set up the Verilog workflow as described in the lecture notes, and then switch

Let's begin.

Familiarize with Verilator

This is probably the first time you hear about Verilator, and that's quite normal. Then, it is also normal that you would want to learn more about various aspects of Verilator. However, it is inappropriate if your first reaction is to ask others. In fact, the Verilator tool is so well - known in the simulation field that you can easily find relevant information about it on the Internet. You need to find its official website through STFW and then read the relevant introduction.

After finding and reading the relavent informations, it is time to try running it. But before that we need to install it first.

Installing Verilator

Find the steps to install Verilator on the official website and follow the corresponding steps for installation via git. The reason we don't use apt-get for installation is that the version it provides is relatively old. In addition, to unify the version, you need to install version 5.008 via git. For this purpose, you also need to perform some simple git operations. If you are not familiar with this, you may need to look for some git tutorials to learn. Moreover, it's better for you to carry out this operation in a directory outside ysyx-workbench/. Otherwise, git will track the source code of Verilator, thereby occupying unnecessary disk space. After successful installation, run the following command to check if the installation is successful and if the version is correct.

verilator --version

Verilator compiles code into C++ files, which are then compiled into executable files. Simulation is carried out by running these executable files.

Running An Example

The Verilator manual contains a C++ example. You need to find this example in the manual and follow the steps of the example to operate. You have already learned C language. To use Verilator, you don't need to understand complex C++ syntax. You just need to know some basic usage methods of classes. From this point of view, many materials on the Internet can meet your needs.

Example: Two-way Switch (Combinational Logic Circuit)

The example in the manual is very simple and doesn't even qualify as a real circuit module. Next, we'll write a real circuit module, a two-way switch, for testing. Write the following Verilog code:

module top(
  input a,
  input b,
  output f
);
  assign f = a ^ b;
endmodule

One application of a two-way switch is to jointly control the on/off state (f) of the same lamp through two switches (a and b). Unlike the example in the manual, this module has input and output ports. To drive the input ports and obtain results from the output ports, we need to modify the while loop in the C++ file:

// The following is pseudocode

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

while (???) {
  int a = rand() & 1;
  int b = rand() & 1;
  top->a = a;
  top->b = b;
  top->eval();
  printf("a = %d, b = %d, f = %d\n", a, b, top->f);
  assert(top->f == (a ^ b));
}

In one loop iteration, the code will randomly generate two 1-bit signals to drive the two input ports. Then, it will update the circuit state using the eval() function, allowing us to read and print the values from the output port. To automatically verify the correctness of the results, we will check the output using assert() statements.

Simulate the two-way switch module

Try to simulate the two-way switch module in Verilator. Since the top-level module name is different from the example in the manual, you need to make some corresponding modifications to the C++ file. In addition, this project has no statement to indicate the end of the simulation. To exit the simulation, you need to type Ctrl+C.

What does the above code mean?

If you don't know how to modify it, it means you are not very familiar with writing C programs. You should go back to the previous section to review C language.

Printing and Viewing Waveforms

Viewing waveform files is one of the common methods for RTL debugging. Verilator supports waveform generation, and you can view waveforms using the open-source waveform viewer GTKWave.

Generate and View Waveforms

The Verilator manual has already introduced the method for generating waveforms. You need to read the manual to find the relevant content, then follow the steps in the manual to generate the waveform file, and install GTKWave using

apt-get install gtkwave

to view the waveforms.

With so much content in the manual, how to find it?

Try pressing Ctrl+F.

Do not generate waveforms for a long time

Waveform files generally occupy a lot of disk space. Generating waveforms for a long time may lead to disk space exhaustion, which can cause the system to crash.

Generate FST format waveforms

The size of FST format waveform files is roughly 1/50 of that of VCD format, but it is only supported by GTKWave. Nevertheless, we still recommend you to use it. Specifically, you can refer to the Verilator manual to learn how to generate FST format waveforms.

Writing Makefile

One-click Simulation

Repeatedly typing compile and run commands is inconvenient. Try to write a sim rule for npc/Makefile to implement one-click simulation, such that typing make sim will execute the above simulation.

Note to Preserve Git Tracking Commands

The framework code has already provided a default sim rule in npc/Makefile, which includes the command for git tracking: $(call git_commit, "sim RTL"). When writing the Makefile, be careful not to modify this command, as it will affect the development tracking function, which is an important basis for recording the originality of the "One Student One Chip" results. Therefore, after writing the Makefile and running it, you also need to confirm whether git has correctly tracked the simulation records.

Integrating NVBoard

NVBoardopen in new window (NJU Virtual Board) is a virtual FPGA board project developed by Nanjing University for teaching purposes. It can provide a virtual board interface in an RTL simulation environment, supporting functions such as DIP switches, LED lights, VGA displays, etc. In scenarios where speed requirements are not high, it can completely replace a real FPGA board (after all, not everyone has an FPGA at hand). Obtain the NVBoard code using the following command:

cd ysyx-workbench
bash init.sh nvboard

Running The NVBoard Example

Read the README.md of NVBoard, and try to run the provided example.

Not sure how NVBoard works?

Try starting with the make command to see how everything happens. With the knowledge you've gained from previous studies, you already have a sufficient background to understand how NVBoard operates: This includes the use of Makefiles, and the basic usage of classes in C and C++. Now, try reading the code (Makefiles are also code) to see how the Verilog top-level ports, constraint files, and NVBoard are connected.

Implement the two-way switch on NVBoard

Read the instructions of the NVBoard project, then try to mimic the C++ files and Makefile in the example to modify your C++ file, assign pins to the input and output of the two-way switch, and modify npc/Makefile to connect it to the switches and LED lights on NVBoard.

The story of NVBoard

Although NVBoard is a teaching project of Nanjing University, it has a special connection with all those participating in "One Student One Chip":

Among the list of students who fabricated chips in the third phase of "One Student One Chip", there were two special ones. They were only freshmen when they signed up. And one of them, sjr, is the first author of NVBoard.

In fact, it was the ability to solve problems independently and the confidence that sjr developed while participating in "One Student One Chip" that helped him successfully develop the NVBoard project. Now, the NVBoard project in turn helps "One Student One Chip" improve the learning effect. Beyond its function as a virtual FPGA board, NVBoard also carries the concept of independently solving problems that "One Student One Chip" upholds. All of this is not far from you. When you are willing to learn independently instead of waiting for others to give you answers, your future will also be full of infinite possibilities.

Example: Running Lights (Sequential Logic Circuit)

Recall the running lights module mentioned earlier. Each bit of its output signal led corresponds to an LED light on the virtual board. Since the code contains sequential logic components that require reset, we need to modify the simulation code of Verilator:

// Below is the pseudocode

void single_cycle() {
  top->clk = 0; top->eval();
  top->clk = 1; top->eval();
}

void reset(int n) {
  top->rst = 1;
  while (n -- > 0) single_cycle();
  top->rst = 0;
}

...
reset(10);  // reset for 10 cycles
while(???) {
  ...
  single_cycle();
  ...
}

Connect the running lights to NVBoard

Write the running lights module, then connect it to NVBoard and assign pins. If your implementation is correct, you will see the lights light up and turn off sequentially from the right end to the left end.

Static Code Checking

Verilator can also function as a lint tool for static code checking. By passing the --lint-only parameter to verilator in the command line, Verilator will only perform code checking, and point out potentially problematic code in the form of warning messages without generating C++ files. In particular, you can also add the -Wall option to enable all types of checks in Verilator, allowing it to help you find more potential issues.

The Errors and Warnings chapter in the Verilator manual lists explanations for all warnings. By reading them, you will understand how these warnings are generated and thus know how to fix them. For warnings related to code logic, you should remove them by modifying the code; but for some warnings related to code style, if you are sure they do not affect the code logic, you can turn off the xxx warning with an additional -Wno-xxx option, such as -Wno-DECLFILENAME.

Perform Static Code Checking with Verilator

Try using Verilator to check your code and fix all warnings as much as possible. We recommend that you always enable Verilator's static code checking function in the future. On one hand, this helps you develop good coding habits, thereby writing higher-quality code. On the other hand, finding potential problems in the code as early as possible is also beneficial for saving unnecessary debugging work: As the code scale increases, you may well spend several days debugging due to a bit-width error of a certain signal in the future, but Verilator's warnings can make you notice this problem immediately, thus easily eliminating the corresponding error.

Advanced Study in Verilator

please click hereopen in new window

The Simulation Behavior and Coding Style of Verilog

Similar to the C language standard, the semantics of the Verilog language during simulation are defined by the Verilog Standard Manualopen in new window. Chapter 11 of the manual reveals the essence of the Verilog language. This chapter is not lengthy, merely 5 pages, yet it contains a wealth of information (yzh believes this part is so crucial that it should be moved earlier to Chapter 3 of the manual). However, these essential aspects are not covered in traditional textbooks and most Verilog materials. As a consequence, the vast majority of Verilog developers, even those with extensive experience in Verilog, are unaware of the existence of these semantics. Therefore, they are unable to accurately comprehend the differences between Verilog in the two application scenarios of simulation and synthesis.

I can write Verilog, isn't that enough? Why do I need to know these things?

In fact, many Verilog developers really don't understand the essence of Verilog, but they can still design circuits whose behavior is mostly as expected by following certain Verilog coding guidelines. However, they don't know the underlying principles behind these coding guidelines, nor can they judge whether a coding guideline is correct. When encountering unexpected situations during simulation, they lack the ability to analyze the root causes of problems. They can only make random modifications blindly, and if they can't get it right, they might even complain that there's a bug in the simulator... As a small test, here are several Verilog coding suggestions or descriptions, but some of them are incorrect. Please try to identify them:

  1. Using #0 can force an assignment to be delayed until the end of the current simulation time step.
  2. In the same begin-end block, performing multiple non-blocking assignments to the same variable results in undefined behavior.
  3. When describing combinational logic elements with always blocks, non-blocking assignments must not be used.
  4. A variable must not be assigned in multiple always blocks.
  5. It's not recommended to use the $display system task because sometimes it can't correctly output variable values.
  6. $display cannot output the results of non-blocking assignment statements.

If you plan to use Verilog in future development and cannot judge the correctness of the above statements, we strongly recommend that you carefully understand this part of the content.

Execution of Verilog Code

We know that the execution of C language refers to modifying the state of objects through a sequence of evaluation processes that conform to standard specifications. Then, in Verilog, what is the meaning of "execution"? To get a clear answer, we need to refer to the Verilog standard manual. Section 11.1 of the manual defines what "executing Verilog code" means:

The elements that make up the Verilog HDL can be used to describe the behavior, at
varying levels of abstraction, of electronic hardware. An HDL has to be a parallel
programming language. The execution of certain language constructs is defined by
parallel execution of blocks or processes. It is important to understand what
execution order is guaranteed to the user and what execution order is indeterminate.

Although the Verilog HDL is used for more than simulation, the semantics of the
language are defined for simulation, and everything else is abstracted from this
base definition.

That is to say, the Verilog language is used to describe hardware behavior at various abstraction levels. HDL is a parallel programming language, and the execution of some language components is defined as the parallel execution of code blocks or processes. It is crucial for Verilog users to understand which execution sequences are guaranteed by the standard manual and which are uncertain. Although the application scenarios of Verilog are not limited to simulation, the semantics of Verilog are defined for simulation purposes, and the semantics for other scenarios are abstracted based on this fundamental definition.

Have you felt that your understanding of Verilog has been subverted?

The essence of Verilog is actually a parallel programming language! Moreover, the language standard of Verilog is defined for simulation, not for RTL design! Furthermore, the semantics of Verilog are uncertain in some scenarios! This is somewhat similar to the unspecified or undefined behaviors in C language: Understand what kind of coding styles will introduce uncertainty, and then avoid these coding styles when using Verilog, thereby ensuring that the behavior of Verilog is deterministic.

This means that the semantics of some Verilog code may not conform to your intuition, or there may be differences between simulation behavior and synthesis behavior. If you have used Verilog before and had experiences like "simulation passed but didn't work correctly on FPGA", after ruling out FPGA-related detailed issues, if you still don't understand the reasons, it's very likely because you didn't understand the above essence of the Verilog language. Now it's time to further deepen your understanding of it!

Event-Based Simulation

Section 11.2 of the Verilog standard manual describes the simulation process:

The Verilog HDL is defined in terms of a discrete event execution model.

The definition of the Verilog language is based on a discrete event execution model. We have selected some key points to elaborate further:

Processes are objects that can be evaluated, that may have state, and that can
respond to changes on their inputs to produce outputs. Processes include primitives,
modules, initial and always procedural blocks, continuous assignments, asynchronous
tasks, and procedural assignment statements.

In Verilog, "processes" are objects that can be evaluated. They have their own states, can respond to changes when inputs vary, and produce outputs. Processes include primitives, modules, initial and always procedural blocks, continuous assignments, asynchronous tasks, and procedural assignment statements.

Every change in value of a net or variable in the circuit being simulated, as well
as the named event, is considered an update event.

Processes are sensitive to update events. When an update event is executed, all the
processes that are sensitive to that event are evaluated in an arbitrary order. The
evaluation of a process is also an event, known as an evaluation event.

In a circuit being simulated, changes in the values of nets or variables, as well as named events, are all regarded as "update events". Processes are sensitive to update events. After an update event is executed, all processes sensitive to that event will be evaluated, and the order of evaluation is arbitrary. The evaluation of a process is also an event, referred to as an "evaluation event".

Events can occur at different times. In order to keep track of the events and to
make sure they are processed in the correct order, the events are kept on an event
queue, ordered by simulation time. Putting an event on the queue is called
scheduling an event.

Events occur at different simulation times. To ensure that events are processed in the correct order, they need to be stored in an event queue sorted by simulation time. Placing an event into the queue is called "scheduling of the event". It can be seen that the semantics of Verilog language components are all associated with events. The simulation process is the process of handling these events in a certain correct order. During the processing of events, the state of objects in the circuit will change, and we expect such changes to conform to the expected behavior of the circuit, thereby realizing the simulation of the circuit's behavior.

Hierarchical Event Queue

According to the Verilog standard manual, the event queue logically contains the following 5 regions, each used to handle corresponding types of events:

  1. Active event region, denoted as , which stores events that occur at the current simulation time and can be processed.
  2. Inactive event region, denoted as , which stores events that occur at the current simulation time but cannot be processed immediately. These events can only be processed when is empty.
  3. Nonblocking assign update event region, denoted as , which stores events that have completed evaluation at previous simulation times but need to be assigned at the end of the current simulation time. These events can only be processed when both and are empty.
  4. Monitor event region, denoted as , which stores events related to monitoring operations. These events can only be processed when , , and are all empty.
  5. Future event region, denoted as , which stores events to be processed at future simulation times.

An event is added to different regions according to its type and transferred to according to certain rules. Once processed, it is removed from the event queue. Some rules for event generation are as follows:

  • Explicit zero delay (#0) can suspend the corresponding process and generate an event.
  • Nonblocking assignments generate an event.
  • The system tasks $monitor and strobe generate an event at each simulation time.
  • Evaluation of PLI processes generates an event.

Based on the above processing order of different events, the Verilog standard manual provides a reference implementation of the event processing engine, which is the core loop of the Verilog simulator:

while (there are events) {
  if (no active events) {
    if (there are inactive events) {
      activate all inactive events;
    } else if (there are nonblocking assign update events) {
      activate all nonblocking assign update events;
    } else if (there are monitor events) {
      activate all monitor events;
    } else {
      advance T to the next event time;
      activate all inactive events for time T;
    }
  }
  E = any active event;
  if (E is an update event) {
    update the modified object;
    add evaluation events for sensitive processes to event queue;
  } else { /* shall be an evaluation event */
    evaluate the process;
    add update events to the event queue;
  }
}

The event processing engine will repeatedly perform the following operations:

  • If there are events in , take out an event E:
    • If E is an update event:
      • Update the corresponding object.
      • Add the evaluation of the processes sensitive to this event to the event queue as evaluation events.
    • Otherwise, if E is an evaluation event:
      • Evaluate the process.
      • Add the assignment behavior to the event queue as an update event.
  • If is empty:
    • If is not empty, transfer all events in to .
    • Else, if is not empty, transfer all events in to .
    • Else, if is not empty, transfer all events in to .
    • Else:
      • Advance the simulation time by one unit.
      • Transfer all events in that belong to the current simulation time to or according to their types.

Verilog Code != C Code

For teachers of digital circuit courses, one of the most frustrating things is that when students learn Verilog, they easily write Verilog code using C language programming thinking. Even though teachers have repeatedly emphasized that "you can't treat Verilog as C language", most students still can't deeply understand the meaning of this sentence: if it can't be written like C language, then what exactly is Verilog?

The reason why we introduce the content of the Verilog standard manual here is to present the answer to this question. The above event processing loop has directly shown the difference between Verilog code and C code: Take i = i + 1 as an example. In a C program, under the action of the compiler, this line of code is finally compiled into an instruction similar to addi a0, a0, 1, which is directly executed on the processor; In Verilog, however, this line of code will be converted into an evaluation event and an update event. Under the processing of the event processing engine, the addition operation and assignment operation are completed, and new events sensitive to it are generated.

In fact, the Verilog code you write will eventually be transformed into events in accordance with the conventions of the standard manual. The simulator processes these events in a certain order that conforms to the conventions of the standard manual, and presents the overall behavior of the hardware circuit through the behavior of these events, thereby realizing the modeling of the hardware circuit.

Event Scheduling for Assignment Operations

According to Section 11.6 of the Verilog standard manual, assignment operations are converted into behaviorally equivalent processes, which generate corresponding events to be processed by the simulator.We have selected some common assignment operations for explanation. For simplicity, we first consider cases where no delay information (#) is specified.

  • Continuous assignment (i.e., assign statement) - Corresponds to a process sensitive to all source operands of the expression. When the value of the expression changes, an update event is generated and added to . In particular, the continuous assignment process generates a 0-time evaluation event to implement constant propagation.
  • Blocking assignment within a process - First, calculate the value of the right-hand side of the assignment expression using the current value of the object. Then immediately calculate the target object on the left-hand side of the assignment expression, update it, and generate events resulting from this update. The execution process can continue to execute the next statement in sequence or process other active events.
  • Non-blocking assignment within a process - First, calculate the value of the right-hand side of the assignment expression and the target object on the left-hand side using the current values of the objects. A  event for the current simulation time is generated.

A New Understanding of Blocking and Non-blocking Assignments

For teachers of digital circuit courses, the second most frustrating thing is that students struggle to understand the difference between blocking and non-blocking assignments. C language has only one type of assignment, but Verilog has several types of assignments. To deeply understand the differences between different assignment methods, we must return to the essence of Verilog, which is the event model.

Contrary to intuition, in Verilog's event model, the specific operation of an assignment expression needs to be considered in terms of two sub-operations: First is the evaluation operation, which completes the evaluation process of the right-hand side of the assignment expression; Then comes the update operation, which writes the evaluation result to the object indicated by the left-hand side of the assignment expression.

According to the event scheduling behavior mentioned above, the biggest difference between blocking and non-blocking assignments lies in how they handle the update operation. Specifically, the update operation of a blocking assignment is performed immediately after the evaluation operation without generating a new update event; For non-blocking assignments, however, the evaluation and update operations are separate. After completing the evaluation operation, an update event belonging to  is generated. This update event can only be processed when both  and  are empty. It is this difference that causes the timing of specific assignment operations to differ between the two, further resulting in different sets of events that can see the assignment results, thereby affecting the behavior of these events and ultimately influencing the overall behavior of the circuit.

Using Event Model to Analyze the Behavior of Verilog Code

Consider the code a = 1, b = 2, c = 3, d = 4, e = 5 at t-time. Try to analyze the value of all the variables at time t+1.

always @(posedge clk) begin
  b  = a;
  c <= b;
  d  = c;
  e <= d;
  a  = e;
end
  • Port connections - For the input port connection .a(expr), it is treated as the continuous assignment statement assign a = expr;. For the output port connection .b(net), it is treated as the continuous assignment statement assign net = b;.
  • Functions and tasks - Parameters are passed by value when called. When returning, the behavior of "replacing the call site with the return value" is handled as a blocking assignment.

The Verilog manual also defines how to convert more scenarios into events for processing, including specifying delay information, using procedural continuous assignment statements, handling transistor - level behaviors, and so on. When you need to understand these, you can refer to the relevant content in the manual.

Event Processing Order

In fact, the order of event processing is not 100% deterministic. According to the definition in the Verilog standard manual, there are two main sources of uncertainty:

  • When there are multiple active events in the event queue, the processing order is arbitrary.
  • In behavioral modules, statements without time controls (i.e., # expressions and @ expressions) do not have to be processed as a single entire event. When evaluating a statement in a behavioral module, the simulator can suspend the execution of this statement at any time, and treat the remaining execution operations as an active event in the event queue. This allows different processes to interleave their execution, but the order of interleaving is uncertain and not under the user's control. Why does Verilog need to introduce such uncertainties? To answer this question, we need to review how hardware circuits work. In fact, the behavior of hardware circuits itself is parallel, and multiple components naturally work in parallel.
  • From the perspective of consistency between the circuit behavior model and the real circuit, there is no reason to specify the order in which these components work. If we forcibly specify these sequences, it will cause the modeling results to fail to fully reflect the working conditions of the real circuit. In particular, if there is a problem in the real circuit but it cannot be reflected through modeling and repaired in time, the modeling will lose its meaning.
  • From the perspective of the software nature of the simulator, the simulator can only process different events serially. The above-mentioned uncertainties defined by the Verilog standard are actually a kind of "relaxation" of the event processing order: If there is no dependency between two events, there is no need to require them to be processed in a certain order. Furthermore, the simulator can even use some parallel optimization techniques to handle these events without dependencies, thereby better simulating the parallelism between circuit components.

To fully understand the behavior of Verilog code, we also need to consider the determinism of event processing order. For convenience of expression, we introduce a sequential relationship, denoted as , where  means that event A is processed before event B. The event processing engine mentioned above actually implies some order requirements:

  • Order Rule 1 - If  is generated during the processing of , then . This is because processing  must be preceded by the completion of processing .
  • Order Rule 2 - If at a certain moment during the simulation,, and , then . This is because the event processing engine will process all events in  before processing events in .

In addition to these implicit orders, the Verilog standard manual explicitly defines the following two order rules:

  • Order Rule 3 - Statements within a begin-end block need to be executed in the order of the statements. That is, for two statements  and  in the same begin-end block, if , then  is executed before .
  • Order Rule 4 - Non-blocking assignment operations need to be performed in the order of statement execution. That is, if , and the corresponding evaluation operations of the assignment expressions are  and  respectively, with , then .

The following example is in the Verilog standard manual:

initial begin
  a <= 0; // (1)
  a <= 1; // (2)
end

We use  to represent "the evaluation event of expression " and  to represent "the update event of object ". Therefore, the statement marked (1) in the above example can be decomposed into two events:  and . Similarly, the statement marked (2) can be decomposed into two events:  and . Applying the order rules mentioned above, we can draw the following conclusions:

  • Considering Order Rule 1, it should be that  and .
  • Considering Order Rule 2, it should be that  and .
  • Considering Order Rule 3, it should be that .
  • Considering Order Rule 4, it should be that .

Synthesizing these conclusions, there is only one possible order: . That is, in this example, the simulator can only process events in this order. Therefore, during the simulation, the object a will first be assigned 0 and then 1.

Simulators and Simulation Programs

Looking back at the example of the running light, let's analyze the event processing order within it.

module light(
  input clk,
  input rst,
  output reg [15:0] led
);
  reg [31:0] count;
  always @(posedge clk) begin
    if (rst) begin led <= 1; count <= 0; end
    else begin
      if (count == 0) led <= {led[14:0], led[15]};
      count <= (count >= 5000000 ? 32'b0 : count + 1);
    end
  end
endmodule

When a rising edge of clk arrives, the following events may occur at this simulation time:

  • When rst = 1, it should be that .

  • When rst = 0 and count = 0, it should be that .

  • When rst = 0 and count != 0, it should be that .

After obtaining the above event processing order, we can directly implement this order using C code. Here, two macros EVAL() and UPDATE() are defined, which respectively implement the semantics of the evaluation event eval(expr) and the update event update(obj):

#define EVAL(c, name, val) do { \
                             c->CONCAT(name, _next) = (val); \
                             c->CONCAT(name, _update) = 1; \
                           } while (0)
#define UPDATE(c, name)    do { \
                             if (c->CONCAT(name, _update)) { \
                               c->name = c->CONCAT(name, _next); \
                             } \
                           } while (0)

static void cycle(Circuit *c) {
  c->led_update = 0;
  c->count_update = 0;
  if (c->rst) {
    EVAL(c, led, 1);
    EVAL(c, count, 0);
  } else {
    if (c->count == 0) {
      EVAL(c, led, (BITS(c->led, 14, 0) << 1) | BITS(c->led, 15, 15));
    }
    EVAL(c, count, c->count >= 5000000 ? 0 : c->count + 1);
  }
  UPDATE(c, led);
  UPDATE(c, count);
}

Although this piece of C code implements the function of simulating a running light circuit, it does not include the concept of an "event queue": The events defined in the Verilog standard manual are not fetched from a queue in this C code. Instead, they are directly "flattened" in the C code in a specific order, and this order conforms to the conventions of the Verilog standard manual. In fact, the definition of the event queue in the Verilog standard manual is logical:

The Verilog event queue is logically segmented into five different regions.

Therefore, a simulation program does not necessarily need to explicitly maintain the order of events through a queue data structure. As long as the event processing order complies with the conventions of the Verilog standard manual, the behavior of the simulation program is in line with the manual's specifications.

The above simulation method, which "flattens" events in a specific order, is called "cycle simulation". This simulation method operates at the granularity of cycles, evaluating all components in the circuit within each simulation cycle. In the cycle-based approach, the evaluation order of the circuit is determined before the simulation starts, which belongs to the static scheduling of events. Verilator adopts this simulation method. In contrast, the reference implementation of the event processing engine provided in the Verilog standard manual mentioned earlier is called "event simulation". In this approach, the evaluation order of the circuit is determined during the simulation process, which belongs to the dynamic scheduling of events. The commercial simulator VCS adopts this simulation method.

Understand the behavior of the simulation program generated by Verilator

Compile the running light circuit with Verilator and try to understand the behavior of the generated C++ code.

Data Races

A valid real-world circuit should produce consistent outputs even when its various components operate in parallel. Therefore, this also requires that the circuit model, under the aforementioned uncertainties, should yield consistent results regardless of the order in which these events are processed. Conversely, if there exist two different event processing orders that lead to inconsistent simulation results, this is referred to as a data race. The Verilog standard manual terms this a "race condition," which has the same meaning as a data race. In general, if a data race exists during simulation, the circuit model described is not a valid representation of a real-world circuit.

Consider the following example:

always @(posedge clk or negedge rstn) begin
  if (!rstn) a = 1'b0;
  else a = b; // (1)
end

always @(posedge clk or negedge rstn) begin
  if (!rstn) b = 1'b1;
  else b = a; // (2)
end

There are two always blocks in the code, meaning there are two processes. Assume that at time ta = 0b = 1rstn = 1, and a rising edge of clk arrives. Considering that the evaluation and update operations of a blocking assignment are completed together, we use a new operation  to represent this combined process. Therefore, the statements marked (1) and (2) can be decomposed into the following events:

According to the definition in the Verilog standard manual, the processing order of multiple active events is arbitrary, so other order rules cannot be applied. We can list all possible event processing sequences:

  • , resulting in a = 1, b = 1
  • , resulting in a = 0, b = 0

It can be seen that there is a data race in the above code. When the simulator chooses different event processing orders, it will lead to different simulation results. This is somewhat similar to the unspecified behavior in the C language. Different simulation results may appear in different simulators, different versions of the same simulator, multiple runs of the same version of the same simulator, and even different simulation moments in a single run of the same version of the same simulator. All these situations are in line with the conventions of the Verilog standard manual. It can be seen that if there is a data race in the Verilog code, the simulation results may be unpredictable.

Analyze the behavior of Verilog code using the event model (2)

Change the blocking assignments in the above code to non-blocking assignments, and try to re-analyze the possible event processing sequences and their results. Does the modified code still have data races? Why?

always @(posedge clk or negedge rstn) begin
  if (!rstn) a <= 1'b0;
  else a <= b;
end

always @(posedge clk or negedge rstn) begin
  if (!rstn) b <= 1'b1;
  else b <= a;
end

Consider the following example:

always @(posedge clk or negedge rstn) begin
  if (!rstn) a = 1'b0;
  else a = 1;
end

always @(posedge clk) begin
  $display("a = %d", a);
end

Assume that at time t, a = 0, rstn = 1, and a rising edge of clk arrives. Through a similar analysis process, we can obtain 2 events during the simulation at time t:

It should be noted that the processing event of the $display system task, . We can list all possible event processing sequences:

  • , outputs a = 1
  • , outputs a = 0

It can be seen that there is a data race in the above code. Although it has nothing to do with the behavior of the circuit itself, different simulation results may still occur when the simulator chooses different event processing orders. This may cause confusion for developers during debugging.

Analyze the behavior of Verilog code using the event model (3)

Change $display in the above code to $strobe, and try to re-analyze the possible event processing sequences and their results. Does the modified code still have data races? Why?

always @(posedge clk or negedge rstn) begin
  if (!rstn) a = 1'b0;
  else a = 1;
end

always @(posedge clk) begin
  $strobe("a = %d", a);
end
  • The processing order between and is undefined
  • At least one event in and will update -->

In fact, we can summarize the sufficient and necessary conditions for the existence of data races from the above examples. A data race exists in Verilog code if and only if there are two events and related to the same object , which simultaneously satisfy:

  • The processing order between and is uncertain.
  • At least one of and updates .

Good Verilog Coding Styles

To eliminate data races, it is necessary to remove events that meet the above conditions. However, as the project scale becomes complex, manually judging whether there are data races in the code is very difficult. To address this challenge, many Verilog books and related materials recommend some good coding standards. If developers follow these coding standards, they can eliminate most data races in the code, making it more likely to design circuits with behaviors that meet expectations.

For example, the article "Nonblocking Assignments in Verilog Synthesis, Coding Styles That Kill!"open in new window puts forward the following Verilog coding suggestions, and mentions that adopting these suggestions can eliminate more than 90% of data races:

  1. Use non-blocking assignments when modeling sequential circuits.
  2. Use non-blocking assignments when modeling latch circuits.
  3. Use blocking assignments when building combinational logic models with always blocks.
  4. Use non-blocking assignments when building both sequential and combinational logic circuits in the same always block.
  5. Do not use both non-blocking and blocking assignments in the same always block.
  6. Do not assign values to the same variable in more than one always block.
  7. Use the $strobe system task to display the values of variables assigned with non-blocking assignments.
  8. Do not use #0 delays in assignments.

After understanding the event model, we can analyze the principles behind these recommendations:

  1. The reason for using non-blocking assignments to describe sequential logic elements is that the update events of non-blocking assignments are processed only after the events in and have been handled. This characteristic matches the property of "sequential logic elements performing writes only when the next clock arrives."
  2. Latches are not commonly used in synchronous circuits, so we will not elaborate on this here.
  3. The reason for using blocking assignments to describe combinational elements is that the update events of blocking assignments are processed immediately in . Other events can immediately see the updated results of blocking assignments, allowing subsequent evaluations to use the updated values. This characteristic matches the property of "combinational logic elements' outputs changing immediately when inputs change."
  4. This is related to Verilog's synthesizable semantics, which will be discussed further below.
  5. Like the previous suggestion, this suggestion is also related to Verilog's synthesizable semantics, which will be further discussed in the following text.
  6. Different always blocks belong to different processes, and the evaluation order between different processes is uncertain. Additionally, assigning values to variables generates update events. This exactly satisfies the sufficient and necessary conditions for data races, so data races are inevitable.
  7. The update events of non-blocking assignments belong to , while the events of the $strobe system task belong to . Therefore, the $strobe system task can output the values of variables after they have been updated by non-blocking assignments.
  8. Events generated by #0 belong to , and their processing timing is between and . If the event processing order is not understood, one may write code whose behavior does not match expectations.

I can write Verilog, so why do I need to know this?

At the very beginning of this section, several Verilog coding suggestions or descriptions were mentioned, but some of them are incorrect. Please try to identify them and analyze why they are incorrect:

  1. Using #0 can force the assignment operation to be delayed until the end of the current simulation time.
  2. In the same begin-end statement block, performing multiple non-blocking assignments to the same variable will result in an undefined outcome.
  3. When describing combinational logic elements with an always block, non-blocking assignments cannot be used.
  4. It is not allowed to assign values to the same variable in multiple always blocks.
  5. It is not recommended to use the $display system task because it sometimes fails to output the value of a variable correctly.
  6. $display cannot output the result of a non-blocking assignment statement.

More Examples and Analyses

"Nonblocking Assignments in Verilog Synthesis, Coding Styles That Kill!"open in new window This article also lists many examples and analyses. If you plan to use Verilog, we strongly recommend that you read it.

I can write Verilog, so why do I need to know this? (2)

Of course, no matter how many suggestions you follow, there will always be a 10% chance of accidentally writing code with data races. When you understand the details of the event model, you will have the ability to independently analyze and resolve data races in your code.

Moreover, it only takes about 1 hour to understand this content. Compared with the time you will spend on debugging in the future, this 1 hour is negligible. Just from the perspective that "it may help you avoid days of debugging without knowing why", this 1 hour of investment is definitely worthwhile.

Logic Synthesis - From RTL Code to Netlist

RTL code is just a description of a circuit, and simulation is only a program to simulate the behavior of the circuit. If you need to manufacture this circuit, the foundry also needs more detailed information. Specifically, what the foundry needs is a layout file in GDSII (Graphic Design System II) format, which describes the physical position of each element in the circuit. For example, there is an AND gate at coordinate (3, 4), and a wire between coordinates (4, 2) and (0, 2). To convert RTL code into a similar GDS layout, a series of EDA tools are required for processing in multiple stages. For instance, the "placement" stage is needed to determine the coordinates of each gate circuit, and the "routing" stage is needed to determine how to connect the gate circuits through nets.

Generally, the foundry will provide a Process Design Kit (PDK)open in new window, which contains a series of resources such as device models, design rules, process constraints, verification files, and standard cell libraries under a specific process node. Physical design engineers usually use PDK to design circuits that comply with the foundry's manufacturing specifications. The standard cell library in PDK contains the logic units supported by the process, called standard cells. Standard cells are part of the objects described in the GDS layout file mentioned above, and they are also the smallest units processed by EDA tools.

Logic synthesis (also simply referred to as "synthesis") refers to the process of converting RTL descriptions into standard cells. In addition, the GDS layout file also records the connection relationships (i.e., topological structures) between standard cells through wiring. These connection relationships are first described in the RTL code (i.e., the connection relationships between circuits and modules). Therefore, the synthesizer also needs to include them in the synthesis results for use in subsequent stages, and ultimately pass them to the GDS layout file. To sum up, the output of the synthesizer is a netlist of standard cells, which not only records the converted standard cells but also their interconnection relationships. To help you further understand this process, we provide a synthesis and evaluation project based on open-source EDA. You can clone the project using the following command:

git clone git@github.com:OSCPU/yosys-sta.git

This project synthesizes RTL code using the open-source RTL synthesizer yosysopen in new window and maps the synthesis results to an open-source 45nm PDK, nangate45open in new window.

Try using the synthesizer

After cloning the above project, try to synthesize the running light project mentioned earlier as an example. You can perform synthesis using the make syn command in the above project, but you need to modify some configurations or parameters. Please refer to the README in the project for specific operation methods.

After experiencing the synthesis process through an example, we can take a look at the synthesized netlist file. Find the synthesis results in the result directory and open the .netlist.v file. You can see that many signals are defined in bit units in the netlist file, and many submodules with names similar to NOR2_X1 and OAI22_X1 are instantiated. These submodules are the standard cells provided by the nangate45 PDK.

So, how does the synthesizer convert RTL code into a netlist? We will introduce the synthesis process of the synthesizer yosys to you using the following counter as an example. The yosys official manualopen in new window is provided here for your reference when needed.

// counter.v
module counter(
  input clk,
  input rst,
  input en,
  output reg [1:0] count
);
  always @(posedge clk) begin
    if (rst) count <= 2'd0;
    else if (en) count <= count + 2'd1;
  end
endmodule

Parsing

You can use the following command to make yosys read and parse the source files:

$ yosys counter.v

-- Parsing `counter.v' using frontend ` -vlog2k' --

1. Executing Verilog-2005 frontend: counter.v
Parsing Verilog input from `counter.v' to AST representation.
Storing AST representation for module `$abstract\counter'.
Successfully finished Verilog frontend.

yosys>

As you can see, yosys parses counter.v and converts it into an Abstract Syntax Tree (AST), then outputs the command prompt yosys>. If you wish to exit yosys, you can type exit after the command prompt.

The process of parsing source files is very similar to compiling C language, involving lexical analysis and syntax analysis. If you remove one of the semicolons ; from the source file and rerun the command, you'll find that yosys reports the following error:

counter.v:10: ERROR: syntax error, unexpected TOK_ELSE

Elaboration

The work in the elaboration phase includes parsing the instantiation relationships between modules, calculating parameters of module instances, and completing tasks such as instance names and port bindings for module instantiations. You can use the following command to make yosys perform elaboration:

yosys> hierarchy -check -top counter

2. Executing HIERARCHY pass (managing design hierarchy).

3. Executing AST frontend in derive mode using pre-parsed AST for module `\counter'.
Generating RTLIL representation for module `\counter'.

3.1. Analyzing design hierarchy..
Top module:  \counter

3.2. Analyzing design hierarchy..
Top module:  \counter
Removing unused module `$abstract\counter'.
Removed 1 unused modules.

As you can see, the hierarchy command also needs to specify a top-level module. Yosys will start from this top-level module and sequentially expand all instantiated sub-modules, thereby determining the boundaries of the entire design. Modules that are not instantiated will be removed. At the same time, the elaboration phase will also convert the AST of the entire design into RTLIL, an intermediate language of yosys, which is very similar to the intermediate code generation phase in C language compilation.

Semantic Analysis

We can assume that during the elaboration phase of yosys, work similar to semantic analysis in C language compilation is also performed. For example, if we change posedge clk in counter.v to posedge counter, yosys will report the following error when executing the above hierarchy command:

counter.v:8: ERROR: Found posedge/negedge event on a signal that is not 1 bit wide!

If you add a module instantiation statement mymodule abc(clk, rst); to counter.v, yosys will report the following error when executing the above hierarchy command:

ERROR: Module `\mymodule' referenced in module `\counter' in cell `\abc' is not part of the design.

It can be seen that both types of errors conform to Verilog syntax, so they cannot be detected during the file parsing phase.

Intermediate Code Generation

After the hierarchy command is executed successfully, we can view the RTLIL of the entire design. By executing the dump command, yosys will output the RTLIL of the current design to the terminal in text form:

yosys> dump

Or output the RTLIL to a file using the write_rtlil command:

yosys> write_rtlil counter.rtlil

Taking the counter.rtlil file as an example, its content is as follows:

autoidx 3
attribute \hdlname "counter"
attribute \top 1
attribute \src "counter.v:2.1-12.10"
module \counter
  attribute \src "counter.v:8.3-11.6"
  wire width 2 $0\count[1:0]
  attribute \src "counter.v:10.27-10.39"
  wire width 2 $add$counter.v:10$2_Y
  attribute \src "counter.v:3.9-3.12"
  wire input 1 \clk
  attribute \src "counter.v:6.20-6.25"
  wire width 2 output 4 \count
  attribute \src "counter.v:5.9-5.11"
  wire input 3 \en
  attribute \src "counter.v:4.9-4.12"
  wire input 2 \rst
  attribute \src "counter.v:10.27-10.39"
  cell $add $add$counter.v:10$2
    parameter \A_SIGNED 0
    parameter \A_WIDTH 2
    parameter \B_SIGNED 0
    parameter \B_WIDTH 2
    parameter \Y_WIDTH 2
    connect \A \count
    connect \B 2'01
    connect \Y $add$counter.v:10$2_Y
  end
  attribute \src "counter.v:8.3-11.6"
  process $proc$counter.v:8$1
    assign $0\count[1:0] \count
    attribute \src "counter.v:9.5-10.40"
    switch \rst
      attribute \src "counter.v:9.9-9.12"
      case 1'1
        assign $0\count[1:0] 2'00
      attribute \src "counter.v:10.5-10.9"
      case
        attribute \src "counter.v:10.10-10.40"
        switch \en
          attribute \src "counter.v:10.14-10.16"
          case 1'1
            assign $0\count[1:0] $add$counter.v:10$2_Y
          case
        end
    end
    sync posedge \clk
      update \count $0\count[1:0]
  end
end

We will provide some explanations for the output RTLIL. For more information about RTLIL, you can refer to the relevant yosys manualopen in new window:

  • attribute is used to identify some attributes. For example, attribute \src "counter.v:10.27-10.39" is used to indicate the position of the corresponding element in the source file, that is, from column 27 to column 39 of line 10 in counter.v.
  • The wire width 2 $0\count[1:0] indicates defining a signal with a bit width of 2, and its name is $0\count[1:0] (note that characters such as $, \, [, :, and ] are all part of the name). The cell $add $add$counter.v:10$2 indicates instantiating a cell of type $add, with the name $add$counter.v:10$2. The specific parameters of the cell are represented by parameter. For example, parameter \A_WIDTH 2 means that the bit width of port A of the cell is 2, and parameter \A_SIGNED 0 means that port A of the cell is unsigned. The connection relationship of the ports is represented by connect. For example, connect \Y $add$counter.v:10$2_Y means that port Y of the cell is connected to the signal $add$counter.v:10$2_Y.
  • The process represents a behavioral description process, where assign indicates the assignment of signals, switch-case indicates conditional assignment based on the value of a signal, and sync indicates updating the signal when the condition is met.

As can be seen, although the syntax of RTLIL is different from that of Verilog, we can still feel that RTLIL is also describing hardware, and even sense that process corresponds to always in Verilog code. However, some operators have been replaced by cells. For example, + is replaced by $add. Therefore, compared with the original Verilog code, the current RTLIL is closer to the netlist. Such cells as $add belong to yosys's internal cell libraryopen in new window.

We can also visualize the topological relationships in RTLIL through structure diagrams. However, before doing this, you may need to install a viewing tool called Graphviz dot for files with such format :

apt-get install xdot

Then, you can execute the show command in yosys, and it will automatically call tools like xdot to open the structure diagram:

yosys> show

The structure diagram file generated by the show command is saved by default in ~/.yosys_show.dot. Executing the show command multiple times will overwrite it. You can manually copy it to another directory and open it with the xdot tool.

View Structure Diagram

Use the show command to view the current structure diagram and thus understand RTLIL.

Coarse-grain Synthesis

The coarse-grain synthesis stage is responsible for processing based on the "coarse-grain representation" of the design. Here, the coarse-grain representation refers to describing the design using operator-level units. Yosys's internal cell library includes a type of word-level cellsopen in new window, which describe functions at a relatively high level of abstraction and support multi-bit widths and parameter functions. In terms of naming conventions, word-level cells are usually prefixed with $. The $add cell mentioned above belongs to word-level cells, other word-level cells include $shift (shift operation), $mux (selection operation), etc.

Converting Procedural Descriptions to Coarse-grain Representations

However, the current RTLIL still contains procedural descriptions like process (represented by PROC nodes in the structure diagram), which do not belong to the coarse-grain representation and thus cannot be processed in relation to coarse-grain representation. Therefore, Yosys needs to first convert all procedural descriptions to coarse-grain representations, which can be achieved using the proc command:

yosys> proc

The proc command is actually a macro command, which sequentially calls a series of subcommands to complete the conversion of procedural descriptions:

StepSubcommandDescription
1proc_cleanRemove empty branches and empty procedural descriptions
2proc_rmdeadRemove unreachable case branches
3proc_pruneRemove redundant assignment operations (overwritten by subsequent assignments)
4proc_initConvert init operations in procedural descriptions to init attributes on corresponding signals
5proc_arstIdentify asynchronous resets
6proc_romConvert switch operations in procedural descriptions to ROM when appropriate
7proc_muxConvert switch operations in procedural descriptions to $mux cells (multiplexers)
8proc_dlatchConvert latches in procedural descriptions to D-latch type cells
9proc_dffConvert latches in procedural descriptions to D flip-flop type cells
10proc_memwrConvert memory write operations in procedural descriptions to $memwr cells
11proc_cleanRemove empty branches and empty procedural descriptions
12opt_expr -keepdcPerform expression-related optimizations

Some subcommands are quite similar to C language compilation optimization techniques, so they should be easy for you to understand. Overall, the proc command mainly converts the switch-case parts of procedural descriptions in RTLIL into $mux cells, and converts sync descriptions into D-latch type or D-flip-flop type cells, thereby obtaining a complete coarse-grain representation.

View Structure Diagram (2)

Use the show command to view the current structure diagram and compare the differences before and after executing the proc command.

Optimization

Similar to compilation optimization, synthesizers generally also provide optimization functions, allowing developers to focus on architecture design and logic design without having to overly consider the performance of the circuit during the design phase. Synthesizers can usually provide a fairly good minimum performance level.

After obtaining a complete coarse-grain representation, a series of optimizations can be applied to generate a better design. This can be achieved through the opt command:

yosys> opt

Similar to the proc command introduced earlier, opt is also a macro command, which sequentially calls a series of subcommands to perform various optimizations:

StepSubcommandDescription
1opt_exprConstant folding and simple expression rewriting
2opt_merge -nomuxMerge identical cells, but not selector-type cells
doStart loop
3opt_muxtreeRemove unreachable branches in nested selectors
4opt_reduceSimplify multi-input selectors, AND gates, and OR gates
5opt_mergeMerge identical cells
6opt_shareMerge cells with the same input, same type, and non-simultaneous activation
7opt_dffConstant optimization of D flip-flops and merging of clock/reset signals
8opt_cleanRemove useless cells and nets
9opt_exprConstant folding and simple expression rewriting
while (changed)If the design has changed, jump to step 3 to continue the loop

Here are some common optimization techniques. For easier understanding, we'll use Verilog code to illustrate the semantics before and after optimization.

  • Constant folding and simple expression rewriting (opt_expr) - In some expressions, if an input is a specific constant or the expression follows a special pattern, it can be simplified. In the following example, the result of the expression a != a must be 0, so it can be optimized to assign x = 1'b0; further, after applying constant propagation optimization, the result of the expression b | x must be b, so it can be optimized to assign y = b;; in addition, since the bit width of c is 1, the expression c == 0 is equivalent to ~c, so it can be optimized to assign z = ~c;. The synthesizer can replace these cells with calculation results, thereby simplifying the corresponding circuit.
// before optimization         |   after optimization
  wire a, b, c, x, y, z;       |    wire a, b, c, x, y, z;
  // ......                    |    // ......
  assign x = a != a;           |    assign x = 1'0;
  assign y = b | x;            |    assign y = b;
  assign z = c == x;           |    assign z = ~c;
  • Merging identical cells (opt_merge) - For multiple cells with identical functions and inputs, they can be merged into a single cell, with its output driving all the original output signals, thereby reducing the number of cells. In the following example, the cells for a + b and b + a have identical functions and inputs, so x can directly drive y, eliminating one addition cell.
// before optimization         |   after optimization
  wire a, b, x, y;             |    wire a, b, x, y;
  // ......                    |    // ......
  assign x = a + b;            |    assign x = a + b;
  assign y = b + a;            |    assign y = x;

Trade-off Between Area and Performance

It should be noted that although this optimization reduces the number of cells, it increases the fan-out of the cell (i.e., the number of downstream cells connected to the cell's output). All other things being equal, an increase in fan-out will lead to an increase in circuit delay.

To use a real-life analogy, the output of a cell is like a faucet that supplies water to downstream pools. Only when a pool is full can the gate be opened, similar to the flipping of downstream transistors. Before optimization, two faucets supply water to their respective downstream pools; after optimization, the number of faucets is reduced, but it needs to supply water to two downstream pools at the same time. With the water flow rate remaining unchanged, it will take longer to fill both pools, and this time is similar to the circuit delay.

In situations where delay optimization is required, the method of "cell duplication" is used instead to increase the number of faucets. Therefore, choosing between merging cells and duplicating cells is actually a trade-off between area and performance.

  • Removing unreachable branches in nested selectors (opt_muxtree) - In nested selectors, some branches are unreachable due to conflicting conditions and can be removed. In the following example, the result of the inner selector a ? b : c cannot be c because this would require the outer selection signal a = 1 and the inner selection signal a = 0 at the same time, which is a contradiction. Therefore, the inner selector cell can be replaced with b, thereby simplifying the corresponding circuit.
// before optimization            |   after optimization
  wire a, b, c, d, x;             |    wire a, b, c, d, x;
  // ......                       |    // ......
  assign x = a ? (a ? b : c) : d; |    assign x = a ? b : d;
  • Simplifying multi-input selectors, AND gates, and OR gates (opt_reduce) - For multi-input selectors, AND gates, and OR gates, some of their inputs may be the same, allowing these inputs to be eliminated or merged. In the following example, before optimization, a selection between two 32-bit signals is needed to derive imm. However, for these two signals, their bits 12 to 31 are respectively the same as bit 11 (one is 0, the other is inst[31]). Therefore, the inputs of the selector can be optimized by first selecting bits 0 to 11 as imm[11:0], then using the selected bit 11 as imm[31:20]. After optimization, the bit width of the selector's data port is reduced from 32 to 12. Similarly, when performing a reduce AND operation via &imm, since bits 12 to 31 of the input imm are the same as bit 11, bits 12 to 31 of the input can be directly removed, and the result is the same as &imm[11:0]. After optimization, the bit width of the AND gate's input is reduced from 32 to 12.
//          优化前              |            优化后
  wire [31:0] inst, imm;          |    wire [31:0] inst, imm;
  wire sel, x;                    |    wire sel, x;
  // ......                       |    // ......
  assign imm = !sel ? 32'b0 :     |    assign imm[11:0] = !sel ? 12'b0 : inst[31:20];
    {{20{inst[31]}}, inst[31:20]};|    assign imm[31:20] = {20{imm[11]}};
  assign x = &imm;                |    assign x = &imm[11:0];
  • Constant optimization of D flip-flops (opt_dff) - If the data input of a D flip-flop is a constant, the flip-flop can be replaced with that constant, thereby removing the corresponding D flip-flop cell. In the following example, the data input of the D flip-flop r is a constant, so it can be directly optimized into a constant signal.
// before optimization            |   after optimization
  reg [31:0] r;                   |    wire [31:0] r;
  // ......                       |    // ......
  always @ (posedge clk)          |    assign r = 32'hdeadbeef;
    r <= 32'hdeadbeef;            |
  • Removing useless cells and nets (opt_clean) - If certain cells and nets do not affect the output of the module, they can be removed. In the following example, the net t does not affect the output port x of the module, so it can be removed together with the cell a & b.
// before optimization            |   after optimization
  module m(                       |    module m(
    input a, b;                   |      input a, b;
    output x;                     |      output x;
  );                              |    );
    wire t;                       |      assign x = a + b;
    assign x = a + b;             |    endmodule
    assign t = a & b;             |
  endmodule                       |

In addition to the techniques introduced above, there are many other optimization techniques used in the synthesis process, such as bit-width reduction and peephole optimization. We will not elaborate on them here. For those who are interested, you can refer to the yosys documentationopen in new window or related materials.

View Structure Diagram (3)

Use the show command to view the current structure diagram and compare the differences before and after executing the opt command.

Identification and Processing of Finite State Machines

We have previously introduced the state machine model of computer systems, which explained that digital circuits can also be regarded as a state machine. The Finite State Machine (FSM) here mainly refers to the part in the design that is implemented with digital logic and has the characteristics of a state machine.

For example, you should have completed similar problems like "identifying three consecutive 1s" on HDLBits. For this problem, we can list the following state transition table (where each entry means "next state/output"):

MeaningInput 0Input 1
S0Initial stateS0/0S1/0
S1Identified 11S0/0S2/0
S2Identified 2 consecutive 1sS0/0S2/1

For some more complex FSMs, there may be redundant states or mergeable states, but these situations are difficult to detect at the circuit-level semantics. Therefore, synthesizers generally first identify FSMs at the circuit level, then analyze and optimize them at the FSM level, and finally map them back to the circuit level. In yosys, this can be achieved through the fsm command:

yosys> fsm

fsm is also a macro command that sequentially invokes a series of subcommands to handle FSMs. The main processing steps include:

  1. FSM detection - Identify FSMs in RTLIL according to certain rules and mark related cells with special attributes
  2. FSM extraction - Replace the marked related cells with $fsm cells and parse out the state transition table
  3. FSM optimization - Optimize the FSM based on the state transition table, including removing useless output signals, merging identical upstream input signals, merging states with the same output, simplifying states according to constant inputs, etc.
  4. FSM recoding - After optimization, the number of FSM states may be reduced, so recoding can be used to reduce the bit width of state signals and state registers
  5. Cell mapping - Map the processed $fsm cells back to circuit-level cells

However, the aforementioned counter.v does not contain an FSM, so executing the above fsm command will have no effect. Theoretically, according to the state machine model of digital circuits, we can try to treat any digital circuit design as an FSM as a whole for the above processing. But the state transition table parsed in this way will be extremely large (for a 32-bit register, there are already  states), which will cause two problems: On the one hand, the amount of computation required to process such a state space is far beyond the computing power of current computers; On the other hand, such states have very low similarity in terms of input, output, and next state, making it almost impossible to find states and signals that can be optimized. Therefore, the state machine model of digital circuits is only used to help us understand the basic principles, and in practice, it will not be used for FSM optimization.

Identification and Processing of Memories

Another type of unit that requires special handling is memory. In Yosys, memory-related processing on RTLIL can be performed using the memory command:

yosys> memory

memory is also a macro command that sequentially invokes a series of subcommands. Its related processing includes merging flip-flops upstream and downstream into the read-write units of the memory, and merging multiple read-write units of the memory into a multi-port memory unit, etc.

In the FPGA flow, since FPGAs provide only a few types of memory devices (such as LUT RAM, Block RAM, and FF), FPGA synthesizers can automatically identify memories in the RTL code through the above-mentioned methods and map them to physical memory devices according to the identified memory attributes. Utilizing the programmability of FPGAs, synthesizers can realize on-demand allocation of memory devices. That is, synthesizers can dynamically calculate the required memory size based on the RTL code and map to them.

However, this is not the case for the ASIC flow. To improve memory performance, memory units in the ASIC flow do not have programmable functions. Instead, the standard cell library provides several types of memories with fixed specifications for RTL developers to choose from. These memory units also have different performance, area, and power consumption attributes. For example, to implement a 64x64 memory function at the RTL level, RTL developers can choose one 64x64 memory unit or splice two 32x64 memory units. The former has a smaller total area but may have higher read latency; the latter has a larger total area but better read latency. In addition, different memory specifications have different shapes, which will also affect the subsequent placement and routing. These mutually restrictive factors make it difficult for synthesizers to automatically identify and map memories. Therefore, RTL developers need to select memory specifications according to design goals and manually instantiate memory units as submodules in the RTL code.

It can be said that the difference in the way memories are used between the ASIC flow and the FPGA flow reflects the trade-off between performance and flexibility: ASIC pursues higher performance but has weaker flexibility, making it difficult to achieve on-demand allocation, requiring developers to choose specific specifications by themselves; FPGA has better flexibility, but the programmable functions of its memory devices make its performance not as good as that of ASIC.

However, the above-mentioned counter.v does not contain memory, so executing the above memory command will have no effect. For a period of time in the future, we will not come into contact with circuit designs containing such memories. We will discuss the issue of memory in A Stage.

Fine-grain Synthesis

The fine-grain synthesis stage is responsible for processing based on the "fine-grain representation" of the design. As mentioned above, the so-called fine-grain representation refers to describing the design using gate-level cells. There is a category of gate-level cellsopen in new window in Yosys's internal cell library. Compared with the cells used in the coarse-grain representation, these gate-level cells all have a data bit width of 1 and do not provide parameter functions. In terms of naming style, gate-level cells are usually named in the form of $_XXX_, where XXX is generally in uppercase to distinguish it from word-level cells.

Fine-grain synthesis first needs to convert the coarse-grain representation of the design into a fine-grain representation, which can be achieved through the techmap command:

yosys> techmap

The techmap command is used to replace the cells of the current design with cell implementations from a specified cell library. If no cell library is specified, the command will use Yosys's internal gate-level cell library.

After replacing with gate-level cells, it is also necessary to split some multi-bit nets and ports. Otherwise, RTLIL will contain unnecessary bit extraction and bit concatenation operations. This can be achieved through the splitnets command:

yosys> splitnets -ports

Next, you can execute the opt -full command to allow Yosys to perform some optimization work, which will clear out unused cells and nets, after which you can view the structure diagram.

It can be seen that the D flip-flop cell with a bit width of 2 in the coarse-grain representation has been split into 2 D flip-flop cells each with a bit width of 1, and the adder cell $add has also been split into several gate-level cells. Therefore, compared with the coarse-grain representation, the current fine-grain representation is closer to the netlist.

View Structure Diagram (4)

Use the show command to view the current structure diagram and compare the differences before and after executing the techmap command.

Technology Mapping

Technology mapping refers to the process of mapping a technology-independent circuit representation to a specific technology implementation. In this context, the technology mapping stage is responsible for mapping the fine-grain representation of the design to the standard cells of the target technology. To demonstrate the effect of technology mapping, we use a simple example standard cell library, which is derived from the Yosys manual:

library(demo) {
  cell(BUF) {
    area: 6;
    pin(A) { direction: input; }
    pin(Y) { direction: output;
              function: "A"; }
  }
  cell(NOT) {
    area: 3;
    pin(A) { direction: input; }
    pin(Y) { direction: output;
              function: "A'"; }
  }
  cell(NAND) {
    area: 4;
    pin(A) { direction: input; }
    pin(B) { direction: input; }
    pin(Y) { direction: output;
             function: "(A*B)'"; }
  }
  cell(NOR) {
    area: 4;
    pin(A) { direction: input; }
    pin(B) { direction: input; }
    pin(Y) { direction: output;
             function: "(A+B)'"; }
  }
  cell(DFF) {
    area: 18;
    ff(IQ, IQN) { clocked_on: C;
                  next_state: D; }
    pin(C) { direction: input;
                 clock: true; }
    pin(D) { direction: input; }
    pin(Q) { direction: output;
              function: "IQ"; }
  }
}

Save the above content to the file cell.lib, which describes the properties of standard cells in text format. The above file contains the following properties:

  • The area of the cell, generally in
  • Pins, which also indicate the direction; In particular, output pins also contain a function attribute, given by a logical expression; For the clock input pin of a flip-flop, it also contains a clock attribute to identify that the pin is a clock signal.

In Yosys, the technology mapping process is divided into two steps. First, perform technology mapping for sequential logic cells using the following command:

yosys> dfflibmap -liberty cell.lib

After executing the above command and viewing the structure diagram, you will notice that in the fine-grain representation, the gate-level cell $_SDFFE_PP0P_ has been replaced with DFF and several $_MUX_ cells, where DFF is the standard cell from the standard cell library cell.lib. The reason why this step generates additional $_MUX_ cells is that there is no standard cell in cell.lib that is fully functionally equivalent to $_SDFFE_PP0P_. According to the Yosys manual, the function of $_SDFFE_PP0P_ is "a positive-edge D flip-flop with an active-high synchronous reset signal and an active-high enable signal". However, the only sequential logic cell DFF in cell.lib is just a simple D flip-flop. Therefore, some additional combinational logic cells need to be introduced to implement the functions of "active-high synchronous reset signal" and "active-high enable signal".

However, you will find that the output pin Q of the DFF cell appears on the left side (which represents the input side) in the structure diagram. This is because cell.lib is an external cell library for Yosys, and the show command does not have information about the DFF standard cell by default. To fix this issue, we can let Yosys read the cell.lib standard cell library first:

yosys> read_liberty -lib cell.lib

After a successful read-in, executing the show command again will resolve this issue.

View Structure Diagram (5)

Use the show command to view the current structure diagram and compare the differences before and after executing the dfflibmap command.

Next, perform technology mapping for combinational logic cells using the following command:

yosys> abc -liberty cell.lib

The abc command will invoke an external tool ABC to perform technology mapping for combinational logic cells. All gate-level cells will be replaced with standard cells from the standard cell library cell.lib. Finally, use the clean command to remove unused cells and connections, and the final netlist will be obtained, thus completing the conversion from RTL code to netlist.

View Structure Diagram (6)

Use the show command to view the current structure diagram and compare the differences before and after executing the abc command.

Technology Mapping and Yosys's techmap Command

In the previous fine-grain synthesis stage, we introduced Yosys's techmap command, which is actually an abbreviation of "Technology Mapping". The Yosys official manual describes Technology Mapping in two steps: The first step is to map word-level cells to gate-level cells, and the second step is to map gate-level cells to standard cells of the target technology.

This is actually different from the concept of technology mapping introduced in the lecture notes: The lecture notes understand technology mapping more as emphasizing "from abstraction to concreteness". To align with industry concepts, we do not adopt the understanding of technology mapping in the Yosys official manual. Therefore, what the official Yosys manual describes as "the first step of technology mapping" actually corresponds to "fine-grain synthesis" in the lecture notes; and what the official Yosys manual describes as "the second step of technology mapping" actually corresponds to "technology mapping" in the lecture notes.

Netlist and Report Generation

Finally, write the netlist to a file using the write_verilog command, and output information about the standard cells used via the stat command:

yosys> write_verilog netlist.v
yosys> stat -liberty cell.lib

Understand the Synthesis Process through Yosys Log Files

You have already synthesized the running light project. Try to view the Yosys log files in the yosys-sta/result directory and understand the synthesis process in combination with the above text.

To avoid repeatedly typing Yosys commands, make syn drives Yosys to perform synthesis through a script. For details, you can refer to yosys-sta/scripts/yosys.tcl. If you want to further learn about Yosys commands, you can consult the Yosys official manual, or enter help xxx in the Yosys command line to view information about the xxx command.

RTL Synthesis Semantics of Verilog

When converting a certain type of RTL to a specific type of standard cells, the semantics of RTL synthesis must be considered. However, the Verilog standard manual defines the simulation semantics of Verilog, which is not applicable to the scenario of RTL synthesis. For this reason, the Verilog RTL Synthesis Standard Manualopen in new window specifically describes the semantics of the Verilog language in the synthesis scenario. The synthesizer reads the Verilog code and then converts the Verilog code into semantically equivalent standard cells in accordance with the semantics described in this standard manual.

Section 1.1 of the Verilog RTL Synthesis Standard Manual introduces the background of RTL synthesis:

This standard defines a set of modeling rules for writing Verilog® HDL descriptions
for synthesis. Adherence to these rules guarantees the interoperability of Verilog
HDL descriptions between register-transfer level synthesis tools that comply to this
standard. The standard defines how the semantics of Verilog HDL are used, for
example, to describe level- and edge-sensitive logic. It also describes the syntax
of the language with reference to what shall be supported and what shall not be
supported for interoperability.

Use of this standard will enhance the portability of Verilog-HDL-based designs
across synthesis tools conforming to this standard. In addition, it will minimize
the potential for functional mismatch that may occur between the RTL model and the
synthesized netlist.

Some key pieces of information include:

  • This standard specifies which parts of the Verilog syntax need to be supported by synthesizers and which do not. This indicates that the Verilog syntax supported by synthesizers is only a subset of the overall Verilog syntax.
  • Adopting this standard can improve the portability of Verilog designs across different synthesizers, enabling compliant synthesizers to parse compliant Verilog code with consistent semantics.
  • Using this standard can also help minimize the potential risk of functional inconsistencies between the RTL model and the synthesized netlist.

RTFM

Read Chapter 5 Modeling hardware elements of the Verilog RTL Synthesis Standard Manual to understand what kind of Verilog code is synthesized into what kind of circuit.

This section is only about 10 pages long, but it is more authoritative than all other Verilog learning materials. It also provides a large number of code examples and explanations, even detailing the scenarios where x and z are synthesizable.

In fact, some of the coding suggestions put forward by Professor Xia Yuwen in his book mentioned above are based on the specifications of the Verilog RTL Synthesis Standard Manual.

Since the semantics of simulation and synthesis are not completely consistent, there may be cases where the same Verilog code behaves inconsistently in simulation and synthesis scenarios. In the chip design flow, both simulation and synthesis are essential steps. Moreover, considering physical design and manufacturing, the behavior of the Verilog code we design should be based on synthesis. This requires us to avoid writing code that behaves inconsistently in simulation and synthesis scenarios.

Appendix B Functional mismatches of the Verilog RTL Synthesis Standard Manual describes some scenarios where such problems occur. Section B.1 mentions uncertain behaviors, with examples as follows:

always @(posedge clock) begin
  a = 0;
  a = 1;
end

always @(posedge clock)
  b = a;

Analyze the Behavior of Verilog Code Using Event Model (4)

Try to analyze why there is a data race in the above code using the event model.

In this example, the synthesizer is free to choose either 0 or 1 as the input to the flip-flop b. However, regardless of the choice, due to the existence of data races, the behavior of the synthesized netlist may be inconsistent with the simulation results.

RTFM (2)

Read Appendix B of the Verilog RTL Synthesis Standard Manual to understand other situations that may cause functional mismatches between the synthesized netlist and simulation behavior.

Chisel Benefits

If you plan to use the Chisel language for circuit design, you don't need to consider the data races and functional mismatches mentioned above, because the semantics of the Chisel language ensure that the generated Verilog code is free from these issues.

Evaluating Circuits with Open-Source EDA Tools

Since the standard cells in the netlist are manufacturable and have various attributes, we can conduct a preliminary evaluation of the circuit's quality after obtaining the netlist. There are multiple dimensions to measure the quality of a circuit, with three commonly used ones being performance, power consumption, and area, collectively referred to as PPA (Performance, Power, Area).

Area Evaluation

The simplest is area evaluation. The .lib file already provides the area attributes of standard cells. The synthesizer only needs to count the number of instantiations of each standard cell in the netlist to calculate the total area of the current design.

Performance Evaluation

The performance of a circuit is mainly measured by frequency, that is, "how many times the circuit can work at most per second". This is determined by "the minimum time required to complete one work cycle". We define "one work cycle" as "the sequential logic elements updating their states under the drive of the clock signal". Since these indicators are related to time, the process of analyzing them is also called timing analysis.

Recalling the state machine model of digital circuits, sequential logic elements update their states when the clock signal arrives. The "state" here is essentially data, which needs to be calculated through combinational logic. However, the calculation of combinational logic requires a certain delay. Therefore, we need to control the clock frequency so that the interval between two clocks (i.e., the period) can sufficiently accommodate the delay of the combinational logic; otherwise, the data signal used to update the state will not be the stable result calculated by the combinational logic, and thus the behavior of the circuit during operation will not match our design expectations.

    +------------------+
+-->| Sequential Logic |----+
|   +------------------+    |
| next state                | current state
|                           |
|  +---------------------+  |
+--| Combinational Logic |<-+
   +---------------------+

Therefore, evaluating the performance of a circuit involves extrapolating the maximum operating frequency the circuit can achieve by assessing the delay of combinational logic within it. If the actual operating frequency of the circuit exceeds this maximum frequency, the results updated by some sequential logic in the circuit will not match our design expectations, making it impossible for the circuit to function as intended.

The Cost of Faster Speeds

Some electronics enthusiasts attempt to overclock their processors, making them run at a higher frequency than the maximum frequency claimed by the manufacturer, in an attempt to get a better computer user experience. However, if overclocking fails, the computer will enter an unstable state and may freeze after running for a period of time.

The essential reason for the freeze is actually the same as what was introduced above: because the processor is working too fast, some of its sequential logic units are not updated with the expected data, eventually causing the processor to enter an incorrect state.

There are numerous combinational logic elements in a circuit, and the operating frequency of the circuit is limited by the combinational logic path with the longest delay in the circuit. This path is called the "critical path" of the circuit. To find the critical path in a circuit, EDA tools need to read the delay information of standard cells from the standard cell library, then analyze the synthesized netlist, and calculate the total delay of all standard cells on the combinational logic paths. The path with the longest total delay is the critical path of the circuit. This evaluation process can be carried out based on the netlist and the delay information of the standard cell library, and it does not involve the actual operation process of the circuit. Therefore, it is called Static Timing Analysis (STA). It should be noted that the aforementioned cell.lib is only a simple example and does not contain delay information, so it cannot be used for static timing analysis.

However, the standard cell library in nangate45 provides complete standard cell delay information. In the yosys-sta project, executing the make sta command allows you to evaluate the performance of the RTL design on nangate45. Specifically, this command first calls the synthesizer Yosys to synthesize the RTL design to obtain a netlist of nangate45 standard cells. Then, it invokes the netlist optimization tool iNO to insert buffers into the netlist to optimize the performance of the netlist. Finally, the optimized netlist file and the standard cell information file in the PDK are input into the open-source static timing analysis tool iSTAopen in new window. iSTA will quickly evaluate the path delays in the RTL design and report several paths that have the largest gap from the target frequency for users' reference.

Evaluate Circuit Performance

Try to evaluate the circuit's performance through the yosys-sta project, and read the static timing analysis report to understand the maximum frequency at which the target circuit can operate.

Currently, we do not require you to understand all the details in the report; we will cover more STA content in B Stage. If you are interested in the details of the report now, you can refer to this tutorialopen in new window or search online for tutorials on reading timing reports. Other tools can also generate timing reports, and although the formats may vary, most of the concepts in them are common.

Issues Encountered During Evaluation

If you encounter bugs during runtime, you can report the problem in the Issues section of the yosys-sta repository and provide the following information:

  • The corresponding RTL design
  • SDC file
  • Netlist file generated by Yosys
  • Version number of iEDA, which can be obtained via the command echo exit | ./bin/iEDA -v

In fact, the timing report obtained through the above method cannot fully reflect the frequency of the chip during tape-out. This is because the netlist only contains information about standard cells and their topology, but not the physical location information of the standard cells. It is conceivable that if two standard cells are far apart in physical location, the signal transmitted between them will also take a certain delay to arrive, which is called net delay. The delay attribute of standard cells in the standard cell library can only reflect the delay of signals passing through the standard cells themselves, which is called logic delay. Complete delay information should consist of both logic delay and net delay. That is to say, the frequency obtained by the above evaluation method is only based on logic delay. Only after the EDA tool completes the placement and routing work can accurate net delay information be obtained, thereby evaluating the frequency information that is closer to the tape-out scenario.

Then, is the frequency information obtained currently completely meaningless? Not at all. On one hand, conducting physical design work also takes a certain amount of time. For complex high-performance processors, it may even take several days to complete one round of physical design. Obviously, in order to obtain more accurate delay and frequency information, the design team needs to invest more time, which will affect the efficiency of project iterations. On the other hand, although logic delay cannot represent the final delay information, it has already given the upper limit of the frequency. At the same time, it can also reflect some problems in the RTL logic design stage, such as overly complex circuit logic. This information is sufficient to help RTL designers conduct a preliminary evaluation of the RTL design, thereby enabling rapid iterative optimization. We will further introduce optimization methods in B Stage.

In particular, for a processor, frequency is not the sole factor in measuring its performance. Another way to understand frequency is the number of cycles it operates per second, but not every cycle necessarily involves "substantial work." The fundamental task of a processor is to execute programs; therefore, for a processor, performance should be interpreted as "the efficiency of program execution." More specifically, a processor executes instructions within a program. If a processor has a high frequency but takes a long time to execute a single instruction, it cannot be considered an excellent processor overall. Thus, another metric is needed to measure the efficiency with which a processor executes instructions. A commonly used metric for this is IPC (Instructions Per Cycle), which quantifies the average number of instructions a processor executes per cycle. We will further discuss methods for measuring and optimizing IPC in B Stage.

Power Consumption Evaluation

To evaluate the power consumption of a circuit, we need to assess the sum of the power consumption of all standard cells in the circuit. EDA tools need to read the power consumption information of standard cells from the standard cell library, calculate the power consumption of each standard cell, and thus compute the total power consumption of the circuit. Similar to delay evaluation, the aforementioned cell.lib is only a simple example and does not contain power consumption information, so it cannot be used for power consumption analysis.

Likewise, the standard cell library in nangate45 provides complete power consumption information for standard cells. Executing the make sta command in the yosys-sta project can evaluate not only the performance of the RTL design but also its power consumption: iSTA will quickly evaluate and report the power consumption of each standard cell and the total power consumption in the RTL design.

Evaluate Circuit Power Consumption

Try to evaluate the circuit's performance through the yosys-sta project and read the power consumption analysis report to understand the power consumption information of the target circuit.

The power consumption report includes three types of power consumption:

  • Internal Power refers to internal power consumption. When transistors switch states, nMOS and pMOS do not transition instantaneously. Therefore, for a brief moment, both nMOS and pMOS are conducting simultaneously, creating a short circuit path from the power supply to the ground. The power consumed by this current is known as internal power consumption, also called short-circuit power. Internal power is part of dynamic power.
  • Switch Power is the power consumed during state transitions. When a CMOS circuit switches between 0 and 1, it needs to charge or discharge the equivalent capacitance. The power consumed in this process is called switching power. Switching power is also part of dynamic power, meaning dynamic power consists of both internal power and switching power.
  • Leakage Power is leakage power consumption. Ideally, no current flows between the source and drain of a transistor when it is in the off state. In reality, however, various factors cause a small amount of current, known as leakage current, to flow between the source and drain. The power consumed by this leakage current is called leakage power. Since leakage power exists even when transistors are not switching, it is also referred to as static power.

In particular, in the current report, the switching power is always 0. This is because evaluating switching power requires first calculating the corresponding equivalent capacitance, and the equivalent capacitance is related not only to the standard cells themselves but also to the topology and length of the interconnects. Furthermore, to obtain the topology and length of the interconnects, a series of backend physical design steps must be completed first. Since yosys-sta does not perform these steps, it cannot calculate the equivalent capacitance and thus cannot evaluate the switching power. We will cover more content related to power consumption in B Stage.

Limitations of Open-Source EDA Tools

Of course, the yosys-sta evaluation project is not perfect. At least for now, it has the following drawbacks:

  • The synthesis quality of Yosys is not high, and there is still a certain gap compared with commercial synthesizers.
  • Nangate45 is a PDK oriented to academic research, and the quantity and quality of standard cells in it also have a certain gap with commercial PDKs.
  • Nangate45 cannot be used for tape-out, and no factory uses it in production lines.

However, in the scenario of timing evaluation after synthesis, the above defects will not cause obvious impact: even if the synthesis quality of Yosys is not high, we can guide the direction of RTL optimization through the relative improvement of synthesis results.

So is an FPGA still needed for learning "OSOC"?

Basically, it's not necessary anymore:

  • In terms of accuracy, Yosys' synthesis flow is oriented towards ASIC design. Compared with the FPGA flow, its principles and the accuracy of reports are more suitable for "OSOC".
  • In terms of time, the main role of FPGA is simulation acceleration. That is to say, if the simulation task does not take a long time to complete, the advantage of using FPGA is not obvious. In fact, from the perspective of the complete processes of both, the advantage of FPGA can only be reflected when the following inequality holds:
  • Among them,  usually reaches the order of hours, while  can usually be completed within a few minutes. Therefore, the above inequality can only hold when  reaches the order of hours. However, in the study of "OSOC", it is difficult for you to encounter simulation tasks that take hours to complete. And when you encounter such tasks, we will also put forward higher requirements for the FPGA evaluation process. We will continue to discuss this issue in B Stage.
  • In terms of debugging difficulty, FPGA's debugging methods are very limited, and you can only capture underlying waveform information under the constraints of both time and space; on the contrary, software simulation is much more flexible, and we can use many software methods to improve debugging efficiency in various aspects.

PDK and Standard Cell Library

The cell.lib we mentioned earlier is just a simple example of a standard cell library. Now let's introduce the nangate45 PDK. You have already come into contact with nangate45 when synthesizing the Running light project. Specifically, you can find the synthesized netlist file in the yosys-sta/result directory, and the instantiated cells in the netlist file are all standard cells from nangate45.

Content of PDK

As mentioned above, PDK contains a series of resources under a specific process node, such as device models, design rules, process constraints, verification files, and standard cell libraries. The standard cell library is a collection of standard cells and their attributes, which include information such as logical functions, transistor structures, timing, power consumption, and physical size. Usually, this information is distributed in various file formats within the PDK, and the .lib file mentioned above is just one of them. Taking nangate45 as an example, its files include (some files are not listed):

nangate45
├── cdl
│   └── NangateOpenCellLibrary.cdl       # Transistor-level information of standard cells
├── drc
│   └── FreePDK45.lydrc                  # Design rules that must be satisfied for manufacturable chips
├── gds
│   └── NangateOpenCellLibrary.gds       # Physical layout information of standard cells
├── lef
│   ├── fakeram45_1024x32.lef
│   ├── NangateOpenCellLibrary.macro.lef # Physical geometry information of standard cells
│   └── NangateOpenCellLibrary.tech.lef  # Process-related design specifications
├── lib
│   ├── fakeram45_1024x32.lib
│   ├── Nangate45_fast.lib
│   ├── Nangate45_slow.lib
│   └── Nangate45_typ.lib                # Information such as the logical functions, area, timing, and power consumption of standard cells
├── sim
│   └── cells.v                          # Verilog behavioral simulation models of standard cells
└── verilog
    ├── blackbox.v
    ├── cells_clkgate.v
    └── cells_latch.v

In particular, except for the .gds files which are binary files, all other files are text files that can be directly opened and read using a text editor.

In the full flow of processor design, different design stages use different files. For example, the technology mapping in synthesis reads .lib files and maps logically equivalent subcircuits to corresponding standard cells based on the logical functions of the standard cells; when performing netlist simulation, .v files are read to allow the RTL simulator to conduct standard cell-level simulation, thereby verifying that the function of the simulated netlist meets expectations; during placement, .lef files need to be read, and the position of each standard cell is determined according to information such as the size of the standard cells.

Chip Structure from a Process Perspective

To facilitate understanding of the information in the PDK and the subsequent physical design stages, we first need to understand the chip structure from a process perspective. In semiconductor manufacturing, the physical structure of a chip is hierarchical. For example, the cross-sectional view of a chip using a certain process is as shown in the following figure:

---------------------   M7    --+
  | | | | | | | | |             +--- Clock, Cower
---------------------   M6    --+
  | | | | | | | | |
---------------------   M5    --+
  | | | | | | | | |             +--- Wiring Between Standard Cells
---------------------   M4    --+
  | | | | | | | | |
---------------------   M3    --+
  | | | | | | | | |             |
---------------------   M2      +--- Wiring between Transistors
  | | | | | | | | |             |
---------------------   M1    --+
  | | | | | | | | | <------- Via
=====================  Poly-silicon        --+
+++++++++++++++++++++  dielectric            +--- Transistors
ooooooooooooooooooooo  Silicon Substrate   --+

Among them, the bottom layer is the silicon substrate, which contains the source and drain of the transistors; On top of it is the dielectric layer, also known as the gate oxide layer, which usually uses silicon dioxide as the material; Above that is the poly-silicon layer, which serves as the gate of the transistor. These three layers are used to realize the physical structure of the transistor.

There are multiple metal layers above the poly-silicon layer, which utilize their conductive properties to realize signal transmission, thereby connecting different transistors and achieving the functions of different gate circuits or standard cells. There are two types of connection methods: intra-layer connection and inter-layer connection. The former involves routing within the same metal layer, while the latter is achieved through vias between different metal layers. The connection relationships between various components in the RTL logic design are ultimately physically realized through the connection function provided by the metal layers.

To distinguish different metal layers, they are usually numbered. The larger the number, the higher the layer. Different metal layers have different requirements for the width and distance of the traces within them, thus serving different roles, as shown in the following table. It should be added that, according to middle school physics knowledge, the resistance of a trace is inversely proportional to its cross-sectional area.

Metal LayerWire WidthRouting SpaceRouting Characteristicsrole
LowerShortShortHigh resistance, short transmission distance, high wiring densityConnect different transistors to form gate circuits and standard cells
MediumMediumMediumMedium resistance, medium transmission distance, medium wiring densityConnect different standard cells to realize the main logic of the chip
HigherWideWideLow resistance, long transmission distance, low wiring densityClock or power

The number of metal layers may vary for different manufacturing processes. For example, the process structure in the above figure is abbreviated as 1P7M, where P stands for Poly (polysilicon layer) and M stands for Metal (metal layer). Thus, 1P7M indicates 1 polysilicon layer and 7 metal layers. In 1P7M, M1-M3 are lower metal layers, M4-M5 are middle metal layers, and M6-M7 are upper metal layers. It can be seen that in 1P7M, only two middle metal layers are specifically used to realize connections between standard cells. Generally, advanced processes provide more metal layers, such as 1P9M, 1P11M, etc. These can offer more abundant space for connections between standard cells, but more metal layer masks are required during the manufacturing process, resulting in higher manufacturing costs.

The properties of metal layers are recorded in the process LEF file of the PDK. The LEF file, with the suffix .lef, adopts the Library Exchange Format. It describes the physical layer information of the corresponding process, such as metal layers, vias, and layout rules, in a text format.

vim yosys-sta/pdk/nangate45/lef/Nangate45_tech.lef

For example, you can see the definition of LAYER metal1, where the relevant fields describe the properties of the first metal layer.

LAYER metal1
  TYPE ROUTING ;
  SPACING 0.065 ;
  WIDTH 0.07 ;
  PITCH 0.14 ;
  DIRECTION HORIZONTAL ;
  OFFSET 0.095 0.07 ;
  RESISTANCE RPERSQ 0.38 ;
  THICKNESS 0.13 ;
  HEIGHT 0.37 ;
  CAPACITANCE CPERSQDIST 7.7161e-05 ;
  EDGECAPACITANCE 2.7365e-05 ;
END metal1

Among them, the TYPE field is ROUTING, indicating that this layer is used for routing. The WIDTH and PITCH fields specify the minimum wire width and minimum routing spacing of this layer in (micrometers), respectively, as shown in the following figure. For more information about LEF files, you can refer to the relevant manualopen in new window.

 WIDTH
   |
 <-+->
|     |                          |     |
|     |                          |     |
|     |                          |     |
|     |                          |     |
|wire |                          |     |
|     |                          |     |
|     |                          |     |
|     |                          |     |
|     |                          |     |
|     |                          |     |
|     |           PITCH          |     |
|     |             |            |     |
|     |             |            |     |
|     |             |            |     |
   |<---------------+-------------->|

Before the definition of metal layers, you should also be able to see the definition of the polysilicon layer LAYER poly, which contains only a TYPE field without any other fields. This is because the polysilicon layer, together with the dielectric and silicon substrate, is used to form transistors. The parameters of transistors are determined by the manufacturing process and are relatively fixed during the backend physical design process, unlike metal layers that allow EDA tools to perform dynamic routing according to design requirements. Therefore, in the LEF file, it is only necessary to declare the existence of the polysilicon layer, while more process details are recorded in other files (such as GDS). Similarly, the dielectric and silicon substrate are functionally closely bound to the polysilicon layer and do not even need to appear in the LEF file.

了解 nangate45 的金属层

Read the process LEF file of nangate45. How many metal layers does it contain? And try to infer the function of each metal layer based on the wire width and routing space.

If the scale of a processor is complex (such as an out-of-order superscalar high-performance processor), there will be many interconnections between standard cells, which will put great pressure on the routing stage and require detours in routing. This not only increases the area of the chip, but also increases the line delay and reduces the frequency of the chip. It may also lead to routing failure due to excessive congestion, making the chip unable to enter the manufacturing stage. Therefore, the design of high-performance processors usually chooses advanced processes with more metal layers. The richer routing space can alleviate the pressure in the routing stage. For example, the Xiangshan processor team tried to switch the process from 1P9M to 1P11M during the design process. Without modifying the RTL code, the wire delay can be reduced and the main frequency of the processor can be increased.

Processes with Multiple Polysilicon Layers

Not all manufacturing processes have only one polysilicon layer. Depending on different application scenarios, processes with more polysilicon layers may be adopted. For example, the memory cells of flash memory use floating-gate transistors, which are special transistors containing two gates. One of them is called the floating gate, which has two states: "storing charge" (charged) and "un-storing charge" (discharged), representing 0 and 1 respectively; the other is called the control gate, which is used to control the reading and writing of the memory cell. Flash memory adopts the 2P8M manufacturing process to realize such special transistors, where the two polysilicon layers are used to implement the floating gate and the control gate respectively.

Attributes of Standard Cells

The standard cell library provided by PDK usually contains a large number of standard cells. In terms of their names, if we ignore suffixes like X1 and X4, the functions of some standard cells are easy to understand. For example, NAND2_X1 represents a 2-input NAND gate, and OR3_X4 represents a 3-input OR gate. Next, we will take NAND2_X1 as an example and refer to the relevant files in the standard cell library to further understand the various attributes of standard cells.

LIB Files - Functionality and Timing

LIB files, with the .lib extension, adopt the Liberty Timing File format. They describe the functionality of standard cells in text form, as well as attributes such as timing and power consumption under certain conditions.

vim yosys-sta/pdk/nangate45/lib/Nangate45_typ.lib

A brief review reveals that a LIB file consists of some header fields and descriptions of several standard cells. For example, we can directly search for NAND2_X1 to check the relevant attributes of this standard cell.

  cell (NAND2_X1) {
	drive_strength     	: 1;
	area               	: 0.798000;

	pg_pin(VDD) {
		voltage_name : VDD;
		pg_type      : primary_power;
	}
	pg_pin(VSS) {
		voltage_name : VSS;
		pg_type      : primary_ground;
	}

	cell_leakage_power 	: 17.393360;

	leakage_power () {
		when           : "!A1 & !A2";
		value          : 3.482556;
	}
	leakage_power () {
		when           : "!A1 & A2";
		value          : 24.799456;
	}
	leakage_power () {
		when           : "A1 & !A2";
		value          : 4.085038;
	}
	leakage_power () {
		when           : "A1 & A2";
		value          : 37.206389;
	}

	pin (A1) {
		direction		: input;
		related_power_pin		: "VDD";
		related_ground_pin		: "VSS";
		capacitance		: 1.599032;
		fall_capacitance	: 1.529196;
		rise_capacitance	: 1.599032;
	}

	pin (A2) {
		direction		: input;
		related_power_pin		: "VDD";
		related_ground_pin		: "VSS";
		capacitance		: 1.664199;
		fall_capacitance	: 1.502278;
		rise_capacitance	: 1.664199;
	}

	pin (ZN) {
		direction		: output;
		related_power_pin	: "VDD";
		related_ground_pin	: "VSS";
		max_capacitance		: 59.356700;
		function		: "!(A1 & A2)";

		timing () { ...... }
		internal_power () { ...... }
	}
}

So far, the attributes we can understand include (the units of some attributes are defined in the header fields):

  • Area: Usually measured in square micrometers (square micrometers).
  • Leakage power: This attribute includes the leakage power consumption of the standard cell under various conditions.
  • Pin: Including direction, capacitance, etc. In particular, for output pins, it also contains the following information:
    • Function: Given by a logical expression, from which we can understand the function of the standard cell.
    • Timing: Including the delay of the standard cell under various conditions.
    • Internal power: This attribute includes the internal power consumption of the standard cell under various conditions.

Let us elaborate on the meaning of the area attribute. A chip is a three-dimensional object, and the standard cells within the chip also exist in three-dimensional space. For the convenience of description, we assume the chip is placed horizontally and a three-dimensional coordinate system is established. The area of a standard cell refers to the area of its projection onto the plane, that is, the area in its top view. Considering the process structure of the chip, the area of a standard cell is also the area it occupies in the polysilicon layer and the lower metal layers.

From the above attributes, it can be seen that LIB files are mainly used in synthesis, timing analysis, and power consumption analysis. For example, Yosys reads the LIB file during technology mapping and, based on the function field of standard cells, determines which subcircuits to map to which standard cells, thereby ensuring that the circuit logic described by the netlist is equivalent to the input RTL code; the iSTA tool calculates the logic delay of each standard cell under various conditions according to the timing field of the standard cells, and finally reports several longest paths of the netlist.

RTFM

The aforementioned LIB file contains 130,000 lines, making it inconvenient to refer to directly. We recommend that you refer to the web version of the nangate45 data bookopen in new window, whose data is derived from the aforementioned LIB file but features better visualization. It also allows you to view the transistor schematic diagrams of the corresponding standard cells.

If you wish to understand the specific meanings of each field in the LIB file, you can consult the file format manual for Liberty Timing Fileopen in new window.

Verilog Files - Behavioral Models

To verify the equivalence between the synthesized netlist and the pre-synthesis RTL design, one approach is to perform netlist simulation, which involves simulating the netlist in conjunction with the behavior of standard cells. Although the function field of standard cells in the LIB file also describes their behavior, RTL simulators typically cannot recognize LIB files. Therefore, standard cell libraries usually also provide Verilog behavioral models of standard cells.

In nangate45, the Verilog behavioral models of standard cells are located in the following file:

vim yosys-sta/pdk/nangate45/sim/cells.v

For example, the behaviorial model of NAND2_X1 is the following:

module NAND2_X1 (A1, A2, ZN);
   input A1;
   input A2;
   output ZN;
   assign ZN = ~(A1 & A2);
endmodule

It can be seen that this is simply an implementation of the standard cell functions using the Verilog language. When the netlist file is input into the RTL simulator together with this behavioral model file, the RTL simulator will instantiate the standard cells in the netlist file according to the module definitions in the model file, thereby enabling simulation work at the netlist level.

The behavioral models provided by nangate45 are relatively simple and can only be used for functional simulation. Some PDKs provide behavioral models that also contain rich timing information to support users in carrying out timing simulation work at the netlist level.

LEF Files - Physical Geometry Information

As mentioned earlier, there are process-related LEF files. In fact, there is another type of LEF file associated with standard cells, which is used to describe the physical geometry information of standard cells.

vim yosys-sta/pdk/nangate45/lef/Nangate45_stdcell.lef

Taking NAND2_X1 as an example, the description of it in the LEF file is as follows:

MACRO NAND2_X1
  CLASS core ;
  FOREIGN NAND2_X1 0.0 0.0 ;
  ORIGIN 0 0 ;
  SYMMETRY X Y ;
  SITE FreePDK45_38x28_10R_NP_162NW_34O ;
  SIZE 0.57 BY 1.4 ;
  PIN A1
    DIRECTION INPUT ;
    ANTENNAPARTIALMETALAREA 0.021875 LAYER metal1 ;
    ANTENNAPARTIALMETALSIDEAREA 0.078 LAYER metal1 ;
    ANTENNAGATEAREA 0.05225 ;
    PORT
      LAYER metal1 ;
      POLYGON 0.385 0.525 0.51 0.525 0.51 0.7 0.385 0.7  ;
    END
  END A1
  ......
END NAND2_X1

Among them, SYMMETRY X Y indicates that the standard cell can be placed symmetrically along the -axis or -axis, thereby optimizing the layout effect (such as the wire delay to a certain pin, etc.). The SITE field specifies the alignment rules that the standard cell must follow during placement. Here, the value FreePDK45_38x28_10R_NP_162NW_34O refers to alignment rules defined elsewhere, which are specifically located in the process LEF file:

SITE FreePDK45_38x28_10R_NP_162NW_34O
  SYMMETRY y ;
  CLASS core ;
  SIZE 0.19 BY 1.4 ;
END FreePDK45_38x28_10R_NP_162NW_34O

Here, SYMMETRY y means that the standard cell can be placed symmetrically along the y-axis. The SIZE 0.19 BY 1.4 specifies the alignment rule as 0.19 X 1.4, which means that when placing the standard cell, the -axis coordinate must be an integer multiple of 0.19, and the -axis coordinate must be an integer multiple of 1.4. Looking back at the SIZE field of NAND2_X1, it gives the dimensions of the standard cell. It can be seen that the length in the -axis (0.57) and the length in the -axis (1.4) in the SIZE field are respectively integer multiples of 0.19 and 1.4 in the alignment rules. More generally, the SITE can be considered as defining a grid unit, and all standard cells, in terms of size, are rectangles composed of one or more grid units. The PIN field is used to describe some attributes of the specified pin, including direction (DIRECTION), parameters related to antenna effect (attributes starting with ANTENNA), and the geometry of the port (PORT). The PORT further describes that the port needs to occupy a polygon on the metal1 layer, and the shape is given by the POLYGON field.

It can be seen that the LEF file describes the geometric shape information of the standard cell in detail. This information will help EDA tools correctly place the standard cells in the chip.

CDL Files - Transistor Netlists

CDL files, with the .cdl extension, are based on a Circuit Description Language. They describe the transistor structure of standard cells in text form.

vim yosys-sta/pdk/nangate45/cdl/NangateOpenCellLibrary.cdl

The format for describing transistor structures in CDL files is as follows:

.SUBCKT subcircuit name port1 port2 ...
transistor instance name drain gate source substrate transistor type channel width channel length
...
.ENDS

Taking NAND2_X1 as an example, the description of it in the CDL file is as follows:

.SUBCKT NAND2_X1 A1 A2 ZN VDD VSS
*.PININFO A1:I A2:I ZN:O VDD:P VSS:G
*.EQN ZN=!(A1 * A2)
M_i_1 net_0 A2 VSS VSS NMOS_VTL W=0.415000U L=0.050000U
M_i_0 ZN A1 net_0 VSS NMOS_VTL W=0.415000U L=0.050000U
M_i_3 ZN A2 VDD VDD PMOS_VTL W=0.630000U L=0.050000U
M_i_2 VDD A1 ZN VDD PMOS_VTL W=0.630000U L=0.050000U
.ENDS

Lines starting with * are comments. The above description defines a subcircuit (i.e., a standard cell) named NAND2_X1 using .SUBCKT, which has 5 ports: A1, A2, ZN, VDD, and VSS in sequence. The line starting with M_i_1 instantiates an nMOS transistor named M_i_1. Its drain is connected to the net net_0, its gate is connected to the port A2, its source is connected to the port VSS, and its substrate is connected to the port VSS. The channel width and length are and , respectively. The remaining content of the above example describes the remaining transistors and their connection relationships in a similar manner.

Draw the transistor structure based on the CDL file

Try to draw the transistor structure of the standard cell NAND2_X1 according to the above CDL description, and check whether its function is consistent with that of a NAND gate.

The transistor structure information described in the CDL is mainly used for transistor-level SPICE simulation, and for checking the consistency between the GDS layout and the netlist logic. The latter task is called LVS (Layout Versus Schematic).

GDS File - Physical Layout

GDS files, with the suffix .gds, contain all the physical and process information required for manufacturing a standard cell. GDS files are not text files and require specialized tools for parsing and reading. The "OSOC" program does not impose requirements on the specific content within GDS files, so no further elaboration will be provided here.

Classification of Standard Cells

There is a wide variety of standard cells in a standard cell library, which can be classified according to their functions, including but not limited to the following categories. Generally speaking, the first 5 categories of cells and clock buffers are essential to ensure the correct implementation of the basic functions of various designs. By providing other types of cells, users can design better circuits for specific scenarios or achieve more convenient chip debugging functions.

Logic Gate Cells

Logic gate cells include basic logic gates (AND gates, OR gates, NOT gates, etc.) and complex logic gates.

Understand the Function of Complex Logic Gate Cells

There are standard cells named like OAI22_X1 in the Nangate45 LIB file, whose functions are not intuitive. Try to look up the relevant attributes of the standard cell OAI22_X1 to understand its function.

After checking the Nangate45 LIB file, it can be found that the function of the standard cell OAI22_X1 is more complex than a single logic gate. Its function includes two 2-input OR gates, one 2-input AND gate, and one NOT gate. This type of logic gate is called a complex gate. If we check the area of the relevant standard cells, we can get the following data:

area(OAI22_X1) = 1.33
area(OR2_X1)*2 + area(AND2_X1) + area(INV_X1) = 1.064*2 + 1.064 + 0.532 = 3.724

It can be seen that the area of the standard cell OAI22_X1 is much smaller than the area occupied by using multiple functionally equivalent logic gate cells. This is because realizing the logical functions of "AND" and "OR" through the series and parallel connection of transistors is much less costly than realizing the corresponding logical functions through "AND gates" and "OR gates". The cost here is not only reflected in the area but also in the delay and power consumption. Therefore, the standard cell library does not only contain simple logic gate cells. Complex logic gates like OAI22_X1 that are implemented at the transistor level are also provided as standard cells.

Understand the Transistor Structure of OAI22

Check the transistor structure of OAI22_X1 in the CDL file. How many transistors does it have? Try to understand how its transistor structure implements the logical expression of OAI22_X1.

Naming of OAI22

In fact, the naming of OAI22 has its meaning, where OAI stands for Or-And-Invert, and 22 indicates two groups of input signals, with two in each group. Let's assume the input signals are A1, A2, B1, B2 respectively. Then, OAI means first performing the Or operation within each group of signals to get A1 | A2 and B1 | B2; then performing the And operation on the results to get (A1 | A2) & (B1 | B2); and finally performing the Invert operation on the result to get !((A1 | A2) & (B1 | B2)).

Understand the Function of Complex Logic Gate Cells (2)

Similarly, there is a standard cell named AOI221. Try to list its logical expression based on the naming, and check the function in the standard cell library to compare whether your understanding is correct.

The suffixes like X1, X4, etc., included in standard cells indicate the drive strength of the standard cell. Drive strength refers to the current that a standard cell can source or sink while maintaining a specified voltage range, and it affects the time required for the downstream standard cells to switch. Therefore, NAND2_X1, NAND2_X2, and NAND2_X4 are completely equivalent in terms of logical function. However, NAND2_X4 can provide greater drive strength, enabling its downstream logic to switch faster. But this requires larger or more transistors to achieve, so NAND2_X4 has a larger area and higher power consumption.

Understand Drive Strength

Try to look up the relevant attributes of NAND2_X1, NAND2_X2, and NAND2_X4, and compare their area and power consumption.

Understand Drive Strength (2)

Try to check the transistor structure of NAND2_X2 and see how it differs from that of NAND2_X1.

Considering the trade-offs between different drive strengths in terms of performance, area, power consumption, and other indicators, standard cells with higher drive strengths are usually used in critical paths that affect the frequency. This helps reduce the delay of critical paths and improve the chip's frequency. On non-critical paths that do not affect the frequency, standard cells with lower drive strengths are used. This way, the overall area and power consumption of the chip can be reduced without lowering the chip's frequency.

Sequential Cells

Sequential cells include flip-flops, latches, etc., among which there are various types with/without clear/set terminals, such as the flip-flop DFF_X1 and so on.

Draw the Transistor Structure Based on the CDL File (2)

Check the transistor structure of DFF_X1 in the CDL file. How many transistors does this standard cell consist of? Draw the transistor structure of DFF_X1 according to the description in the CDL file, and try to understand how the flip-flop function is realized through this transistor structure.

I/O Cells

A chip needs to communicate with the outside world through I/O cells (input/output cells). I/O cells are used to connect the internal I/O signals of the chip to the metal bonding pads of the I/O cells. After the chip is manufactured, the packaging process will lead out metal pins from the metal bonding pads of the I/O cells, allowing external signals of the chip to interact with the internal I/O signals of the chip through the pins. For the I/O cells of nangate45, reference can be made to the relevant LIB files:

vim yosys-sta/pdk/nangate45/lib/dummy_pads.lib

Among them, I/O cells can be further classified as follows:

  1. Data I/O cells (also known as GPIO), which are used to provide input and output of data signals, such as PADCELL_SIG_H.
  2. Core power supply cells, which are used to provide power for transistors inside the chip, including the source power supply (VSS) and the drain power supply (VDD), such as PADCELL_VSS_H and PADCELL_VDD_H.
  3. I/O power supply cells, which are used to provide power for data I/O cells (i.e., the first type of I/O cells). For example, PADCELL_VSSIO_H and PADCELL_VDDIO_H. Data I/O cells are usually more complex than general standard cells, so their power supply requirements are different from those of general standard cells. Therefore, core power supply cells (i.e., the second type of I/O cells) cannot be used to power data I/O cells.

Each type of I/O cell is further divided into horizontal and vertical directions: Horizontal I/O cells are suffixed with _H, such as the aforementioned PADCELL_SIG_H, PADCELL_VDD_H, etc. Vertical I/O cells are suffixed with _V, such as PADCELL_SIG_V, PADCELL_VDD_V, etc.

Although I/O cells are also part of the standard cell library, their area is several orders of magnitude larger than that of general standard cells. This is because communicating with the outside of the chip imposes more requirements on the functions of I/O cells. For example, they need to have strong driving capability to transmit signals to the outside of the chip, integrate protection circuits to prevent external static electricity from damaging the inside of the chip, and meet the physical size and welding requirements of chip pins (such as spacing, metal layer thickness, and retaining blank areas to avoid short circuits). Therefore, the circuit-level implementation of I/O cells is much more complex than that of general standard cells.

Learn About the Size of I/O Cells

Try to refer to relevant files in nangate45, find out the size of a 2-input NAND gate and the size of an I/O cell, and compare them.

Driver Cells

Driver cells are used to enhance the driving capability of signals, ensure signal integrity, and optimize timing and load. When a signal is transmitted over a long distance or there are too many downstream circuits (high fan-out), the signal may suffer from excessive transmission delay due to insufficient driving capability, or even signal distortion leading to errors. Inserting driver cells helps alleviate the above problems. They are specifically divided into:

  • Logical non-inverting driver cells, also known as buffers, whose output is logically identical to the input. In nangate45, such driver cells include BUF_X1, BUF_X2, BUF_X4, etc. The larger the number in the suffix, the stronger the driving capability of the cell.
  • Logical inverting driver cells, also known as inverters, which have the same function as NOT gates and also come with various driving capabilities.

Physical Cells

Physical cells have no logical functions and are mainly used to solve specific problems in backend physical design that are unrelated to the circuit's logical functions. Some common physical cells include:

  • Pull-up/pull-down cells. These cells have no inputs, only outputs, providing logic 0 (low level) and logic 1 (high level) respectively. In nangate45, the pull-up cell and pull-down cell are LOGIC1_X1 and LOGIC0_X1 respectively.
  • Filler cells. They are used to fill blank areas in the chip, ensuring the continuity of certain layers (such as the power layer) and avoiding defects caused during the manufacturing process. In nangate45, filler cells include FILLCELL_X1, FILLCELL_X2, etc.
  • Decoupling cells (decap). They are used to avoid the impact of dynamic voltage drop caused by the simultaneous switching of a large number of cells in the circuit. Nangate45 does not currently provide decoupling cells.
  • Antenna effect repair cells. During the ion etching step of the chip manufacturing process, under certain conditions, the antenna effect on the circuit may be triggered, which can break down the transistor and make it ineffective. Adding such standard cells in appropriate positions can eliminate the antenna effect, thereby ensuring the correctness of the chip. In nangate45, the antenna effect repair cell is ANTENNA_X1.

Macro Cells

Macro cells are standard cells with specific functions, pre-designed for their physical implementation by manufacturers (foundries or IP vendors), and have a relatively large area. Examples include SRAM memories, DDR phy modules, etc. SRAM memories are a common type of macro cell and are often used in processor design. Nangate45 itself does not come with SRAM macro cells but integrates SRAM macro cells generated by the SRAM generatoropen in new window.

Compare storage density

Select an SRAM memory of a certain specification in nangate45, check its area, and calculate its storage density (that is, the amount of information that can be stored per unit area). Compare it with latches and flip-flops, and find out how many times the storage density of the selected SRAM memory is that of them?

Compare storage density (2)

Both are used to realize information access. Why can the storage density of SRAM be higher than that of latches and flip-flops?

Compare storage density (3)

For the two specifications of SRAM, 32x64 and 64x32, their storage capacities are the same, but their areas are different. Which specification has a larger area? Why?

Complex Function Units

Complex function units include multiplexers, half-adders, full-adders, comparators, etc. Compared to circuits built with logic gates to achieve equivalent functions, providing these functions as standard cells can achieve better delay and area performance. In nangate45, complex function units include MUX2_X1, MUX2_X2, HA_X1, and FA_X1.

Full-Custom Circuits of Complex Units

Taking HA_X1 as an example, try to find the area and transistor structure of this standard cell from the relevant files of the standard cell library. Assume that a certain standard cell library does not provide a standard cell for the half-adder, and it is necessary to build a half-adder using several standard cells of basic logic gates. Please calculate the required area and number of transistors in this case.

Clock-Specific Cells

Clock-specific cells are dedicated to processing clock signals, including clock buffers, clock gating cells, and logic gate cells for handling clock signals, etc. The reason why general standard cells (such as AND gates, buffers, etc.) cannot be used to process clock signals is due to the particularity of clock signals:

  • Minor changes in clock signals may cause flip-flops to work incorrectly. For example, glitches in clock signals caused by jitter may be mistaken by flip-flops as the arrival of clock edges.
  • The delay of clock signals will also affect the timing of flip-flops, thereby affecting the operating frequency of the entire circuit.
  • All flip-flops in the circuit need to be connected to the clock, so the fan-out of the clock signal is very large and the transmission distance is very long, requiring strong driving capability. According to the introduction of the internal structure of the chip above, clock signals are usually transmitted in the upper metal layers.

Therefore, compared with general standard cells, the designers of the standard cell library need to carry out targeted designs for clock-specific cells, making them have characteristics such as low jitter, low delay, and high driving capability.

In nangate45, clock-specific cells include CLKBUF_X1, CLKGATE_X1, etc.

Power Management Cells

Power management cells are used to implement low-power designs, including power gating cells, isolation cells, etc. Nangate45 does not currently provide such cells.

Test and Debug Cells

Test and debug cells are used to support chip testing and debugging, including scan chain cells, Built-In Self Test (BIST) control cells, etc.

Scan chain cells are usually used in Design for Testability (DFT). On the basis of general flip-flops, they add a scan enable terminal SE (scan enable) and a scan input terminal SI (scan input). When SE is active, SI is used to update the flip-flop. Therefore, developers can inject specific states into such flip-flops through external control, which helps them debug the chip after production. However, compared with general flip-flops, scan chain cells have a larger area and higher power consumption.

In nangate45, test and debug cells include SDFF_X1, SDFFS_X1, SDFFR_X1, SDFFRS_X1, etc.

Learn About All Standard Cells

Try to further understand the standard cells provided by nangate45 in combination with the relevant files of the PDK. You can check the corresponding comments and functional attributes to understand the role of related standard cells. After understanding, you will have a simple understanding of how your RTL code is processed by the synthesizer.

PVT Corners

The delay of a circuit is mainly affected by three factors: Process, Voltage, and Temperature, collectively referred to as PVT parameters. Electronics engineers typically select multiple combinations of PVT parameters as a series of environments and strive to ensure that the chip will work in these environments during the design phase. These environments are called PVT corners.

Process variations refer to uncontrollable disturbance factors during chip manufacturing. For example, the environment of chips at the center of a wafer is different from that at the edge of the wafer; the thickness of the transistor's metal layer is not completely uniform; the doping concentration of the transistor substrate is uneven, and so on. These factors will affect the resistance and capacitance (collectively referred to as RC parameters) of the transistor, and ultimately affect the delay performance of the transistor: it may become faster or slower.

To test that the circuit can work correctly under various transistor delays, several scenarios are generally defined according to the operating speed of the transistor, which are called process corners. Process corners are usually named with two letters. The first letter represents the operating speed of nMOS, and the second letter represents the operating speed of pMOS. The operating speed is divided into three cases: typical (denoted by the letter t), fast (denoted by the letter f), and slow (denoted by the letter s). Among them, fast and slow are relative to the typical case. Therefore, according to the polarity and operating speed of the transistor, five process corners can be combined: ss, tt, ff, sf, and fs. For example, fs indicates a delay situation where nMOS works faster than the typical case, but pMOS works slower than the typical case.

Is "tt" considered a "corner"?

In a coordinate system where the X-axis represents the operating speed of nMOS and the Y-axis represents the operating speed of pMOS, ss, ff, sf, and fs will each fall at the four corners of a rectangle, which is the origin of the term "process corner". tt, however, actually lies at the center of the rectangle. Strictly speaking, it is not a "corner". But as a typical scenario, electronics engineers still include it in the concept of "process corners".

Among the five process corners mentioned above, for ss, tt, and ff, the operating speeds of nMOS and pMOS are basically consistent. Therefore, they do not have a significant impact on the overall function of the transistor, but only affect its delay. However, for sf and fs, since one of the nMOS and pMOS operates faster while the other operates slower, the delay when the transistor switches from 0 to 1 differs from that when it switches from 1 to 0. To ensure that various circuit components can work correctly, the determination of component delay parameters needs to be more cautious. Nevertheless, in actual manufacturing processes, due to the randomness of process variations, the probability that the operating speeds of nMOS and pMOS in a chip change in exactly opposite directions is extremely low. Therefore, electronics engineers usually do not consider the sf and fs process corners.

Process Corners and Intel Processor Models

Due to process variations, chips from the same batch may exhibit different performance. Intel has taken advantage of this by classifying chips from the same batch that belong to different process corners into different model tiers for sale. For example, in the Intel Core series, most chips with performance corresponding to the tt process corner are sold under the i5 model; a small number of chips belonging to the ff process corner, which can run at higher frequencies, are sold under the more expensive i7 model to generate higher profits; and the remaining chips that fall into the ss process corner, which can only run at lower frequencies than tt, are sold under the cheaper i3 model. This avoids discarding them as defective chips.

In the operating environment of a chip, the voltage is not constant either. For example, the current passing through the power supply network will form a voltage drop based on its resistance, resulting in the input voltages of standard cells at different positions not being completely the same: standard cells close to the power I/O cells can obtain a higher input voltage, and the transistors work faster; while standard cells far from the power I/O cells, due to the existence of voltage drop, relatively obtain a lower input voltage, and the working speed of the transistors is relatively slower. In addition, the power supply may also have white noise. Even for standard cells at the same position, the working speed of the transistors will fluctuate over time. To cope with voltage fluctuations, electronic engineers generally need to ensure that the circuit can work correctly within the range of of the standard operating voltage v (i.e., ).

Temperature also affects the working speed of transistors. On the one hand, it is affected by the external ambient temperature. Some chips work in the high-temperature environment of factory workshops, and some work in the low-temperature environment of the North and South Poles. On the other hand, even if the external environment is the same, transistors at different positions in the chip will be affected by different temperatures: areas with high transistor density or high transistor switching frequency generate more heat. Compared with normal temperature, an increase in temperature will slow down the working speed of transistors. This is because, according to thermodynamic effects, particles have higher energy at high temperatures, so atoms in semiconductor materials will vibrate more intensely in the crystal lattice. Affected by this vibration, the direction of electron movement in the transistor channel will be changed, resulting in a decrease in current, which further reduces the working speed of the transistor. Therefore, to improve the robustness of the chip, the working conditions at different temperatures need to be considered.

The naming of LIB files usually includes information about PVT corners. For example, ss_100C_1v60 indicates a process corner of ss, a temperature of 100 degrees Celsius, and a voltage of 1.60V; ff_n40C_1v95 indicates a process corner of ff, a temperature of -40 degrees Celsius, and a voltage of 1.95V. In the integrated circuit field, the naming convention that uses a letter to replace the decimal point ., such as 1v60, is very common. On one hand, some early file systems or EDA tools did not support decimal points in file names. On the other hand, in densely written technical documents or on boards with small fonts, 1v60 has a better visual distinguishability than 1.60v, especially since the decimal point is easily overlooked. The use of the letter n (representing negative) to replace the minus sign - is for similar reasons. In addition, different manufacturers may adopt different naming conventions. For instance, some use the letter p (representing point) to replace the decimal point (e.g., v1p60 for 1.60V), and some use the letter m (representing minus) to replace the minus sign (e.g., m40C for -40 degrees Celsius).

Try Evaluation Results of Different PVT Corners

Nangate45 provides LIB files for different PVT corners (located in the yosys-sta/pdk/nangate45/lib/ directory). Try to replace the LIB files for different PVT corners in the yosys-sta project (specified in yosys-sta/pdk/nangate45.tcl), then re-evaluate the circuit performance and compare the evaluation results under different PVT corners.

Since PVT corners describe the working conditions of standard cells in different environments, the internal structure and geometry of the same standard cell are exactly the same under different PVT corners. Therefore, it is only necessary to include timing information and power consumption information for different PVT corners in different LIB files. Designing a chip with EDA tools under a certain PVT corner is actually answering the question: "How will the chip perform if it works in the environment described by this PVT corner in the future?" The actual measured performance of the chip is related to its real working environment. If the actual working environment is inconsistent with the PVT corner used when designing the chip, the information reported by EDA tools cannot fully represent the actual measured performance of the chip.

PVT Corners and Overclocking

Some enthusiast groups use water cooling or even liquid nitrogen to overclock processors, successfully enabling computers to run stably at higher operating frequencies. Try to analyze why these technologies allow successful overclocking from the perspective of PVT corners.

Manufacturers' Marketing Strategies

Some manufacturers may leverage the differences between PVT corners and actual operating environments for marketing purposes. For example, a manufacturer uses the PVT corner ff_n40C_1v95 during the chip design phase. EDA tools report that the chip can run at a maximum frequency of 2GHz, so the manufacturer claims that its chip reaches a 2GHz frequency. However, after users purchase the manufacturer's chips or related products equipped with these chips, they find that the chips can only run at a maximum of 1.5GHz.

On one hand, users use chip products at room temperature (about 25 degrees Celsius) rather than -40 degrees Celsius. On the other hand, among a batch of chips, those with typical process variations account for the majority, and most users purchase these chips, while those with process variations in the ff range are in the minority. If the manufacturer uses the PVT corner tt_25C_1v95 during the design phase, the data reported by EDA tools will be closer to the actual usage of users.

Therefore, if the frequency in a manufacturer's advertisement is not measured data, it is necessary to pay attention to the PVT corner under which the relevant data is evaluated, so as to make a more objective estimate of the future working conditions of the chip.

Threshold Voltage

Recalling the working principle of a transistor, the difference between the gate voltage and the source voltage must reach a certain threshold for the transistor to conduct; otherwise, the transistor remains off. Transistors with different thresholds have different electrical properties, so standard cells built with transistors of different thresholds also have distinct characteristics.

Standard cells with different threshold voltages are mainly used to strike a balance between delay and static power consumption. Specifically, for standard cells with a higher threshold voltage, it takes more time for the transistor to switch from the off state to the on state, resulting in higher delay. As for static power consumption, as mentioned earlier, it is mainly caused by leakage current. In current CMOS technology, the largest component of leakage current is the sub-threshold current. The existence of sub-threshold current is because when a transistor switches from the on state to the off state, it does not instantly enter a perfect off state but instead enters a "sub-threshold" state. In this state, there is still a weak electric field near the gate, which can still attract a small number of electrons. Although these electrons are not enough to form a channel connecting the source and the drain, they still generate a tiny current from the drain to the source, which is the sub-threshold current. Assuming other factors remain unchanged, the relationship between the sub-threshold current  and the threshold voltage (V_T) is as follows:

where  and  are two factors independent of . Since the sub-threshold current accounts for a large proportion of the leakage current, the static power consumption can be approximately expressed as:

where  is the supply voltage. It can be seen that the higher the threshold voltage, the lower the leakage current and the lower the static power consumption. Conversely, when the threshold voltage decreases, the static power consumption increases exponentially.

However, standard cells with the same logical function but different threshold voltages usually have the same area. This is because different threshold voltages are achieved by adjusting the parameters of the transistors themselves, such as the doping concentration of the substrate and the thickness of the gate dielectric. These parameter adjustments do not affect the size and arrangement of the transistors in the standard cell, so they will not affect the area of the standard cell.

Standard cells are usually classified by threshold voltage into the following categories: HVT (High Threshold Voltage), SVT (Standard Threshold Voltage), LVT (Low Threshold Voltage), and ULVT (Ultra-Low Threshold Voltage). Among them, HVT has the highest threshold voltage, the lowest static power consumption, but the highest delay; ULVT is the opposite, with the lowest threshold voltage, the lowest delay, but the highest static power consumption. Some manufacturers refer to SVT as RVT (Regular Threshold Voltage). Some process nodes also provide UHVT (Ultra-High Threshold Voltage) standard cells for users to choose from. However, nangate45 does not currently provide standard cells with different threshold voltages, so it can be considered that only SVT standard cells are provided. Electronic engineers need to select standard cells with appropriate threshold voltages to design chips according to the application scenarios of the chips. For example, in low-power application scenarios, HVT is preferred; in high-performance application scenarios, LVT or even ULVT is preferred. In occasions where both goals are pursued, a hybrid design approach can be chosen, that is, using LVT or ULVT in the critical paths that affect the frequency, thereby reducing the delay of the critical paths and increasing the frequency of the chip; using SVT or HVT in non-critical paths that do not affect the frequency, so as to reduce the overall static power consumption of the chip without reducing the chip frequency. For example, according to the "Xiangshan" paper published in MICRO, a top international conference in the field of architecture, the proportion of standard cells with different voltage thresholds in the first-generation "Xiangshan" processor chip is: ULVT 1.04%, LVT 19.32%, SVT 25.19%, HVT 53.67%.

Determine the number of tracks in nangate45

Attempt to find the required parameters in relevant files and calculate the number of tracks in nangate45 standard cells.

Standard cells with fewer tracks (such as 6T, 7T) have smaller areas and lower power consumption, but their driving capability is weaker, resulting in longer transistor switching times and thus lower performance. On the contrary, standard cells with more tracks (such as 12T, 13T) have higher performance but larger areas and higher power consumption. There are also standard cells with a track count between the two (such as 9T, 10T), which achieve a relatively balanced performance in terms of indicators like performance, area, and power consumption. Some PDKs provide multiple standard cell libraries with different numbers of tracks. Electronics engineers need to select standard cells with an appropriate number of tracks for chip design based on the application scenarios of the chip. However, after selecting a standard cell library, it is impossible to mix standard cells with different track counts in the circuit. This is different from threshold voltages, because a standard cell library can also include standard cells with multiple threshold voltages, which can be mixed and used. For example, the PDK of skywater130 provides the following standard cell libraries:

Standard Cell LibraryCharacteristicsTrack NumberGrid Cell
sky130_fd_sc_hdhigh density9T0.46 x 2.72
sky130_fd_sc_hdllhigh density, low leakage9T0.46 x 2.72
sky130_fd_sc_hshigh speed11T0.48 x 3.33
sky130_fd_sc_msmedium speed11T0.48 x 3.33
sky130_fd_sc_lslow speed11T0.48 x 3.33
sky130_fd_sc_lplow power11T0.48 x 3.33
sky130_fd_sc_hvlhigh voltage14T0.48 x 4.07

Learn about skywater130

As a producible open-source PDK, skywater130 offers a rich set of standard cell libraries, including different PVT corners, different threshold voltages, and different track counts. The entire PDK size is as large as 20GB. If you are interested, you can download skywater130open in new window and further explore its various details.

Physical Design - From Netlist to Tapeout-Ready Layout

Physical design refers to the process of mapping the standard cells and their connection relationships recorded in the netlist to the three-dimensional space of a real chip. Specifically, EDA tools responsible for physical design need to determine the coordinates of each standard cell in the chip, as well as the routing direction of the wires. This ensures that the wires can connect standard cells located at different coordinates according to the connection relationships in the netlist, thereby achieving functions consistent with the netlist logic. The file that records the coordinates of standard cells and the routing directions is the GDS layout file mentioned above. Finally, EDA tools also need to evaluate whether the resulting chip can be manufactured correctly and whether the chip's indicators meet expectations.

After understanding the process structure of the chip, you can grasp the essence of physical design. The process of physical design is to determine the content in each layer: Where to place which standard cells in the lower layers (floorplanning, placement) How to connect these standard cells in the middle layers (routing) How to plan power (power planning) and clocks (clock tree synthesis) in the upper layers What size the area of each layer should be (floorplanning) Let's introduce the work to be carried out in each of these stages one by one.

---------------------   M7    <----- Power Planning
  | | | | | | | | |
---------------------   M6    <----- Clock Tree Synthesis
  | | | | | | | | |
---------------------   M5    <-+
  | | | | | | | | |             +--- Routing
---------------------   M4    <-+
  | | | | | | | | |
---------------------   M3    <-------+
  | | | | | | | | |                   |
---------------------   M2    <-------+--- Floorplanning, Placement
  | | | | | | | | |                   |
---------------------   M1    <-------+
  | | | | | | | | |                   |
=====================  Poly-silicon <-+
+++++++++++++++++++++  dielectric
ooooooooooooooooooooo  Silicon Substrate

Since physical design involves actual circuits, physical design engineers need to understand relevant knowledge in the field of electronics to be competent in their work. However, in "OSOC", you only need to roughly understand the steps involved in converting a netlist into a tapeout-ready layout. This will help you understand the mutual influences between logical design (i.e., RTL design) and physical design in the future. You do not need to delve into or even memorize all the details of the physical design process.

Floorplan

The main task of floorplan is to determine the size of the chip and place some cells whose positions will not be adjusted in subsequent processes. The work of floorplan mainly includes the following contents.

Determining the Chip Size

Similar to the area of standard cells, the die size refers to the projected area of the chip on the  plane, which is the area of the rectangle obtained from the top view of the chip. It is also the area of each metal layer in the chip, hence also called the chip area. The thickness of the chip (i.e., the length in the -axis direction) is related to the selected process, such as the 1P7M mentioned above. Generally, once a process is selected, the thickness of the chip is not an adjustable parameter, so the physical design stage usually does not concern itself with the chip's thickness.

Based on the chip's process structure, the chip area is mainly composed of the area occupied by transistors and the area occupied by routing. The area of transistors mainly includes the area occupied by the source and drain of the silicon substrate, plus the area occupied by the gate of the polysilicon layer. However, the standard cell library provides the area occupied by each standard cell, so users or EDA tools do not need to consider the dimensions at the transistor level. Routing is divided into vertical direction (i.e., -axis direction) and horizontal direction. The former passes through vias between metal layers, with a projection on the  plane as a point; the latter extends within the metal layer, with a projection on the  plane as a line. On the surface, neither of these two cases occupies area, but according to the requirements of process manufacturing, there are minimum distances between vias and between routing lines. Otherwise, signal interference or even short circuits will occur. Therefore, in practice, routing also occupies a certain area.

Since routing work has not yet been carried out at this point, the specific area occupied by the routing cannot be determined. In the floorplanning stage, the chip size is generally estimated based on the area report obtained from synthesis (i.e., the total area occupied by standard cells). For estimation, engineers need to consider the expected proportion of the total area of standard cells to the total chip area, and this proportion is called utilization. Based on experience, the utilization rate is generally around 60% to 80%. For example, suppose after synthesis, the total area of standard cells of a chip is approximately , and an engineer expects a utilization rate of 70%. Then, in the floorplanning stage, the estimated chip size can be .

The selection of utilization rate requires a trade-off between cost and design difficulty. A higher utilization rate means less space left for the routing stage, which makes it easier to encounter congestion during routing and requires long-distance wiring. This increases wire delay, reduces the chip frequency, and may even lead to failure due to excessive congestion, making it impossible to complete the physical design. On the other hand, a lower utilization rate results in a larger amount of redundant area in the chip, causing waste. Since the manufacturing cost of a chip is generally proportional to its area, this introduces unnecessary cost expenses.

Engineers need to select the target utilization rate based on the chip's characteristics and their own experience: For small chips, their topology is relatively simple, and placement and routing are easier to achieve successfully, so a higher utilization rate can be set. For complex large chips, however, a too high utilization rate is not appropriate, as sufficient space needs to be reserved for the routing stage. Experienced engineers can set a higher utilization rate, while novices can start with a lower utilization rate in the early stages to accumulate experience. In large-scale projects, engineers typically conduct multiple rounds of physical design. They adjust the parameters of subsequent rounds based on the results of the previous round (such as excessive congestion or excessive remaining area), thereby continuously optimizing the physical design outcomes and achieving the expected performance goals without incurring excessive costs.

Determining the Chip Side Lengths

After determining the approximate area, it is also necessary to consider the chip's side lengths, i.e., the lengths of the chip along the -axis and -axis directions. Factors affecting the side lengths include not only the synthesized area of standard cells but also the number of chip pins. The impact of the number of pins is, in turn, related to the packaging scheme. A common packaging scheme is QFP (Quad Flat Package), where pins are distributed around the four sides of the chip. Therefore, the side lengths of the chip are proportional to the number of pins.

         |    |    |
    +----+----+----+----+
    |                   |
 ---+                   +---
    |                   |
 ---+                   +---
    |                   |
 ---+                   +---
    |                   |
    +----+----+----+----+
         |    |    |

Each pin of a chip corresponds to an I/O cell, so the influence of the number of chip pins on the chip's side lengths is actually reflected through the size of the I/O cells. On one hand, as mentioned earlier, the size of an I/O cell is several orders of magnitude larger than that of a general standard cell. On the other hand, since some I/O cells need to undertake the task of power supply and cannot be used for communication, the total number of pins that actually need to be planned is more than the number of pins used for communication. The proportion and distribution density of power supply pins are related to the chip's size, process, power consumption and other attributes. Process-related manuals will provide recommended layouts for power supply pins. But generally speaking, the larger the size and the higher the power consumption of the chip, the greater the power supply required, and the more the number of power supply pins.

For example, if a chip requires 90 pins for communication and it is assumed that 1/3 of the pins are needed for power supply, then the actual number of pins required is . Therefore, a common packaging scheme with 144 pins should be chosen. If the pins are evenly distributed around the four sides, each side will need to place approximately pins. Suppose the chip is designed using a 130nm process, and the size of an I/O cell is . This process allows I/O cells to be arranged closely next to each other. Then, the length of one side of the chip is , and the minimum area of the chip is . That is to say, if this packaging scheme is adopted, even if the estimated chip size based on utilization is , the chip area still needs to be planned as after considering the pins and packaging scheme. Some processes require a certain gap to be reserved between I/O cells, in which case the side length of the chip will be longer and the minimum area will be larger.

Must a chip be square in shape?

The symmetry of a square can ease the burden of certain subsequent tasks, such as power planning and clock tree synthesis, which require that the distances from the source point to all target points do not differ too much. However, in reality, the side lengths of a chip can be different. A chip can be taped out and produced as long as it can accommodate the required I/O cells, successfully complete physical design (such as successful routing), meet the manufacturing constraints specified by the manufacturer, and have a feasible packaging scheme.

Therefore, the number of chip pins, as a kind of resource, needs to have its requirements clearly defined in the specification definition stage at the early stage of the project. At the same time, we can also quickly estimate the minimum area of the chip through the number of pins. For example, if a chip only needs 28 pins for communication, according to the same calculation of the proportion of power supply pins, a 44-pin packaging scheme can be adopted. Under the above-mentioned 130nm process, the minimum area of the chip is .

Another factor that may affect the chip size is special macro cells. Since macro cells are pre-designed, their shapes are fixed. In order to place certain special macro cells, the side lengths of the chip need to meet certain requirements. For example, some DDR phy modules need to be placed on the I/O boundary of the chip and have an L-shaped form. This requires that the long side of the chip must be longer than the long side of the L-shape; otherwise, this DDR phy module cannot be placed.By integrating the above conditions, the size of the chip can be initially determined, and its top view is as shown in the following figure.

+-----------------------------------------+
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
|                                         |
+-----------------------------------------+

Placing I/O Cells

After determining the chip's side lengths, the next step is to place I/O cells around the chip. Generally, the placement of I/O cells follows these practices:

  • Place data I/O cells corresponding to functionally similar top-level ports in physically adjacent positions. Data I/O cells logically correspond to the top-level ports of the entire design and are connected to the standard cells inside the chip through wiring. Therefore, this placement method helps reduce wire delays during the routing stage (such as ports A and B in the figure below). Otherwise, if functionally similar ports are placed on opposite sides or even diagonally of the chip, one end will inevitably require long wiring to reach the target standard cells.
         |    |    |                         |    |    |
    +----+----+----+----+               +----+----+----+----+
    |                   |               |                   |
A---+--+                +---        A---+--+                +---
    |  o                |               |  o                |
B---+--+                +---         ---+  +                +---
    |                   |               |  |                |
 ---+                   +---         ---+  +----------------+---B
    |                   |               |                   |
    +----+----+----+----+               +----+----+----+----+
         |    |    |                         |    |    |
  • Place corresponding core power supply units and I/O power supply units in accordance with the requirements for power pin density specified in the process manual. For example, a certain process may require that a power supply unit be placed every two data I/O units.
+-----------------------------------------+
|  I  I  P  p  I  I  P  p  I  I  P  p  I  |
|                                         |
|p                                       I|
|                                         |
|P                                       P|
|                                         |          I Data I/O Cell
|I                                       p|          p Core Power Cell
|                                         |          P I/O Power Cell
|I                                       I|
|                                         |
|I                                       I|
|                                         |
|p                                       p|
|                                         |
|P                                       P|
|                                         |
|  I  p  P  I  I  p  P  I  I  p  P  I  I  |
+-----------------------------------------+

Placing Macro Cells

Another task is to place macro cells. Macro cells occupy a much larger area than ordinary standard cells. For example, a 64x64 SRAM, considering only the memory cells, requires transistors, while a standard cell of a two-input NAND gate only needs 4 transistors. Therefore, macro cells need to be placed in advance; otherwise, after placing standard cells, it will be difficult to free up a continuous large area for placing macro cells.

+-----------------------------------------+
|  I  I  P  p  I  I  P  p  I  I  P  p  I  |
|                                         |
|p                                       I|
|    +-------+                            |
|P   | MMMMM |                           P|
|    | MMMMM |                            |          I Data I/O Cell
|I   | MMMMM |                           p|          p Core Power Cell
|    +-------+                            |          P I/O Power Cell
|I                                       I|<-- clk   M Macro Cell
|                            +-------+    |
|I                           | MMMMM |   I|
|                            +-------+    |
|p                                       p|
|                                         |
|P                                       P|
|                                         |
|  I  p  P  I  I  p  P  I  I  p  P  I  I  |
+-----------------------------------------+

Similar to the placement of I/O cells, the placement of macro cells also needs to refer to their functions. Macro cells with similar functions should be placed in physically adjacent positions. This avoids long wiring during the routing stage, which would otherwise affect the chip's frequency.

Powerplan

The goal of powerplan is to plan the distribution of power supply wiring at the chip level to ensure the reliability of the chip's power supply. The work of powerplan mainly includes:

  1. Planning the power ring for I/O cells. From a physical distribution perspective, the power ring of I/O cells surrounds the I/O cells around the chip and is connected to the power ports of the I/O cells, with power supplied by the I/O power units within the I/O cells. As mentioned earlier, I/O cells require strong driving capability to drive circuits external to the chip, so the power supply for I/O cells is different from that for general standard cells and requires separate planning and design.
  2. Planning the internal power ring of the chip. In terms of physical distribution, it is similar to the power ring of I/O cells but is located inside the I/O cells and surrounds the chip, forming the main trunk of the power network. It is powered by the core power units in the I/O cells and provides a uniform power input to the standard cells inside the chip.
  3. Planning the internal power stripes of the chip. Physically, power stripes are distributed in a crisscross pattern inside the chip, used to uniformly deliver power to various macro cells and standard cells within the chip. During the placement stage, these power stripes will be connected to the source and drain of the gate circuits in the standard cells.

For the sake of simplicity, the following figure only shows one power stripe as an illustration; in reality, multiple crisscrossing power stripes should be planned.

+-----------------------------------------+
|  I  I  P  p  I  I  P  p  I  I  P  p  I  |
| ####################################### |
|p#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#I|
| #% +-------+                         %# |
|P#% | MMMMM |                         %#P|
| #% | MMMMM |                         %# |          I Data I/O Cell
|I#% | MMMMM |                         %#p|          p Core Power Cell
| #% +-------+                         %# |          P I/O Power Cell
|I#%                                   %#I|<-- clk   M Macro Cell
| #%                         +-------+ %# |          # I/O Power Ring
|I#%                         | MMMMM | %#I|          % Core Power Ring
| #%                         +-------+ %# |          = Power Stripe
|p#%===================================%#p|
| #%                                   %# |
|P#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#P|
| ####################################### |
|  I  p  P  I  I  p  P  I  I  p  P  I  I  |
+-----------------------------------------+

In the future, when the chip is operating, power will be input into the chip through the power supply pins, propagated to the surrounding areas of the chip via the power rings, and then transmitted to the standard cells in various regions of the chip through the power stripes. Corresponding voltages will be applied to the source and drain of the transistors, enabling them to operate in accordance with their electrical characteristics.

Some complex chips also need to support power management-related functions, such as multi-voltage domains and power gating. Planning work related to these functions also needs to be carried out at this stage.

Placement

The goal of placement is to position standard cells within the chip and determine the physical location of each standard cell. However, the placement of standard cells is not arbitrary and must follow certain rules:

  1. Standard cells must not overlap with each other. Although a chip is a three-dimensional object, according to its process structure, standard cells are implemented through transistors in the lower layers and connections in the lower metal layers. Transistors of different standard cells should occupy different positions in these layers. That is, all standard cells have the same -axis coordinate component. Therefore, from the projection on the plane (i.e., the top view of the chip), standard cells must not overlap.
  2. The placement of standard cells must meet certain alignment conditions. As mentioned earlier, attributes such as the SITE and track count of standard cells are essentially used to constrain the positions of standard cells during placement. Aligning by SITE allows the power stripes set in the power planning stage to be easily connected to the standard cells (as shown in the figure below), while the concept of track count enables the subsequent routing stage to more easily meet the minimum spacing requirements for metal layer routing (i.e., the PITCH attribute). If the placement of standard cells does not meet the alignment requirements, EDA tools will need to spend significant effort to generate a design that meets process requirements.
        --- +------+  +------------+----------+    +---+
         ^  |======|==|============|==========|====|===| <- VSS power stripe
         |  |      |  |            |          |    |   |
height --+  |      |  |            |          |    |   |
         |  |      |  |            |          |    |   |
         v  |======|==|============|==========|====|===| <- VDD power stripe
        --- +------+  +------------+----------+    +---+
               OR2        AOI221       AND4        NAND2
+-----------------------------------------+
|  I  I  P  p  I  I  P  p  I  I  P  p  I  |
| ####################################### |
|p#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#I|
| #% +-------+  @@  @@  @           @ @%# |
|P#% | MMMMM |   @      @@@      @@  @@%#P|
| #% | MMMMM |  @    @                @%# |          I Data I/O Cell
|I#% | MMMMM |                @        %#p|          p Core Power Cell
| #% +-------+   @   @@           @    %# |          P I/O Power Cell
|I#%                                   %#I|<-- clk   M Macro Cell
| #% @   @   @   @@  @       +-------+ %# |          # I/O Power Ring
|I#% @    @  @   @   @   @   | MMMMM | %#I|          % Core Power Ring
| #% @   @   @   @   @    @  +-------+@%# |          = Power Stripe
|p#%===================================%#p|          @ Standard Cell
| #% @   @       @   @    @     @     @%# |
|P#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#P|
| ####################################### |
|  I  p  P  I  I  p  P  I  I  p  P  I  I  |
+-----------------------------------------+

In addition to ensuring that the placement of standard cells meets the requirements of process manufacturing, EDA tools also consider ways to improve the quality of the circuit. Some of the measures include, but are not limited to:

  • Placing logically similar standard cells close to each other. Similar to the placement of data I/O cells mentioned above, if two logically similar standard cells are far apart, higher line delays will be introduced during the routing stage, thereby reducing the chip's frequency.
  • Applying mirror symmetry to standard cells. As mentioned earlier, the SYMMETRY attribute in the LEF file indicates that a standard cell can be placed symmetrically along the -axis or -axis to optimize the placement effect (such as the line delay to a certain port). For example, if the port p on the left side of a standard cell A needs to be connected to another standard cell B located on the right side of A, A can be mirrored along the -axis so that port p is on the right side of A, which shortens the distance between p and B and reduces line delay.
  • Congestion mitigation. Excessive concentration of standard cells in a certain area may make subsequent routing work difficult. It will not only cause routing detours, introducing high line delays, but may even lead to routing failure due to excessive congestion. To alleviate congestion, EDA tools may disperse the overly concentrated standard cells, thereby reserving more space for the routing stage.

Size of Filler Cells

The standard cell library usually provides filler cells of different sizes, which are used to fill the blank positions in the chip where no standard cells are placed. Taking nangate45 as an example, try to find the size of the minimum size filler cell in the relevant files. What is the relationship between this size and the SITE property of the standard cell? Why?

Clock Tree Synthesis (CTS)

The goal of Clock Tree Synthesis is to construct a clock network that delivers clock signals to the clock pins of all sequential cells. This clock network typically has only one or a few sources (clock pins or outputs of phase-locked loops). We can regard these sources as root nodes and the sequential cells as leaf nodes. Thus, this clock network is like one or several trees growing from the root nodes to the leaf nodes, hence the name "clock tree".

In the RTL design stage, we consider clock signals to be ideal. However, this is not the case in reality. In the physical design stage, we need to take into account the actual issues that clock signals have to deal with. The special properties of clock signals were briefly discussed when introducing "clock-dedicated cells" earlier. Therefore, the constructed clock tree should also satisfy these properties, which specifically include:

  • Low latency. There are many factors that introduce delays into the propagation of clock signals. Some can be optimized by EDA tools, such as the routing of clock signals. EDA tools should minimize the distance from the clock source to the clock ports of flip-flops. Others cannot be optimized by EDA tools, such as the delay of the clock source itself. EDA tools should consider these factors when modeling clock signal delays.
  • Low skew. During RTL design, we assume that ideal clock signals reach all flip-flops simultaneously. In reality, however, different flip-flops are placed in different positions during the placement stage, and the time required for the same clock signal to reach different flip-flops is not exactly the same, which gives rise to the concept of clock skew. To minimize clock skew as much as possible, EDA tools need to carefully plan the routing of clock signals, ensuring that the wire delays from the clock source to each flip-flop are as uniform as possible.
  • Low jitter. Jitter is an inherent characteristic of electrical signals in the physical world, related to specific process parameters, and cannot be optimized or eliminated by EDA tools. Therefore, EDA tools should consider the impact of jitter when modeling clock signal delays. Otherwise, the EDA tool's estimation of clock signal delays may be overly optimistic, and when the chip operates in real scenarios in the future, the actual jitter may cause the circuit to violate the overly optimistic timing conditions, ultimately making the chip unable to work correctly.
  • High drive capability. To achieve high drive capability of clock signals, dedicated clock buffers are generally inserted into the clock tree. However, there are certain mutual constraints among these properties. For example, some routing topologies have the property of low skew but come with high line delays; inserting clock buffers can enhance the driving capability of clock signals, but it will also change the delay on the corresponding paths, which may lead to severe skew. Therefore, EDA tools need to comprehensively consider the impact of these technologies on the clock tree and construct a clock tree that meets the requirements as a whole.

For the sake of simplicity, the following figure only shows a part of the clock tree as an illustration; in reality, the clock tree should be connected to all sequential cells.

+-----------------------------------------+
|  I  I  P  p  I  I  P  p  I  I  P  p  I  |
| ####################################### |
|p#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#I|
| #% +-------+  @@o @@  @           @ @%# |
|P#% | MMMMM |   @o     @@@      @@ o@@%#P|
| #% | MMMMM |  @ o  @    o         o @%# |          I Data I/O Cell
|I#% | MMMMM |    o       o   @     o  %#p|          p Core Power Cell
| #% +-------+   @o  @@   o       @ o  %# |          P I/O Power Cell
|I#% oooooooooooooooooooooooooooooooooo%#I|<-- clk   M Macro Cell
| #% @   @   @   @@o @    o  +-------+ %# |          # I/O Power Ring
|I#% @    @  @   @ o @   @o  | MMMMM | %#I|          % Core Power Ring
| #% @   @   @   @ o @    @  +-------+@%# |          = Power Stripe
|p#%===================================%#p|          @ Standard Cell
| #% @   @       @ o @    @     @   oo@%# |          o Clock Tree
|P#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#P|
| ####################################### |
|  I  p  P  I  I  p  P  I  I  p  P  I  I  |
+-----------------------------------------+

Routing

The goal of routing is to connect the standard cells placed during the placement stage through wires according to the topological relationship of the netlist. As large-scale and very-large-scale integrated circuits contain a huge number of standard cells, there are also numerous wires between them. Routing will fail if even one wire cannot be connected. To increase the probability of successful routing, the routing task is generally divided into two stages: Global Routing and Detailed Routing.

Using urban road planning as an analogy, global routing is similar to planning a city's main roads. On one hand, it must ensure connectivity between different locations in the city; on the other hand, it should prevent main roads from being excessively circuitous and achieve connectivity with the shortest possible distance. Finally, it also needs to avoid excessive congestion in certain areas. Detailed routing, by contrast, is equivalent to further dividing actual lanes on the main roads, allowing vehicles to travel on these lanes to reach destinations connected by the main roads.

Returning to the context of routing, the goal of global routing is to plan coarse-grained wiring schemes and allocate wiring resources for these schemes, including the number of tracks, wiring directions, and vias between metal layers. Specifically, during the global routing stage, routing tools treat multiple tracks as a grid and then attempt to connect standard cells using the grid as a coarse-grained unit, resulting in "grid paths" (similar to main roads in road planning). In the process of global routing, while ensuring connectivity, routing tools seek a "grid path connectivity scheme" that is relatively short in distance and avoids excessive congestion. The goal of detailed routing is, based on the global routing, to determine specific wiring tracks within the "grid paths" and use these tracks to physically connect the standard cells.

+-----------------------------------------+
|  I  I  P  p  I  I  P  p  I  I  P  p  I  |
| ####################################### |
|p#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#I|
| #% +-------+..@@o @@..@...   .....@.@%# |
|P#% | MMMMM |  .@o.....@@@..... @@ o@@%#P|
| #% | MMMMM |..@ o  @    o   .   . o @%# |          I Data I/O Cell
|I#% | MMMMM |  . o       o   @   . o .%#p|          p Core Power Cell
| #% +-------+  .@o  @@...o   ....@ o .%# |          P I/O Power Cell
|I#% oooooooooooooooooooooooooooooooooo%#I|<-- clk   M Macro Cell
| #% @ ..@...@ ..@@o @... o  +-------+.%# |          # I/O Power Ring
|I#% @....@ .@.. @ o @. .@o .| MMMMM |.%#I|          % Core Power Ring
| #% @...@...@  .@ o @  ..@..+-------+@%# |          = Power Stripe
|p#%===================================%#p|          @ Standard Cell
| #% @...@      .@ o @... @.....@.. oo@%# |          o Clock Tree
|P#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#P|          . Wiring
| ####################################### |
|  I  p  P  I  I  p  P  I  I  p  P  I  I  |
+-----------------------------------------+

Up to this point, the work of physical design has been fully completed, and the content of each layer in the chip has been determined. The specific layout of all standard cells and physical wirings in each layer of the chip can be described through the GDS layout file.

Sign-off Analysis

The goal of sign-off analysis is to ensure that the layout obtained through the physical design process is manufacturable. On one hand, it is necessary to ensure that the layout meets the indicators of the front-end design, including PPA and others; on the other hand, it is also required to ensure that the layout complies with the manufacturing requirements of the wafer fab. Strictly speaking, sign-off analysis does not fall within the scope of physical design, but it is indispensable to guarantee that the layout is tape-out ready. Otherwise, the chips produced may fail to function properly.

The specific tasks of sign-off analysis include but are not limited to:

  • Static Timing Analysis (STA). After the completion of physical design, the positions of all standard cells in three-dimensional space are determined, and details such as the direction, length, and corners of all wirings are clear. Therefore, it is possible to accurately model the logic delays and wire delays of each path in the circuit, including the propagation delay, jitter, and skew of clock signals. By integrating these factors, more accurate timing evaluation results can be obtained to determine whether the user-specified frequency indicators are met.
  • Power Analysis. Similar to static timing analysis, after the physical design is completed, more accurate power consumption evaluation results can be obtained to determine whether the user-specified power consumption indicators are satisfied.
  • Signal Integrity Analysis. This involves analyzing the crosstalk phenomenon between adjacent wirings to ensure that signals can still be transmitted correctly without distortion in noisy environments and under crosstalk conditions.
  • Physical Verification (PV for short). It conducts checks on the physical structure of the chip. If any violations are found, it is necessary to modify the placement and routing results and recheck. The specific inspection tasks of physical verification include:
    • Design Rule Check (DRC). Ensure that the GDS layout complies with the foundry's design rules, such as minimum line width requirements and minimum spacing requirements. These rules are documented in text files with the .lydrc extension, provided as part of the PDK:
    vim yosys-sta/pdk/nangate45/drc/FreePDK45.lydrc
    
    As can be seen, it contains check rules for the poly-silicon layer and various metal layers. EDA tools can read these rules from the file and check them one by one. Some examples of the rules are as follows:
    # The minimum width of the metal1 layer is 65nm
    metal1.width(65.nm, euclidian).output("METAL1.1", "METAL1.1 : Minimum width of metal1 : 65nm")
    
    # The minimum spacing of the via6 layer is 160nm
    via6.space(160.nm, euclidian).output("VIA6.2", "VIA6.2 : Minimum spacing of via6 : 160nm")
    
    • Electrical Rule Check (ERC). Check for electrical issues in the circuit, such as dangling wires and short circuits.
    • Layout Versus Schematic (LVS). Verify the consistency between the GDS layout (physical circuit) and the netlist (logical circuit).

From Tape-out Ready Layout to Operational Chip

The journey from a tape-out ready layout to an operational chip involves the following steps:

  1. After the backend design team submits the chip's GDS layout to the wafer fab, the wafer fab will fabricate masks according to the layout.
  2. The wafer fab uses the masks to mass-produce wafers. A single wafer contains multiple dies. The wafer fab then cuts the wafer to obtain a batch of individual dies.
  3. The wafer fab then sends these dies to the packaging house, which packages the dies according to the planned packaging scheme to produce finished chips.
  4. The packaging house delivers the finished chips to the development board team, who designs the development board and solders the finished chips onto it.
  5. The development board is then provided to the user, who deploys software on the chip and operates it.

Several Coding Styles and Standards

In previous phases of OSOC, we found that certain non-standard coding styles can introduce additional problems during the SoC integration phase. To avoid impacting the progress of future SoC integrations, we recommend that everyone adhere to the following coding standards.

1. If you have not deeply understood the event model of Verilog, do not use behavioral modeling.

In fact, the Digital Circuit Lab notes from Nanjing University also mention that “behavioral modeling is detrimental to beginners in establishing circuit thinking.” We quote the relevant description here:

it is strongly recommended that beginners do NOT design circuits using behavioral modeling.

Verilog was not originally intended for designing synthesizable circuits; its essence is a circuit modeling language based on an event queue model. Therefore, behavioral modeling can easily lead beginners away from the original intent of circuit description: Developers need to look at the circuit diagram, mentally visualize the circuit behavior, and then convert that into the event queue model, ultimately using behavioral modeling to describe the circuit’s behavior, from which the synthesizer derives the corresponding circuit. From this process, it is not only unnecessary but also very easy to introduce errors:

  • If the developer already has the circuit diagram in mind, describing it directly is the most convenient.
  • If the developer already has the circuit diagram in mind, but their understanding of behavioral modeling is flawed, they may adopt an incorrect description method, resulting in an unexpected circuit design.
  • If the developer does not have the circuit diagram, but expects the synthesizer to generate a circuit of a certain behavior through behavioral modeling, this has already deviated from the essence of “describing circuits.” Many students easily make this mistake, treating behavioral modeling as procedural C code and attempting to map any complex behavior to a circuit, ultimately leading the synthesizer to generate low-quality circuits with high delay, area, and power consumption, or even result in a circuit that behaves unexpectedly due to data races in the code.

Therefore, until everyone masters the “description of circuits” thinking without being misled by behavioral modeling, we strongly recommend that beginners stay away from behavioral modeling and directly describe circuits using data flow modeling and structured modeling. The following questions can help test whether you have grasped the essence of Verilog:

  • In hardware description languages, what is the precise meaning of “execution”?
  • Who executes Verilog statements? Is it the circuit, the synthesizer, or something else?
  • If the condition of an if statement is met, the statements following else are not executed; what does “not executed” mean here? What is its relation to describing circuits?
  • There are “concurrent executions”, “sequential executions”, and “executions triggered by any variable change”, as well as “executions under any circumstances”; how are they reflected in the designed circuit?

If you cannot answer these questions clearly, we strongly recommend that you refrain from using behavioral modeling. If you truly want to understand them, you need to read the Verilog Standard Manualopen in new window.

the true description of a circuit = instantiation + wiring.

Forgetting behavioral modeling allows for a straightforward return to the simple essence of circuit description. Imagine you have a circuit diagram; how would you describe its contents to others? You would likely say something like, “There is an A component/module, and its x pin is connected to the y pin of another B component/module,” as this is the most natural way to describe a circuit. Designing circuits with HDL is about using HDL to describe the circuit diagram—what’s on the diagram is directly what you describe. Thus, using HDL to describe a circuit essentially involves two tasks:

  • Instantiation: Placing a component/module on the circuit board, which can be a gate circuit or a module composed of gate circuits.
  • Wiring: Correctly connecting the pins of components/modules with wires.

You can appreciate how data flow modeling and structured modeling embody these two tasks, while behavioral modeling complicates these straightforward tasks.

Thus, we do not recommend beginners write any "always" statements in Verilog code. To facilitate the use of flip-flops and multiplexers, we provide the following Verilog templates for you to utilize:

// Flip-Flop Template
module Reg #(WIDTH = 1, RESET_VAL = 0) (
  input clk,
  input rst,
  input [WIDTH-1:0] din,
  output reg [WIDTH-1:0] dout,
  input wen
);
  always @(posedge clk) begin
    if (rst) dout <= RESET_VAL;
    else if (wen) dout <= din;
  end
endmodule

// Example of using the Flip-Flop Template
module example(
  input clk,
  input rst,
  input [3:0] in,
  output [3:0] out
);
  // width of 1 bit, reset value of 1’b1, write enable always active
  Reg #(1, 1'b1) i0 (clk, rst, in[0], out[0], 1'b1);
  // width of 3 bits, reset value of 3’b0, write enable is out[0]
  Reg #(3, 3'b0) i1 (clk, rst, in[3:1], out[3:1], out[0]);
endmodule
// Internal Implementation of the Multiplexer Template
module MuxKeyInternal #(NR_KEY = 2, KEY_LEN = 1, DATA_LEN = 1, HAS_DEFAULT = 0) (
  output reg [DATA_LEN-1:0] out,
  input [KEY_LEN-1:0] key,
  input [DATA_LEN-1:0] default_out,
  input [NR_KEY*(KEY_LEN + DATA_LEN)-1:0] lut
);

  localparam PAIR_LEN = KEY_LEN + DATA_LEN;
  wire [PAIR_LEN-1:0] pair_list [NR_KEY-1:0];
  wire [KEY_LEN-1:0] key_list [NR_KEY-1:0];
  wire [DATA_LEN-1:0] data_list [NR_KEY-1:0];

  genvar n;
  generate
    for (n = 0; n < NR_KEY; n = n + 1) begin
      assign pair_list[n] = lut[PAIR_LEN*(n+1)-1 : PAIR_LEN*n];
      assign data_list[n] = pair_list[n][DATA_LEN-1:0];
      assign key_list[n]  = pair_list[n][PAIR_LEN-1:DATA_LEN];
    end
  endgenerate

  reg [DATA_LEN-1 : 0] lut_out;
  reg hit;
  integer i;
  always @(*) begin
    lut_out = 0;
    hit = 0;
    for (i = 0; i < NR_KEY; i = i + 1) begin
      lut_out = lut_out | ({DATA_LEN{key == key_list[i]}} & data_list[i]);
      hit = hit | (key == key_list[i]);
    end
    if (!HAS_DEFAULT) out = lut_out;
    else out = (hit ? lut_out : default_out);
  end
endmodule

// Multiplexer Template without Default Value
module MuxKey #(NR_KEY = 2, KEY_LEN = 1, DATA_LEN = 1) (
  output [DATA_LEN-1:0] out,
  input [KEY_LEN-1:0] key,
  input [NR_KEY*(KEY_LEN + DATA_LEN)-1:0] lut
);
  MuxKeyInternal #(NR_KEY, KEY_LEN, DATA_LEN, 0) i0 (out, key, {DATA_LEN{1'b0}}, lut);
endmodule

// Multiplexer Template with Default Value
module MuxKeyWithDefault #(NR_KEY = 2, KEY_LEN = 1, DATA_LEN = 1) (
  output [DATA_LEN-1:0] out,
  input [KEY_LEN-1:0] key,
  input [DATA_LEN-1:0] default_out,
  input [NR_KEY*(KEY_LEN + DATA_LEN)-1:0] lut
);
  MuxKeyInternal #(NR_KEY, KEY_LEN, DATA_LEN, 1) i0 (out, key, default_out, lut);
endmodule

In which, the MuxKey module implements the “key-value selection” function, which sets out to the matching data based on the provided key key from a list of (key, data) pairs lut. If there is no data with the key value key in the list, out will be 0. Specifically, the MuxKeyWithDefault module can provide a default value default_out, and when no key-value pair matches key, out will be default_out.

When instantiating these two modules, please note the following:

  • Users need to provide the number of key-value pairs NR_KEY, the bit width of the key KEY_LEN, and the data width DATA_LEN, ensuring that the port signal widths match the parameters provided, or else incorrect results will be produced.
  • If there are multiple data entries with the same key value in the list, the value of out is undefined, and it is the user’s responsibility to ensure that the key values in the list are unique.

The implementation of the MuxKeyInternal module utilizes various advanced features such as generate and for loops, and behavioral modeling has been used for convenience. Here, we do not elaborate on this; through the abstraction of structured modeling, users can ignore these details.

The following code uses the multiplexer templates to implement both a 2-to-1 multiplexer and a 4-to-1 multiplexer:

module mux21(a,b,s,y);
  input   a,b,s;
  output  y;

  // Implement the following always code through MuxKey
  // always @(*) begin
  //  case (s)
  //    1'b0: y = a;
  //    1'b1: y = b;
  //  endcase
  // end
  MuxKey #(2, 1, 1) i0 (y, s, {
    1'b0, a,
    1'b1, b
  });
endmodule

module mux41(a,s,y);
  input  [3:0] a;
  input  [1:0] s;
  output y;

  // Implement the following always code through MuxKeyWithDefault
  // always @(*) begin
  //  case (s)
  //    2'b00: y = a[0];
  //    2'b01: y = a[1];
  //    2'b10: y = a[2];
  //    2'b11: y = a[3];
  //    default: y = 1'b0;
  //  endcase
  // end
  MuxKeyWithDefault #(4, 2, 1) i0 (y, s, 1'b0, {
    2'b00, a[0],
    2'b01, a[1],
    2'b10, a[2],
    2'b11, a[3]
  });
endmodule

if you use Chisel, it is also advised that you do not use "when" and "switch".

In Chisel, the semantics of when and switch are very similar to Verilog’s behavioral modeling, so it is also not recommended for beginners to use them. Instead, you can use library functions like Mux1H to implement multiplexer functionality. For specifics, you can refer to related materials on Chisel.

2. if you insist on using Verilog’s behavioral modeling, do not use negedge.

Mixing posedge and negedge can make timing convergence more difficult and increase the difficulty of backend physical implementation. If you are unclear on how to maintain good timing while mixing both, we recommend you only use posedge. Otherwise, if your processor severely affects the overall timing of the SoC, the OSOC project team will remove your processor from the batch of tape-out listings under tight tape-out deadlines.

If you use the Verilog templates we provided above or use Chisel, you do not need to worry about this issue.

try synthesizing with negedge.

Using the above modules as an example, attempt to evaluate the timing of the following module:

module test(input clk, input rst, input in, output out);
  wire t0, t1;
  Reg r1(clk, rst, in, t0, 1'b1);
  Reg r2(clk, rst, t0, t1, 1'b1);
  Reg r3(clk, rst, t1, out, 1'b1);
endmodule

Then modify r2 to trigger on the falling edge of the clock and reevaluate the timing. Compare the timing reports before and after the modification; what differences do you notice? If you find yourself unable to understand the details, do not worry; we will further introduce timing analysis details in Phase B, where we will revisit this issue.

3. if you insist on using Verilog’s behavioral modeling, do not use latches.

The changes in latches are not driven by the clock, making them difficult for timing analysis tools to analyze. If you are unsure how to avoid latches, we recommend you not use behavioral modeling.

If you use the Verilog templates we provided above or use Chisel, you do not need to worry about this issue.

4. you need to add a student ID prefix before the module name.

For example, module IFU needs to be modified to module ysyx_22040000_IFU. This is because when everyone integrates their processor into the SoC, modules with the same name will lead to duplicate definition errors reported by the tools.

If you use Chisel, you do not need to add a student ID prefix to your module names while writing code for now.

5. if you use Verilog, you need to add a student ID prefix before the macro definition identifiers.

For example, define SIZE 5` needs to be modified to define ysyx_22040000_SIZE 5`. This is because when everyone integrates their processor into the SoC, macros with the same name will lead to duplicate definition errors reported by the tools.

If you use Chisel, you do not need to worry about this issue.

Complete Digital Circuit Experiments

You have previously completed several digital circuit designs through the online learning platform HDLBits. After setting up the simulation environment and introducing EDA tools, we can now support a relatively complete digital circuit design process:

New Requirements -> Architecture Design -> Logic Design -> Functional Verification -> Circuit Evaluation

Here, Architecture Design refers to “thinking about how to implement new requirements through circuit functionality,” Logic Design refers to “implementing the design plan using RTL code,” Functional Verification is currently achieved through Verilator simulation to check whether the functionality implemented by the RTL code meets expectations, and Circuit Evaluation uses open-source EDA tools to assess circuit performance, area, power consumption, and other metrics.

Next, you will attempt to complete some digital circuit experiments according to the above process, thereby gaining a deeper understanding of it.

if you want to use Chisel

Please run the following command:

cd ysyx-workbench
bash init.sh npc-chisel

This command will replace the files in the npc directory with a Chisel development environment; specific details can be found in the README.md within it.

For Verilog code generated by Chisel, warnings from Verilator’s static code analysis may be difficult to fix, but you can ignore them as long as you are sure these warnings do not affect code correctness. However, we still recommend that you always enable Verilator’s static code check feature, as you may discover some code logic-related issues while reviewing these warnings.

VSCode Auto-Jump Plugin

  • If you choose Chisel programming, we recommend the metals plugin.
  • If you choose Verilog programming, we recommend the digital ideopen in new window plugin.

complete digital circuit experiments with NVBoard

We first recommend Nanjing University's Digital Circuit and Computer Composition Experimentopen in new window.

Nanjing University has implemented a teaching reform that integrates “Digital Circuits” and “Computer Organization Principles” into a single course, with experiment content spanning from the basics of digital circuits to simple processor design. With NVBoard, you can treat it as an FPGA to implement experiments that require FPGA support.

You need to complete the following mandatory content:

  • Experiment 2: Decoders and Encoders
  • Experiment 3: Adders and ALUs
  • Experiment 6: Shift Registers and Barrel Shifters
  • Experiment 7: State Machines and Keyboard Input
  • Experiment 8: VGA Interface Controller Implementation

If you plan to use Chisel to complete the above digital circuit experiments, you just need to connect the compiled Verilog code to Verilator and NVBoard.

Implementing a Simple Processor with RTL

The moment has finally come to design your first processor using RTL! You have implemented sCPU using Logisim, and now to implement sCPU in RTL, you will describe the circuit structure of each module using RTL code based on the circuit schematic in Logisim. With your experience in completing digital circuits, this should not be difficult for you.

Implement sCPU with RTL

Try to make sCPU the design target of NPC. Based on the sCPU you designed with Logisim, redesign it using RTL for calculating 1+2+...+10. To see the output, you can add the out rs instruction in sCPU to display the sum of the series on the seven-segment display of NVBoard.

Unlike sEMU, we do not currently require you to implement a more powerful runtime environment; we only need this processor to support the calculation of 1+2+...+10. We will cover more about the runtime environment in the D phase.

Compare sEMU and sCPU

For sISA, you have implemented both sEMU and sCPU. Try comparing them and see what differences there are.

Evaluate the Performance of sCPU

Try evaluating your designed sCPU using yosys-sta.

Congratulations, you have essentially completed a simple CPU’s full design process!

New Requirements -> Architecture Design -> Logic Design -> Functional Verification -> Circuit Evaluation
  • The requirement for the sCPU design is to implement an sISA processor with RTL.
  • The lecture content from phase F has already helped everyone to understand how to implement sCPU through digital circuit functionality, which essentially completes the architecture design, the output of which is the circuit schematic in Logisim.
  • The process of designing sCPU with RTL is the logic design process.
  • Verifying whether your RTL code can successfully run 1+2+...+10 using Verilator and NVBoard is the functional verification.
  • You have conducted a circuit evaluation of sCPU using yosys-sta.

use open-source EDA tools for physical design

Currently, the OSOC lecture only introduces the physical design process and does not yet include hands-on practical content related to physical design. If you are interested, you can read the iEDA introductory tutorialopen in new window and try to use iEDA for the physical design of sCPU to generate the corresponding layout.

Of course, the functionality of sCPU is still far from what we want to achieve in a CPU, so we do not currently require everyone to analyze the evaluation results for quality. From the project process perspective, we need to first design a relatively complete CPU before considering how to optimize it, which also adheres to the principle of “complete first, perfect later.” Subsequent lecture content will follow this process, guiding everyone on how to design more powerful processors to run more complex programs.