添加icache提升了处理器的指令供给能力
还可以从数据供给和计算效率优化处理器
体系结构设计能力的基本素质 - 学会估算一项技术的预期收益
工序: 组装 -> 贴纸 -> 包装袋 -> 包装盒 -> 外检
用1~5标识这些工序, 用A, B, C…标识不同的产品
----> 时间
| 产品
| +---+---+---+---+---+
V |A.1|A.2|A.3|A.4|A.5|
+---+---+---+---+---+
+---+---+---+---+---+
|B.1|B.2|B.3|B.4|B.5|
+---+---+---+---+---+
+---+---+---+---+---+
|C.1|C.2|C.3|C.4|C.5|
+---+---+---+---+---+
================================================================
----> 时间
| 员工
| +---+ +---+ +---+
V |A.1| |B.1| |C.1|
+---+ +---+ +---+
+---+ +---+ +---+
|A.2| |B.2| |C.2|
+---+ +---+ +---+
+---+ +---+ +---+
|A.3| |B.3| |C.3|
+---+ +---+ +---+
+---+ +---+ +---+
|A.4| |B.4| |C.4|
+---+ +---+ +---+
+---+ +---+ +---+
|A.5| |B.5| |C.5|
+---+ +---+ +---+
----> 时间 * 每个产品的生产时间没有减少
| 产品
| +---+---+---+---+---+
V |A.1|A.2|A.3|A.4|A.5|
+---+---+---+---+---+
+---+---+---+---+---+
|B.1|B.2|B.3|B.4|B.5|
+---+---+---+---+---+
+---+---+---+---+---+
|C.1|C.2|C.3|C.4|C.5|
+---+---+---+---+---+
+---+---+---+---+---+
|D.1|D.2|D.3|D.4|D.5|
+---+---+---+---+---+
+---+---+---+---+---+
|E.1|E.2|E.3|E.4|E.5|
+---+---+---+---+---+
================================================================
----> 时间 * 但每位员工都能一直保持工作状态
| 员工
| +---+---+---+---+---+
V |A.1|B.1|C.1|D.1|E.1|
+---+---+---+---+---+
+---+---+---+---+---+
|A.2|B.2|C.2|D.2|E.2|
+---+---+---+---+---+
+---+---+---+---+---+
|A.3|B.3|C.3|D.3|E.3|
+---+---+---+---+---+
+---+---+---+---+---+
|A.4|B.4|C.4|D.4|E.4|
+---+---+---+---+---+
+---+---+---+---+---+
|A.5|B.5|C.5|D.5|E.5| * 每一时刻都有一件产品完成生产, 从而提升产线的吞吐
+---+---+---+---+---+
假设5个阶段的延迟都是1ns(先不考虑真实的访存延迟)
阶段寄存器 | 关键路径 | 频率 | 指令执行延迟 | IPC | |
---|---|---|---|---|---|
单周期 | 无 | 5ns | 200MHz | 5ns | 1 |
多周期 | 有 | 1ns | 1000MHz | 5ns | 0.2 |
流水线 | 有 | 1ns | 1000MHz | 5ns | 1 |
虽然指令执行的延迟仍然是5ns, 但流水线的频率高, IPC高
stage reg -> +----+ +----+
+-----+ -> |....| -> +-----+ -> |....| -> +-----+
| | +----+ | | +----+ | |
| IDU | valid ---> | EXU | valid ---> | LSU |
| | | | | |
+-----+ <--- ready +-----+ <--- ready +-----+
对于每个阶段的输入in
和输出out
,
需要正确处理以下信号(bits
指代阶段之间需要传输的负载):
out.bits
, 由当前阶段生成out.valid
, 由当前阶段生成,
通常还与in.valid
有关in.ready
, 由当前阶段生成, 忙碌时置为无效,
处理完当前指令时置为有效out.ready
, 与下一阶段的in.ready
相同in.bits
,
当前阶段的in.ready
和上一阶段的out.valid
同时有效时,
更新成上一阶段的out.bits
in.valid
, 作为作业留给大家def pipelineConnect[T <: Data, T2 <: Data](prevOut: DecoupledIO[T],
thisIn: DecoupledIO[T], thisOut: DecoupledIO[T2]) = {
prevOut.ready := thisIn.ready
thisIn.bits := RegEnable(prevOut.bits, prevOut.valid && thisIn.ready)
thisIn.valid := ???
}
pipelineConnect(ifu.io.out, idu.io.in, idu.io.out)
pipelineConnect(idu.io.out, exu.io.in, exu.io.out)
pipelineConnect(exu.io.out, lsu.io.in, lsu.io.out)
// ...
RegEnable
= 传统教科书的 “流水段寄存器”
总线视角的理解: 下游模块接收消息的缓冲区
在流水线中, 当前周期不能执行当前指令的情况
冒险主要有3类: 结构冒险, 数据冒险和控制冒险
在流水线设计中需要检测出冒险, 并正确处理它们
in.ready
和out.valid
添加等待条件流水线中的不同阶段需要同时访问同一个部件
T1 T2 T3 T4 T5 T6 T7 T8
+----+----+----+----+----+
I1: lw | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I2: add | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I3: sub | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I4: xor | IF | ID | EX | LS | WB |
+----+----+----+----+----+
部分结构冒险可从设计上完全避免, 使其不会在CPU执行过程中发生
READ
/WRITE
命令的其中之一有一些结构冒险还是无法完全避免
处理方式: 等
好消息: 总线天生具备等待的功能
不同阶段的指令依赖同一个寄存器数据, 且至少一条指令写入该寄存器
T1 T2 T3 T4 T5 T6 T7 T8 T9
+----+----+----+----+----+
I1: add a0,t0,s0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I2: sub a1,a0,t0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I3: and a2,a0,s0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I4: xor a3,a0,t1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I5: sll a4,a0,1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
写后读(Read After Write, RAW)冒险: 年老指令写, 年轻指令读
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12
+----+----+----+----+----+
I1: add a0,t0,s0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
nop | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
nop | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
nop | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I2: sub a1,a0,t0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I3: and a2,a0,s0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I4: xor a3,a0,t1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I5: sll a4,a0,1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
思想: 与其等待, 还不如执行一些有意义的指令
编译器尝试寻找一些没有数据依赖关系的指令, 在不影响程序行为的情况下调整其顺序
I1: add a0,t0,s0 I1: add a0,t0,s0
I2: sub a1,a0,t0 I6: add t5,t4,t3 *
I3: and a2,a0,s0 I7: add s5,s4,s3 *
I4: xor a3,a0,t1 ---> I8: sub s6,t4,t2 *
I5: sll a4,a0,1 I2: sub a1,a0,t0
I6: add t5,t4,t3 I3: and a2,a0,s0
I7: add s5,s4,s3 I4: xor a3,a0,t1
I8: sub s6,t4,t2 I5: sll a4,a0,1
编译器只能尽力而为, 实在找不到, 就只能插入nop
补充: 硬件模块实现指令调度 = 乱序执行处理器
考虑load-use冒险(一种特殊的RAW冒险, 被依赖的是一条load指令)
T1 T2 T3 .... T? T? T? T? T? T? T?
+----+----+----+--------------+----+
I1: lw a0,t0,s0 | IF | ID | EX | LS | WB |
+----+----+----+--------------+----+
+----+----+----+----+----+
nop X ? | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I2: sub a1,a0,t0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
在真实的SoC中, 软件几乎无法预测访存指令在将来执行时的延迟 😂
观察: 寄存器写入操作发生在WBU中, 因此需要写入的寄存器编号会也会随着流水线传播到WBU
检测方法: 若位于IDU的指令要读出的寄存器与后续某阶段中将要写入的寄存器相同, 则发生RAW冒险
def conflict(rs: UInt, rd: UInt) = (rs === rd)
def conflictWithStage[T <: Stage](rs1: UInt, rs2: UInt, stage: T) = {
conflict(rs1, stage.rd) || conflict(rs2, stage.rd)
}
val isRAW = conflictWithStage(IDU.rs1, IDU.rs2, EXU) ||
conflictWithStage(IDU.rs1, IDU.rs2, LSU) ||
conflictWithStage(IDU.rs1, IDU.rs2, WBU)
还需要处理的细节: 有的指令不写寄存器; 有的指令不读rs2; 有的阶段无有效指令; 零寄存器…
检测到RAW冒险后的一种简单处理方式: 等
in.ready
和out.valid
置0即可
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12
+----+----+----+----+----+
I1: add a0,t0,s0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+-------------------+----+----+----+
I2: sub a1,a0,t0 | IF | ID | EX | LS | WB |
+----+-------------------+----+----+----+
+-------------------+----+----+----+----+
I3: and a2,a0,s0 | IF | ID | EX | LS | WB |
+-------------------+----+----+----+----+
+----+----+----+----+----+
I4 xor a3,a0,t1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I5: sll a4,a0,1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
硬件阻塞方案的适用性比软件方案更强
跳转指令会改变指令执行顺序, 导致IFU可能会取到不该执行的指令
T1 T2 T3 T4 T5 T6 T7 T8
+----+----+----+----+----+
I1: 100 add | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I2: 104 lw | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I3: 108 beq 200 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I4: ??? ??? | IF | ID | EX | LS | WB |
+----+----+----+----+----+
jal
和jalr
也会造成类似问题
T1 T2 T3 T4 T5 T6 T7 T8
+----+----+----+----+----+
I1: 100 add | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I2: 104 lw | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I3: 108 ecall | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I4: ??? ??? | IF | ID | EX | LS | WB |
+----+----+----+----+----+
I4应该从mtvec
所指的内存位置取指,
但通常在T4时刻无法得知
最早捕获RISC-V异常的模块
0 - Instruction address misaligned - IFU
1 - Instruction access fault - IFU
2 - Illegal Instruction - IDU
3 - Breakpoint - IDU
4 - Load address misaligned - LSU
5 - Load access fault - LSU
6 - Store/AMO address misaligned - LSU
7 - Store/AMO access fault - LSU
8 - Environment call from U-mode - IDU
9 - Environment call from S-mode - IDU
11 - Environment call from M-mode - IDU
12 - Instruction page fault - IFU
13 - Load page fault - LSU
15 - Store/AMO page fault - LSU
有的指令需要等到几乎执行完成, 才能确定是否抛出异常(如load指令)
推测执行(speculative execution): 在等待的同时尝试推测一个选择, 如果猜对了, 就相当于提前做出了正确的选择, 从而节省等待开销
总是推测接下来执行下一条静态指令
PC + 4
处的指令即可PC + 4
PC + 4
性能提升与推测的准确率有关
PC + 4
的概率非常低
上述分析给我们提供了一些优化的思路
valid
置为0mepc
设置成发生异常的指令的PC值
mcause
这么多流水级, 若每一级指令类型不同, 可能会有不同的行为
jal
,
jalr
ecall
, mret
fence.i
你不会愿意设计那么多测试用例的 😂
让形式化验证工具帮我们自动找反例
更简单的对比方法: 对比状态的转移是否一致, 而不是状态本身
class PipelineTest extends Module {
val io = IO(new Bundle {
val inst = Input(UInt(32.W))
val rdata = Input(UInt(XLEN.W))
})
val dut = Module(new PipelineNPC)
val ref = Module(new SingleCycleNPC)
dut.io.imem.inst := io.inst
dut.io.imem.valid := ...
dut.io.dmem.rdata := io.rdata
dut.io.dmem.valid := ...
// ...
ref.io.imem.inst := dut.io.wb.inst
// ...
when (dut.io.wb.valid) {
assert(dut.io.wb.rd === ref.io.wb.rd)
assert(dut.io.wb.res === ref.io.wb.res)
// ...
}
}
还要考虑很多细节(如使用assume(!isIllegal)
),
具体参考讲义
实现简单的流水线处理器后, 我们来讨论如何提升流水线的效率
阻碍流水线吞吐提升的主要原因:
根据当前的设计, 你觉得哪个原因占比最高?
你不一定能马上想明白, 所以profiling非常重要
需要考虑icache在命中时的指令供给能力:
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13
+--------------+----+----+----+----+
I1 | I$ | ID | EX | LS | WB |
+--------------+----+----+----+----+
+--------------+----+----+----+----+
I2 | I$ | ID | EX | LS | WB |
+--------------+----+----+----+----+
+--------------+----+----+----+----+
I3 | I$ | ID | EX | LS | WB |
+--------------+----+----+----+----+
我们希望提升icache的吞吐 - 连续命中时, 每周期都能读出指令
解决方案 - 将icache流水化
T1 T2 T3 T4 T5 T6 T7 T8 T9
+----+----+----+----+----+----+----+
I1 | I$1| I$2| I$3| ID | EX | LS | WB |
+----+----+----+----+----+----+----+
+----+----+----+----+----+----+----+
I2 | I$1| I$2| I$3| ID | EX | LS | WB |
+----+----+----+----+----+----+----+
+----+----+----+----+----+----+----+
I3 | I$1| I$2| I$3| ID | EX | LS | WB |
+----+----+----+----+----+----+----+
类似处理器流水线, 将icache的访问分成若干阶段, 并在时间上重叠
可复用PipelineConnect()
来实现icache的流水化
一个想法: 寄存器的新值并非WB阶段才产生, 能否提前拿到?
T1 T2 T3 T4 T5 T6 T7 T8
+----+----+----+----+----+
I1: add a0,t0,s0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
| | |
V | |
+----+----+----+----+----+
I2: sub a1,a0,t0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
| |
V |
+----+----+----+----+----+
I3: and a2,a0,s0 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
|
V
+----+----+----+----+----+
I4 xor a3,a0,t1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
I1中a0的新值在T3时刻已经可以从EX阶段的计算结果中读出
转发源有3个: EX阶段, LS阶段和WB阶段
但转发并不能无条件进行, 需要转发源满足以下条件:
前两个条件和RAW冒险检测条件一致, 可复用RAW冒险的检测逻辑
使用转发技术后, 阻塞流水线的条件将有所变:
如果存在多条指令同时满足转发条件, 则需要仔细考量
T1 T2 T3 T4 T5 T6 T7 T8
+----+----+----+----+----+
I1: add a0, a0, a1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I2: add a0, a0, a1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I3: add a0, a0, a1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I4: add a0, a0, a1 | IF | ID | EX | LS | WB |
+----+----+----+----+----+
T4时, 对于I3, 位于EX阶段的I2和位于LS阶段的I1均满足转发条件
a0
应该是最近一次写入a0
的结果,
因此应该选择由位于EX阶段的I2进行转发结论: 多条指令同时满足转发条件时, 应选择最年轻的指令进行转发
已经采用了推测执行, 需要进一步考虑降低冲刷流水线带来的负面影响:
通过 “分支预测”(branch prediction)技术提升分支指令的推测准确率
仅根据指令本身来预测
Software should also assume that backward branches will be predicted taken and
forward branches as not taken, at least the first time they are encountered.
分支预测算法的一个重要指标: 预测准确率
不必每次都完整运行程序, 只需要一个简单的功能模拟器branchsim
分支预测器的预测结果需要提供给IFU使用
PC + 4
处取指但需要在ID阶段才能得知一条指令是否为分支指令及其跳转目标
解决方法: 通过一张表来维护PC和分支跳转目标的对应关系, 称为BTB(Branch Target Buffer)
tag target
+-------+----------+ Branch Target Buffer
+----+ +-------+----------+
| PC |---> +-------+----------+
+-+--+ +-------+----------+
| +-------+----------+
| | | branch predicted
| v | target +-----------+ next PC +-----+
| +----+ +------->| branch |---------->| IFU |
+--------->| == |-------------->| predictor | +-----+
+----+ is branch +-----------+
tag target
+-------+----------+ Branch Target Buffer
+----+ +-------+----------+
| PC |---> +-------+----------+
+-+--+ +-------+----------+
| +-------+----------+
| | | branch predicted
| v | target +-----------+ next PC +-----+
| +----+ +------->| branch |---------->| IFU |
+--------->| == |-------------->| predictor | +-----+
+----+ is branch +-----------+
PC + 4
处取指fence.i
-
让其后的取指操作可以看到在其之前的store结果
fence.i
执行完之前, 流水线可能已经取出若干年轻指令
fence.i
生效前取出的 T1 T2 T3 T4 T5 T6 T7
+----+----+----+----+----+
I1: add | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+----+----+----+
I2: fence.i | IF | ID | EX | LS | WB |
+----+----+----+----+----+
+----+----+
I3: ??? may be stale | IF | ID |
+----+----+
+----+
I4: ??? may be stale | IF |
+----+
+----+----+----+----+----+
I5: sub | IF | ID | EX | LS | WB |
+----+----+----+----+----+
解决方案: 将其冲刷掉, 可复用推测执行错误时的冲刷逻辑
流水线作为组成原理教科书上的技术巅峰, 会给你带来一种幻觉:
你真正需要的体系结构设计能力:
这些需要通过独立写代码来锻炼, 这正是 “一生一芯”给大家的训练