.
A nice worked example of taking a simple pipeline and adding improvements, also comparing with a superscalar alternative. Paying attention to implementation costs and verification costs, and effects on the critical path.
Article by Jagadeesh Mummana:
Optimising a Pipelined RISC-V Core: From Naive Pipeline to Near-Superscalar PerformanceQuote:
Starting from a CPI of 1.3158 on a plain five-stage pipeline, the following techniques reduced CPI to 1.0675, a 23.26% improvement in CoreMark/MHz:
Moving branch resolution to the ID stage with forwarding support for the branch comparator: +8.557%
Adding a 2-bit saturating counter direction predictor: +6.547%
Adding a branch target buffer for target prediction at the IF stage: +3.458%
Generalising the forwarding network to cover MEM-to-EX paths for load results: +1.996%
Replacing the local predictor with a tournament predictor: +0.455%
Adding a JAL fast path and return address stack: +0.080%
Examples from the article:
Quote:
Code:
Critical path comparison (conceptual gate depth):
Single-issue EX stage forwarding mux:
2 sources (EX/MEM, MEM/WB) -> 2:1 mux -> ALU input
Gate depth: ~2-3 gates for mux
Dual-issue EX stage forwarding mux (slot 0):
Sources: EX/MEM slot0, EX/MEM slot1, MEM/WB slot0, MEM/WB slot1
-> 4:1 mux -> ALU input
Gate depth: ~4-5 gates for mux
Four-wide EX stage forwarding mux (one slot):
Sources: 4 EX/MEM slots + 4 MEM/WB slots = 8 sources
-> 8:1 mux -> ALU input
Gate depth: ~6-7 gates for mux
Each additional layer of mux depth can push the stage's critical path
past the target clock period, forcing either pipeline lengthening
or clock frequency reduction.
Quote:
Code:
Forwarding matrix complexity scaling:
Single-issue (N=1 slot):
Check: prev_rd vs curr_rs1, curr_rs2
Paths: 2 (one per operand)
Dual-issue (N=2 slots):
Check: slot0_rd vs slot1_rs1, slot1_rs2 (intra-cycle)
slot0_rd, slot1_rd vs next_slot0_rs1/rs2, next_slot1_rs1/rs2
Paths: ~8 forwarding cases to reason about
Four-wide (N=4 slots):
Intra-cycle dependencies alone: 4*3 = 12 pairs to check
Cross-cycle dependencies: 4 producers * 4 consumers * 2 operands = 32 paths
Total: ~44 forwarding cases before simplification
Quote:
The superscalar has more room to fall in CPI through branch prediction improvement alone. But the complexity of implementing and validating the combined design is substantially higher than what was done here for the single-issue case. The hazard logic in particular, when done correctly for dual-issue, requires careful enumeration of all inter-slot and cross-cycle dependency cases. The implementation effort is not twice as hard – it is harder than that, because the number of interaction cases between slots grows combinatorially.
But also, as an addendum, a worked example of incremental improvements to a superscalar implementation:
Quote:
Total: CPI reduced from 1.0435 to 0.7271, CoreMark/MHz up from 3.243 to 4.655, a ~1.435x speedup over the unoptimised superscalar baseline.
The most striking difference from the pipeline journey is that load-use forwarding (luopt + fulu) was by far the dominant win at +17%, compared to only ~2% in the single-issue case. In a dual-issue design, a load-RAW hazard that squashes slot 1 wastes two issue opportunities per stall cycle rather than one. Fixing the forwarding paths had proportionally larger impact.