Last visit was: Sat May 02, 2026 2:54 am
It is currently Sat May 02, 2026 2:54 am



 [ 2 posts ] 
 "Optimising a Pipelined RISC-V Core" - technical blog post 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1883
.
A nice worked example of taking a simple pipeline and adding improvements, also comparing with a superscalar alternative. Paying attention to implementation costs and verification costs, and effects on the critical path.

Article by Jagadeesh Mummana:
Optimising a Pipelined RISC-V Core: From Naive Pipeline to Near-Superscalar Performance

Quote:
Starting from a CPI of 1.3158 on a plain five-stage pipeline, the following techniques reduced CPI to 1.0675, a 23.26% improvement in CoreMark/MHz:

Moving branch resolution to the ID stage with forwarding support for the branch comparator: +8.557%
Adding a 2-bit saturating counter direction predictor: +6.547%
Adding a branch target buffer for target prediction at the IF stage: +3.458%
Generalising the forwarding network to cover MEM-to-EX paths for load results: +1.996%
Replacing the local predictor with a tournament predictor: +0.455%
Adding a JAL fast path and return address stack: +0.080%


Examples from the article:
Quote:
Code:
Critical path comparison (conceptual gate depth):

Single-issue EX stage forwarding mux:
  2 sources (EX/MEM, MEM/WB) -> 2:1 mux -> ALU input
  Gate depth: ~2-3 gates for mux

Dual-issue EX stage forwarding mux (slot 0):
  Sources: EX/MEM slot0, EX/MEM slot1, MEM/WB slot0, MEM/WB slot1
  -> 4:1 mux -> ALU input
  Gate depth: ~4-5 gates for mux

Four-wide EX stage forwarding mux (one slot):
  Sources: 4 EX/MEM slots + 4 MEM/WB slots = 8 sources
  -> 8:1 mux -> ALU input
  Gate depth: ~6-7 gates for mux

Each additional layer of mux depth can push the stage's critical path
past the target clock period, forcing either pipeline lengthening
or clock frequency reduction.



Quote:
Code:
Forwarding matrix complexity scaling:

Single-issue (N=1 slot):
  Check: prev_rd vs curr_rs1, curr_rs2
  Paths: 2 (one per operand)

Dual-issue (N=2 slots):
  Check: slot0_rd vs slot1_rs1, slot1_rs2  (intra-cycle)
         slot0_rd, slot1_rd vs next_slot0_rs1/rs2, next_slot1_rs1/rs2
  Paths: ~8 forwarding cases to reason about

Four-wide (N=4 slots):
  Intra-cycle dependencies alone: 4*3 = 12 pairs to check
  Cross-cycle dependencies: 4 producers * 4 consumers * 2 operands = 32 paths
  Total: ~44 forwarding cases before simplification


Quote:
The superscalar has more room to fall in CPI through branch prediction improvement alone. But the complexity of implementing and validating the combined design is substantially higher than what was done here for the single-issue case. The hazard logic in particular, when done correctly for dual-issue, requires careful enumeration of all inter-slot and cross-cycle dependency cases. The implementation effort is not twice as hard – it is harder than that, because the number of interaction cases between slots grows combinatorially.


But also, as an addendum, a worked example of incremental improvements to a superscalar implementation:
Quote:
Total: CPI reduced from 1.0435 to 0.7271, CoreMark/MHz up from 3.243 to 4.655, a ~1.435x speedup over the unoptimised superscalar baseline.

The most striking difference from the pipeline journey is that load-use forwarding (luopt + fulu) was by far the dominant win at +17%, compared to only ~2% in the single-issue case. In a dual-issue design, a load-RAW hazard that squashes slot 1 wastes two issue opportunities per stall cycle rather than one. Fixing the forwarding paths had proportionally larger impact.


Thu Apr 30, 2026 5:35 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2505
Location: Canada
Yes, a very good workup. It is great to see all the comparisons which sort of gives a heads-up what to expect. Makes one think twice before proceeding.

It shows that simply going superscalar probably is not worth the effort unless you go big. A two-way in-order superscalar does not give a lot better performance than a well pipelined five-stage CPU. And the five-stager is likely to be smaller, have a higher fmax and be more reliable. I also found this empirically which led me to trying to implement a four or more way out-of-order machine.

An out-of-order machine without going superscalar may also improve performance.

The examples are machines with full forwarding, it would be nice if there were some compares without forwarding. The routing and muxing for forwarding may slow the fmax down significantly in an FPGA. Timings are given for CPI and CoreMark/MHz. I think some samples, if they did not have forwarding, may have surprising results when looking at the CoreMark/MHz as the MHz may be significantly higher.

Something like the pico-rv which is just a sequential machine taking six or seven clocks per instruction if I recall correctly, runs at something like 200 MHz. Where if it were pipelined it would likely run at only 75 MHz.
One of the machines I like is the Nyuzi which has a barrel processor similar approach, eliminating forwarding muxes to gain MHz.

_________________
Robert Finch http://www.finitron.ca


Fri May 01, 2026 1:42 am WWW
 [ 2 posts ] 

Who is online

Users browsing this forum: alibaba-cloud, Chrome-11x-bots, chrome-131-bots, claudebot, zen-net and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software