AnyCPU - View topic - "Optimising a Pipelined RISC-V Core" - technical blog post

FAQ

Last visit was: Sat May 02, 2026 3:34 am

It is currently Sat May 02, 2026 3:34 am

"Optimising a Pipelined RISC-V Core" - technical blog post

Page 1 of 1

[ 2 posts ]

Previous topic | Next topic

"Optimising a Pipelined RISC-V Core" - technical blog post

Author	Message
BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1883	"Optimising a Pipelined RISC-V Core" - technical blog post . A nice worked example of taking a simple pipeline and adding improvements, also comparing with a superscalar alternative. Paying attention to implementation costs and verification costs, and effects on the critical path. Article by Jagadeesh Mummana: Optimising a Pipelined RISC-V Core: From Naive Pipeline to Near-Superscalar Performance Quote: Starting from a CPI of 1.3158 on a plain five-stage pipeline, the following techniques reduced CPI to 1.0675, a 23.26% improvement in CoreMark/MHz: Moving branch resolution to the ID stage with forwarding support for the branch comparator: +8.557% Adding a 2-bit saturating counter direction predictor: +6.547% Adding a branch target buffer for target prediction at the IF stage: +3.458% Generalising the forwarding network to cover MEM-to-EX paths for load results: +1.996% Replacing the local predictor with a tournament predictor: +0.455% Adding a JAL fast path and return address stack: +0.080% Examples from the article: Quote: Code: Critical path comparison (conceptual gate depth): Single-issue EX stage forwarding mux: 2 sources (EX/MEM, MEM/WB) -> 2:1 mux -> ALU input Gate depth: ~2-3 gates for mux Dual-issue EX stage forwarding mux (slot 0): Sources: EX/MEM slot0, EX/MEM slot1, MEM/WB slot0, MEM/WB slot1 -> 4:1 mux -> ALU input Gate depth: ~4-5 gates for mux Four-wide EX stage forwarding mux (one slot): Sources: 4 EX/MEM slots + 4 MEM/WB slots = 8 sources -> 8:1 mux -> ALU input Gate depth: ~6-7 gates for mux Each additional layer of mux depth can push the stage's critical path past the target clock period, forcing either pipeline lengthening or clock frequency reduction. Quote: Code: Forwarding matrix complexity scaling: Single-issue (N=1 slot): Check: prev_rd vs curr_rs1, curr_rs2 Paths: 2 (one per operand) Dual-issue (N=2 slots): Check: slot0_rd vs slot1_rs1, slot1_rs2 (intra-cycle) slot0_rd, slot1_rd vs next_slot0_rs1/rs2, next_slot1_rs1/rs2 Paths: ~8 forwarding cases to reason about Four-wide (N=4 slots): Intra-cycle dependencies alone: 43 = 12 pairs to check Cross-cycle dependencies: 4 producers 4 consumers * 2 operands = 32 paths Total: ~44 forwarding cases before simplification Quote: The superscalar has more room to fall in CPI through branch prediction improvement alone. But the complexity of implementing and validating the combined design is substantially higher than what was done here for the single-issue case. The hazard logic in particular, when done correctly for dual-issue, requires careful enumeration of all inter-slot and cross-cycle dependency cases. The implementation effort is not twice as hard – it is harder than that, because the number of interaction cases between slots grows combinatorially. But also, as an addendum, a worked example of incremental improvements to a superscalar implementation: Quote: Total: CPI reduced from 1.0435 to 0.7271, CoreMark/MHz up from 3.243 to 4.655, a ~1.435x speedup over the unoptimised superscalar baseline. The most striking difference from the pipeline journey is that load-use forwarding (luopt + fulu) was by far the dominant win at +17%, compared to only ~2% in the single-issue case. In a dual-issue design, a load-RAW hazard that squashes slot 1 wastes two issue opportunities per stall cycle rather than one. Fixing the forwarding paths had proportionally larger impact.
Thu Apr 30, 2026 5:35 pm

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2505 Location: Canada	Re: "Optimising a Pipelined RISC-V Core" - technical blog po Yes, a very good workup. It is great to see all the comparisons which sort of gives a heads-up what to expect. Makes one think twice before proceeding. It shows that simply going superscalar probably is not worth the effort unless you go big. A two-way in-order superscalar does not give a lot better performance than a well pipelined five-stage CPU. And the five-stager is likely to be smaller, have a higher fmax and be more reliable. I also found this empirically which led me to trying to implement a four or more way out-of-order machine. An out-of-order machine without going superscalar may also improve performance. The examples are machines with full forwarding, it would be nice if there were some compares without forwarding. The routing and muxing for forwarding may slow the fmax down significantly in an FPGA. Timings are given for CPI and CoreMark/MHz. I think some samples, if they did not have forwarding, may have surprising results when looking at the CoreMark/MHz as the MHz may be significantly higher. Something like the pico-rv which is just a sequential machine taking six or seven clocks per instruction if I recall correctly, runs at something like 200 MHz. Where if it were pipelined it would likely run at only 75 MHz. One of the machines I like is the Nyuzi which has a barrel processor similar approach, eliminating forwarding muxes to gain MHz. _________________ Robert Finch http://www.finitron.ca
Fri May 01, 2026 1:42 am

Page 1 of 1

[ 2 posts ]

Who is online

Users browsing this forum: Bytespider, claudebot and 0 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software