View unanswered posts | View active topics It is currently Thu Apr 25, 2024 2:14 pm



Reply to topic  [ 7 posts ] 
 RISC-V - Compressed Instructions and Macro-Operation Fusion 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
.
An interesting article by Erik Engheim, addressing the question of whether RISC-V overdoes the minimalism. And answering in the negative, because of two features which should work better for RISC-V: Compressed Instructions and Macro-Operation Fusion

The article is here but I'm having trouble reading it. Here's an archive - I could read it if I force-stop the loading. Oh, but here's a better archive.

And here's some of the content:

Quote:
...unlike the ARM, MIPS and x86 designers, RISC-V designers knew about instruction compression and macro-ops fusion when they began designing their ISA.

Quote:
The Genius of RISC-V Microprocessors
How the instruction set for RISC-V processors has been designed cleverly for both simplicity and high performance.
However there are in particular two innovations in CPU design which in many ways renders this strategy of adding more complex instructions redundant:

    Compressed instructions — Instructions are compressed in memory and decompressed at first stage of CPU.
    Macro-operation Fusion — Two or more simple instructions read by the CPU are fused into one more complex instruction.
ARM actually employs both of these strategies already and x86 CPUs utilize the latter, so this isn’t a new trick RISC-V is pulling.

However here is the kicker: RISC-V gets far more milage out of these strategies for two important reasons:

    Compressed instructions got added in from the start. For other architectures such as ARM it was an afterthought and had to get bolted on in a kludgy way.
    The RISC obsession of keeping number of unique instruction low pays off. There is simply more room to fit compressed instructions.


Quote:
instead of fitting one instruction inside 32-bits we can fit two instructions which are 16-bit wide each. Naturally not all RISC-V instructions can be expressed in 16-bit format. Thus a subset of the 32-bit instructions are picked based on their utility and frequency of use.

Quote:
However it is when we combine instruction compression with Macro-operation fusion where we see the real payoff. You see, if the CPU gets a 32-bit word containing two compressed 16-bit instructions, it can fuse these into a single more complex instruction.

That sounds like nonsense, aren’t we just back to the start then?

Nope, because we avoid filling up the ISA specification with lots of complex instructions, the ARM strategy. Instead we are basically expressing a whole host of complex instructions indirectly through various combinations of simple instructions.


via


Sun Dec 27, 2020 9:40 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
It sounds good I wonder though how it will pan out in practice. Macro-fusion is an interesting sounding technical term. Many small instructions are already fused into a larger one. For instance, compare-and-branch in a single instruction. Or regular register indirect-with displacement addressing. If one really wants to go all out fusing instructions then these instructions should be split apart so they can be implemented as fused instructions.

Do they give any examples of macro-fused instructions? I think there are not that many instructions that would be macro-fused. Besides the obvious one of supplying indexed addressing mode, which is missing. Every example I could think of where one would want to use macro fusion, a CISC style instruction works just as well.

My impression of macro-fusion is that the instructions fused must be locked together. And if there are intermediate results maybe a virtual intermediate result register file could be used. Intermediate registers do not need to persist outside the execution of the instruction. I think they can just be pipeline registers. But then one needs a way to distinguish intermediate registers from real registers.
Suppose one wants a *b + c * d using macro fused instructions.
The a*b = e would be one instruction
c * d = f would be one instruction
e+f = g would be one instruction
Note that e, f do not need to be real registers since they are discarded anyway once the result is formed. But then why not just use the fused-dot-product instruction?

With macro fused instructions one has to worry about how much precision to carry in a register. Is that multiply going to need a double precision result or not? With a CISC style instruction that can be hidden.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 28, 2020 1:01 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Yes, there is a worked example in the article.

I don't think precision is a concern because the fused macro ops are a way to keep the machinery busy, not intended to change the semantics of the instructions which get fused.

I think one way to look at this is a late binding. The external instruction encoding should be dense, and regular, so there's not too much cache pressure, memory bandwidth, or decoder complexity. The internal instruction encoding should be wide and trivial to decode, and of course suited to the specific capability of the microarchitecture. By binding late to the encoding which drives the execution units, it's possible and highly desirable to fit the internal encoding to the microarchitecture. What's needed is a stable external encoding, for portability, but a highly specific internal encoding, to allow for low-cost or high-performance implementations.

A central idea is that RISC-V was invented for, and optimised for, execution on high performance CPUs. Earlier RISC ISAs had in mind simple pipelining, and I think I'm right in saying that both MIPS and ARM had to be adjusted to suit later higher performance implementations.

Because RISC-V is still RISC, and because it's defined as an obligatory core set with various optional extensions, it should also work well for low power, low complexity, low cost implementations. And so it's also good for teaching:
13-Year-Old, Nicholas Sharkey, Creates a RISC-V Core


Mon Dec 28, 2020 12:45 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
I think RISC may be better defined as "Easy to Pipeline with serial memory". Registers are a side effect
rather than a feature of RISC. With all the complexity of todays RISC"s, A CSIC like a
CRAY could it give similar performace?


Mon Dec 28, 2020 10:55 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
I don't think precision is a concern because the fused macro ops are a way to keep the machinery busy, not intended to change the semantics of the instructions which get fused.

Without some assumed semantics more complex instructions are needed anyway. If one wants to get minimal error from a fused dot-product then a fused-dot-product instruction is required in the instruction set, because fusing simpler instructions which do not change semantics will not supply the lower error. To control the semantics of sequences of instructions that are fused would require additional instructions anyway.

The example of scaled-indexed addressing given in the article was about the only example I could think of, of fusing instructions. That is why I would like to see another example besides an address calc. The result of the example was that it was just as good as the ARM. In other words, it was not a whole lot better.

I added scaled-indexed addressing as custom instructions to a RISC-V core as a 32-bit instruction and they are just as code dense as the example and use one less register. I do not see how fusing the instructions is better, it looks to be about equivalent to me. Except that it complicates the decode with additional pattern matching and fusion to create new instructions. It probably adds a pipeline stage or more to the decode.

To get high performance (in a simple manner) one probably wants to go with a fixed size instruction. I think the ARM 64 and PowerPC are better in that regard.

One place where macro-fusion / instruction fission is useful is in isolating an instruction set from the micro-architecture that executes it. This is maybe suitable when an complex instruction set is being implemented on simpler technology.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 29, 2020 3:58 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Agreed, a multiply-accumulate or dot-product instruction is (or may be) a different thing.

I found this:
https://en.wikichip.org/wiki/macro-oper ... ion#RISC-V

Quote:
The use of macro-op fusion in RISC-V was proposed in a 2016 Berkeley paper where a renewed case was made for the use of macro-operation fusion over bloating the ISA with more complex instructions. The paper compared the RISC-V isa performance in terms of instruction count on the popular SPEC CPU2006 benchmark where it is found to be slightly behind contemporary ISAs. In their paper, it's claimed that the RV64G and RV64GC effective instruction count can be reduced by 5.4% on average by leveraging macro-op fusion, thereby closing much of the deficiency gap


(I'm not greatly in favour of the perjorative term 'bloat' but it's in the title of the paper.)

> isolating an instruction set from the micro-architecture that executes it.

Absolutely! That for me is the main drive of RISC-V. It's not an ISA for a single implementation, or for a short period, it's meant to decouple one thing which should be stable from another thing which should be flexible.


Tue Dec 29, 2020 9:10 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Read through the Berkley paper. RISCV is definitely an excellent ISA and approach. Having quantified research is a little better than "it looks this way or that way". Using macro-fusion as a “fix” for the lack of indexed addressing is just as good as having indexed addressing. Keeping in mind that the instruction set was designed for a wide variety of environments. It was noted that a couple of routines like memset() and memcpy() were suffering from a lack of a wide memory loads and stores compared to other architectures such as ARM and x86.

I have been wondering how to use macro fusion to get more read ports for the fused dot-product operation, which requires four source ports. Could there be an instruction whose sole purpose is to read operands? Which then would be fused to following instructions. Hmm, I wonder if a branch-to-next instruction could be used for this. Branches read two operands and do not store a result.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 29, 2020 12:21 pm
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 7 posts ] 

Who is online

Users browsing this forum: SemrushBot and 14 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software