View unanswered posts | View active topics It is currently Thu Mar 28, 2024 7:45 pm



Reply to topic  [ 84 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6
 microop 6502 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Getting the core ready to run in an FPGA. Simulation revealed a few more pipelining bugs. I’ve still yet to see the blinky leds going.

The branch target buffer is on the critical path now. The issue is that it uses block ram with an inverted clock signal in order to get the target address within the same clock cycle as the fetch stage. This means it’s really trying to operate at 100MHz and there just isn’t enough time to meet the timing requirements. So, the BTB was made slightly smaller and converted to use distributed ram so the inverted clock isn’t needed. This should give it double the time to operate. Also, the update inputs to the BTB are now registered. It’s probably acceptable to delay the inputs for updating the BTB by a clock cycle in order to get a lot better performance out of it. It’s unlikely the predicted address update would be used in the very next clock cycle anyway. It may be worth it to even add a second registration of the inputs.

It looks like some of the combo logic associated with branch misses is now on the critical path. This is fortunate. It can be noted that the branch miss signal is active for at least two clock cycles, meaning that a pipeline register should be able to be inserted into the combo logic. The goal is to split the amount of work done per clock in half.

_________________
Robert Finch http://www.finitron.ca


Mon Jan 06, 2020 4:05 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Ok. It appears that I didn't formulate my question with the right terms, so let me try again:

My understanding of this thread and the general concept is this:

1- you start from 6502 opcodes
2 -you convert natiive 6502 codes into secondary codes with custom, wider encoding format, involving not just one code per 6502 instruction but potentially several ones, which are easier/convenient to decode.
3- you call these secondary codes "microop 6502"
4- the microop 6502 codes are easy to decode because they are wider (i.e they contain all required information for easy/quick decoding and execution)
5- the microop 6502 codes benefit from all the performance gains and optimisations that can be applied to pure RISC cpu cores
6- the same kind of thing is performed with the modern implementation of the x86. Wikipedia says that width of these secondary codes for the Pentium Pro was 118 bits
7- the x86 does not seem to suffer any performance penalty from this. It is even comparatively better than some RISC processors, leaving aside complexity and required power.

My question (again) is: can a CISC cpu, implemented as described, achieve better performance than a native RISC cpu, because of the more flexible nature of the secondary opcodes, compared with the direct decoding of a RISC cpu?

(I hope my question is clearer now. English is NOT my primary language so correcting me for the use of language is DESIRABLE and helpful, also in public, honestly. But it's hard to get such cooperation from native speakers sometimes)


Mon Jan 06, 2020 8:58 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
My feeling is that the x86's benefits from micro-ops include the improved density of the CISC opcodes compared to a RISC. Not so much that the micro-ops are more flexible. Also, the micro-ops are constructed for easy decoding suited to the specific datapath microarchitecture: they are not a public interface but a private one. And there's a potential benefit (which I think has fallen in and out of practice) whereby there can be a cache of the micro-op version of the instruction stream, into which branches can branch, such that the decoding from CISC doesn't need to be re-done.

The cost in the x86 has been huge complexity, big chips, and high power.


Mon Jan 06, 2020 9:41 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Thanks Ed, that makes sense.


Mon Jan 06, 2020 11:59 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Having said which, it's not just x86. I was told by someone who ought to know that the front end of a high performance ARM is just as expensive as an x86 one. I found it hard to believe but there you are. It turns out high performance ARMs might also use micro-ops internally, for similar reasons:
https://www.quora.com/Why-do-ARM-proces ... operations


Mon Jan 06, 2020 1:40 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
Having said which, it's not just x86. I was told by someone who ought to know that the front end of a high performance ARM is just as expensive as an x86 one. I found it hard to believe but there you are. It turns out high performance ARMs might also use micro-ops internally, for similar reasons:
https://www.quora.com/Why-do-ARM-proces ... operations

Wow that’s really interesting. This means that when we enter to the field of high performance processors, their native instruction encodings doesn’t really matter. Which makes me dream on the return of an instruction set as beautiful as the Digital VAX-11, with fully regular, compact, variable size encodings, allowing any arbitrary combination of addressing modes and operations in a single instruction, and both compiler and human friendly! . Imagine that instruction set being processed as micro-ops by an imaginary cpu with the complexity of the x86, achieving same or better performance... wouldn’t that be delightful?
http://bitsavers.trailing-edge.com/pdf/dec/vax/archSpec/EY-3459E-DP_VAX_Architecture_Reference_Manual_1987.pdf
(How this was allowed to die in favour of the x86?, well, I know, so that’s just a rhetorical question)


Mon Jan 06, 2020 8:15 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added a pipeline stage into the branch miss logic. This increases the penalty for missed branches an extra clock cycle.

Heads[] is an array of indexes into the issue/dispatch queue that indicate where the head of the queued items is.

Removed all the heads[n] indirection in many places just substituting a plain ‘n’ as heads[n] became on the critical path. Heads[n] involves de/multiplexing combo logic in series with selecting a queue entry. heads[n] was used to identify queue slots in optimum order from the head of the queue. head[0] being the head, head[1] being next, etc. The idea for using the indirection was to be able to search the queue in order from the head down so that the oldest instructions would tend to issue first. It’s a nice ideal but not practical in many cases in terms of performance. So now the queue is searched from entry 0 to the last entry without regard to the head. It doesn’t actually matter much because the core supports out-of-order operation. It may however lower performance very slightly to substitute ‘n’ for heads[n]. While it should be algorithmically faster to use heads[n] the simplicity of just ‘n’ results in higher fmax. The search will find ready to issue entries which tend to be towards the head of the queue. A Couple of places where it does matter are memory operation ordering and re-order buffer retirement which are left as is.

I got lazy with the cache controllers while working in ‘whip it up quickly’ mode and had a select signal that was just combo logic when it should have been registered. Well it
showed up on the critical path so it was modified to use registers.

I also registered the outputs of the alu except for when it feeds back to itself. This causes a one cycle delay in updates of waiting arguments costing performance, but also increases the fmax benefiting performance. The trade-off is for a higher-fmax at the cost of an additional clock cycle. <- discovered with some more testing that this latest change didn’t work so it was backed out.

Running into some signals now which are difficult to pipeline. It looks like the fmax for the design may be about 32 MHz.

Just firing up the integrated logic analyzer (ILA) to try and identify why the core isn’t working in the FPGA. It runs at least for a short while in simulation updating LED output. But in the FPGA LEDs are stuck at $d0.

_________________
Robert Finch http://www.finitron.ca


Tue Jan 07, 2020 3:57 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Accidently found a place where the core hangs by simulating the system with the SIM flag set to false. When the SIM flag is set false it uses all the real rtl code as opposed to having some rtl code stubbed out to improve sim performance. After several system builds and some trial and error, I believe a bug has been worked out. It had to do with the register file source being erroneously set by the second instruction fetched when it was not being queued, due to a branch on the first instruction. The core ran for about 13 us before hitting this.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 08, 2020 3:48 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
joanlluch wrote:
... Which makes me dream on the return of an instruction set as beautiful as the Digital VAX-11, with fully regular, compact, variable size encodings, ... both compiler and human friendly! . Imagine that instruction set being processed as micro-ops ...

An interesting thought indeed - I started a new thread:


Fri Jan 10, 2020 9:19 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 84 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6

Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software