Last visit was: Fri Sep 17, 2021 11:02 pm
It is currently Fri Sep 17, 2021 11:02 pm



 [ 84 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
 microop 6502 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
PC increment pulses for macro-instruction fetch were two-clocks wide. This didn’t matter most of the time… Got the micro-op engine working at least for the tiny Fibonacci test program.
Modified the I$ to work in increments of 16-bit parcels rather than 13. This was to eliminate a couple of multiply by 13 operations and replace them with a power of two shift (16). Although the I$ now works with 16-bit parcels only 13-bits are implemented. The change was to improve performance.
Decided to use the 3 bits available to add pre-decoding of the instruction length. The instruction length is now decoded when the L1 I$ is being loaded. Previously the length was decoded at the fetch stage. Now it is simply read at the fetch stage. This should improve performance.
An attempt was made to add more pipelining at the fetch stage. The fetch stage does a fair amount of work in one long clock cycle, and is suspected to affect performance.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 25, 2019 4:45 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Added more pre-decoding of instructions. Instructions that can affect program flow (branches, return, break) are now pre-decoded. These instructions are used in determining the next program counter in the fetch stage. It helps if the decodes are available fast. Decode of branches also feed into the branch predictor. The pre-decode bits use more room in the instruction cache, increasing the size of the cache by about 50%.

Found an issue with memory issue logic where things would only issue from one of the first eight queue slots. This would cause the core to hang waiting for a load / store operation to issue if it was in the ninth or greater queue slot. Found another bug where a second memory instruction would issue too soon. Found these by inspection while refactoring code.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 27, 2019 4:03 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Simulating the core today, after updating the assembler to v5. Worked on exception handling logic and oddball instruction commits.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 28, 2019 4:44 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
More simulation and bug fixes. Found several bugs in the generic portion of the core. Which would have been in prior processors. Including bugs that would cause the processor to halt. Went a little nuts with the type use available in System Verilog and added a number of variable types. Implemented different instruction formats using structured types and a union. It makes the appearance of the code a little easier to understand.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 29, 2019 5:19 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Spent several hours trying to debug a mysterious ‘function must have at least one input’ error message. Things seemed to work in simulation but not in synthesis. It finally dawned on me that perhaps synthesis didn’t support all the nice features of simulation. And voila a quick check on the web reveals this to be true. The message was occurring because a class variable was used, and synthesis doesn’t support classes.

Worked on the compiler today.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 30, 2019 5:47 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Added a buffer between the instruction fetch and dispatch queue. This is commonly called a decode buffer. Splitting the logic up greatly reduces the amount of work done in a single clock cycle, which potentially increases the max core clock frequency.

For some reason the core occasionally inserts duplicate instructions into the queue. This is a pipelining bug of some sort. I can’t seem to track it down, the core works fine for several hundred cycles then poof an extra instruction is inserted. So, for now, I added logic to the queuing to detect a duplicate instruction and mark the new duplicate instruction as invalid, that seems to work.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 31, 2019 10:56 pm WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
I think back in the day when I did this sort of thing at work, this is where unit testing might well come in: with randomised flow control signals, your testbench checks that each pipelined unit obeys the pipeline rules. Testing for that kind of bug when you have a full system is, of course, much more difficult.


Wed Jan 01, 2020 8:45 am

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
robfinch wrote:
Worked on the compiler today.
This caught my attention. My understanding is that the micro-op core handles native 6502 instructions. Does the compiler generate standard 6502 code?


Wed Jan 01, 2020 11:01 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Quote:
I think back in the day when I did this sort of thing at work, this is where unit testing might well come in: with randomised flow control signals, your testbench checks that each pipelined unit obeys the pipeline rules. Testing for that kind of bug when you have a full system is, of course, much more difficult.
Yeah, unit testing is great, used to do a lot of it. I used to say 90% of the job is unit testing (working as a programmer). As a hobby a rigorously unit tested approach would be too time consuming. It would take too many man years :) I do "unit testing" on the fly without proper documentation here. When I hit really nasty bugs that need to be worked out then I get more rigorous.

I setup a small test program to see what happens when a branch instruction spans a cache line, and it's branch miss or a branch taken.

I determined the issue after some more testing. The pc increment logic needed to be separated from the queue count logic. The pc increment happens one cycle sooner than the queue count now, because of the addition of the decode buffer. The pc was incrementing by only one if there was a predicted taken branch in the first slot. The second slot however still contained an instruction that was being queued erroneously. Since the pc incremented by only one, the instruction that was in the second fetch slot got fetched again resulting in it being queued twice. (I disabled the kludge I put in in order to get further - it doesn't really belong in the finished version.)

Quote:
This caught my attention. My understanding is that the micro-op core handles native 6502 instructions. Does the compiler generate standard 6502 code?
Nope. I've been working on another project and posting under this topic. I should really change the topic as what I've been working on has mutated. It started out as the rtf65004 native instruction set.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 01, 2020 4:46 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Ported the rtl code for a soc to the current project to use as a testbed. Also worked on many different things today, the assembler, compiler. Edited makefiles to enable building the system.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 02, 2020 6:12 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Worked on sprite controller documentation.

The compiler outputs zero and sign extension instructions when needed. Well, these weren’t in the ISA so they got added. Not wanting to use up a lot of opcode space for the rarely used operations, extra bits in the ‘or’ instruction were used to indicate to zero or sign extend the result. So now the ‘or’ operation is an ‘or with sign/zero extend’. Rather than muddy up the or instructions alternate mnemonics of ‘movsx and movzx’ were defined.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 03, 2020 4:25 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Ran the design through to implementation to get timing results. According to the results it could be clocked at about 23MHz, 3MHz faster than the clock supplied to the core. Not good enough! I want at least 40MHz operation! So, I’ve started looking at what’s slowing things down now. According to the timing results there are 40 logic levels and the slowest path involves the sequence numbers. IIRC and understand correctly from reading newsgroups the number of levels in the typical design is about 12. So, I’m going to shoot for 24 cutting the logic levels in half.
Code:
Max Delay Paths
--------------------------------------------------------------------------------------
Slack (MET) :             8.969ns  (required time - arrival time)
  Source:                 ucpu1/iq_sn_reg[5][0]/C
                            (rising edge-triggered cell FDRE clocked by clk20_NexysVideoClkgen  {rise@0.000ns fall@25.000ns period=50.000ns})
  Destination:            ucpu1/iq_argA_reg[6][3]/D
                            (rising edge-triggered cell FDRE clocked by clk20_NexysVideoClkgen  {rise@0.000ns fall@25.000ns period=50.000ns})
  Path Group:             clk20_NexysVideoClkgen
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            50.000ns  (clk20_NexysVideoClkgen rise@50.000ns - clk20_NexysVideoClkgen rise@0.000ns)
  Data Path Delay:        40.692ns  (logic 8.217ns (20.193%)  route 32.475ns (79.807%))
  Logic Levels:           40  (CARRY4=14 LUT2=1 LUT3=4 LUT4=4 LUT5=5 LUT6=10 MUXF7=2)
  Clock Path Skew:        0.038ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    1.704ns = ( 51.704 - 50.000 )
    Source Clock Delay      (SCD):    1.669ns
    Clock Pessimism Removal (CPR):    0.003ns
  Clock Uncertainty:      0.106ns  ((TSJ^2 + DJ^2)^1/2) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Discrete Jitter          (DJ):    0.200ns
    Phase Error              (PE):    0.000ns

One simple change should improve things slightly – reducing the size of the sequence number. It’s currently 26 bits, but probably 20 bits is good enough. That would reduce all the comparators by six bits.
I also found the queue state variable (sometimes used with the sequence number) use could be improved. It’s currently a packed value since the queue can only be in one state at a time. However, using a packed value leads to comparators which become part of the path on the critical timing path. So, I changed the queue state variable back to a set of bits. (It was a bitset in the original design). This means the value doesn’t need to be broken down by a comparison.
Another area that could be improved was the stomp logic which happens when a branch miss occurs. By noting that a branchmiss signal is always active for at least two clock cycles, the stomp signal could be registered during the first clock cycle rather than being strictly combinational logic. Registering the signal means that dependent logic sees the output of a ff rather than more combo logic. This should reduce the delay in the stomp by quite a bit.

I made a few modifications over the day and got the timing improved somewhat. Trimmed the number of logic levels from 40 down to 35.
Code:
1st iteration timing improvement
trimmed over 2 ns off delay

Max Delay Paths
--------------------------------------------------------------------------------------
Slack (MET) :             11.081ns  (required time - arrival time)
  Source:                 ucpu1/misssn_reg[1]/C
                            (rising edge-triggered cell FDRE clocked by clk20_NexysVideoClkgen  {rise@0.000ns fall@25.000ns period=50.000ns})
  Destination:            ucpu1/ualu1/o1__6/A[0]
                            (rising edge-triggered cell DSP48E1 clocked by clk20_NexysVideoClkgen  {rise@0.000ns fall@25.000ns period=50.000ns})
  Path Group:             clk20_NexysVideoClkgen
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            50.000ns  (clk20_NexysVideoClkgen rise@50.000ns - clk20_NexysVideoClkgen rise@0.000ns)
  Data Path Delay:        38.619ns  (logic 7.889ns (20.428%)  route 30.730ns (79.572%))
  Logic Levels:           40  (CARRY4=14 LUT2=3 LUT3=1 LUT4=3 LUT5=4 LUT6=13 MUXF7=2)
  Clock Path Skew:        0.168ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    -2.892ns = ( 47.108 - 50.000 )
    Source Clock Delay      (SCD):    -2.649ns
    Clock Pessimism Removal (CPR):    0.411ns
  Clock Uncertainty:      0.106ns  ((TSJ^2 + DJ^2)^1/2) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Discrete Jitter          (DJ):    0.200ns
    Phase Error              (PE):    0.000ns

2nd iteration timing improvement   
     Logic Levels:           35  (CARRY4=10 LUT2=2 LUT3=2 LUT4=5 LUT5=3 LUT6=11 MUXF7=2)

The rtl code changes have resulted in a pipeline issue again.

_________________
Robert Finch http://www.finitron.ca


Sat Jan 04, 2020 4:43 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Forgot to gate the return address onto the bus for the JAL instruction. The first return didn’t work then.
Got the logic levels down to 18 using additional pipelining. This is quite a bit better than the 40 levels. According to the timing summary it looks like the core should be able to work at close to 50 MHz.
Code:
  Data Path Delay:        19.417ns  (logic 3.944ns (20.313%)  route 15.473ns (79.687%))
  Logic Levels:           18  (CARRY4=7 LUT2=1 LUT3=2 LUT4=1 LUT5=2 LUT6=3 RAMD32=1 RAMD64E=1)

It would be nice if the core could work at 50MHz because there’s already a 4x and 2x clock available.
With the additional pipelining are additional pipelining bugs to work out.

_________________
Robert Finch http://www.finitron.ca


Sun Jan 05, 2020 5:41 am WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Rob,

50 MHz looks impressive to me, but I'm confused about what that figure actually means. I understand this is the clock frequency that you can achieve given your design (including the degree of pipelining) and the particular type of PFGA you use for modelling. It would be interesting to compare that figure with a standard implementation of a RISC-V or MIPS processor on the same FPGA or FPGA tools.

Up to some point, I can think that a microcoded CISC processor can potentially run as fast or even faster than a natively running RISC processor because the micro-ops for the CISC processor can be highly tuned to achieve max performance through very wide micro-code encodings and deep pipelining. I think this is the case of the x86 processors, which according to wikipedia used up to 118 bit wide micro-ops for the Pentium Pro, with the ability to hold 32 bit immediate values.

So, how does the implementation of the microop 6502 relate with the performance that would be achieved by a natively running (I mean without a microcoding stage, but just direct decoding) RISC cpu ?


Sun Jan 05, 2020 9:56 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Quote:
50 MHz looks impressive to me, but I'm confused about what that figure actually means. I understand this is the clock frequency that you can achieve given your design (including the degree of pipelining) and the particular type of PFGA you use for modelling. It would be interesting to compare that figure with a standard implementation of a RISC-V or MIPS processor on the same FPGA or FPGA tools.
I seem to recall seeing a spec of 90MHz for the RISC-V Boom processor in an FPGA, I’m not sure what the FPGA / toolset in use was, but I’m guessing they used a faster part. The FPGA I’m using is lower cost part. There are much faster versions of the part. I sometimes have a look at the Apollo website, which is a 68x080 core, running in an FPGA. I think they hit something like 200MHz, but in a fast FPGA.
Quote:
Up to some point, I can think that a microcoded CISC processor can potentially run as fast or even faster than a natively running RISC processor because the micro-ops for the CISC processor can be highly tuned to achieve max performance through very wide micro-code encodings and deep pipelining. I think this is the case of the x86 processors, which according to wikipedia used up to 118 bit wide micro-ops for the Pentium Pro, with the ability to hold 32 bit immediate values.
I’m not sure what they mean by micro-ops with the width being 118 bits :) Is it referring to a whole queue entry or a distinct entity on it’s own? The information in the micro-op queue for the rtf65004 is wide as is, it includes a copy of the original instruction (32 bits) and program counter along with the “micro-op” among other things. A queue entry is about 85 bits wide.
Quote:
So, how does the implementation of the microop 6502 relate with the performance that would be achieved by a natively running (I mean without a microcoding stage, but just direct decoding) RISC cpu ?
At the moment the core is running just as a direct decoding without the micro-op stage.

_________________
Robert Finch http://www.finitron.ca


Mon Jan 06, 2020 4:04 am WWW
 [ 84 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software