Last visit was: Sun Nov 28, 2021 12:01 pm
It is currently Sun Nov 28, 2021 12:01 pm



 [ 84 posts ]  Go to page 1, 2, 3, 4, 5, 6  Next
 microop 6502 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
I’m toying with the idea of a superscalar 6502. It would work by changing 6502 opcodes into micro-ops in a manner similar to what’s done for the x86. So, I need to have worked out an appropriate set of micro-ops. The micro-ops would be a load / store architecture with a fixed 13-bit instruction format. I think all instructions can be implemented with a maximum of four micro-ops. This means a table of 54 bits for each instruction (2 bits used to indicate # of micro-ops). The table would be indexed by the opcode byte and a field (Ld3) in the micro-op instruction indicates when to take values from the macro-op instruction. Necessary for constants supplied by macro-ops.

Micro-op table follows:
Attachment:
microops.png

Some sample instruction breakdowns:
Code:
[b]pha[/b]
SB   acc,sp
ADD.B   sp,#-1
[b]adc (zp),y[/b]
LDW tmp,zp
ADD tmp,y
LDB tmp,[tmp]
ADC.B Acc,tmp
[b]rti[/b]
ADD.B sp,#1
LDB sr,sp
ADD.B sp #2
LDW pc,sp-1
[b]rts[/b]
ADD.B sp,#2
LDW pc,sp-1
ADD.w pc,#1


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 26, 2019 3:25 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Have to rethink what’s going on with this core. It looks like it would have to be able to queue eight micro-ops per clock cycle (enough for two instructions). Otherwise no point to going superscalar. Queuing eight is a lot. I wonder if instead the four micro-ops composing the instruction could be queued as a single unit.

Well, it’s queuing between two and eight micro-ops to a micro-op queue. The micro-op queue then feeds an issue queue at a rate of three micro-ops per clock cycle. The core will be three-way micro-op superscalar. This will allow up to three simple instructions (eg inx, cmp #) to execute per clock. More complicated instructions (those with an address mode) may take longer to execute. Two macro instructions will be fetched per clock cycle.

A micro-op assembly language is being defined. The goal is that every instruction can be implemented with four or fewer micro-ops. The micro-op table has been setup for most instructions. With a little luck it will be possible to update this table at run-time in order to allow new instructions to be defined. The micro-op instructions had to be expanded to 16-bits. Some thought is being given to enhancing the micro-op instruction set do it can be used to emulate processors other than the 6502. Initially only the 6502 instruction set will be supported. It should be trivial to support the 65C02 instruction set in the future. Supporting the 65816 is more difficult because of the different modes.

As a wild guess, the author is estimating the core to be between 50,000 and 100,000 LC’s. It should fit easily into the FPGA.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 28, 2019 3:11 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1644
Have you seen this thread by Drass over on 6502.org? There might be some cross-fertilisation of ideas possible:


Thu Nov 28, 2019 9:01 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
That’s an interesting thread. I just have to post there. I skimmed over it a while ago. Is it more than a python emulator? I was surprised to see such good numbers for the CPI. I would’ve thought that the branches would cause quite a loss since there’s seven stages flushed. I was able only to get down to about 1.5 CPI for a similarly pipelined processor. The design on 6502.org waits until the writeback stage for branches. I’m pretty sure it’s really only necessary to wait until the execute stage. I prefer to determine average CPI by measuring running software. Hardware glitches can cause the CPI to be inaccurate.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 28, 2019 9:30 am WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
Hi Rob,

I saw your post regarding this on 6502.org and I thought I would provide some answers to your questions below.
robfinch wrote:
Is it more than a python emulator?
Not at present. My plan is to use the python script to test out various ideas and generally get the hang of the pipelined architecture. I’m not familiar with Verilog so python seemed to be the easiest choice to prototype something before attempting an implementation. The objective is to eventually implement the CPU in TTL logic, but that may prove far too ambitious. If that’s the case, I’ll probably try my luck at Verilog.

Quote:
I was surprised to see such good numbers for the CPI. I would’ve thought that the branches would cause quite a loss since there’s seven stages flushed. I was able only to get down to about 1.5 CPI for a similarly pipelined processor.
So far I have tried only specialized workloads. We’ll have to see if the CPI holds up over time. The first attempt at running the ehBASIC interpreter, for example, yielded a CPI of 1.42. I have a couple of optimizations in mind that should improve that, but as you suggest, performance is likely to be limited by the branch prediction logic and associated mis-prediction penalty.

Quote:
The design on 6502.org waits until the writeback stage for branches. I’m pretty sure it’s really only necessary to wait until the execute stage.
Can you elaborate on this a bit? I am very interested to understand what you have in mind.

Quote:
I prefer to determine average CPI by measuring running software. Hardware glitches can cause the CPI to be inaccurate.
I’m also curious what you mean by this. The CPI is currently calculated simply by dividing the total number of cycles (iterations in the main loop of the python program) by the numbers of 6502 instructions that are retired at the WriteBack stage. It’s not currently my intention to instrument the CPU to report CPI at part of the implementation.


Fri Nov 29, 2019 2:30 am

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
Regarding Microop, I’m not very familiar with superscalar architectures, so apologies in advance if these questions are naive ...

  • I am assuming that microops are fed into a standard pipeline as they are dispatched. Is that so? You mention that three microops are dispatched per clock. Does that mean there are three parallel pipelines?
  • You mention that 6502 instructions can be implemented in four or fewer microops, and simple 6502 instructions like INX and CMP # can be implemented in a single microop. Do you have a sense of what the average CPI might be at three microops per clock?
  • Does the CPU resolve dependencies between instructions by stalling, or by buffering and dispatching instructions only when their input operands are available?


Fri Nov 29, 2019 2:45 am

Joined: Mon Oct 07, 2019 2:41 am
Posts: 271
Will this version have the 6502/6800 clock 1/ clock2 memory memory cycle, or some other
memory interface for faster speeds?


Fri Nov 29, 2019 5:58 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Quote:
am assuming that microops are fed into a standard pipeline as they are dispatched. Is that so? You mention that three microops are dispatched per clock. Does that mean there are three parallel pipelines?

The micro-ops are fed into a dispatch/issue queue. Multiple functional units operating in parallel sit between the issue queue and the re-order buffer (queue). They get used as they are available and instructions are ready. It isn’t exactly like three parallel pipelines, but the functional units are used in parallel. One could say there are six parallel pipelines, but they aren’t identical. There are two alu’s, two address generators, a branch unit, and a memory unit that can handle two reads at the same time. It’s an out-of-order design, slightly different than an in-order design which might have parallel pipelines.
Quote:
You mention that 6502 instructions can be implemented in four or fewer microops, and simple 6502 instructions like INX and CMP # can be implemented in a single microop. Do you have a sense of what the average CPI might be at three microops per clock?
It may sound strange, but I was hoping for a CPI between 1 and 2, which seems wouldn’t be any better than a pipelined design. Although in ideal cases three simple instructions could be executing at the same time, the pipeline is fed with only two 6502 instructions, so it can’t run any faster than that on average. (CPI=0.5 max average). While perhaps the CPI won’t end up being any better than a pipelined design, out-of-order designs can hide some of the memory access time which would cause stalls for a regular pipeline. I’ve run a similar design and found that the CPI was between 3 and 4 even though multiple instructions were executing at the same time, but it was so high because the memory access time and filling cache misses swamped the out-of-order pipeline. (Memory access taking for instance 10-20 cycles).
Quote:
Does the CPU resolve dependencies between instructions by stalling, or by buffering and dispatching instructions only when their input operands are available?
Instructions are buffered and issued only when the input operands are available. Other instructions that are ready to execute might execute first. However, with the 6502 instructions (micro-ops) are pretty simple, and so likely to be executed in order anyway, since they are all for the most part taking the same length of time to execute. (There’s no multiply or divide or other multi-cycle ops.). Mainly the memory instructions which require multiple cycles may cause the order to be disrupted.
Quote:
Will this version have the 6502/6800 clock 1/ clock2 memory memory cycle, or some other
memory interface for faster speeds?
Some other. This design uses caches for both instruction and data. L1 cache access is single cycle. Reading the L2 cache takes about 3 clock cycles. However, the interface to main memory uses the WISHBONE bus (in burst mode) it typically takes numerous cycles to fill the cache. Cache lines are 512 bit and memory is only 128 bit so it takes a minimum of four cycles, but there is also other overhead. Note that the I$ reads enough bytes to provide for two 6502 instructions in a single cycle (6 bytes or more).

_________________
Robert Finch http://www.finitron.ca


Fri Nov 29, 2019 7:42 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Coding for rtf65004 (micro-op 6502) is coming along nicely. The code is a bit simpler than something like nvio because of the limited list of operations provided by the micro-ops. For instance, there’s no call/jsr or rts instructions to deal with. These are all wrapped up by multiple micro-ops which translate into single jumps.

Arrgh! The brk instruction requires six micro-ops. I got rid of using the pc as a target register for loads (which allowed indirect jumps). The author isn’t sure what to do about this. It’s the only instruction requiring six, and it’d be pretty wasteful to make the whole table six micro-ops wide just for one instruction.
Code:
ADDB   sp,#-3   ; subtract 3 from stack pointer
STB   sr,1[sp]      ; store status reg
STW   pc,2[sp]   ; store program counter
SEI         ; set interrupt mask
LDW   tmp,$FFFE   ; fetch break address
JMP   $0000[tmp]   ; jump to break routine


For now, the brk case is specially checked for and the micro-ops are set manually into the queue, rather than looking them up from the table.
There must be some special case logic for handling reset and hardware interrupts. A reset or hardware interrupt forces a brk instruction into the instruction stream. The only difference is which vector is used for the interrupt routine.

Thinking about how to compact the micro-op table. The author thinks it could be done with a little bit of indirection. Also thinking about how the design could be extended to the 65816 and beyond.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 01, 2019 3:27 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Worked on the status register commit. Status register commit is interesting because it uses a bitmask of the bits to update in the status register. This allows only the parts of the register to be updated that correspond to the instruction. Many 6502 instructions only update N,Z for instance.

Worked on the branch unit. Chopped a bunch of code off the nvio branch unit since there are only two types of flow control transfers: a jump or a branch. All the code to support calls, returns, break’s and rti could be removed.

For comparison: up to about 4,500 lines now in the main file. nvio3 is about 8,300 lines. And getting close to running through the synthesizer.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 02, 2019 3:38 am WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
I noticed the BRK sequence above uses a store byte followed by a store word. What happens if the store word for the return address crosses a word boundary on the stack? Does the write happen over two cycles?


Mon Dec 02, 2019 4:16 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Quote:
I noticed the BRK sequence above uses a store byte followed by a store word. What happens if the store word for the return address crosses a word boundary on the stack? Does the write happen over two cycles?

It may. The memory unit takes care of unaligned accesses. The external bus currently uses 128-bit accesses (mainly for cache line loads) so it would only require two accesses if the word stored crosses the 128-bit boundary. Most of the time it (15/16) it doesn’t.
Another related issue is stack wrapping over/under from $100 to $1FF. The second write must be checked to see if it wraps. This isn’t being done yet. I want to get the core basically working then worry about corner cases.
I thought about using individual byte reads and writes for everything, but that would slow the core down and increase the number of micro-ops required. Also thought about reading or writing 24-bit values which is appealing to support the 65816 [] modes. For the 65816 there’s also the program bank register to take care of, so the number of micro-ops for BRK and RTI is increased (unless 24 bit values are supported). The approach I use is to get the gist of it done, then iron out the details later. Moving from high-level to lower level. It’s only possible to use this approach because ripping up the circuits and re-wiring them is cheap in the FPGA.
There is also a need to store and read the PC+2. PC+2 is stored by the BRK / JSR instructions. I was planning on supporting it as special register read as opposed to using micro-ops to increment the pc. The BRK micro-op list should really include this.
Code:
ADDB   sp,#-3   ; subtract 3 from stack pointer
STB   sr,1[sp]      ; store status reg
STW   pc2,2[sp]   ; store program counter plus two
SEI         ; set interrupt mask
LDW   tmp,$FFFE   ; fetch break address
JMP   $0000[tmp]   ; jump to break routine

It may be possible to get rid of the SEI micro-op by integrating it with another one. For instance JMP and SEI could be combined into a new micro-op JSI.

To support the 65816 the m and x status bits would be combined with the opcode to form a 10-bit index into the micro-op list table. There would be 1024 table entries instead of 256, with a number of redundant entries.
The neat thing about the micro-ops is that maybe only a handful need to be defined and the remaining ones could be loaded during the reset routine.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 02, 2019 6:42 am WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
robfinch wrote:
The memory unit takes care of unaligned accesses. The external bus currently uses 128-bit accesses (mainly for cache line loads) so it would only require two accesses if the word stored crosses the 128-bit boundary. Most of the time it (15/16) it doesn’t.
So the store-word from the core gets buffered without concern for alignment. Then the memory unit completes the write to the cache, and may take two cycles to do it. Is that right? Writes to external memory are 128-bit wide, but I assume writes to the cache are one word at a time, correct? How wide is your external bus? Thanks you for these informative answers.

Quote:
Another related issue is stack wrapping over/under from $100 to $1FF. The second write must be checked to see if it wraps.
Yes, an interesting side-effect of the store-word micro-op to stack writes.

Quote:
I thought about using individual byte reads and writes for everything, but that would slow the core down and increase the number of micro-ops required.
This is currently what I am doing, and have found the additional overhead for JSR to be undesirable. I am currently implementing a store-word micro-instruction to alleviate the problem. My questions about word-alignment arise out of this consideration.

I had originally hesitated to implement 16-bit writes because of the each byte address then has to be checked specifically to identify RAW dependencies in the pipeline. I am trying to keep the number of comparators in the hazard logic to a manageable number, but the performance benefit seems worth it in this case.

Quote:
Also thought about reading or writing 24-bit values which is appealing to support the 65816 [] modes. For the 65816 there’s also the program bank register to take care of, so the number of micro-ops for BRK and RTI is increased (unless 24 bit values are supported).
My thought on this issue was that long-jumps in general are not likely to be as as prevalent on 65816 code, and therefore an additional micro-op (or additional microinstruction in my case) to handle 24-bit return addresses might work out fine.

Quote:
The approach I use is to get the gist of it done, then iron out the details later. Moving from high-level to lower level. It’s only possible to use this approach because ripping up the circuits and re-wiring them is cheap in the FPGA.
Perhaps I should consider a Verilog model as well.

Quote:
There is also a need to store and read the PC+2. PC+2 is stored by the BRK / JSR instructions. I was planning on supporting it as special register read as opposed to using micro-ops to increment the pc. The BRK micro-op list should really include this.
Code:
ADDB   sp,#-3   ; subtract 3 from stack pointer
STB   sr,1[sp]      ; store status reg
STW   pc2,2[sp]   ; store program counter plus two
SEI         ; set interrupt mask
LDW   tmp,$FFFE   ; fetch break address
JMP   $0000[tmp]   ; jump to break routine
Note that JSR and BRK require different PC offsets, which is a pain. RTS increments the return address before using it, whereas RTI does not.

Quote:
It may be possible to get rid of the SEI micro-op by integrating it with another one. For instance JMP and SEI could be combined into a new micro-op JSI.
I concur with the idea of combining SEI into another micro-op. The benefits probably outweigh the complication of a specialized micro-op. A specialized PUSH P micro-op which writes the status register to the stack and also sets the status flags appropriately for the mode of operation might be useful. (Note that the 65C02 clears the Decimal flag on interrupts. The 65816 might also, if I recall correctly).

Quote:
To support the 65816 the m and x status bits would be combined with the opcode to form a 10-bit index into the micro-op list table. There would be 1024 table entries instead of 256, with a number of redundant entries.
Ok, wow. That’s an interesting observation, and a tidy solution.

Quote:
The neat thing about the micro-ops is that maybe only a handful need to be defined and the remaining ones could be loaded during the reset routine.
A dynamically adjusted micro-op table. Cool. It might be nice to be able to switch between 65816 and 6502 modes programmatically without having to reset the CPU. That’s what we did in a the C74-6502 TTL implementation.


Mon Dec 02, 2019 2:13 pm

Joined: Fri May 05, 2017 7:39 pm
Posts: 22
If I understand this (highly interesting project and discussion) correctly, there is a mechanism that reads a 6502 program and pushes a set of micro ops for each instruction it encounters into a queue for execution. Then it continues fetching the next instruction and so on. Special cases are branches, as you cannot predict whether it is taken or not and so where the next instruction is. Lesser special are jumps (even indirect ones), returns and the BRK.

As the BRK exceeds the 4 micro ops limit, what about 'flagging' this one out: if you encounter a BRK opcode the first time you set a special flag and push the first half of micro ops into the queue. The special flag inhibits the reading PC from advancing, causing to read the BRK a second time. This time the flag is already set, so you select a second micro op sequence and reset the special flag (which in turn allows the PC to be incremented). This would be just one more line in the opcode list.

May be nonsense.

Cheers,
Arne


Mon Dec 02, 2019 9:02 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Quote:
So the store-word from the core gets buffered without concern for alignment. Then the memory unit completes the write to the cache, and may take two cycles to do it. Is that right? Writes to external memory are 128-bit wide, but I assume writes to the cache are one word at a time, correct? How wide is your external bus? Thanks you for these informative answers.
Sounds right. The external bus is 128 bits for both read and write. You got me looking at the cache controller and write buffer code. The cache line is actually 64+1 bytes (520 bits) wide. The +1 is for a data word that spans two cache lines. When updating the D$ due to a write hit it’s only a single cycle.* The byte or word is shifted up to 504 bits to be placed at the proper address. Stores go through the write buffer which stores the data to memory first, before updating the cache.
• * = this may have to change to using two write cycles to the cache. It only works if consistently reading / writing the same size of data to the last word on the line. Suppose one wants to write a word, but then read the high byte of the word. The high byte of the word is actually being stored in the previous cache line as the 65th byte. The byte on the current line won’t have been updated by a previous word write. It’s a detail that should be fixed. I suspect that many programs would work just fine since there’s only byte writes on the 6502. This implementation however puts subroutine addresses on the stack by writing words. However, unless one tries to manipulate the return address on the stack, that should be okay too. (In short it’s just a rare case where it wouldn’t work).
Quote:
(Note that the 65C02 clears the Decimal flag on interrupts. The 65816 might also, if I recall correctly).]/quote]I had forgotten the 6502 doesn’t clear the decimal flag. I wonder how many 6502 programs would break if the decimal flag were being cleared? I’d rather not have different code for the 6502 and 65C02 here. I can’t imagine that it would be very many. Usually one of the first things done in the ISR is CLD. If it’s already clear it likely doesn’t matter.
Quote:
Note that JSR and BRK require different PC offsets, which is a pain. RTS increments the return address before using it, whereas RTI does not.
This has got me thinking that a hardware interrupt needs the actual pc value as opposed to pc+2 for a BRK or JSR. This it not nice because the BRK code must store one of two different values then depending on whether it’s a hardware interrupt or not. Fortunately, there’s already special code for BRK so it can be handled without adding micro-ops.
Now I’m wondering if there’s any software out there that uses values pushed on the stack by an NMI?

Thanks for your interest.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 02, 2019 10:28 pm WWW
 [ 84 posts ]  Go to page 1, 2, 3, 4, 5, 6  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software