AnyCPU http://anycpu.org/forum/ |
|
microop 6502 http://anycpu.org/forum/viewtopic.php?f=23&t=655 |
Page 1 of 6 |
Author: | robfinch [ Tue Nov 26, 2019 3:25 am ] |
Post subject: | microop 6502 |
I’m toying with the idea of a superscalar 6502. It would work by changing 6502 opcodes into micro-ops in a manner similar to what’s done for the x86. So, I need to have worked out an appropriate set of micro-ops. The micro-ops would be a load / store architecture with a fixed 13-bit instruction format. I think all instructions can be implemented with a maximum of four micro-ops. This means a table of 54 bits for each instruction (2 bits used to indicate # of micro-ops). The table would be indexed by the opcode byte and a field (Ld3) in the micro-op instruction indicates when to take values from the macro-op instruction. Necessary for constants supplied by macro-ops. Micro-op table follows: Attachment:
File comment: 6502 Micro-op table microops.png [ 36.17 KiB | Viewed 7252 times ] Some sample instruction breakdowns: Code: [b]pha[/b] SB acc,sp ADD.B sp,#-1 [b]adc (zp),y[/b] LDW tmp,zp ADD tmp,y LDB tmp,[tmp] ADC.B Acc,tmp [b]rti[/b] ADD.B sp,#1 LDB sr,sp ADD.B sp #2 LDW pc,sp-1 [b]rts[/b] ADD.B sp,#2 LDW pc,sp-1 ADD.w pc,#1 |
Author: | robfinch [ Thu Nov 28, 2019 3:11 am ] |
Post subject: | Re: microop 6502 |
Have to rethink what’s going on with this core. It looks like it would have to be able to queue eight micro-ops per clock cycle (enough for two instructions). Otherwise no point to going superscalar. Queuing eight is a lot. I wonder if instead the four micro-ops composing the instruction could be queued as a single unit. Well, it’s queuing between two and eight micro-ops to a micro-op queue. The micro-op queue then feeds an issue queue at a rate of three micro-ops per clock cycle. The core will be three-way micro-op superscalar. This will allow up to three simple instructions (eg inx, cmp #) to execute per clock. More complicated instructions (those with an address mode) may take longer to execute. Two macro instructions will be fetched per clock cycle. A micro-op assembly language is being defined. The goal is that every instruction can be implemented with four or fewer micro-ops. The micro-op table has been setup for most instructions. With a little luck it will be possible to update this table at run-time in order to allow new instructions to be defined. The micro-op instructions had to be expanded to 16-bits. Some thought is being given to enhancing the micro-op instruction set do it can be used to emulate processors other than the 6502. Initially only the 6502 instruction set will be supported. It should be trivial to support the 65C02 instruction set in the future. Supporting the 65816 is more difficult because of the different modes. As a wild guess, the author is estimating the core to be between 50,000 and 100,000 LC’s. It should fit easily into the FPGA. |
Author: | BigEd [ Thu Nov 28, 2019 9:01 am ] |
Post subject: | Re: microop 6502 |
Have you seen this thread by Drass over on 6502.org? There might be some cross-fertilisation of ideas possible: |
Author: | robfinch [ Thu Nov 28, 2019 9:30 am ] |
Post subject: | Re: microop 6502 |
That’s an interesting thread. I just have to post there. I skimmed over it a while ago. Is it more than a python emulator? I was surprised to see such good numbers for the CPI. I would’ve thought that the branches would cause quite a loss since there’s seven stages flushed. I was able only to get down to about 1.5 CPI for a similarly pipelined processor. The design on 6502.org waits until the writeback stage for branches. I’m pretty sure it’s really only necessary to wait until the execute stage. I prefer to determine average CPI by measuring running software. Hardware glitches can cause the CPI to be inaccurate. |
Author: | Drass [ Fri Nov 29, 2019 2:30 am ] |
Post subject: | Re: microop 6502 |
Hi Rob, I saw your post regarding this on 6502.org and I thought I would provide some answers to your questions below. robfinch wrote: Is it more than a python emulator? Quote: I was surprised to see such good numbers for the CPI. I would’ve thought that the branches would cause quite a loss since there’s seven stages flushed. I was able only to get down to about 1.5 CPI for a similarly pipelined processor. Quote: The design on 6502.org waits until the writeback stage for branches. I’m pretty sure it’s really only necessary to wait until the execute stage. Quote: I prefer to determine average CPI by measuring running software. Hardware glitches can cause the CPI to be inaccurate. |
Author: | Drass [ Fri Nov 29, 2019 2:45 am ] |
Post subject: | Re: microop 6502 |
Regarding Microop, I’m not very familiar with superscalar architectures, so apologies in advance if these questions are naive ...
|
Author: | oldben [ Fri Nov 29, 2019 5:58 am ] |
Post subject: | Re: microop 6502 |
Will this version have the 6502/6800 clock 1/ clock2 memory memory cycle, or some other memory interface for faster speeds? |
Author: | robfinch [ Fri Nov 29, 2019 7:42 am ] |
Post subject: | Re: microop 6502 |
Quote: am assuming that microops are fed into a standard pipeline as they are dispatched. Is that so? You mention that three microops are dispatched per clock. Does that mean there are three parallel pipelines? The micro-ops are fed into a dispatch/issue queue. Multiple functional units operating in parallel sit between the issue queue and the re-order buffer (queue). They get used as they are available and instructions are ready. It isn’t exactly like three parallel pipelines, but the functional units are used in parallel. One could say there are six parallel pipelines, but they aren’t identical. There are two alu’s, two address generators, a branch unit, and a memory unit that can handle two reads at the same time. It’s an out-of-order design, slightly different than an in-order design which might have parallel pipelines. Quote: You mention that 6502 instructions can be implemented in four or fewer microops, and simple 6502 instructions like INX and CMP # can be implemented in a single microop. Do you have a sense of what the average CPI might be at three microops per clock? Quote: Does the CPU resolve dependencies between instructions by stalling, or by buffering and dispatching instructions only when their input operands are available? Quote: Will this version have the 6502/6800 clock 1/ clock2 memory memory cycle, or some other memory interface for faster speeds? |
Author: | robfinch [ Sun Dec 01, 2019 3:27 am ] |
Post subject: | Re: microop 6502 |
Coding for rtf65004 (micro-op 6502) is coming along nicely. The code is a bit simpler than something like nvio because of the limited list of operations provided by the micro-ops. For instance, there’s no call/jsr or rts instructions to deal with. These are all wrapped up by multiple micro-ops which translate into single jumps. Arrgh! The brk instruction requires six micro-ops. I got rid of using the pc as a target register for loads (which allowed indirect jumps). The author isn’t sure what to do about this. It’s the only instruction requiring six, and it’d be pretty wasteful to make the whole table six micro-ops wide just for one instruction. Code: ADDB sp,#-3 ; subtract 3 from stack pointer STB sr,1[sp] ; store status reg STW pc,2[sp] ; store program counter SEI ; set interrupt mask LDW tmp,$FFFE ; fetch break address JMP $0000[tmp] ; jump to break routine For now, the brk case is specially checked for and the micro-ops are set manually into the queue, rather than looking them up from the table. There must be some special case logic for handling reset and hardware interrupts. A reset or hardware interrupt forces a brk instruction into the instruction stream. The only difference is which vector is used for the interrupt routine. Thinking about how to compact the micro-op table. The author thinks it could be done with a little bit of indirection. Also thinking about how the design could be extended to the 65816 and beyond. |
Author: | robfinch [ Mon Dec 02, 2019 3:38 am ] |
Post subject: | Re: microop 6502 |
Worked on the status register commit. Status register commit is interesting because it uses a bitmask of the bits to update in the status register. This allows only the parts of the register to be updated that correspond to the instruction. Many 6502 instructions only update N,Z for instance. Worked on the branch unit. Chopped a bunch of code off the nvio branch unit since there are only two types of flow control transfers: a jump or a branch. All the code to support calls, returns, break’s and rti could be removed. For comparison: up to about 4,500 lines now in the main file. nvio3 is about 8,300 lines. And getting close to running through the synthesizer. |
Author: | Drass [ Mon Dec 02, 2019 4:16 am ] |
Post subject: | Re: microop 6502 |
I noticed the BRK sequence above uses a store byte followed by a store word. What happens if the store word for the return address crosses a word boundary on the stack? Does the write happen over two cycles? |
Author: | robfinch [ Mon Dec 02, 2019 6:42 am ] |
Post subject: | Re: microop 6502 |
Quote: I noticed the BRK sequence above uses a store byte followed by a store word. What happens if the store word for the return address crosses a word boundary on the stack? Does the write happen over two cycles? It may. The memory unit takes care of unaligned accesses. The external bus currently uses 128-bit accesses (mainly for cache line loads) so it would only require two accesses if the word stored crosses the 128-bit boundary. Most of the time it (15/16) it doesn’t. Another related issue is stack wrapping over/under from $100 to $1FF. The second write must be checked to see if it wraps. This isn’t being done yet. I want to get the core basically working then worry about corner cases. I thought about using individual byte reads and writes for everything, but that would slow the core down and increase the number of micro-ops required. Also thought about reading or writing 24-bit values which is appealing to support the 65816 [] modes. For the 65816 there’s also the program bank register to take care of, so the number of micro-ops for BRK and RTI is increased (unless 24 bit values are supported). The approach I use is to get the gist of it done, then iron out the details later. Moving from high-level to lower level. It’s only possible to use this approach because ripping up the circuits and re-wiring them is cheap in the FPGA. There is also a need to store and read the PC+2. PC+2 is stored by the BRK / JSR instructions. I was planning on supporting it as special register read as opposed to using micro-ops to increment the pc. The BRK micro-op list should really include this. Code: ADDB sp,#-3 ; subtract 3 from stack pointer STB sr,1[sp] ; store status reg STW pc2,2[sp] ; store program counter plus two SEI ; set interrupt mask LDW tmp,$FFFE ; fetch break address JMP $0000[tmp] ; jump to break routine It may be possible to get rid of the SEI micro-op by integrating it with another one. For instance JMP and SEI could be combined into a new micro-op JSI. To support the 65816 the m and x status bits would be combined with the opcode to form a 10-bit index into the micro-op list table. There would be 1024 table entries instead of 256, with a number of redundant entries. The neat thing about the micro-ops is that maybe only a handful need to be defined and the remaining ones could be loaded during the reset routine. |
Author: | Drass [ Mon Dec 02, 2019 2:13 pm ] |
Post subject: | Re: microop 6502 |
robfinch wrote: The memory unit takes care of unaligned accesses. The external bus currently uses 128-bit accesses (mainly for cache line loads) so it would only require two accesses if the word stored crosses the 128-bit boundary. Most of the time it (15/16) it doesn’t. Quote: Another related issue is stack wrapping over/under from $100 to $1FF. The second write must be checked to see if it wraps. Quote: I thought about using individual byte reads and writes for everything, but that would slow the core down and increase the number of micro-ops required. I had originally hesitated to implement 16-bit writes because of the each byte address then has to be checked specifically to identify RAW dependencies in the pipeline. I am trying to keep the number of comparators in the hazard logic to a manageable number, but the performance benefit seems worth it in this case. Quote: Also thought about reading or writing 24-bit values which is appealing to support the 65816 [] modes. For the 65816 there’s also the program bank register to take care of, so the number of micro-ops for BRK and RTI is increased (unless 24 bit values are supported). Quote: The approach I use is to get the gist of it done, then iron out the details later. Moving from high-level to lower level. It’s only possible to use this approach because ripping up the circuits and re-wiring them is cheap in the FPGA. Quote: There is also a need to store and read the PC+2. PC+2 is stored by the BRK / JSR instructions. I was planning on supporting it as special register read as opposed to using micro-ops to increment the pc. The BRK micro-op list should really include this. Code: ADDB sp,#-3 ; subtract 3 from stack pointer STB sr,1[sp] ; store status reg STW pc2,2[sp] ; store program counter plus two SEI ; set interrupt mask LDW tmp,$FFFE ; fetch break address JMP $0000[tmp] ; jump to break routine Quote: It may be possible to get rid of the SEI micro-op by integrating it with another one. For instance JMP and SEI could be combined into a new micro-op JSI. Quote: To support the 65816 the m and x status bits would be combined with the opcode to form a 10-bit index into the micro-op list table. There would be 1024 table entries instead of 256, with a number of redundant entries. Quote: The neat thing about the micro-ops is that maybe only a handful need to be defined and the remaining ones could be loaded during the reset routine. |
Author: | GaBuZoMeu [ Mon Dec 02, 2019 9:02 pm ] |
Post subject: | Re: microop 6502 |
If I understand this (highly interesting project and discussion) correctly, there is a mechanism that reads a 6502 program and pushes a set of micro ops for each instruction it encounters into a queue for execution. Then it continues fetching the next instruction and so on. Special cases are branches, as you cannot predict whether it is taken or not and so where the next instruction is. Lesser special are jumps (even indirect ones), returns and the BRK. As the BRK exceeds the 4 micro ops limit, what about 'flagging' this one out: if you encounter a BRK opcode the first time you set a special flag and push the first half of micro ops into the queue. The special flag inhibits the reading PC from advancing, causing to read the BRK a second time. This time the flag is already set, so you select a second micro op sequence and reset the special flag (which in turn allows the PC to be incremented). This would be just one more line in the opcode list. May be nonsense. Cheers, Arne |
Author: | robfinch [ Mon Dec 02, 2019 10:28 pm ] |
Post subject: | Re: microop 6502 |
Quote: So the store-word from the core gets buffered without concern for alignment. Then the memory unit completes the write to the cache, and may take two cycles to do it. Is that right? Writes to external memory are 128-bit wide, but I assume writes to the cache are one word at a time, correct? How wide is your external bus? Thanks you for these informative answers. • * = this may have to change to using two write cycles to the cache. It only works if consistently reading / writing the same size of data to the last word on the line. Suppose one wants to write a word, but then read the high byte of the word. The high byte of the word is actually being stored in the previous cache line as the 65th byte. The byte on the current line won’t have been updated by a previous word write. It’s a detail that should be fixed. I suspect that many programs would work just fine since there’s only byte writes on the 6502. This implementation however puts subroutine addresses on the stack by writing words. However, unless one tries to manipulate the return address on the stack, that should be okay too. (In short it’s just a rare case where it wouldn’t work). Quote: (Note that the 65C02 clears the Decimal flag on interrupts. The 65816 might also, if I recall correctly).]/quote]I had forgotten the 6502 doesn’t clear the decimal flag. I wonder how many 6502 programs would break if the decimal flag were being cleared? I’d rather not have different code for the 6502 and 65C02 here. I can’t imagine that it would be very many. Usually one of the first things done in the ISR is CLD. If it’s already clear it likely doesn’t matter. Quote: Note that JSR and BRK require different PC offsets, which is a pain. RTS increments the return address before using it, whereas RTI does not. Now I’m wondering if there’s any software out there that uses values pushed on the stack by an NMI? Thanks for your interest. |
Page 1 of 6 | All times are UTC |
Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/ |