View unanswered posts | View active topics It is currently Thu Dec 12, 2019 12:30 am



Reply to topic  [ 30 posts ]  Go to page Previous  1, 2
 microop 6502 
Author Message

Joined: Fri May 05, 2017 7:39 pm
Posts: 21
Quote:
There must be some special case logic for handling reset and hardware interrupts. A reset or hardware interrupt forces a brk instruction into the instruction stream. The only difference is which vector is used for the interrupt routine.
The contents of the stacked status register has the B bit cleared when hardware interrupts. The pushed flag is set in case of a BRK.


Mon Dec 02, 2019 11:53 pm
Profile

Joined: Fri Nov 29, 2019 2:09 am
Posts: 6
robfinch wrote:
The cache line is actually 64+1 bytes (520 bits) wide. The +1 is for a data word that spans two cache lines. When updating the D$ due to a write hit it’s only a single cycle.* The byte or word is shifted up to 504 bits to be placed at the proper address. Stores go through the write buffer which stores the data to memory first, before updating the cache.
Thanks for this explanation. Very helpful.
Quote:
The byte on the current line won’t have been updated by a previous word write. It’s a detail that should be fixed. I suspect that many programs would work just fine since there’s only byte writes on the 6502. This implementation however puts subroutine addresses on the stack by writing words. However, unless one tries to manipulate the return address on the stack, that should be okay too. (In short it’s just a rare case where it wouldn’t work).
It is certainly not uncommon for a program to explicitly push and address on the stack and then RTS / RTI through it. 6502 programmers are a wild and crazy bunch — everything and anything is fair game if it saves a cycle.

Quote:
I had forgotten the 6502 doesn’t clear the decimal flag. I wonder how many 6502 programs would break if the decimal flag were being cleared? I’d rather not have different code for the 6502 and 65C02 here. I can’t imagine that it would be very many. Usually one of the first things done in the ISR is CLD. If it’s already clear it likely doesn’t matter.
You’re right, of course. I think you sort of have to decide how far you’re willing to go down the compatibility road. A pipelined or out-of-order design will necessarily fall short of full compatibility in that it won’t be cycle accurate. So the question is then what other compromises is one prepared to tolerate. The ISR’s implicit CLD seems like a reasonable bet. That said, if you have a special micro-op for the ISR-SEI (combined with another micro-op), why not have that also check a mode bit and perform a CLD if necessary. That’s what I did for the C74-6502 (which runs both 6502 and 65C02 code as well).

Quote:
This has got me thinking that a hardware interrupt needs the actual pc value as opposed to pc+2 for a BRK or JSR. This it not nice because the BRK code must store one of two different values then depending on whether it’s a hardware interrupt or not. Fortunately, there’s already special code for BRK so it can be handled without adding micro-ops.
Yes, in fact BRK pushes PC+2, JSR pushes PC+1 and a hardware interrupt pushes PC. You can change the semantics of course, but that certainly breaks any code that manipulates (or synthesizes) the return address.
Quote:
Now I’m wondering if there’s any software out there that uses values pushed on the stack by an NMI?
Unfortunately, my experience with this kind of thing is that the answer is always “yes”. (Remember it’s also programs that manufacture a return address and then execute an RTI).


Tue Dec 03, 2019 3:42 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
Taking care of some details tonight like having word accesses to the stack wrap around at $01FF. Fixing the data cache update for two write cycles when a word write takes place at the end of a cache line. Setting the BRK flag for BRK’s, clearing decimal mode on interrupts, setting the interrupt mask.
Decimal mode isn’t supported yet.

After another marathon coding session, first synthesis: 58,000 LC’s. It's bound to grow by a few LC's as fixes are made.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 03, 2019 4:23 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
Just re-ran synthesis at a higher optimization level: 28,000 LC's.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 03, 2019 4:28 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
Still coding, calling it a night. Got into simulating the core. Lots of X’s to resolve yet, but the core can be seen loading the micro-op queue with the reset micro-ops. It then tries to queue to the issue queue but gets caught up in a loop.
Quote:
As the BRK exceeds the 4 micro ops limit, what about 'flagging' this one out: if you encounter a BRK opcode the first time you set a special flag and push the first half of micro ops into the queue. The special flag inhibits the reading PC from advancing, causing to read the BRK a second time. This time the flag is already set, so you select a second micro op sequence and reset the special flag (which in turn allows the PC to be incremented). This would be just one more line in the opcode list.
I was thinking along those lines as a potential solution. And maybe having a micro-op indicator that says 'start fetching from this table entry'. I think this may have to be revisited in the future to support the 65816 and beyond.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 03, 2019 10:31 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
Starting to see some progress in simulation. Micro-ops from the micro-op queue make it over to the issue queue. Some instructions execute, but it hangs on a memory load operation which just happens to be the vector load for the reset. It may be due to X’s in the memory cells of the tagram. Some more fixes to the data cache are required.
Forgot to consider what happens to the micro-op queue when a branch miss occurs. I think it’s acceptable to flush the entire micro-op queue because instruction in that queue are guaranteed to occur after the branch. Flushing the issue queue is more complex because it’s a ring and depending on where the branch instruction is in the queue only part of the queue should be flushed.
Got past the reset jump now. The code vectors to the start of the ROM at $8000 successfully. It then processes a SEC instruction then gets confused by an 65816 opcode. I had used the 65816 boot rom for test instructions but what’s really needed is a 6502 boot rom. Or I supposed the design could be extended to support some 65816 instructions.
30 micro-ops get queued in 8 clock cycles, then performance goes kaput due to stalls. The issue queue fills up with instructions then the micro-op queue gets stalled while the instructions are dispatched and executed.

Reset BRK instructions are streamed into the core until a reset jump takes place.
Code:
------------------------------------------------------------------------ Micro-op Buffer -----------------------------------------------------------------------
..  0: v- 0002 0c7b 0001 0 #
..  1: v- 0002 10a3 0002 0 #
..  2: vo 0002 0ba8 fffc 0 #
..  3: vo 0002 a405 0000 0 #
..  4: vo 0004 1758 fffd 0 #
..  5: vo 0004 0c7b 0001 0 #
..  6: vo 0004 10a3 0002 0 #
..  7: vo 0004 0ba8 fffc 0 #
..  8: vo 0004 a405 0000 0 #
..  9: vo 0006 1758 fffd 0 #
.. 10: vo 0006 0c7b 0001 0 #
.. 11: vo 0006 10a3 0002 0 #
.. 12: vo 0006 0ba8 fffc 0 #
.. 13: vo 0006 a405 0000 0 #
HT 14: vo 0000 a405 0000 0 #
.. 15: vo 0002 1758 fffd 0 #
------------------------------------------------------------------------ Dispatch Buffer -----------------------------------------------------------------------
..  0: Q-- 0 0 -- - M02 LDW  5,5 fffc 0000 14 0  8 0000 0 1 15 04 0 1 15 0000 15 #
CQ  1: A-- 0 1 a- - M03 STB  7,7 0001 0004  1 1 15 01fc 3 1  0 04 3 1 15 fffc 01 #
..  2: A-- 0 0 a- - M04 STW  4,4 0002 fffc  2 1 15 01fc 3 1  0 04 3 1 15 fffc 02 #
..  3: A-- 0 0 a- - M02 LDW  5,5 fffc 0000 14 1 15 0000 0 1 15 04 0 1 15 fffc 03 #
..  4: Q-- 0 0 -- - F29 JSI  0,0 0000 00xx  0 1 15 0000 5 0  3 04 5 1 15 fffc 04 #
..  5: Cd- 0 0 -- - A05 ADDB 3,3 fffd 01fc 13 1  0 0000 0 1 15 04 0 1 15 fffe 05 #
..  6: O-o 0 0 -- - M03 STB  7,7 0001 0004  1 1 15 01f9 3 1  5 04 3 1 15 fffe 06 #
..  7: A-- 0 0 a- - M04 STW  4,4 0002 fffe  2 1 15 01f9 3 1  5 04 3 1 15 fffe 07 #
..  8: Q-- 0 0 -- - M02 LDW  5,5 fffc 0000 14 0  3 0000 0 1 15 04 0 1 15 fffe 10 #
..  9: Q-- 0 0 -- - F29 JSI  0,0 0000 00xx  0 1 15 0000 5 0  8 04 5 1 15 fffe 11 #
.. 10: Cd- 0 0 -- - A05 ADDB 3,3 fffd 01fc 13 0  5 0000 0 1 15 04 0 1 15 0000 12 #
.. 11: Q-- 0 0 -- - M03 STB  7,7 0001 0004  1 1 15 01f6 3 1 10 04 3 1 15 0000 13 #
.. 12: Q-- 0 0 -- - M04 STW  4,4 0002 0000  2 1 15 01f6 3 1 10 04 3 1 15 0000 14 #

Q = instruction queued, A = address generated, C=instruction ready to commit, O=instruction out (in progress), M = memory operation taking place

_________________
Robert Finch http://www.finitron.ca


Wed Dec 04, 2019 5:42 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
Worked on the memory issue logic today. The author has been meaning to try and clean it up for a while.
Memory stores no longer wait if the address is the same, provided the lane select signals are different. Previously if the address was in the same 128-bit memory region, a subsequent store would stall the memory operation. Loads were also switched to ignore address overlaps with other loads.

Simulation fails because the micro-op table is screwed up. The contents of the table aren’t as defined in the code. This looks like an issue with the simulator tool or possibly bad memory. It runs Fibonacci up to the point where there’s a BNE and the BNE instruction doesn’t get translated into the proper micro-ops. So, I dump the table in sim and voila it’s not correct.
If I setup the table differently, it might work. The table is currently hand-coded in the main source file using preprocessor definitions. Changing the table to be an IP catalogue component with the contents of the memory defined by a .coe file might give different results. To do this is a lot more work, a tool would probably need to be developed to define the table rows.

Some stats:
Code:
 instructions committed:               6 valid committed:               0 ticks:         61
micro-ops queued:         43   instr. queued:         18

Using 61 clock cycles, six instructions were completely executed and stored to the machine state. 18 instructions were queued that means 12 are still in the process of being executed. The 18 instructions were represented by 43 micro-ops. The CPI at this point (counting only completely finished instructions) is about 10. This is a little skewed because the cache is cold and not enough instructions have been executed.
Just glad to see the things at least partially working.
Pondering what a micro-op table generator program looks like.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 05, 2019 4:24 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
Well, the simulation has got further. The previous issue with the micro-op table turned out to be using the incorrect macro for a field in the table. Fibonacci has been simulated to the end of the loop. The CPI turned out to be about 3. The clocks per micro-op is about 1.5.
All the micro-ops were writing to the selected target register all the time when they committed. This included micro-ops that should not have been updating registers like: stores, nop, and compare. This caused the accumulator register to be zeroed out periodically. The solution was to add a register file write enable signal to the commit bus.
Missed the fact that on a branch miss the status register source must be set back to the last valid source instruction in the queue. This is similar to what is required for target register sources.
There seems to be a glitch in the update of the status register. It almost works but seems to update too late sometimes. For some reason the CLC instruction also clears the N flag.

Some questionable stats. It looks like something is screwy with the performance counters.
Code:
 micro-ops committed:         84  ticks:        240
micro-ops queued:        305   instr. queued:        265
micro-ops stomped by branches:        134
I$ load stalls cycles:         59

_________________
Robert Finch http://www.finitron.ca


Fri Dec 06, 2019 3:12 am
Profile WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 6
That’s great progress!
The main drag on performance in the pipeline model I am working on is branch mis-predictions. Is that the case here as well? What kind of prediction scheme are you using?


Fri Dec 06, 2019 12:15 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
Quote:
The main drag on performance in the pipeline model I am working on is branch mis-predictions. Is that the case here as well? What kind of prediction scheme are you using?

It seems to be. 1/3 of the micro-ops are simply discarded due to branches. I need to get a larger test case running though to get a more accurate picture. If there’s a branch instruction the next address is predicted by the branch target buffer in the same clock cycle that the instruction is fetched. During the next clock cycle the branch predictor may override the address supplied by the BTB. The branch predictor is a (2,2) correlating predictor which uses a two-bit global branch history. There are 512 entries in the history table.
Compared to a 6502 which never misses on a branch, a lot of performance is lost; it’s the cost of using an advanced pipeline. The 6502 is almost a memory-to-memory processor. Because it does so many memory based operations, it’s hard to get better performance out of the ISA. A small 6502 core with dedicated high-speed multi-port memory might just outperform a superscalar version. The benefit of a superscalar design would come from extending the design, for instance an increased address space requires caches. Adding more registers would alleviate the need for memory operations. Adding advanced operations like divide or reciprocal would help with some software. At some point it becomes a non-6502.
The superscalar design gains performance in a couple of ways. 1) it fetches all the bytes for two instructions at once in a single clock cycle. It might take the 6502 as many as six clock cycles to do the same thing. 2) because of the use of caches, data can be accessed at the same time as code. The 6502 must wait for it’s single memory port. 3) sometimes instructions that are not interdependent can execute in parallel. 4) memory latency can be hidden by out-of-order execution of instructions.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 07, 2019 4:19 am
Profile WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 209
Location: Girona-Catalonia
I find this quite fascinating. I can't contribute much (or anything) because most details are beyond my area of knowledge, but I am able to understand the basic concepts exposed here and it's interesting that the two approaches being compared here (Drass's pipelined 6502, and your microop 6502) are finding similar performance constraints, and facing similar problems. To some extend, this is also a practical lesson of history for me, because I can imagine that what's discussed here influenced the evolution of 8 bit processors into the 64 bit ones that we all currently have in our 'modern' computers.


Sat Dec 07, 2019 8:36 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
Worked on the branch prediction today which I found out was 100% inaccurate. 0 of 14 branches were predicted correctly.
The branch predictor was only accurate for branches at even addresses. It seems the LSB was used to record the branch state since the LSB of the pc for last core using the predictor was always zero. Anyway, I moved the branch state to its own bit. Also corrected was the branch taken / not taken bit in the queue was being supplied from the wrong source. It was supplied with state from the fetch stage, and it should’ve been coming from the micro-op queue. This led to the prediction status being set for the wrong instructions. Fixing these item results in 11/14 or 79% accuracy. I believe the predictor works now, but the sample size is too small.
Also worked on the sequence numbering of instructions. I found a way to simplify it and reduce the hardware I think. A small (7-bit) sequence number is used. All the sequence number entries are checked for bit 6 being set which indicates the number is getting too large >64. If bit 6 of any sequence number is set, then 16 is subtracted from the all sequence numbers. The check for bit 7 is just a single bit test for each queue entry. Just a two-bit subtracter for each queue entry is required. Previously the sequence number was decremented by the amount of the sequence number of the latest committing instruction. This sometimes led to underflow and meant that larger subtractors were required. Note the core is parameterized so the size of the sequence number depends on the number of queue entries.
The core seems to work better without results forwarding enabled. This lowers performance by a good percentage because it takes an extra cycle for data to appear for a dependent instruction. There are still some issues with the results forwarding to be worked out.

I ran the memcpy routine as a test. It turns out the CPI is pretty awful. About 5.5. According to Drass’s sim the CPI of the 6502 is about 4 for the same program. So why? The 6502 does everything in one long clock cycle. The micro-op based solution is done in a pipelined fashion taking multiple clocks. Hopefully the fmax is higher. Similar to the difference between a 6502 and a Z80. The solution is also doing things that the typical 6502 wouldn’t. It’s multiplexing between caches and main memory for instance.
Code:
 micro-ops committed:        621  ticks:       1860
micro-ops queued:        723
micro-ops stomped by branches:         54
instr. queued:        352
instr. committed:        307
I$ load stalls cycles:         44
Branch override BTB:         84
Total branches:         74
Missed branches:          2

The CPI per micro-op is about 3. It should be closer to 1. I have a sneaky suspicion that it’s the memory system. It’s not branches, it missed only twice out of 74 times (97% accurate). An issue is that the micro-ops are dependant on one another which slows things down. The indirect address mode must load the address for the next micro-op before that one can execute.
The CPI could be improved by removing some of the pipelining in the memory system, making it more like the 6502. The fmax would likely go down.

It takes 1 clock cycle to load the cache address driving latch from the queue and move to the BUSY state. While in the BUSY state the data read hit signal is checked and if true transition is made to the READY state (clock2) By the end of the READY state ram output latches are loaded with aligned data from either the dcache or external memory. clock 3. Data is then loaded onto the result bus in clock 4. So, it takes a minimum of four clocks to read data.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 08, 2019 3:40 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
The code is working well enough that I decided to try extending it. There is now some support for 24-bit code address space. JML, JSL, RTL instructions are supported. I decided to modify the micro-ops to be 18 bits wide which gives two more bits to represent register fields. It is now possible to select up to 16 different registers, special registers or constants.
I also found out that stores were only issuing at the head of the queue due to some faulty memory issue logic.
Went back and reviewed an earlier project, the rtf65003. It supported the 6502, 65816 and native instruction sets. The native ISA supports register based operations with 16 general purpose registers. I’m wondering how difficult it would be to support something similar.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 09, 2019 4:31 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
Hm. Wondering what can be done to better pipeline the branch predictor. All in one cycle the branch predictor predicts the branch outcome then feeds into logic controlling the program counter and slot valid signals. It would be better if the output of the lookup were registered, then fed to the rest of the logic. But I don’t know how to get it to work as registering the lookup would add a phase delay effect, it would be too late to be used for the current instruction.
I modified the way that branches are detected to use multiple decoders in the fetch stage. This should improve performance. I’ve also got things setup now to support oddball instructions that require more than four micro-ops. If the number of micro-ops is greater than four the first four micro-ops are queued and a flag is set as suggested a while back by Arne. In the next clock cycle the micro-op table is referenced indirectly through a map which selects one of 16 extra bundles of micro-ops. The map is needed to compress the table. The assumption is there will be fewer than 16 instruction requiring more than four micro-ops. The BRK, MVN and MVP instructions require six each. RTI might require five.
Whew! Extended the data bus to 64-bits and added 32 regs for native mode. The core is slowly mutating into a 64-bit processor.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 10, 2019 6:30 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 974
Location: Canada
My first successful attempt at using a SystemVerilog structure. Changed the representation of the micro-op table to use structured elements rather than just have one large bitfield. The goal is to reduce the number of mistakes in the micro-op table. There were several made when manipulated as a raw bitfield. Hopefully adding a structured coding element to it will eliminate errors in the future.

Redoing the native mode instruction set. I didn’t like the address modes associated with anything but load and store operations. So, it looks more like a classic risc processor now.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 11, 2019 3:56 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 30 posts ]  Go to page Previous  1, 2

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software