View unanswered posts | View active topics It is currently Fri Mar 29, 2024 12:57 pm



Reply to topic  [ 84 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
 microop 6502 
Author Message

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
That’s so great. Thank you for sharing the code.

The large number of parallel additions is not going to make it practical for a TTL build, I don’t think. But it’s very cool to see it going into your FPGA projects.

Btw, I was wondering: if the prediction takes an extra cycle, how does the fetch stage know what to do? Do you stall or just make a pre-prediction?


Sun Dec 15, 2019 11:19 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Btw, I was wondering: if the prediction takes an extra cycle, how does the fetch stage know what to do? Do you stall or just make a pre-prediction?

That's a good question. I don't know what to do about this. It feels like stalling the pipeline would defeat the purpose of branch prediction, but maybe the simplest solution is to stall. Stall for two clocks on every branch? If the prediction takes two clock cycles it's even worse. It's complicated because instructions may or may not have queued. If the pipeline was stalled waiting for instructions to queue then it's no problem to override the pc. If there are instructions queued already then they may need to be flushed from the queue. An issue is that it's not known which queue slot is the branch miss one. It means there needs to be logic tracking how many queues took place during the branch prediction, and backing up the queue pointers. There may need to be checkpoints in the queue for each branch, and instructions after the newest branch would be removed. I will have to think more about how the perceptron predictor can be used. Fortunately, it may be possible to just back-out the micro-op queue which is easier to do than the issue queue.
At the moment the perceptron predictor doesn't actually determine program flow. I just have it wired up in parallel with the gselect predictor which determines program flow. But the input and output of the predictor are logged so I can see if it works.
BTW there's a mistake in the code of the previous post.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 16, 2019 2:12 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
I would think that if the branch predictor is a multi-cycle function, the delay slots after the branch instruction would be filled with instructions that would be unconditionally executed. One or two delay slots can, in many cases, be filled effectively by instructions/operations that do not affect the flags upon which a conditional branch instruction depends. This is especially true for loops where updating the loop control variables can occur regardless of the results of the compare instruction on those loop control variables: for(i = 0; i < maxLoopCnt; i++) {...}.

I am reminded that many processors, with no branch prediction or relatively simple branch prediction, included branch delay slots in their architecture. Although the branch delay slot may have to be filled with a NOP, the inefficiency that such use of the branch delay slot introduces can be a better compromise architecturally than stalling the pipeline. In the case of the 6502, many branches occur in combinations. With a branch delay slot, the delay in evaluating a branch target could be filled with another conditional branch instruction in a cost free manner. The pipeline delay penalty for a dual branch instruction for a multi-flag conditional branch would be reduced by 50%. If a more complex combination of the flags is required, where some logic manipulation of the flags is required, I wonder if those logic operations can be placed in the branch delay slots for a relatively cost free complex, multi-flag branch operation.

For the M65C02A architecture, I opted to include a number of multi-flag branch instructions that would otherwise require evaluation of two or three of the basic ALU flags. My M65C02A-specific branch instructions perform fairly complex logic manipulations of those flags to implement the conditional branches that would otherwise require multiple branches and logic tests of the ALU flags. Thus, it's my opinion that except for some fairly simple unsigned arithmetic operations that can be tested using the 6502's single flag testing conditional branch instructions, allowing the branch predictor to require the use of branch delay slots should not be disregarded as an acceptable compromise, and in fact, may yield better overall performance when multi-flag conditional branches are required.

_________________
Michael A.


Mon Dec 16, 2019 2:57 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I think something like the following might work (x2 for two instruction lanes):
Attachment:
File comment: rtf65004 - branch predict
BranchPredict.png
BranchPredict.png [ 44.42 KiB | Viewed 4611 times ]

_________________
Robert Finch http://www.finitron.ca


Mon Dec 16, 2019 3:08 am
Profile WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
I like it, a gselect pre-prediction. Nice.

There is a significant tradeoff here between a bigger mis-prediction penalty and the promise of higher accuracy. Looks like the gauntlet has been thrown down on perceptron! :)


Mon Dec 16, 2019 7:44 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Been having fun with edits yesterday. My workstation is acting a bit like there’s a virus on it and undoing edits previously done. I suppose it could be a buggy text editor. Something amiss with the undo buffer.

I decided not to implement the two level prediction scheme outlined yesterday.
I changed the branch predictor to a gshare predictor with a much larger table (1024B vs 128B) than was previously in use. There is something amiss with the branch prediction logic as the miss rate is 100%.
Continued to work on a native mode for the rtf65004. Got much of the instruction set ironed out. It’s going to be variable length and make use of micro-ops.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 17, 2019 7:06 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
(Make sure you have a good story for backups! And best practice is to use version control, checking in your work several times a day, as it makes sense to do so - ideally also mirroring to a remote repository.)

(There are two kinds of people: those who have suffered a disk failure, and those who have not yet suffered a disk failure.)


Tue Dec 17, 2019 9:59 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Make sure you have a good story for backups! And best practice is to use version control, checking in your work several times a day,

On technique I use is multiple copies of texts. I've only occasionally lost more than about 1/2 hours work. Anything considered valuable should be backed up!

Refined the micro-op queuing mechanism. There are now two micro-op program counters which point into a micro-op program. The program counters are initialized by looking in a map for the opcodes of the two instructions fetched. A maximum of four micro-ops queue per clock cycle advancing the program counters as ops are queued. The new setup removes the fixed number of micro-op slots per macro-instruction. There could be any number of micro-ops associated with a macro instruction now. One complex instruction is the compare strings macro-instruction which is only a single byte. It converts into eight micro-ops.
Code:
 // CMPS (60):
'{2'd1,   LD,         TMP1,xr,ZERO,ZERO},
'{2'd0,   LD,         tmp2,yr,ZERO,ZERO},
'{2'd0,   ADD,      xr,xr,ZERO,FOUR},
'{2'd0,   ADD,      yr,yr,ZERO,FOUR},
'{2'd0,   SUB,      acc,acc,ZERO,ONE},
'{2'd0, BEQ,      ONE,ZERO,ZERO,ZERO},
'{2'd0,   SUBu,      ZERO,TMP1,tmp2,ZERO},
'{2'd2,   BEQ,      ZERO,ZERO,ZERO,ZERO},

_________________
Robert Finch http://www.finitron.ca


Wed Dec 18, 2019 8:22 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Having a lot of fun trying to get micro-op queuing to work. The queue seems to want to place NOP instructions all the time. It’s complicated. The whole thing is pipelined for performance or there’s no point to doing it. There are two micro-program counters, one for each macro instruction fetched. Micro-ops are selected from the micro-program according to the program counters, but they are fetched in order with possibly not being able to fetch for the second program counter. It’s possible there’s too many micro-ops to fetch for them all to be fetched in a single cycle. If that’s the case the macro-pipeline stalls while more micro-ops are fetched. I could also be that case that there isn’t room in the issue queue, which will also cause a stall. Complicated and it doesn’t work quite right yet, almost, but no cigar. It queued about a dozen instructions from the Fibonacci program then skipped over a branch.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 20, 2019 5:42 am
Profile WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
This sounds like a very general mechanism. if I understand correctly, you have two macro-ops running at any one time, and fetch micro-ops from both — super-scalar micro-ops and macro-ops. Is that right?


Fri Dec 20, 2019 1:21 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
This sounds like a very general mechanism. if I understand correctly, you have two macro-ops running at any one time, and fetch micro-ops from both — super-scalar micro-ops and macro-ops. Is that right?
Right on the money. It makes queuing instructions in order a challenge.

Still working on the micro-op queuing. Issues with branch miss. On a branch miss I’ve fudged it at the moment to insert extra NOPs into the pipeline. Because there’s a pipeline delay before real instructions can be queued again, of about two cycles something has to be placed that’s fairly innocuous in the pipeline. The branch predictor logic is busted again, causing every branch to miss. <- found out this wasn’t the case.

Sequence number decrement was occurring at the wrong spot. This caused newly queuing instructions to have the incorrect sequence number sometimes. This affected branches which use the sequence number to determine what to invalidate.

Studying the gshare branch predictor results, and realizing it’s not accurate for short sequences of branches. It missed 12 out of 14 times in my simple Fibonacci test program. The issue is that gshare uses a larger number of global history bits than gselect to determine the pht index. When the global history is building up it looks like this:
Code:
 Global History
0000000   <- no branches taken
0000001 <- first taken branch of loop recorded
0000011   <- 2nd branch of loop
0000111 <- 3rd branch of loop
...
1111111 <- seventh branch of loop
1111111 <- eigth etc. branch of loop

Note the above pattern is used as a table index. Since the pattern differs seven times in a row, a different pht entry is selected to be updated. Since the pht entry must be hit several times in order for it to flip it's decision, it ends up taking about 12-13 branches before the predictor settles into a 'it's taken' result.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 21, 2019 3:40 am
Profile WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
Ah, interesting. Coincidentally, I get into exactly this issue in this post]. It takes a while to train a gshare predictor. I ended up modelling a hybrid tournament predictor which includes a 2-bit Bimodal predictor to complement a gshare predictor for this reason. In the early going, the bi-modal predictor is used while gshare builds up history. It makes the predictor much more balanced.

Please feel free to comment in that thread about your experience. I’m trying to provide as much information as I can for other folks who may come along later and be puzzled by the same issues.


Sat Dec 21, 2019 4:39 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Cleaned up the micro-op engine code a little bit, breaking it out to its own module from the mainline. I hope to be able to cut and paste for future projects. I tried the startup masking for the gshare predictor, it didn’t seem to help.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 22, 2019 4:39 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added a bypassing network which bypasses from queued result directly to other argument entries in the queue that might be waiting. This is kind of a N x M bypass network and it increases the size of the core significantly. There is also bypassing of functional unit outputs back to the inputs, but this bypassing is valid only for a single cycle. If there are dependent instructions entering the queue two or more cycles later, then arguments don’t get loaded until commit time. The extra queue bypassing logic improves this situation. However current testing reveal the extra bypass doesn't make a difference. So it's optional.

Gave some more thought to a design more similar to the Thor core with its predicated instructions. I had avoided predication in more recent processors as it requires an extra read port on the target register. However, if implementing vector instructions, this extra read port is required anyway so predication doesn’t incur a port penalty.

Miss-predicted branches cause pipeline bubbles in the micro-op queue (see below), and I was wondering if the slots which are currently filled with NOPs could be filled with other useful instructions. I’m reminded of branch delay slots, but the presence of slots here isn’t predictable. I’m wondering what special instruction could be used. For instance, an ‘increment missed branches count’ instruction might be useful. Or some other performance tuning instruction.
Code:
 ..  0: Q-- 0 0 -- - A03 SUBu  3,R   0000000000001 000000000000c  1 0  4 0000000000000 0 1 11 10 0 1  9 fffffffffc018         83 #
..  1: Q-- 0 0 -- - F13 BNE   0,R   ffffffffffff4 0000000000000  0 1 11 0000000000000 5 1 11 10 5 0  0 fffffffffc01a         84 #
CQ  2: Dd- 0 0 -- - M0b ST    1,R   0000000000004 0000000000000  1 1  3 0000000000000 0 1  3 10 0 1  1 fffffffffc014         67 #
..  3: A-- 0 0 a- - M0b ST    2,CC  0000000000000 0000000000000  1 1  3 0000000000000 0 1  3 10 0 1  1 fffffffffc016         68 #
..  4: Cd- 0 0 -- - A03 SUBu  3,R   0000000000001 000000000000c  1 1  8 0000000000000 0 1  3 10 0 1  1 fffffffffc018         69 #
..  5: Cd- 0 0 -- - F13 BNE   0,R   ffffffffffff4 0000000000000  0 1  3 0000000000000 5 1  3 10 5 1  4 fffffffffc01a         70 #
..  6: Cd- 0 0 -- - A23 NOP   0,R   0000000000000 0000000000000  0 1 11 0000000000000 0 1 11 10 0 1  1 0000000000000         77 #
..  7: Cd- 0 0 -- - A23 NOP   0,R   0000000000000 0000000000000  0 1 11 0000000000000 0 1 11 10 0 1  1 0000000000000         78 #
..  8: A-- 0 0 a- - M07 LD    2,CC  0000000000004 0000000000000  1 1 11 0000000000000 0 1 11 10 0 1  1 fffffffffc010         79 #
..  9: Q-- 0 0 -- - A01 ADDu  1,CC  0000000000000 0000000000002  1 1  1 0000000000000 0 0  8 10 0 1  1 fffffffffc012         80 #
.. 10: Q-- 0 0 -- - M0b ST    1,CC  0000000000004 0000000000000  1 1 11 0000000000000 0 1 11 10 0 1  9 fffffffffc014         81 #
.. 11: O-o 0 0 -- - M0b ST    2,CC  0000000000000 0000000000000  1 1 11 0000000000000 0 1 11 10 0 1  9 fffffffffc016         82 #
MicroOp queuing is misbehaving after running for several hundred cycles. I've yet to figure out why.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 23, 2019 4:47 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Worked on the micro-op queuing today. Re-wrote parts of the code that were long into a more compact form using loops. The loops were necessary to allow the number of micro-ops queued to be configured. In the process the core no longer queues properly. Details to work out yet.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 24, 2019 5:53 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 84 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

Who is online

Users browsing this forum: AhrefsBot, YandexBot and 12 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software