Last visit was: Sat Oct 23, 2021 10:09 am
It is currently Sat Oct 23, 2021 10:09 am



 [ 84 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
 microop 6502 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1508
Location: Canada
Shelving this project for now and switching back to nvio3. Without changing the instruction set and programming model drastically, it's probably faster to use a much smaller pipelined 6502 core.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 12, 2019 3:55 am WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
Some interesting findings in this project Rob. As you suggest, a pipelined 6502 seems to be a promising approach. I’m currently digging into the branch prediction scheme of the pipelined model I built. I am finding that the correlating predictor I am using needs a very big history table — gshare with a 16-bit history. I was very interested to learn how efficient your correlating predictor seems to be in this build (2-bit history, if I recall correctly). Makes me think I must be doing something wrong. If you’re interested, here is a link to the relevant post on 6502.org. I would be delighted to learn more about your approach to these predictors.


Sat Dec 14, 2019 3:58 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1508
Location: Canada
One gotcha with larger tables in an FPGA is that they can't be built out of block ram, block ram has a minimum latency of a clock cycle, and two or three cycles if one wants to get good read performance. A branch predictor needs the prediction within the current clock cycle. That means building a table using FF's or distributed rams in an FPGA. The table is pretty much limited in size then or the fmax will suffer. Using predictors that make use of smaller PHT's might help.
Quote:
the correlating predictor I am using needs a very big history table — gshare with a 16-bit history.
As gshare exclusively or's the global history with some of the program counter bits, a smaller table can be used by limiting the bits xor'd. The gshare predictor I coded uses only a 1k entry table by xor'ing five bits of global history with five bits from the program counter, then using five other bits of the program counter as a pht index.
Code:
wire [9:0] bht_wa = {pc[6:2],gbl_branch_hist[5:1]^pc[11:7]};      // write address

One trick that might make a difference for the 6502 is to trying 'zoning' the branches by not using the low bit or two of the program counter as an index. Branches are two byte instructions, if they were aligned every other byte then using the low bit of the program counter as an index would not be necessary. Since branches are about every fourth or fifth instruction there's a 'zone' that the branch falls in. Of course if two branches fall in the same zone the prediction will be completely off.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 14, 2019 5:16 am WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
robfinch wrote:
A branch predictor needs the prediction within the current clock cycle. That means building a table using FF's or distributed rams in an FPGA. The table is pretty much limited in size then or the fmax will suffer.
This is great information, thank you. An FF table needs to be pretty small. I’ll need to work a little harder here.

Quote:
The gshare predictor I coded uses only a 1k entry table by xor'ing five bits of global history with five bits from the program counter, then using five other bits of the program counter as a pht index.
Wait, I’m not clear on this ... I’m xor’ing low-order bits from PC with the Global History Register (GHR), and then using *that* as an index into the pht. I discard the upper bits of PC. You seem to use the remaining bits from PC for the index. What do you do with the xor’ed value then? I may be doing something wrong.

Quote:
One trick that might make a difference for the 6502 is to trying 'zoning' the branches by not using the low bit or two of the program counter as an index. Branches are two byte instructions, if they were aligned every other byte then using the low bit of the program counter as an index would not be necessary. Since branches are about every fourth or fifth instruction there's a 'zone' that the branch falls in. Of course if two branches fall in the same zone the prediction will be completely off.
Great suggestion. I’ve actually tried this and it yielded no measurable increase in accuracy when keeping the table the same size. But I did not test whether it would deliver the same accuracy with a smaller table, which is not quite the same thing. :roll: I will try again.

Many thanks for this dialogue.


Sat Dec 14, 2019 1:00 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1508
Location: Canada
Quote:

Quote:
The gshare predictor I coded uses only a 1k entry table by xor'ing five bits of global history with five bits from the program counter, then using five other bits of the program counter as a pht index.
Wait, I’m not clear on this ... I’m xor’ing low-order bits from PC with the Global History Register (GHR), and then using *that* as an index into the pht. I discard the upper bits of PC. You seem to use the remaining bits from PC for the index. What do you do with the xor’ed value then? I may be doing something wrong.
I was under the impression that a gshare predictor uses multiple pattern history tables, but I may have a hybrid method here. The xor'd value is used as a table select (really just additional bits used to index one large table). The low order bits are being used to index pht's. The high order bits are being used to select pht's. The difference between this predictor and gselect is that more tables are available through the inclusion of more global history bits in the table select and the table selected depends on both the global history and the pc. To keep the index small (10 bits) I guessed that it wasn't necessary to use the high-order pc bits beyond bit 11. Most code is written in subroutines that often stay within the same page of memory (about 4kB). It's almost organized along cache lines. In this case each pht covers a 128 byte region.
By discarding the upper bits of the pc the table can't track between subroutines. Suppose there are two loops, one in each of two subroutines. When the first sub calls the second, the history for the first sub will be wiped out because the high order pc bits are not included for indexing. It's tempting to fold more high order bits of the pc into the xor operation. It might give better results for code that jumps between subroutines. I haven't really tested this predictor.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 14, 2019 3:30 pm WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
Thanks for this explanation. I find the terminology a bit confusing, but I think we’re on the same page. Yes, xor PC and GHR together and use that to access the pht — the low order bits are used to index the pht, and the high order bits to select phts.

In my model, I get the highest accuracy when I use all 16-bits of PC and a 16-bit GHR in the XOR operation. The result is used to fetch the 2-bit predictor in the pht as described above (in reality, the pht is a single long table, but it is conceptually a two dimensional array).

I’m beginning to think that the difference in the performance of the predictor has more to do with the nature of the code under test (the Basic interpreter), then anything in the predictor itself.

Once again, thanks for your commentary.


Sat Dec 14, 2019 5:38 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1640
Very interesting idea that the distance (as well as direction) of a branch might be indicative and help divide branches into different classes. Long medium and short distance branches might well have different statistics. And potentially interesting to use the low bit of distance as a hint bit, although if half the time it costs a NOP, that cost would need to be overcome in the improved prediction.


Sat Dec 14, 2019 5:41 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1508
Location: Canada
I just had the thought of perhaps slaving the branch history table to subroutine access. When a subroutine is called allocate a history table for it. Keep the history table around for a while. Then if the subroutine is called again use the history table for it. On return from subroutine switch back to the caller’s history table. Many routines are small. Perhaps the size of the history table required could be specified somehow in the code. A routine like memcpy() or strlen() might only require a single small pht. I’m reminded of a paging memory system, perhaps the history tables could be stored and reloaded when required (paged in and out). There could be a pattern history table stored in memory similar to an mmu page table.
The global branch history combined with pc bits absorbs this kind of access. But perhaps making it more explicit would help.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 15, 2019 1:37 am WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
Very interesting indeed. There is something to various code fragments having their own pht ...

As I test more it’s becoming clear that different code fragments like different length of histories. It seems to be a tradeoff between identifying longer patterns and how long it takes to train the predictor. For small fragments with a few branches, a short history gets up and running quickly and the predictor does well in just a few iterations. More linear code needs a longer history to find the repeating patterns. Conversely, too short a history loses important info that longer fragments require, and too long a history fails to identify short repeating patterns quickly enough.

It’s probably better said that any given branch may have its own particular requirements. You want is enough resolution in PC to keep branches from stepping on each other (physical locality) and also the ability for the predictor to tune the length of the history to each branch in particular (temporal locality).

Actually putting together a predictor that accomplishes that is not for the faint of heart ...


Sun Dec 15, 2019 10:07 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1640
From Hacker News:
Quote:
Neural Methods for Dynamic Branch Prediction.
Daniel A. Jiménez and Calvin Lin. 2002.

(It seems AMD have been using perceptrons for a while in their CPUs, as one of a collection of tactics, whereas Intel have gone for larger tables.)

Edit: also Dynamic Branch Prediction with Perceptrons (2001) and other papers nearby.


Sun Dec 15, 2019 11:39 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1508
Location: Canada
Wow! Thanks for the reference to material BigEd. I figured that a neural networks predictor would be have to be too large to implement for my small projects. It turns out not to be any larger than a history table based predictor. It does take an extra clock cycle to calculate the prediction. I hope I coded it correctly, now to test.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 15, 2019 3:54 pm WWW

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
Yes, wow! I guess this is about recognizing a repeating pattern, so a neural predictor seems to make sense.

Regarding implementation, it looked like a dot product was necessary. How do you manage that in one clock cycle? Very interested in how you implemented this ...


Sun Dec 15, 2019 4:13 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1508
Location: Canada
Quote:
Regarding implementation, it looked like a dot product was necessary. How do you manage that in one clock cycle? Very interested in how you implemented this ...
It turns out the dot product is just a dot product with a single bit. It ends up being converted into an add/sub +1 to the weights. It was just coded in a straightforward fashion.
Code:
// Form dot product of input and weights.
always @*
begin
   sum = 0; // bias weight
   for (n = 0; n < 22; n = n + 1)
      sum = global_history[n] ? sum + {{6{wghts[n][7]}},wghts[n]} : sum - {{6{wghts[n][7]}},wghts[n]};  // note sign extension
end

I followed the article fairly closely and came up with predictor using 256 perceptrons, 22 bits of global history, and 8-bit weights. This uses 5.5kB of memory. One trick was using a 22-byte wide (176 bit) ram so the calc can take place in parallel. All 22 add/subs are done during the same clock cycle. Just looking at the code, it may be slow depending on logic generated to form the sum. I'm assuming the tools can find a solution with minimum delay. The code may have to be broken out across multiple clocks. It's much like a multiply so it may be slow as discrete logic in an FPGA.

Just testing to see if I've got the proper jist from the article. It'll be fine-tuned later, probably to fewer perceptrons and less global history to make the core smaller. (It's about 1,500 LUTs) as is.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 15, 2019 9:41 pm WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1640
(Just in case anyone missed it, I did a couple of updates to my post upthread - there are now links to two papers and to the site of one of the authors, who has more research.)


Sun Dec 15, 2019 10:04 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1508
Location: Canada
Dot product summation broken up into two level adder tree.
Code:
reg [12:0] sum;
reg [12:0] sum1 [0:4];

// Form dot product of input and weights.
always @(posedge clk)
begin
   sum1[0] = 0;
   sum1[1] = 0;
   sum1[2] = 0;
   sum1[3] = 0;
   sum1[4] = 0;
   for (n = 0; n < 5; n = n + 1)
      sum1[0] = global_history[n] ? sum + {{6{wghts[n][7]}},wghts[n]} : sum - {{6{wghts[n][7]}},wghts[n]};
   for (n = 5; n < 10; n = n + 1)
      sum1[1] = global_history[n] ? sum + {{6{wghts[n][7]}},wghts[n]} : sum - {{6{wghts[n][7]}},wghts[n]};
   for (n = 10; n < 15; n = n + 1)
      sum1[2] = global_history[n] ? sum + {{6{wghts[n][7]}},wghts[n]} : sum - {{6{wghts[n][7]}},wghts[n]};
   for (n = 15; n < 20; n = n + 1)
      sum1[3] = global_history[n] ? sum + {{6{wghts[n][7]}},wghts[n]} : sum - {{6{wghts[n][7]}},wghts[n]};
   for (n = 20; n < 22; n = n + 1)
      sum1[4] = global_history[n] ? sum + {{6{wghts[n][7]}},wghts[n]} : sum - {{6{wghts[n][7]}},wghts[n]};
end

always @*
begin
   sum = 0;   // bias weight
   for (n = 0; n < 5; n = n + 1)
      sum = sum + sum1[n];
end

// < 0 means don't take branch, >= 0 means take branch => the take branch
// bit is inverted.
always @(posedge clk)
   prediction_o <= ~sum[12];


_________________
Robert Finch http://www.finitron.ca


Sun Dec 15, 2019 10:32 pm WWW
 [ 84 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software