View unanswered posts | View active topics It is currently Sun Apr 22, 2018 8:37 am



Reply to topic  [ 171 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10, 11, 12  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 568
Location: Canada
Scrapped the stack operations in FT64. They turn out to be of limited value and complicated the core. Pop and link/unlink required a dual result bus to handle two targets in a single instruction. Working with the compiler I came to realize that temporaries are spilled and loaded from pre-defined stack locations using load / store operations rather than push and pop. Eliminating some of operations from the ISA left room for other operations.
AMO (atomic memory operations) were added to the core. AMO operations include: swap, add, and, or, xor, min, max, shl, shr, asr, rol. Immediate forms are also supported.

_________________
Robert Finch http://www.finitron.ca


Tue Jan 09, 2018 6:37 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 902
I do like it when a new machine comes to life sufficiently that practical experience can be brought into play!


Tue Jan 09, 2018 7:26 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 568
Location: Canada
Quote:
I do like it when a new machine comes to life sufficiently that practical experience can be brought into play!
I'm amazed at how much "new stuff" I've learned recently. The amount of knowledge "out there" is incredible. With the number of years of practical experience required it's not a surprise that things are being done by teams of people.
******
I’ve read a couple of documents now that suggest that a solution to register allocation problems is to provide more registers in the design. I’ve also read documentation that says 32 regs is plenty. I assume they meant with a good register allocator. I read one document that suggested using 128 registers as representative of an infinite number for testing register allocators.
So I’m toying with the idea of adding 32 more registers to the design so there would be 64 total and increasing the size of instructions to 36 bits to accommodate them. Seven 36 bit instructions would be packed into a 256 bit wide cache line. Number of registers doesn’t have that much impact in terms of performance of an FPGA design which uses block ram for registers. And a goal is a simple compiler. I’m not after a commercial grade compiler, that means the hardware may have to be more accommodating to the compiler. The original Thor design had 64 regs.

As an example of a case where the compiler runs out of registers to allocate it is shown in the dump for printf() below. The OD columns stands for optimization desireability. Zero means it’s not a good idea to assign the value to a register (constants for instance). Registers number 11 to 17 are assigned with a simple allocator. Even if a better allocator were in use there are only two more cases n=7 and n=8 where a register could be allocated. So if the design had just two more registers available the simple allocator would be sufficient.
Code:
<CSETable>For _printf
N OD Uses DUses Void Reg Sym
0: 132   132   101   0   11   _p   
1: 26   13   5   0   12   
2: 16   8   0   0   13   
3: 12   6   0   0   14   
4: 10   5   0   0   15   
5: 8   4   1   0   16   
6: 4   2   2   0   17   
7: 2   1   0   0   -1   
8: 1   1   0   0   -1   
9: 0   3   0   0   -1   
10: 0   2   0   0   -1   
11: 0   3   3   0   -1   
12: 0   6   0   0   -1   
13: 0   1   1   0   -1   
14: 0   1   0   0   -1   
15: 0   32   25   1   -1   
16: 0   1   1   0   -1   
17: 0   2   2   0   -1   
18: 0   1   0   0   -1   
19: 0   7   0   0   -1   
20: 0   1   1   0   -1   
21: 0   5   0   0   -1   
22: 0   11   11   0   -1   
23: 0   2   0   0   -1   
24: 0   1   0   0   -1   
</CSETable>

_________________
Robert Finch http://www.finitron.ca


Thu Jan 11, 2018 7:15 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 568
Location: Canada
I delved into the world of gcc machine descriptions today. I picked the Altera NIOS machine description as a starting point for FT64.

I could not find an instruction pattern that supports a number of four operand instructions in FT64. and, or, xor, nand, nor, xnor, min, max, and maj instructions all have three source operands and a single destination operand. Because there are three register read ports in the design, a design choice was to have some instruction that could use them all, use them all. So for instance ‘or’ or’s together three registers and places the result in the target. I’m not sure if the gcc compiler is able to support this or not.
FT64’s compiler searches the parsed expression tree for expressions that can be turned into triple source operand instructions and makes use of them. So if there’s an expression like (a || b || c) it gets converted into a single ‘or’ instruction. I’m not sure how easy this would be to do with gcc.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 12, 2018 6:33 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 568
Location: Canada
Some of the pseudo code for the SSA form has been implemented in the FT64 compiler. The code may not be working but it looks good.

The SSA pseudo code comes from:
http://www.cs.utexas.edu/~pingali/CS380 ... Cytron.pdf

Gleamed more information from gccint.pdf as well.

_________________
Robert Finch http://www.finitron.ca


Sat Jan 13, 2018 10:12 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 568
Location: Canada
Took a break from software today to work again on blue sky stuff. I’m curious to find out just how large FT64x4 would be, I’ve got an initial version mostly coded – 18,000 LOC, still needs a lot of work. FT64x4 is a four way version of FT64.

Synthesized the register file for the FT64x4. Came out to 85,000 LUTs and 270 block rams or about 3/4 of the xc7a200 device. There’s 4096 registers counting the vector register file, each with 12 read ports and three write ports. FT64x4 fetches and queues four instructions, issues up to six, and commits up to four instructions at a time. The instruction queue / ROB is 10 entries long. The size of the hardware grows geometrically with the size of the ROB in a couple of places so it’s best kept small.

I could put a hack in to reduce the number of available registers ports. Eight sounds like a nice number. Since most instructions don’t require all three ports, in any group of four instructions eight ports is likely more than enough. It’s a pita to shuffle ports around though.

_________________
Robert Finch http://www.finitron.ca


Mon Jan 15, 2018 4:34 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 568
Location: Canada
I changed the register file update to use a 4x overclock to reduce the number of write ports required.

The branch predictor changed, it hasn’t been updated in a while. When creating a four way version of the predictor the author noted that it wasn’t very resource efficient. It operated very simply but used a memory with four write ports as the history table. This made it quite large. So it was re-written to use an input rate converter with a single port memory instead. The original predictor worked as If every single instruction could be a branch instruction. This is somewhat unrealistic. The new predictor capitalizes on the fact that branches are typically one in four instructions. It stores between zero and four branches into a rate conversion buffer every clock cycle. The branches are then removed from the buffer one per clock cycle to update the table which now has only a single port. Now however if the rate at which branches occur overfills the buffer, then branch predictions will likely be wrong. The new predicator is an order of magnitude smaller than the old one. One thing that's helped bring the size of the core down so that it fits in the FPGA again.

Tonight’s nonsense:
Attachment:
err1a.png
err1a.png [ 62.69 KiB | Viewed 851 times ]

wclk is supposed to be one cycle delayed from clk. But it doesn’t get delayed in the simulation. This causes a problem with the register file update.
Code:
 reg wclk2;
always @(posedge clk4x)
   wclk2 <= clk;
always @(posedge clk4x)
   if (clk & ~wclk2) begin
      wr <= wr0;
      wa <= wa0;
      i <= i0;
   end
   else if (clk & wclk2) begin
      wr <= wr1;
      wa <= wa1;
      i <= i1;
   end
   else begin
      wr <= 1'b0;
      wa <= 'd0;
      i <= 'd0;
   end

_________________
Robert Finch http://www.finitron.ca


Tue Jan 16, 2018 5:25 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 902
Four write ports sounds pretty extreme! I wonder if any part of any commercial machine is quite so writey as that. An interesting workaround to use a rate converter. Could it use a stack rather than a FIFO so it deals with the most recent branch more urgently? More to the point, would that be any better?


Tue Jan 16, 2018 6:39 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 568
Location: Canada
Quote:
Four write ports sounds pretty extreme! I wonder if any part of any commercial machine is quite so writey as that.

Yes, I think that was obviously overkill in the name of simple design.
I've heard some machines use a two phase clock to increase the apparent number of write ports. It's interesting because it doesn't matter how many instructions the core can fetch and queue at once if it can only update two registers at a time. It can't work any faster than the updates it can make. For some of these commercial machines that can process a half dozen (or more) micro-ops at once I wonder if the number of write ports puts a limit on their performance.

I think the branches have to be handled in program order, but maybe it doesn’t matter. It probably doesn’t matter a whole lot since slots specific to the branch address are being updated. Not handling branches in order might create problems with the global branch history though. One problem with delaying the updates a couple cycles is in a tight loop the predictor might not update before the next branch occurs so the prediction could be off by a cycle. In other words the predictions might be less accurate.

I measured the predictor’s accuracy at one point a couple of years ago and IIRC it was about 89-90% accurate. For some reason it wasn’t quite as accurate as the text states it should be (93%), but maybe it was just my limited testing.

_________________
Robert Finch http://www.finitron.ca


Tue Jan 16, 2018 6:43 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 902
I found some notes in wikipedia, of familiar machines if not especially recent ones:

Quote:
At some point, it may be smaller and/or faster to have multiple redundant register files, with smaller numbers of read ports, rather than a single register file with all the read ports. The MIPS R8000's integer unit, for example, had a 9 read 4 write port 32 entry 64-bit register file implemented in a 0.7 µm process, which could be seen when looking at the chip from arm's length.


Quote:
The SPARC uses "Shadow Register File Architecture" as well for its high end line, It had up to 4 copies of integer register files (future, retired, scaled, scratched, each contain 7 read 4 write port) and 2 copies of floating point register file.


Quote:
For example, POWER8 has up to 8 instruction decoders, but up to 32 register files of 32 general purpose registers each (4 read and 4 write port), to facilitate simultaneous multithreading


Quote:
[In the x86 line...] On P6... The register file itself still remains one x86 register file and one x87 stack and both serve as retirement storing. Its x86 register file increased to dual ported to increase bandwidth for result storage.


Quote:
Later P6 implementations (Pentium M, Yonah) introduced "Shadow Register File Architecture" that expanded to 2 copies of dual ported integer architectural register file


Quote:
Core 2 increased the inner ring bus to 24 bytes (allow more than 3 instructions to be decoded) and extended its register file from dual ported (one read/one write) to quad ported (two read/two write)


Quote:
In later x86 implementations, like Nehalem and later processors, both integer and floating point registers are now incorporated into a unified octa-ported (six read and two write) general-purpose register file


Tue Jan 16, 2018 7:20 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 568
Location: Canada
The register file on the SPARC must be huge with all the replication. Because it uses register windowing it already has like 512 regs.

Some integer SIMD operations were added to the core, there’s more that could be added. Currently not supported are multiply and divide. Add, sub, cmp, and, or, xor, and shifts all support SIMD operations. The core currently supports sizes of 8, 16, 32, and 64 bit lanes. Also added was queueing of more vector elements if two queue slots are open. Previously vector instructions queued one element at a time. Several new load instructions were added to load only the portion of the register according to the operation size. Normally a load operation sign or zero extended to the register width, these new operations do not extend. It allows unaligned memory loads to be performed more easily among other things. A combination of the LBO (load byte only) and a rotate instruction can be used to load a register from an unaligned memory address without using extra registers.
Vector queueing is almost working. There is an issue with when to stop queueing. It's off by one sometimes.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 18, 2018 8:10 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 902
Just an idea on the write ports explosion - and this is mentioned in that Wikipedia page - if you have two banks, for even and odd registers, then half of the time you'd be able to do two writes in one cycle even with just one write port. Even better if you split into four banks. Of course, it is extra complexity, and more steering, but as each bank is smaller it might not cost clock speed.


Thu Jan 18, 2018 8:17 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 568
Location: Canada
Another idea is to separate out groups of registers, like data and address registers on the 68000. One reason floating point regs are usually a separate regfile. It would allow a floating point op and an integer op at the same time. Thor has code address pointers separate from the general register file. Having separate result status registers like the PowerPC might help too. There's a lot of compares and branches in code.

I was thinking of using odd/even register banks. But the steering logic takes up resources. The two-way core small enough to fit in the FPGA while retaining a lot of features. For now the write side of the register file is being overclocked by four times to allow two (or three) write ports. Overclocking the register file doesn’t take very many resources compared to other solutions. It’s possible to do this in an FPGA because other logic and routing in the core is much slower than the block ram so it doesn’t really cost performance. But that’s probably FPGA centric. There is also a five times clock in the system for the video display. It would be nice to be able to re-use that clock. Using multiple clocks will complicate things however if varying the clock frequency.
It’ll be lucky if the core runs at 50MHz (kind of an eventual target). I’m trying to get it to work first at 25MHz with 100MHz register write clock. At 50MHz performance should be about the same as a 100MHz overlapped pipeline version.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 19, 2018 4:49 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 902
Nice idea to use faster clocks for faster subsystems - of course the ratios you use may end up affecting your overall speed, but you know that!


Fri Jan 19, 2018 9:46 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 568
Location: Canada
I’m not sure if this is a nonsense idea or not, it has to do with the placement of instruction decoding in the core. For most machines instruction decoding is shown as taking place after fetch. Not the case for FT64.
Instruction decoding in the FT64 core is a late decoder strategy. The instruction register is passed right through the core to the functional units where decoding takes place. Rather than decode all signals on entry into the instruction queue which would result in hundreds of signals being placed in the queue, which would then have to be multiplexed for the functional units, the 32 bit instruction register is placed in the queue for later decoding. Because the FPGA doesn’t have real tri-state busses and must use multiplexer logic to select functional unit outputs any decoded single bit decoded signals would have to be built up again using logic in order to generate multiplexor controls. It’s probably just as fast and resource frugal in an FPGA to decode the signals at a late stage.

Attachment:
File comment: FT64 instruction decode placement
IDPlacement.png
IDPlacement.png [ 67.22 KiB | Viewed 733 times ]

_________________
Robert Finch http://www.finitron.ca


Sat Jan 20, 2018 4:06 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 171 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10, 11, 12  Next

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software