View unanswered posts | View active topics It is currently Mon Aug 19, 2019 12:50 pm



Reply to topic  [ 67 posts ]  Go to page 1, 2, 3, 4, 5  Next
 nvio 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Thread for the rtfItanium core. The rtfItanium is a three-way superscalar core coded in Verilog or System Verilog. It's a 64-register, 80-bit data path machine. Project started in earnest about the 18th May, 2019.

Started another mega project – the rtfItanium. This time using 40-bit opcodes bundled into 128-bit instruction bundles with an 8-bit template field. No, it isn’t a VLIW core, it’s still a more or less ordinary superscalar core. The template field is used to expand the number of bits available to decode for each of the three instruction slots. It compresses the decode by noting that given any three instructions it’s unlikely for there to be certain combinations. For instance, more than two memory operations going on at once. The template field are borrowed from the IA64, slightly different as there is no L or X unit.
The data path is 80-bits in order to support 80-bit floating-point. It’s a 64 register machine with a unified integer and floating-point register set. Given the data path there’s instructions to load / store 8,16,32,40,64 and 80-bit data. There is no requirement for loads and stores to be aligned.
A quick reference chart attached.
Attachment:
File comment: rtfItanium Quick Ref
QuickRef.png
QuickRef.png [ 76.69 KiB | Viewed 2034 times ]

_________________
Robert Finch http://www.finitron.ca


Last edited by robfinch on Fri Jun 21, 2019 3:28 am, edited 3 times in total.



Tue May 21, 2019 3:22 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Coded up a storm today. About 10,000 lines of code in two days. Mostly block copied, but a good chunk of original code too. Got the first pass of code at the rtfItanium completed. Now it needs to be refined. Documentation is severely limited, but good enough for a start.

The ISA uses displacement-plus-offset branch addressing. The offset portion is 8kB and the total reach of the dpo addressing is approximately +-1MB. 8kB corresponds to the planned size of a memory page.

The core is three-way superscalar core. It can fetch, queue, issue, execute, and commit at least three instructions per clock cycle.

The trick will be getting it to fit in an affordable FPGA. The FT64 when built for 64-bit two-way operation just barely fits (90% full) into the FPGA. The Itanium being probably 60-70% larger. The Itanium may remain just a simulation toy for a while.

_________________
Robert Finch http://www.finitron.ca


Wed May 22, 2019 3:15 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Thinking about adding register pair load / store instructions to the ISA. With a wide enough bus to memory two registers could be loaded or stored within a single memory cycle.

I vastly expanded on the templates by noting that the core shouldn’t care about what instructions are placed where. Now I think there’s a template for every possible combination of units. The count is 81 valid template codes. If one wants to put the instructions in an order that executes poorly, there’s nothing preventing that.

Got enough coded to run the code through the synthesizer to get a rough estimate of size. The first synthesis result returned 41,000 LUTs. Indicating it should fit easily in the FPGA. I was quite skeptical of this result. Proceeding to make improvements a little while later synthesizing returned a result of 26,000 LUTs, but I noticed that a number of things got left out. So there’s a signal screwed up somewhere.

_________________
Robert Finch http://www.finitron.ca


Thu May 23, 2019 3:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Expanded, corrected instruction chart. An extra bit of displacement was squeezed out for the branch instructions, by noting the last two bits of an instruction address are the same as the next two bits.
Attachment:
File comment: QuickRef2a
QuickRef2a.png
QuickRef2a.png [ 67.89 KiB | Viewed 1970 times ]

Attachment:
File comment: QuickRef2b
QuickRef2b.png
QuickRef2b.png [ 26.34 KiB | Viewed 1970 times ]

_________________
Robert Finch http://www.finitron.ca


Thu May 23, 2019 4:35 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
A bare-bones version of the core is looking to be around 75,000 LUTs. Smaller than I expected. I managed to trim about 15,000 LUTs off the design by not implementing an instruction fetch buffer, instead the output of the cache is used directly. Cost is performance I'd guess.
Spent considerable time getting the code to a usable point.

_________________
Robert Finch http://www.finitron.ca


Fri May 24, 2019 6:43 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Made a first pass at converting the assembler and compiler from FT64 to the Itanium. It looks like code is about 50% larger than on the FT64. A good chunk due to the lack of compressed instructions. I missed some of the template combinations in my first pass and it took several tries to get them all.
I changed some of the nomenclature, mainly load and store operations. They now follow Knuth’s suggestion of using Greek numbers. For instance, loading a 64-bit word is LDO standing for load octa-byte. The two unusual names are load deci-byte (LDD) and load penta-byte (LDP). There’s a similar convention for stores.

_________________
Robert Finch http://www.finitron.ca


Sat May 25, 2019 2:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Made the branch instructions absolute address instructions. They have a 23-bit address range which is probably good enough given memory management hardware would be present. (The top address bits of the instruction pointer are unchanged by branches). It’s only around an 8MB boundary that one has to be careful that routines don’t span the boundary.

It turns out I missed a couple of template combinations, so I finally decided to just calculate everything. I found a real useful macro on the web that’s supposed to figure out all the combinations, but it only came up with 120. The number of permutations of five different unit types choosing three of them is 125. I wrote an Excel spreadsheet macro to determine all the different combinations and generate the code for tables in “C” for the assembler, Verilog for the RTL code, and a colorized table for documentation. I should have done that in the first place as it was less work.

The template mechanism can support up to six functions units. (results in 216 combos). Seven units would require more than 256 combos, which can’t be fully supported.
All possible combos are needed for templates because it’s a superscalar core, not VLIW, and the core can queue up instructions of any kind even if it isn’t ready to execute them yet.

Well I got as far as running simulations today after about 1001 fixes.

_________________
Robert Finch http://www.finitron.ca


Sun May 26, 2019 3:06 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Fixed a lot of bugs today. Simulation runs up to about 75 us now (100 instructions). I found a bug that is in common to FT64 and went back and updated that core as well.
Added the BYTNDX and WYDNDX instructions which search for a byte or a wyde within a register and return the index if found or -1 if not found. A goal is to support at least all the instructions of the FT64. I tried coding a strlen() function to get a feel for how well the instructions would work. It processes five characters at a time with a single load instruction to grab the characters.
Code:
 /* strlen function */
#include <string.h>

size_t (strlen)(const char *s)
{   // find length of s[]
   __asm {
      push      $t1
      push      $t0
      ldi         $t0,#0               ; $t0 = memory offset / length
      ldd         $t1,30[$sp]         ; $t1 = char pointer
      ldd         $v0,[$t1]            ; fetch first word
.0002:
      wydndx   $v1,$v0,$r0         ; get index of null char
      bge         $v1,$r0,.found
      add         $t0,$t0,#10         ; increment memory offset
      ldd         $v0,[$t1+$t0]      ; fetch another strip of characters
      bra         .0002
.found:
      shr         $t0,$t0,#1         ; adjust for 16 bits per char
      add         $v0,$t0,$v1         ; add the wyde index
      ldd         $t0,[$sp]
      ldd         $t1,10[$sp]
      ret         #30                     ; 2 temps + arg
   }
}

_________________
Robert Finch http://www.finitron.ca


Mon May 27, 2019 3:59 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Fixed a few more bugs. Fortunately, at this stage most of the bugs are really simple to fix. Typos and such. Got the simulation runs over 100 instructions now. Updated the documentation. Wrote a little utility in C# to decompose some of the opcode fields given an opcode, to make it a bit easier to hand verify.

_________________
Robert Finch http://www.finitron.ca


Tue May 28, 2019 3:26 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
I synthesized the design with all the most recent fixes, then voila, as expected: 155,746 LUTs. The design is too large for the FPGA.

The following were using too many resources:
branch target buffer (28%)
register file (11%)
and surprisingly
bus randomizer (8.5%)
Together these components represented almost 50% of the design.

The branch target buffer uses a lot of LUTs because of it’s triple write-ported input. Instead of triple write ports on the input, the design was modified to use a 4x write clock and a single port.

The register file was made out of LUTs. In FT64 block rams were used, because otherwise the register file is too large. When implemented with block rams the bonus is almost free support for multiple register sets. So back to the multi-register sets of the FT64.

The bus randomizer can simply go. It was present as a means to mitigate virus threats. The likelihood of a virus on my personal version of the core is pretty remote.

Making these changes resulted in a synthesis size of 101,303 LUTs. Maybe fit-able in the FPGA with a couple of simple peripherals.


The compiler was modified to use just a single ref type rather than a separate ref type for each primitive type. The compiler now refers to the type field associated with the ref to get the size of the type. It turns out otherwise that it would have been necessary to add a whole bunch more ref types to support the additional sizes of primitives. The compiler is being simplified because of the use of a unified integer and floating-point register array.

_________________
Robert Finch http://www.finitron.ca


Wed May 29, 2019 2:30 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Added large immediate constants to the design. The constants occupy slots one and two of the bundle allowing an 80-bit constant to be supplied to an instruction in slot zero. The constants may be used with either integer or floating-point instructions. When used, the entire bundle queues as a single unit, so it only requires one open queue slot. For integer instructions the usual immediate constant is ignored. For floating point instructions there’s an extra floating-point format to support the constant. One unusual feature is that encoding a normal FP instruction instead of one that uses a constant will cause the constant values to be treated like NOP instructions. So, the bundle format can be used to “hide” other information.

Only the L1 instruction cache was working properly. That L2 wasn’t working was hidden by the fact that loads load both L1 and L2 at the same time. It’s interesting that the core was able to run for about 100 instructions before a load from L2. There were several fixes to the instruction cache controller.

_________________
Robert Finch http://www.finitron.ca


Thu May 30, 2019 3:55 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
Unified the memory load and memory store units. They were using two different unit types, which has now been converted to a single unit type. This reduces the number of templates required. In order to obtain more opcode bits for encoding memory operations two bits were stolen from the constant field. This limits memory operations to a 20-bit constant.

I found out the little utility I wrote in VBA to generate templates didn’t work properly, I think due to rounding of integer values by basic. Some templates were missing and others were duplicated. I re-wrote the utility in C# as part of the opcode utility.

The documentation for the core has been updated. Most instructions are documented now.

The DIF instruction was added which performs a subtract then takes the absolute value of the result.

_________________
Robert Finch http://www.finitron.ca


Fri May 31, 2019 3:37 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1225
A detour because of arithmetic artefacts in Basic? That's a little frustrating, but somehow very authentic!


Fri May 31, 2019 5:06 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
After my short detour I decided to work mainly on floating-point today.

Added the FMIN and FMAX instructions. How I missed those before is a mystery. They find the minimum and maximum of three values.
Added the sigmoid and reciprocal approximate instructions to the ISA. These functions are simply using pre-calculated lookup tables. I’m guessing that the preferred method of performing a divide would be to approximate the reciprocal and use Newton-Raphson iterations. The divide instruction takes about 150 clock cycles, but multiply only takes about 8 clock cycles. I'd like to add more approximation functions but I'm not enough of a numerics nut to know which ones would be the best to have.
Changed the encoding of floating-point operations. Added a format that looks up common constants like e and pi from a small lookup table. I need more constants to fill the table with. There are 64 available spots.
I started working on an emulator for the core, this time written in C#. I’ve been wanting to improve my C# skills.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 01, 2019 3:40 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 918
Location: Canada
The shifting add, and, and or instructions were modified to shift by multiples of 20 bits rather than 22. I thought it would work better if the shifts corresponded to a quadrant of the register. For some SIMD operations a register will be treated as four 20-bit values. A shifting and operation can then be used to mask off one of the 20 bit values.

The data ready signal for the alu’s was only pulsing when the alu operation was complete. That meant that newly incoming instructions after the pulse that were being queued couldn’t make use of the alu output and had to wait until results appeared on the commit bus. This is a performance issue. Tried several fixes to this, but couldn’t get the core to work properly.

Changed some of the dependency checking logic into a matrix rather than discrete logic. This will allow configuring the number of elements that may queue at once using numeric constants rather than having to write more code. I hope to have the core setup able to run six ways parallel. Six is about the max ILP (instruction level parallelism).

_________________
Robert Finch http://www.finitron.ca


Sun Jun 02, 2019 2:37 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 67 posts ]  Go to page 1, 2, 3, 4, 5  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software