View unanswered posts | View active topics It is currently Thu Mar 28, 2024 8:56 am



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 8, 9, 10, 11, 12, 13, 14 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Work has been started on yet another ISA to overcome some short comings in the FT64 ISA. This time PB64. FT64 doesn't have room for compressed instructions. Also a unified integer, floating-point register file was desired meaning 64 available registers to simplify the compiler. A number of instructions were shoe-horned into FT64 (AMO operations, volatile load operations). PB64 goes with a base 7 bit opcode rather than 6 bits, and 6 bit register specs rather than 5. Because of the increased sizes of fields in the instruction the instruction set is based on a 36 bit width. 36 bit instructions fit evenly into a 576 bit cache line, which is also a multiple of 64. Most of the instruction set can be viewed at a glance in the following chart:
Attachment:
PB64.png
PB64.png [ 110.34 KiB | Viewed 5865 times ]

_________________
Robert Finch http://www.finitron.ca


Last edited by robfinch on Sat Feb 24, 2018 6:07 am, edited 1 time in total.



Fri Feb 23, 2018 5:09 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Is there a monster amount of multiplexing to align the instructions for the decoder? Or is that a minor part of the implementation? It's an interesting choice... can you branch into any of the instructions within the cache line?


Fri Feb 23, 2018 5:48 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Yes, there would be a monster shift register / multiplexor to align the instructions for the decoder. But there would be one anyway for any size of instruction when instructions are in a cache line. So it isn't "extra" beyond what is normal. Branching to any instruction in the cache line is possible. I was going to implement compressed instructions which means there's even more multiplexing. Instructions cannot span a cache line. I was also thinking of using a 288 bit cache line because it's smaller, but it doesn't break evenly into 64 bits only 32 bits. A 512 bit cache line is pretty standard so 576 bits isn't far off.
The monster shift register in FT64 hasn't shown up yet on the critical timing path. FT64 is limited right now to about 20MHz. The problem is using 1/2 the clock (the negative clock edge to update registers). So there is only 1/2 clock period to get things done. Otherwise it could be running at 40MHz. The timing limit is in the branch predictor at the moment.
One slight complication is the difference between instruction and data addresses. Instruction addresses will address 18 bit compressed instructions while data addresses are a standard byte addressing. So data addresses are 18/8 of the instruction address. Fortunately this is easy to calculate with just an adder and a couple of shifts. The calc is needed for accessing data outside the cache line during cache load. Once the first data address of the cache line is calculated the rest can just be increments. (Cache line addresses will always work out to even multiples of 64 bits).

_________________
Robert Finch http://www.finitron.ca


Fri Feb 23, 2018 9:47 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Well I modified the latest design I'm now calling FT64v3 a little bit already. I changed the size of the opcode to eight bits and trimmed a bit off each instruction format to allow more compressed instructions. Now if bit 7 of the opcode is set, it's a compressed instruction. The least significant four bits of the compressed opcode are always a register specification. This was necessary to reduce the size of the instruction lookup table to only 8k entries, otherwise the lookup table would be too large. As it is now much of the compressed instruction space is completely programmable at run-time, subject to a few rules. Want a multiply instead of a subtract ? Just adjust the table. The table appears as a writeable memory at $FFFF0000 in the processor's address space.

FT64 was used as a base and modified to handle compressed instructions. A special unit called instruction decompression unit (IDU) was added to the design. It gets issued to like other functional units, decompresses the instruction and fills in about 20 or so decodes before marking the instruction as ready to re-issue to other units. I think decompressing instructions doesn't impact performance very much, it may add to the latency of instruction execution but not likely to the throughput.

I'm eager to test it and see if it works, but it's a ways off from being tested yet.

_________________
Robert Finch http://www.finitron.ca


Sat Feb 24, 2018 5:23 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The following psychedelic table shows the compressed instruction formats.
Attachment:
File comment: FT64v3 Compressed Insns.
FT64 Compressed.png
FT64 Compressed.png [ 58.25 KiB | Viewed 5863 times ]

_________________
Robert Finch http://www.finitron.ca


Sat Feb 24, 2018 5:31 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The compressed instruction set may only be available if SMT is turned on. The issue is computing the next instruction address. With SMT turned on it’s simple because there’s only one instruction and one bit (per thread) to look at to determine the PC increment (by one or two). I now know one reason why commercial processors go with SMT. When SMT is off the pc increment needs to be determined by looking at two instructions and there would need to be search logic to find the next instruction. If compressed instructions aren’t allowed when SMT is off calculating the next instruction address is easy it’s just current plus four.

I went to modify the assembler to support compressed instruction and yikes! The pattern matching for compressed instructions isn’t simple. Optimizing the compressed instruction set could be a challenge. I’ve gone back and rethought the idea of having a table-driven lookup of instructions defined at run-time. It’d be a lot simpler just to define a fixed set of compressed instructions. I guess I’ll have to see how far I can get with the pattern matching in the assembler. The assembler was going to be used to automatically determine the compressed instructions. It’s a bit unreasonable to hand-code a table of 8192 possible compressed instructions.

_________________
Robert Finch http://www.finitron.ca


Sun Feb 25, 2018 8:13 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
The barrel processor does the sort of thing you're ascribing to SMT, to a degree: you get more time to process each stage of an instruction if threads are interleaved. The XCore from XMOS has 4 threads, for example.

For me SMT is more opportunistic: the idea is to keep execution units busy, which would otherwise not be fed enough work from the front end.


Sun Feb 25, 2018 9:38 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I’m really liking the idea of a superscalar barrel processor. Brought on by Ed’s comment. The branch predictor which is currently a timing problem could be removed from the design along with the return address predictor which doesn’t work very well. The whole design could be made a little bit simpler. A little bit of dependency checking logic could be removed. Live target logic could be removed. There’s already multiple register sets and multiple pc’s in the design.

FT64v3 was turned into a superscalar barrel processor and assuming I haven’t made too many mistakes, it’s less than ½ the size (44,000 LUTs instead of 110,000). 32 hardware threads with 2 at a time running concurrently. Each thread running for 1/32 of the time that a single thread would run for. Since the queue is only eight entries deep all the register bypass logic could be removed. Registers are always guaranteed to be valid to read as they are updated a long time before the next instruction on the same thread needs them, so register renaming could be removed too.

Biased or unfair barrel processing:
It may be possible to have a couple of the threads running at a higher frequency than once every 32 ticks since the queue is only eight entries. A thread could in theory be run once every eight ticks. Good for high priority real-time tasks. A small hardware table to map thread priorities might be something to work on.

Some experimental results on compression. Instructions were compressed based on static frequency of occurrence. The operating system and boot rom were used as a source of instructions. This is mainly compiled C code. There were 25175 instructions. A limit was placed on the number of compressed instructions allowed then a graph made of the compression percentage versus the number of allowed compressed instructions. Using a lookup table with only 256 entries gave a compression of about 26 %. Using a smaller 64 entry table the compression obtained was about 19%. Using a 4k table and allocated 64 entries for each application would allow 64 different applications to use compressed instructions. A 64-entry table is also small enough to be swapped on a context switch.

Attachment:
File comment: Compression vs # Entries
Compress.png
Compress.png [ 10.53 KiB | Viewed 5843 times ]

_________________
Robert Finch http://www.finitron.ca


Mon Feb 26, 2018 1:17 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Oh, that's a rapid turnaround experiment - half the size sounds rather attractive.


Mon Feb 26, 2018 5:16 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
After some more analysis on compressed instructions I found out that it maybe better just to use a fixed set of instructions and a simple hardware decompressor. Using a lookup table seems to result in better compression, however it's a lot larger hardware-wise (5,000 LUTs larger) and the hardware decompressor gives close to the same compression. The assembler was quickly modified to generate a fixed compressed instructions. The set was hand chosen and is similar to the RiSC-V set which is what one would expect. After running the assembler the compression ration was about 26.7%. I graphed the percentages each type of instruction contributed to the total ratio in a pie chart. It's a little hard to read.
Attachment:
File comment: Compression pie chart FT64
CompressPie.png
CompressPie.png [ 37.79 KiB | Viewed 5811 times ]

_________________
Robert Finch http://www.finitron.ca


Tue Feb 27, 2018 8:12 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The assembler for FT64 was modified for FT64v3’s 18/36 bit instructions. Some fixed point math was needed to account for the fact that instructions are aligned on bit-pairs. Both the instruction address followed by the data address in brackets are shown in an assembly listing. The data address has a hexadecimal fraction digit to indicate the bit number of the address.

It’s confusing to have two sets of addresses one for data and one for code, so the design is being modified to use a single set of addresses with fractions.
Code:
                           ;----------------------------------------------------------------------------
                           ;----------------------------------------------------------------------------
                           _SyncCursorPos:
71C5594A(FFFC08E6.8) 028BF            sub      $sp,$sp,#24
71C5594B(FFFC08E8.C) 240C2          sw      $r2,[$sp]
71C5594C(FFFC08EB.0) 241C3            sw      $r3,8[$sp]
71C5594D(FFFC08ED.4) 242C6            sw      $r6,16[$sp]
71C5594E(FFFC08EF.8) 6002C4086          ldi      r6,#AVIC
71C55950(FFFC08F4.0) 6042F0000
71C55952(FFFC08F8.8) 3FF70   
71C55953(FFFC08FA.C) 00185C009          lhu      r2,_DBGCursorCol
71C55955(FFFC08FF.4) FF405C10B
71C55957(FFFC0903.C) F8025C04D
71C55959(FFFC0908.4) 001C5C009          lhu      r3,_DBGCursorRow
71C5595B(FFFC090C.C) FF405C10B
71C5595D(FFFC0911.4) F8035C04D
71C5595F(FFFC0915.C) 0C383          shl      r3,r3,#3
71C55960(FFFC0918.0) 01C83            add      r3,r3,#28
71C55961(FFFC091A.4) 0D083            shl      r3,r3,#16
71C55962(FFFC091C.8) 0C382            shl      r2,r2,#3
71C55963(FFFC091E.C) 010008204          add      r2,r2,#256
71C55965(FFFC0923.4) 2198922B1          or      r2,r2,r3
      sh      r2,$408[r6]         ;
71C55967(FFFC0927.C) 01020
71C55968(FFFC092A.0) 140C2            lw      $r2,[$sp]
71C55969(FFFC092C.4) 141C3            lw      $r3,8[$sp]
71C5596A(FFFC092E.8) 142C6            lw      $r6,16[$sp]
71C5596B(FFFC0930.C) 08380          ret      #24

The table below shows branch displacements with two bit fractions. (The F2 fields). Branch prediction bits are not needed so those bits were re-purposed. Note that code addresses must be a multiple of 18 bits as that is the finest resolution the cache shifts by. A special code alignment directive was added to the assembler to force proper alignment.
Attachment:
File comment: FT64v3 Formats
Formats.png
Formats.png [ 39.02 KiB | Viewed 5791 times ]

_________________
Robert Finch http://www.finitron.ca


Wed Feb 28, 2018 11:31 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Can’t get FTv3 to simulate past the jump at boot-up. It’s supposed to jump to $FFFC01F5, but in the simulation jumps to $FFFC41F5. There’s an extra bit set in the middle of the address. I dumped all the variables in the simulator and everything looks correct. The variables don’t correspond to the trace output. I’m thinking this is a bit error in the workstation. Anyway I’m moving this project to the back burner. I don’t like the idea of instructions not being aligned on byte boundaries.

Started working on another somewhat different core. A super-barrel version of the venerable 8088.

_________________
Robert Finch http://www.finitron.ca


Fri Mar 02, 2018 7:55 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The $41F5 turned out to be a problem with the binary file generation of the assembler. Memory for the binary data wasn’t being zeroed out. It was almost zero.

The FT64v3 core seems to be sequencing through instructions now. It does the boot-up jump and cycles through several instructions including a branch. To make things a bit simpler the core may have instructions aligned on any bit pair now. Instruction addresses are referenced in fixed point. Only fixed point add and subtract are needed by the core. It doesn't multiply or divide addresses so the fixed point math is really simple.
The core hangs on the instruction to set a status LED indicator. The address isn’t being calculated correctly. It has something to do with register update / readback. (It's an indexed address mode instruction that causes the problem).

_________________
Robert Finch http://www.finitron.ca


Sat Mar 03, 2018 2:20 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
(Speaking of fixed point, and multiply/divide, I noticed the other day that XMOS' XCore instruction encoding includes some bit-packing by arithmetic - see https://en.wikipedia.org/wiki/XCore_Arc ... n_encoding
But of course with only 5 bit fields, some very small cloud of logic could do the decoding without any reference to arithmetic.)


Sat Mar 03, 2018 5:26 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
With only two bits for register specs only four registers would be available. That would make implementing a compiler for the XMOS XCore a challenge I think. But it would probably be okay for hand coded assembly.

After fixing numerous errors in the assembler and rtl code, FT64v3 gets past the store instruction that failed. The software needs to be modified now to provide different execution paths for each of the threads. Currently it runs as if there is only a single thread, but once it gets to stack operations the stack will get screwed up because all the threads are using the same stack pointer value.

_________________
Robert Finch http://www.finitron.ca


Sun Mar 04, 2018 7:56 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 8, 9, 10, 11, 12, 13, 14 ... 52  Next

Who is online

Users browsing this forum: AhrefsBot, DataForSeoBot and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software