View unanswered posts | View active topics It is currently Wed Sep 18, 2019 5:46 pm



Reply to topic  [ 483 posts ]  Go to page Previous  1 ... 22, 23, 24, 25, 26, 27, 28 ... 33  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Having espoused all the benefits of the base and bounds system, I decided to shelve it for now. The issue I ran into was: what if data needs to be passed to another thread? As it is the system is based around the concept of a thread and it entirely encapsulates the data. It was not possible to pass just the data to another thread without passing a bunch of additional information that is not really relevant. Effectively, a descriptor contained too much information. Greater minds than myself have dealt with this issue already, so I figure it’s best to follow an existing system. So, I’ve gravitated back towards a classic segmented system except that the segments include both upper and lower bounds. There are eight segment registers because four was deemed as maybe not enough. FS and GS do get used to store thread local and global data in x86 systems.

I had a friend tell me I was thinking too small with my designs.

Added the PTRDIF instruction which is an ordinary subtract operation followed by a right-shift. The idea is to determine an index value from the difference between two pointers.

I sketched out what would be required in order to perform a far call, a call to an address in a different code segment. Using a segment load exception, I estimate this would require about 300 clock cycles to perform. Not fast at all, but it should be possible to do.

The SoC is busted back to a not-working state. I’m sure it’s something minor.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 11, 2018 4:06 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
I've been inspired to start working on version eight of FT64 after reading this thesis which details an x86 compatible core designed for an FPGA. I figure if they can hit 200MHz in a fast FPGA for something like an x86, it should be possible to hit 75-100MHz in a slow one for something more contemporary. FT64v7 suffers from the simplicity of it's design. The scheduling logic's atrocious and the number and types of registers leaves something to be desired. I've realized that effectively cams are used all over the place in FT64v7 and they don't map well to FPGA fabric.

http://www.stuffedcow.net/files/henry-thesis-phd.pdf

FT64v8 will have split register files associated with functional units. It will also have variable length instructions. Ft64v8 won't be a micro-op based design but hopefully some of the ideas from the thesis can be used.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 14, 2018 6:16 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1256
Wow, that thesis is a really good find! Needs a thread of its own.


Fri Dec 14, 2018 7:57 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Quote:
Wow, that thesis is a really good find! Needs a thread of its own.

Posted by EricP on comp.arch along with a number of other goodies under the topic "Tomasulo Algorithm and reorder buffer for parallel vector engines"

_________________
Robert Finch http://www.finitron.ca


Fri Dec 14, 2018 2:15 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1256
Thanks Rob! That's this post which is the first response (second post) in this discussion.

I'm glad comp.arch is still healthy. I used to read it at work, and at its best I think it fits my interests nicely.


Sat Dec 15, 2018 8:47 am
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 177
Location: Huntsville, AL
Agree with Ed. That discussion, at least the part starting at the link and which I had time to read, cleared up quite a few cobwebs / misunderstandings of mine regarding the scoreboard vs Tomasulo/ROB and the implementation of precise exceptions in OOO machines. It's a subject I've been intending on studying for some time, and the first responder in that thread did an excellent job in describing the differences. Should make it easier to follow through on the study of the subject when time permits.

_________________
Michael A.


Sat Dec 15, 2018 6:49 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Pseudo coded a microcode routine to load a segment register. Takes about 40-50 instructions. I was going to try micro-coding several routines, but then I realized there wouldn’t be any performance advantage and it’s more complex hardware wise to support micro-code in the processing core. Instead the routines can be coded in a small ROM.

There’s a bit of a chicken and egg paradox when it comes to loading segment registers. Ideally loading the segment registers is done as an operating system routine. But there’s no way to get to the operating system without loading a segment register. The current solution is to designate a portion of the address space as in common to all segments, so that the segment in use doesn’t matter.

An issue with segmentation is that the segment registers are really 256-bit entities by the time base, bounds and access rights are included. Move segment register to segment register requires a 256-bit wide bus in the processor. But it’s undesirable to have all the result busses in the core 256-bit just to support moving segments around.

In the pseudo-code 64-bit accesses are used to manipulate parts of the segment, but this could be changed to 256-bit accesses. I’m wondering if it’d be worthwhile to allow loading of 256-bits at a time directly into a segment register. The load result bus would have to support 256-bit then. But I’ve defined an opcode for 256-bit loads anyway (lo for load octa-word) to support SIMD like instructions down the road.
Code:
; MOV ES,D31

macro   mov2es
      bit      cc7,d31,#15            ; test local / global selector flag
      jsr      LoadYs
      ; Is zero?
      ; We check only the acr word
      cmp      cc7,d30,d0
      beq      cc7,.zeroSeg@
      ; Is segment present?
      bpl      cc7,segNotPresent
      ; Now check segment type
      shr      d29,d30,#48+11
      and      d29,d29,#3
      cmp      cc7,d29,#2            ; check that YS is a data descriptor
      bne      cc7,segtypeFault
      ; now check privileges
      mov      d29,cs.acr
      shr      d29,d29,#48
      shr      d30,d30,#48
      and      d29,d29,#$FF
      and      D30,d30,#$FF
      cmp      cc7,d29,d30
      bgt      cc7,priv_fault      ; DPL must be >= CPL
.zeroSeg@
      mov      es.base,ys.base
      mov      es.lower,ys.lower
      mov      es.upper,ys.upper
      mov      es.acr,ys.acr
      rts
      endm
   

LoadYs:
      and      d31,d31,#$7fff
      bne      cc7,.0001
      ; load from global descriptor table
      ; into temporary segment register ys
      ld      d30,gdt:[d31*32]
      mov      ys.base,d30
      ld      d30,gdt:8[d31*32]
      mov      ys.lower,d30
      ld      d30,gdt:16[d31*32]
      mov      ys.upper,d30
      ld      d30,gdt:24[d31*32]
      mov      ys.acr,d30
      rts
      ; load from local descriptor table
.0001:
      ld      d30,ldt:[d31*32]
      mov      ys.base,d30
      ld      d30,ldt:8[d31*32]
      mov      ys.lower,d30
      ld      d30,ldt:16[d31*32]
      mov      ys.upper,d30
      ld      d30,ldt:24[d31*32]
      mov      ys.acr,d30
      rts
      

_________________
Robert Finch http://www.finitron.ca


Mon Dec 17, 2018 6:47 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1256
For a bit of calibration, it seems the penalty for a context switch in a modern CPU and OS is at least tens of thousands of cycles and possibly hundreds of millions. Yikes.
https://stackoverflow.com/questions/218 ... ext-switch

(Probably the interesting case is not the cold-cache case with large working sets, but ping-ponging between two smallish processes.)


Mon Dec 17, 2018 7:08 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Context switch time is quite a large number of cycles. I expect at least into the tens of thousands for FT64 by the time memory management switches are taken into account. However thread switch time within the same context may be reasonably fast; hopefully only hundreds of cycles.

I found out that for data cache loads byte lanes were being selected based on the instruction that caused a load. The core should have been selecting *all* the byte lanes for a cache load. I think the system worked only because the memory system ignored the byte lane selection signals for read accesses. In any case this has now been switched to read all byte lanes which is also slightly less logic.
Found a bug in the instruction decoding where some component variables required for the decode were being delayed by a clock cycle and they shouldn’t have been. Running simulation with this bug ran into the same kind of lockup that occurs in the FPGA. So hopefully having this fixed will fix up the FPGA version. I found this bug by re-writing part of the core to merge several state variables into a single var called 'state'.

Put some more work into version eight of the core. Rather than a compressed instruction set, version eight will simply use variable length instructions where the instruction length can be determined by the first byte of the instruction. This requires only a 3-bit wide 256 entry lookup table to determine the length.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 19, 2018 8:36 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Back in business. The core boots in the FPGA at least to the monitor.

Improved the branch target buffer. The BTB now queues all flow control transfer info in a manner similar to the branch predictor. It uses a fifo capable of changing the rate of flow control operations down to match a single write port in the buffer. All flow control operations including branches and returns are now predicted by the BTB. The goal is to use the BTB as the first predictor for flow control transfers. Currently it is used in the fetch stage one clock after the current instruction is presented from the icache. This will be moved up to be current with the icache fetch. The branch predictor can’t be moved up because the branch instruction has to be decoded before it is known a branch is present.

After sketching out how to get segmentation to work with hardware in v8 I decided to go back and add it to v7. Segmentation allows up to a 101 bit address range, although a lot fewer bits are actually implemented. Jumping to far code segments is handled differently that x86. An alternate segment register must be loaded with the target code segment first, then specified as a segment override to the jump or call instruction. This splits the operation up into multiple instructions, otherwise a single instruction would be too large.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 21, 2018 10:38 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Limited far calls to using one of four segment registers (ZS, ES, HS, or CS) since it doesn’t make sense to load a data segment (DS, SS) into the code segment, and that way only two bits are required to specify the segment register to use in the far call. Performing a far call is a two-step process of loading the target segment then using an override prefix with a call instruction.
Code:
 mov2seg      hs,#$001234      ; load the hs with the target segment
call         hs:some_function
<…> other
call         hs:another_func


The author looked at having this performed with a single instruction but that would make the instruction too large and an oddball size compared to the rest of the instruction set.

For a call operation the selector for the current code segment is being stored in the upper 24 bits of the link register, so that a second register is not necessary to manage for calls and returns. This does limit code to a 40-bit address space for a single module.

Merry Christmas!

_________________
Robert Finch http://www.finitron.ca


Tue Dec 25, 2018 5:07 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1256
Indeed, Merry Christmas and season's greetings!


Tue Dec 25, 2018 11:52 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Started working in earnest on FT64v8 going with a much simple implementation. v8 is a straightforward non-overlapped pipeline scalar core. One goal is reducing the number of lines of code that must be managed. The purpose of the core is as an I/O or control processor.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 30, 2018 6:10 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Deleted a bunch of old stuff off the disk drive. I had files for pcb’s from college circa 1986. Still working on v7/v8 core.

_________________
Robert Finch http://www.finitron.ca


Tue Jan 01, 2019 11:30 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 920
Location: Canada
Back to the old load/store quandry tonight. About a half dozen load double-word (64-bit) instructions were spec’d out for v8. The problem is if there are an equal number of load instructions for all the different potential load sizes with signed and unsigned versions, that’s a lot of instructions. Then the loads must be mirrored with store instructions on-top of that. The base address mode in use is s-i-b. Scaled-index-base which works out to a 48-bit instruction. Other instructions are subsets of the mode in order to increase code density. There are a couple of possibilities to reduce the number of instructions. For instance, a size prefix could be used with a basic instruction. This decreases code density but would allow rarely used operations to be performed. Another option requiring additional bits in the instruction is specifying a base register to use. Design choices to make.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 02, 2019 5:23 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 483 posts ]  Go to page Previous  1 ... 22, 23, 24, 25, 26, 27, 28 ... 33  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software