Last visit was: Sun Nov 28, 2021 11:05 am
It is currently Sun Nov 28, 2021 11:05 am



 [ 542 posts ]  Go to page Previous  1 ... 6, 7, 8, 9, 10, 11, 12 ... 37  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
I should mention there are a number of highly useful decodes in the “limited decode” area including decodes for register file access which has to be done before instructions queue. Thinking there might be some sort of problems with decoding I went back and found two or three more signals that made sense to push back decode to an earlier point. This shaved about 5,000 LUTs off the size of the implementation. I immediately used the LUT bonus to implement better synchronization primitives.

Simulation of the core craps out at about 56,000 ns after executing a measly 225 instructions because the frame pointer has a zero in it and the program is trying to write to a zero based offset and there isn’t any memory at that offset in the test bench. The frame pointer is loaded with a zero from the stack memory. Exactly when the stack gets written with a zero isn’t determined yet, but I’m assuming it’s a software problem of some sort. So I spent some time now working on the software emulator to make it easier to trace through the program.

It looks like the simple timer based list box display of machine code isn’t working anymore under Windows10. I can single step through the software but animated stepping no longer works. The list box was being updated by timer events. I had put together a simple display using list boxes but perhaps it’s now the time for something more complex that works a bit better. What I need is a display box that can display running machine code. If it were a DOS app it would sit in a loop running checking for a keystroke or mouse press. But since it’s windows things have to be done a little differently. One cannot just create a loop in a program or windows would hang.

_________________
Robert Finch http://www.finitron.ca


Sun Jan 21, 2018 9:16 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
I was wondering why the CPI for the FT64 core was so terrible given that it’s supposed to be able to execute two instructions at a time. It turns out the instruction cache was thrashing because it was only using a single cache line. The i-cache is 64 entry fully associative and replaces the oldest entry. But I forgot to connect up the line that increments the replacement counter so it was frozen on a single line.

The branch predictor wasn’t receiving proper a branch status causing prediction not to be updated. Every branch was being predicted as not-taken.

In the boot-up code the core is averaging about 4 to 5 clocks per instruction. This sounds really lousy but there’s a lot of cache loading going on during boot-up and it takes about 18 cycles to load a cache line. 10 cycles to load L2 (4 memory accesses) then 8 extra cycles loading L1. L1 needs the extra time to detect a hit in L2. The cpu is running in spurts and grinds to halt waiting for cache loads.

Stores were setup to be not done until it was known whether or not the store would exception which was at the end of the memory access cycle. This prevented subsequent memory operations from proceeding and tended to serialize the core. Stores have now been altered to be done as soon as they are issued, allowing subsequent memory operations to issue in following cycles. Instead stores are not allowed to commit to the state of the machine until it is known whether or not they exceptioned. In order to do this another state flag had to be added. Instructions may now be done allowing following instructions to proceed, and yet be not committable to the machine state pending the detection of exceptional conditions.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 25, 2018 4:41 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1644
An interesting and illuminating mistake! To correctly size the caches you would, I think, pretty much need to run some experiments or simulations with different sizes. Are both L2 and L1 on the FPGA?


Thu Jan 25, 2018 11:15 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Quote:
Are both L2 and L1 on the FPGA?

Both L1 and L2 are on the FPGA. The reason there’s two is partially academic, I simply wanted to experiment with multi-level caches. Block ram latency is greater than single cycle unless one omits the output register and uses an inverted clock trick. Omitting the output register may decrease performance. There is also a multiplexor on the cache output made out of LUTs to feed the core interrupt instructions when an interrupt occurs and NOP’s on a cache miss. Since there’s a multiplexor layer anyway on the cache output I decided to try and integrate an additional cache. I should check the CLB slice diagram but the idea is to see if the LUTs for the cache can be contained in the same slice as the multiplexor. In other words it might come “for free”. The L1 cache made out of distributed ram is readable right away (zero latency).

Bigger caches are generally better. L1 is really too small, but it’s limited by the size of a small distributed ram (64 entries) that will fit in a single slice. A bigger ram would have a slower clock cycle time (more slices, routing), meaning it might be better just to use a single layer cache made out of block ram. Anyway, I wanted to experiment. Another reason for L1 is that the cache is dual-ported in order to be able to read two instructions at once. L1 adds the extra read port. (In the x4 version there needs to be four read ports). The L2 cache is only single ported, meaning it can be twice the size it would be if it was dual-ported, and bigger is better.

I’ve been thinking about increasing / modifying the cache in order to try and improve performance.
Right now the block ram cache makes use of an additional output register so there’s a two cycle read latency. Meaning it takes 3-4 cycles for access when checking for a hit is also included. Unfortunately on an L1 miss it has to check for a hit on the L2 cache both before and after L2 is loaded. The L1 cache loads much faster than 18 clocks when there’s a hit in L2.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 25, 2018 5:41 pm WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1644
IIRC one of the early SPARCs could act on a partially-filled cache line - the addresses to fill a line in response to a cache miss were permuted so the missing word was fetched first. Whether that helps much, I don't know, as there's probably a fair chance that the next access is going to hit in the same line. Might be a small win though - evidently the SPARC people thought so.


Thu Jan 25, 2018 5:45 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Quote:
IIRC one of the early SPARCs could act on a partially-filled cache line - the addresses to fill a line in response to a cache miss were permuted so the missing word was fetched first.
Sounds like a challenge. So I tried implementing it.

I coded a critical word first loading. The i-cache used to have only a single address port, now it has separate ports for read and write. The cam memory also had to be altered to add an additional read port. The size of the L1 cache was also increased to 8kB from 2kB.

And... I got it working in simulation. Then I tried synthesizing the core. Well with all the “minor” changes the size of the core exploded to 202,000 LUTs from 103,000. It seems that each 2kB of the “small” cache uses about 25,000 LUTs. Most of the LUTs were used by the cam memory which must be synthesizing as regs and muxes instead of ram memories.

The cam memory uses so many LUTs I decided to try using just a four way set associative cache without cam memories. And the core was switched back to using a single port for both reading and writing the cache. That shrunk the size of the core back down to 86,000 LUTs.

It looks like the biggest thing impacting the core is memory access. Loads and stores.
For example the inner loop of memset is a store, add, and branch. Each loop iteration takes 9 clocks to perform the store operation, and the add and branch operations are hidden. So the CPI works out to 3 even though the add and branch are being done in parallel with the store.

Code:
                           public code _memsetH:
                    
FFFC4114 00800530             beq      r20,r0,.xit
FFFC4118 8B880802             mov      r1,r0
                                    .again:
FFFC411C 50530C82             sh      r19,[r18+r1*4]
FFFC4120 00010844             add      r1,r1,#1
FFFC4124 FF84A071             bltu   r1,r20,.again
                                    .xit:
FFFC4128 8B880C82             mov      r1,r18
FFFC412C 0000EFE9             ret

_________________
Robert Finch http://www.finitron.ca


Fri Jan 26, 2018 9:01 pm WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1644
(That's a large implementation - what's the turnaround time for a synthesis?)

Things like memset and memcpy can be quite a test - they touch memory just once, so allocating cache lines can be unhelpful. I think some ISAs offer special loads and stores for these cases.

Also, I think the enormous implementations from Intel have lots of logic to spot patterns in memory accesses, to avoid wasting accesses and to prefetch more effectively. I wouldn't expect a homebrew machine to try that.


Fri Jan 26, 2018 9:27 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Quote:
(That's a large implementation - what's the turnaround time for a synthesis?)
It takes about 15 min. usually for the core when size < 90,000 LUTs. I think it took about a hour to synthesize to 200,000 LUTs. I've found that the size of a core can be large and it'll still synthesize quickly. Of course Vivado is running on a 3.4 GHz quad-core with 12GB ram. It usually uses about 3GB of ram.

Quote:
I think some ISAs offer special loads and stores for these cases.
It is possible to turn the data cache off (but not the instruction cache !) by writing a bit in a control register.

Thor has a set of load volatile instructions for uncached data loads. There seems to be not quite enough room in six bits to include all the instructions I'd like to see at the root level, but seven bits is too wasteful. I'd like to see the setxx instructions at root level too. The instruction set tries to encode 70 to 80 instructions in six bits.

I got the core to run in an FPGA. It clears the screen then hangs. Coincidently, the same thing happens in the software emulator version so I think it's a software problem.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 26, 2018 10:42 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Given that memory operations seem to be limiting core performance I decided to put some more elbow grease into that aspect of the core. The core has now been modified to issue up to three memory requests per clock cycle. Also changed was the single ram bus in FT64 to three parallel busses. Previously data cache outputs were multiplexed sequentially onto a single bus. The changes added about 6,000 LUTs to the implementation. However it trims at least one cycle off of ram accesses when there are multiple ram requests at the same time. The core does the loads out of order and tends to do the newest load first if there are multiple loads possible. That means the core probably walks backward through memory more often than forwards.

Also added to the core were load instructions that bypass the cache. The offset field of these load instructions is shorter (12 bits) to increase the space available for the opcodes. The core already uses a 12 bit immediate field for the set instructions, so the same instruction format was used.

_________________
Robert Finch http://www.finitron.ca


Sat Jan 27, 2018 10:29 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
The core’s RSB isn’t working very well. It works for a couple of entries then gets off track somehow. The core still operates properly, but return operations are then turned into multi-cycle operations instead of single cycle. I think the problem is if a return instruction is fetched speculatively, then ends up being stomped on in the queue the core doesn’t adjust the return stack pointer for the stomped on instruction.
So for a sequence like the following the core fetches ahead to the ret instruction before it does the call. Pops the RSB stack for the ret, then decides to stomp on the ret instruction which ends up not being executed. The result is the RSB stack is off by one.
Code:
calltest3:
   sub      sp,sp,#8
   sw      lr,[sp]
   call   calltest2
   lw      lr,[sp]
   add      sp,sp,#8
   ret


Added stomped on ret logic to the core but it didn’t seem to make a difference.

Read a little about SMT. To support SMT I think it only requires additional program counters, and a little bit of thread logic. So FT64 sproated SMT wings. Coded but yet to be tested.

_________________
Robert Finch http://www.finitron.ca


Mon Jan 29, 2018 8:04 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
In theory the FT64 core now supports fine grained SMT by optionally running two threads simultaneously every clock cycle. Turning on SMT reduces the performance of an individual thread but may increase overall performance.

There are two program counters in FT64, one for each fetch buffer set. In ordinary operation the second program counter follows the first one, incremented by four, so that it points at the second program word. For SMT operation the program counters operate independently. Two independent register sets are also used with SMT turned on. They are an odd/even register set pair out of a number of available register sets. Turning SMT on and off requires careful manipulation of the program counters. As soon as SMT is turned on the second program counter will begin pointing to it’s own addresses. Some sort of ramp is required because the exact value of the program counters may not be known and the core will have fetched ahead a number of instructions. Also necessary is the ability to find out which thread is actually running. So there are a couple of bits reserved in the status register for that purpose. One might wonder how if both threads are running at the same time there would be a different value in the status register for each thread, but there is.
Presumably one wants to run different code rather than have the same code executing twice. That requires a branch operation based on the thread number (0 or 1).

Switching SMT off will be quite a trick because the location of the two program counters has to be synchronized. If there is a master / slave thread, the slave thread could end in an infinite loop with a branch back to self instruction. Then the master thread could turn off SMT knowing the state of the slave.
Maybe an instruction (wait?) that acts like a gate which doesn’t allow the program to proceed any further unless both program counters are synchronized would help.

Due to the simplicity of the implementation if there’s an instruction cache miss on either thread, both threads stall waiting for the cache to load. This situation may improve in the future by having each port of the i-cache operate independently.

_________________
Robert Finch http://www.finitron.ca


Tue Jan 30, 2018 7:49 am WWW
User avatar

Joined: Tue Jan 15, 2013 5:43 am
Posts: 186
Quote:
Turning SMT on and off requires careful manipulation of the program counters. As soon as SMT is turned on the second program counter will begin pointing to it’s own addresses. Some sort of ramp is required because the exact value of the program counters may not be known

Can you talk more about that, please, Rob? My notions of multithreading aren't very well informed but I would've thought the second thread starts and ends the same way the first one does. IOW a reset vector would be fetched and execution would proceed either forever or until a Halt instruction is encountered. Is there some other goal or tradeoff I've overlooked?

_________________
http://LaughtonElectronics.com


Wed Jan 31, 2018 7:12 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
Quote:
My notions of multithreading aren't very well informed but I would've thought the second thread starts and ends the same way the first one does. IOW a reset vector would be fetched and execution would proceed either forever or until a Halt instruction is encountered. Is there some other goal or tradeoff I've overlooked?
It's a tradeoff. I think the above is the way it would work if it were not possible to turn SMT on and off. I wanted the option to turn SMT on or off which makes it a bit trickier. Turning SMT off would boost the performance of an individual thread. However I'm thinking now it may not be such a good idea to have on/off capability because it adds logic to the core (it's another if / else test), which may actually reduce performance. With SMT on each thread runs about half the performance that an individual thread would. Initially I had a separate set of program counters for the second thread, and it started running from the reset vector when SMT turned on. That meant there were four program counters in the core. Except I noted there's already two program counter effectively (one for each fetch buffer) so I decided to try and reuse that hardware to reduce the amount of hardware. The problem with reusing the existing hardware is that the second program counter always points to the second instruction to execute when SMT is off. So when SMT is turned on, it's automatically already pointing to the second instruction to execute. The tradeoff is it would take additional hardware to reset it to the reset address adding complexity, and I figured it could be handled with software. It's always possible to do a jump to the reset address in the second thread if that's what's desired. The way it works in theory right now is that it's a bit like fork operation. I've set SMT operation aside for a bit because it's not working properly. It misses a branch instruction, seems to skip over it and I don't know why yet. It seems to me that SMT like results could be achieved simply with simpler multiple cores.

I’m wondering if the issue logic in the core can be reduced. I have found out that ILP (instruction level parallelism) is about six max. ref: Mike Johnson, Superscalar Microprocessor Design, Prentice-Hall, 1991, ISBN 0-13-875634-1 ( I haven’t actually read the book but found the reference on the web). That means with eight queue entries the vast majority of the time only the first six instructions might be ready to execute. Following instructions typically have to wait for results of previous ones. I’ve seen this by reviewing simulator output. The instructions just don’t issue because inputs aren’t ready. So I’m thinking that the issue logic could be reduced to checking for issue only for the first six instructions of the queue. Checking for issue of the last two instructions is somewhat pointless because it’s unlikely they’d be ready to execute anyway. If last two instructions queued turn out to be ready to issue then there would be some loss of performance, but the core should still operate properly. This could remove a substantial amount of logic from the core as the amount of logic required grows geometrically with the number of queue entries.

_________________
Robert Finch http://www.finitron.ca


Thu Feb 01, 2018 8:02 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1528
Location: Canada
When SMT is turned on the core only fetches one instruction at a time per thread rather than two. But it fetches instructions for both threads at the same time. So then each thread only needs a single program counter. When SMT is turned back off the second program counter snaps back to pointing four beyond the first one. Turning SMT off while a second thread is executing would not be good then because it would stop executing that instruction stream immediately regardless of whether the thread was actually done or not.

_________________
Robert Finch http://www.finitron.ca


Thu Feb 01, 2018 8:46 am WWW
User avatar

Joined: Tue Jan 15, 2013 5:43 am
Posts: 186
Alright, I think I understand now. Starting and stopping the second thread isn't a problem in itself. But the hardware used by the new thread (PC etc) has other responsibilities when SMT is off -- and the trickiness has to do with suspending and resuming those other responsibilities. When the second thread starts there's something else which has to stop. And vice versa.

Maybe you've already thought of this, but in addition to the two main modes (SMT on and SMT off) maybe there should be a pair of intermediate modes that you pass through while making the transition. (Ramping up SMT isn't the same as ramping it down) Like:

  • [SMT off] -> [off-to-on transition mode] -> [SMT on]
  • [SMT on] -> [on-to-off transition mode] -> [SMT off]

Quote:
Maybe an instruction (wait?) that acts like a gate which doesn’t allow the program to proceed any further unless both program counters are synchronized would help.
OK, you did already think of it. This "gate" sounds like a transition mode. But is it really necessary to synchronize the PC's? It still seems to me the new thread could just grab a reset vector.

I'm still confused, but I'm confused on a higher level! :D

_________________
http://LaughtonElectronics.com


Thu Feb 01, 2018 2:13 pm WWW
 [ 542 posts ]  Go to page Previous  1 ... 6, 7, 8, 9, 10, 11, 12 ... 37  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software