View unanswered posts | View active topics It is currently Thu Mar 28, 2024 9:36 am



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 30, 31, 32, 33, 34, 35, 36 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Started porting the code updates from nvio back to FT64. There were a lot of changes in the past couple of weeks so there’s a lot to update. While FT64 remains basically a 64-bit machine, floating-point will use double-extended or 80-bits precision. That means wider internal busses. The core will also be altered to use address generators rather than the alu’s to generate addresses.
The implementation language is being switched to System Verilog. Newer versions of files are .sv files.

_________________
Robert Finch http://www.finitron.ca


Sun Jun 16, 2019 2:54 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got the data cache and write buffer ported back and widened the internal busses. Then ran a simulation to confirm nothing was broken.

_________________
Robert Finch http://www.finitron.ca


Mon Jun 17, 2019 3:17 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Improved some of the documentation for FT64. Working on the NVIO doc I realized that there was a better way to organize the book. The instructions are described by functional unit rather than all being lumped together.

_________________
Robert Finch http://www.finitron.ca


Tue Jun 18, 2019 4:06 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Worked on FT64 today.
FT64 is so out-of-date. It’s a year a behind on edits to the superscalar engine.
I decided to try and pipeline the instruction decompression better. It requires an additional decompression queue before the main issue queue. At the moment I have the decompression decompressing to the issue queue but that won’t work without a lot of changes. The issue queue expects the register values to be ready to queue which isn’t the case if the instruction isn’t decompressed yet. Without additional pipelining there's a lot of logic between the instruction cache and issue queue. The core must lookup the cache line, align it for 16-bit addressability, then figure out where the second instruction is, then determine instruction lengths, then decompress the instruction, then feed it to the right fetch buffer. Yikes! that's a lot to do in 1 clock cycle.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 13, 2019 4:42 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Started working on a new version, spurred on by the results from the FT20200324 cpu. Rather than OOO superscalar, 3 in order parallel pipes are being used. The goal is to get the cpu running at 100MHz with 3 pipes. This version looks rather a bit like the IA64 with predicate registers and a large register array (64 regs). Also uses 41-bit instructions.

_________________
Robert Finch http://www.finitron.ca


Wed Mar 25, 2020 3:08 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Thor2020 is advancing nicely. Had lots of home time to work on it. I’ve got the base cpu implemented along with an instruction cache and the timing’s still good to 100MHz. Some of the more complicated instructions will burn up a lot of clock cycles in order to keep the fmax high. For instance, divide takes about 70 clocks and multiply 20. I’m not sure some of the instruction break logic will work as expected. The details will have to be ironed out during simulations. Things aren’t quite ready to sim yet. There’s no software yet.

_________________
Robert Finch http://www.finitron.ca


Fri Mar 27, 2020 5:09 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I'd be interested to see a sketch of how the multiply and the divide proceed. (You're not alone, I believe the 68k took rather a few clocks too.)


Fri Mar 27, 2020 8:59 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Multiply / Divide details
The multiplier is using the vendor’s coregen to generate a component so I don’t know how it’s arranged internally. Other than selecting the ‘optimum’ number of pipeline stages which was 18. It’s using multipliers in DSP blocks for speed. There are six multiplier components in use. Three each for signed and unsigned multiply. One for each pipe of the processor. So, it can be doing three multiplies in parallel.
The divider is a simple radix-2 divider I coded myself that follows a standard division algorithm. Schoolbook long-hand division. There are three dividers, one for each pipe as the divider can handle both signed and unsigned division. So, three divides could be taking place at the same time.
If multiplies and divides are mixed in the same bundle then the core waits for the longest running operation to complete before proceeding.
The core a) doesn’t have any forwarding logic. Results are written to the register file at the end of one ‘long’ clock cycle. b) relies on software to set instruction ‘break’ bits in the bundle which cause the core to execute instructions serially waiting for dependencies with prior instructions to resolve.
There are issues that are allowing the core timing to be 100MHz. For instance, the L1 cache is only 1kB. 64 lines of 128 bits. The line size is the same as the instruction fetch size so no multiplexing on the line is required. The short line size and the small cache size mean the performance of the cache is likely to be not so great. Still the 68030? had only 256B cache.
I’ve been thinking about restricting the multipliers / dividers to fewer pipes but then it would be necessary to schedule instructions so that they fall in the right pipes. Divide hardly gets used at all. I’ve not seen code that uses one divide after another where they would need to be packed into the same bundle.

_________________
Robert Finch http://www.finitron.ca


Sat Mar 28, 2020 2:42 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Ah, yes, thanks: with a 64 bit machine we can expect division to be pretty expensive. With fast multiplication, there are faster ways of doing division, but those pipelined multipliers are not particularly fast (compared to the architectures we see on custom microprocessors) - I'm guessing that the basic block of 18x18 is pretty fast, but the techniques of gluing together an array of those blocks end up falling short of what could be built if building a wide multiplier from the start.


Sat Mar 28, 2020 8:55 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Do you have a short muliply for efa calculations? define float foo[5,1000,1000]
foo[i,j,k]= foo[l,i+3,j-7]+...


Sat Mar 28, 2020 7:36 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I could try experimenting with the number of pipeline stages for the multiplier. They may be overly restricted. It could be that too many stages have been specified for a 100MHz target. That is maybe they are working at 200MHz or some such thing, but how am I supposed to know? :? I selected ‘optimum’ when optimized for speed, but maybe it doesn’t need to be optimum.
The Xilinx multiplier blocks are 25x18 IIRC. They have a few more bits on the one side. In other processors I’ve had a fast multiply instruction that does a 24x16 multiply in one clock cycle. Useful for calculating array indexes on small arrays. I should maybe add the instruction.
Well I checked the multiplier specs. up to 741MHz for one. Stack about four of them for a 64x64 multiply and it’s 185MHz. I’m guessing a two-stage multiplier should be sufficient. Fully pipelined their probably running at 740MHz, way faster than needed.
Well I tried a 9-stage multiplier and it looks like it would still meet timing! Multiply is now twice as fast. Next to try four stages. (A binary search to find what’ll meet timing).
It looks like they will work with just one stage.

_________________
Robert Finch http://www.finitron.ca


Sat Mar 28, 2020 11:21 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Just trying to figure out a way to branch into the middle of a bundle. The issue is to try and increase the code density. Otherwise a lot of NOPs are spit out to align branch targets at bundle addresses. I think I found a way to accomplish the goal. There’s a small table called the execution pattern table that identifies which instruction slots get executed during a clock cycle. The table is indexed by the break bits in the instruction bundle. If there are no breaks all three instructions execute in the first clock (the next two clocks are then skipped). By adding the low order instruction pointer bits as part of the index into the table the pattern of instruction execution can be altered to accommodate a branch into the middle of the bundle.
Code:
 // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
// Execution pattern table
// Controls which instruction slots execute during a given clock cycle.
// Execution pattern output is used as a mask along with the predicate to
// determine if the instruction executes.
// - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

wire [8:0] expat [0:11] =
   {
    // Entered at slot #0, three instructions to execute
    9'b001_010_100,   // 11 <= max break, separate cycles for each insn.
    9'b011_100_000,   // 10 <= break after second instruction
    9'b001_110_000,   // 01 <= break after first instruction
    9'b111_000_000,   // 00 <= no breaks, all instructions execute in clock 1
   
    // Entered at slot #1, two instructions to execute
    9'b010_100_000,  // 11
    9'b010_100_000,  // 10
    9'b110_000_000,  // 01
    9'b110_000_000,  // 00
   
    // Entered at slot#2, only one insn to exec.
    9'b100_000_000,  // 11
    9'b100_000_000,  // 10
    9'b100_000_000,  // 01
    9'b100_000_000   // 00
    };
wire [8:0] expats = expat[{ip[3:2],insn[124:123]}];
reg [8:0] expatx;
wire [8:0] nexpatx = expatx << 2'd3;

Since the instructions are 41-bit and not perfectly aligned to byte addresses, a convention is used to identify the instruction. Slot 0 are at 128-bit aligned addresses so there’s no issue with slot 0 instructions. The next slot is tagged as having the address ending in ‘5’; the slot after that is tagged as ‘A’. ‘5’ means the instruction at bit 41. ‘A’ means the instruction at bit 82.

_________________
Robert Finch http://www.finitron.ca


Sun Mar 29, 2020 3:21 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
robfinch wrote:
I could try experimenting with the number of pipeline stages for the multiplier...
Well I tried a 9-stage multiplier and it looks like it would still meet timing! Multiply is now twice as fast. Next to try four stages. (A binary search to find what’ll meet timing).
It looks like they will work with just one stage.

Oh, that's sounding much better!


Sun Mar 29, 2020 8:13 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The ISA currently features updating a predicate register with the result of any operation not just compares. However this is restricted to predicate register #1 and #2.
This feature is likely to be dropped. The issue is that it slows the core down too much. I’ve experimented with a couple of different implementations. The extra multiplexors on the predicate register inputs become part of the critical path.
The offending line looks like this (outside of timing):
Code:
    if (prfwr0|prfwr0a) p[prfwr0a ? isFloat0 + 2'd1 : ir0[9:6]] <= pres0;

And without the extra logic like this (within timing).
Code:
    if (prfwr0) p[ir0[9:6]] <= pres0;

The record feature allowed an instruction to record the predicate result in p1 for integer operations or p2 for floating-point operations without having to do a compare. The idea was to remove compare instructions from the instruction stream. Without the record feature more instructions get executed (but they execute faster).
Example with record:
Code:
        ldi   r4,#-23       ; set up a counter in R4
LP1:    jsr   FIB
        add.  r4,r0,#1      ; inc loop counter
p1.ne   jmp   LP1           ; another iteration if not zero
 


Example without record:
Code:
        ldi   r4,#-23       ; set up a counter in R4
LP1:    jsr   FIB
        add  r4,r0,#1      ; inc loop counter
        cmp p1,r4,r0
p1.ne   jmp   LP1           ; another iteration if not zero
 


Also worked on the MMU today. There are a few design choices to be made. Whether to use software or hardware to manage translation caches is one of them. Another is the organization of the translation tables, hash table or hierarchical? I prefer to use hardware but I’m thinking it may be better left as a software task.

_________________
Robert Finch http://www.finitron.ca


Mon Mar 30, 2020 2:58 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
for the mmu, that would be best at the microcode level of the design if you had microcode.
Computer organization and microprogramming by yaohan chu is a good book.


Mon Mar 30, 2020 8:25 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 30, 31, 32, 33, 34, 35, 36 ... 52  Next

Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software