Last visit was: Fri Sep 18, 2020 10:31 pm
It is currently Fri Sep 18, 2020 10:31 pm



 [ 507 posts ]  Go to page Previous  1 ... 30, 31, 32, 33, 34
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1164
Location: Canada
Quote:
for the mmu, that would be best at the microcode level of the design if you had microcode.
Computer organization and microprogramming by yaohan chu is a good book.
Thanks for the book reference.

I’ve done something to the implementation that cost 4ns. I hope it wasn’t one of those ‘it has to be there for it to work’ type changes. 4ns makes 30MHz difference in operating frequency. Just backing out a bunch of changes now.
Created a #define SLOW to allow building a core that’s fast with instruction restrictions or slow with a full instruction set.
The fast core leaves out:
- sized memory operations, only dword size ops are supported
- unaligned memory access
- Boolean arithmetic on predicate results (which allowed combining multiple compares)
- the set intersect, join and disjoint test instructions
- shift left/right pair by a register (fast supports only immediate count)
At SLOW the core timing is about 75MHz (It’s about 3ns slower).
I tried putting the fast version in a larger system and it just missed the 100MHz timing by 197ps. I may end up running the cpu at 80MHz instead of 100.
I'm toying with the idea of rotating registers for software pipelining like the Itanium. Trying to achieve something like a simplified Itanium here.

_________________
Robert Finch http://www.finitron.ca


Tue Mar 31, 2020 4:21 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1164
Location: Canada
Completely blew the core timing by adding a whole bunch of features.
Added floating-point to the core. Not all floating-point operations are available in all instruction slots. Some of the less frequent operations are restricted to fewer slots.
Added code to support all the different load and store instructions for aligned and unaligned memory access.

_________________
Robert Finch http://www.finitron.ca


Wed Apr 01, 2020 3:32 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1164
Location: Canada
The fp divider had an atrocious clock cycle time, limiting performance to about a 30MHz cycle for the divider. I had used a radix16 divider primitive because it takes few clock cycles for the divide. Other factors in previous systems limited operating frequency to about 30MHz so the divider was setup to match that. So, to get a higher clock frequency for the fp divider the divider primitive has been changed to radix4. Some additional pipelining in the divider was also done for the exponent calc. which is done long before the divide so there’s lots of room for pipelining it.

After some more improvements the fp divider is no longer on the critical path. Now the critical path is between the register file output and address generation for load / store operations. It’ll take some work to bump up the fmax further by registering signals in the load / store path. That means loads and stores will take more clock cycles.
I included code for a data cache now and the core is much larger and somewhat slower. It was a bit unrealistic to have the high-speed datapath going directly out to main memory where it can take dozens of cycles to access something. I also changed the I$ to a 512-bit width to match the data cache for simplicity. That means a mux on the I$ output to select the instruction.

_________________
Robert Finch http://www.finitron.ca


Thu Apr 02, 2020 3:19 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 130
Can a addition be split into two clocks. 1) a'' = mask & (a' $ invert) b"=mask & (b' $ invert)
gen = a" & b" prop= a"$"b
2) sum = generate carries $ prop


Thu Apr 02, 2020 3:53 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1164
Location: Canada
Quote:
Can a addition be split into two clocks. 1) a'' = mask & (a' $ invert) b"=mask & (b' $ invert)
gen = a" & b" prop= a"$"b
2) sum = generate carries $ prop

The question seems a bit cryptic to me (what does $ stand for?). An addition could be split into two or more clocks. I would think that the carry chain would be split across two or more cycles (eg two 32-bit adds instead of one 64-bit). It's the prop delay of transferring data from one bit to the next that slows things down. At 2) it looks like the carries still have to be accounted for.

Put Thor on hold for now. I went to revise code to add more to it and it stopped synthesizing. I realized the code is really horribly written the way it is setup (it mixes control and data), so a re-write is in order. I stopped the synthesis after an hour or so. It may be able to synthesize given enough time, but the code’s just plain not written efficiently.
Given a re-write in order I decided to go back to looking at the ISA and revising that. I pulled the nvio ISA off the shelf to see if I could incorporate some things from it. Nvio with its vector instructions was about 5x too large for the FPGA so it got shelved. But pieces of the ISA may be useful.

I found the notion of performing Boolean algebra on the results from a compare operation intriguing. So, I’ve modified compares to be able to ‘or’ and ‘and’ to the result register instead of just plain copying the result. This allows multiple compare operations to be used to set a register value without having to branch in-between. It does also require a bit more opcode space.

_________________
Robert Finch http://www.finitron.ca


Sat Apr 04, 2020 5:09 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1440
robfinch wrote:
I found the notion of performing Boolean algebra on the results from a compare operation intriguing. So, I’ve modified compares to be able to ‘or’ and ‘and’ to the result register instead of just plain copying the result. This allows multiple compare operations to be used to set a register value without having to branch in-between. It does also require a bit more opcode space.


Mmm, yes, that could be an interesting exploration!


Sat Apr 04, 2020 7:05 am

Joined: Mon Oct 07, 2019 2:41 am
Posts: 130
Not every thing is C.
$ xor
# or
& and
! not
With splitting the add into 2 cycles one would use regular
logic terms for lookahead carry say for 4 bit groups first cycle.
Second cycle would generate the sum terms, using something faster
than the fast ripple carry used with normal addition.
Have you tried a carry select adder to tweek your addition?


Sat Apr 04, 2020 4:00 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1164
Location: Canada
I borrowed the compare idea from the Itanium.

I’ve not tried the carry select adder although I had heard of it. I’ve read that in an FPGA one might as well use the regular carry chain built into the FPGA circuitry as it is just as fast for anything under 64-bits (IIRC). Also there tends to be other logic in the FPGA slowing things down. I’m not sure the adder is on the critical path. It’s a whole lot simpler just to use Verilog’s ‘+’ sign to do an add instead of instancing a module.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 05, 2020 3:08 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1440
Indeed - choosing an adder architecture would normally be the job of the synthesis tool.


Sun Apr 05, 2020 7:40 am

Joined: Mon Oct 07, 2019 2:41 am
Posts: 130
But of course everybody has non standard way of selecting that hidden $$$ ( or Pound or Ero ) feature.
It seems to get speed you need to floor plan and that defeats the language you are programing in.
Does the FPGA software still suport importing of net lists?


Sun Apr 05, 2020 6:17 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1164
Location: Canada
Yes, to get the best performance floor-planning is essential. I haven’t taken to floor-planning anything and still get reasonable performance. I’m more concerned with getting the right architecture at this point. I like to fret over ISA’s and the way things are implemented so I do a lot of stuff manually, and that’s okay for a hobby. Bit of a retro-approach. The goal is once I’ve hit ‘the right’ architecture I’ll invest more into floor-planning etc. Coming up with a good ‘canned’ component is one goal.

I find the FPGA has gone the same way with hardware logic as high-level languages versus assembler in software. Used to be a lot of software was written in assembler, but high-level languages took over for ease of use even though performance may only be half. People have found performance to be ‘good enough’. Being able to express a good algorithm in a high-level language helps a lot.
Sure, maybe performance is half of what could be obtained by coding everything manually versus dealing with hardware structures using the toolset in a more abstract fashion. But it’s hard to beat the engineering productivity gain of using a toolset rather than hand-work. Using a gui-tool and canned components a system can be built in a few days that might takes months of work to do by hand.

_________________
Robert Finch http://www.finitron.ca


Mon Apr 06, 2020 3:43 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1440
I think there is another point worth making though: for both the programming case and the hardware design case, if you know something about the underlying implementation there may be times when you can express your high level design differently and get a much better implementation from the tools.

And indeed, there's another little bit to that: there are skills in reading the logs and output from the tools which can help close the loop, guiding you to a well-chosen tweak to your high level sources.

So, it's possible to treat the tools as black boxes and the implementation as impenetrable, and while doing that you can get the highest productivity, but it's also possible to look inside the box, and often get higher performance.

In both cases, there will be some small part of the design which will be the biggest win if it can be done better - it's not efficient to optimise everything to the same degree.


Mon Apr 06, 2020 7:13 am
 [ 507 posts ]  Go to page Previous  1 ... 30, 31, 32, 33, 34

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software