View unanswered posts | View active topics It is currently Fri Mar 29, 2024 4:55 am



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 31, 32, 33, 34, 35, 36, 37 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
for the mmu, that would be best at the microcode level of the design if you had microcode.
Computer organization and microprogramming by yaohan chu is a good book.
Thanks for the book reference.

I’ve done something to the implementation that cost 4ns. I hope it wasn’t one of those ‘it has to be there for it to work’ type changes. 4ns makes 30MHz difference in operating frequency. Just backing out a bunch of changes now.
Created a #define SLOW to allow building a core that’s fast with instruction restrictions or slow with a full instruction set.
The fast core leaves out:
- sized memory operations, only dword size ops are supported
- unaligned memory access
- Boolean arithmetic on predicate results (which allowed combining multiple compares)
- the set intersect, join and disjoint test instructions
- shift left/right pair by a register (fast supports only immediate count)
At SLOW the core timing is about 75MHz (It’s about 3ns slower).
I tried putting the fast version in a larger system and it just missed the 100MHz timing by 197ps. I may end up running the cpu at 80MHz instead of 100.
I'm toying with the idea of rotating registers for software pipelining like the Itanium. Trying to achieve something like a simplified Itanium here.

_________________
Robert Finch http://www.finitron.ca


Tue Mar 31, 2020 4:21 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Completely blew the core timing by adding a whole bunch of features.
Added floating-point to the core. Not all floating-point operations are available in all instruction slots. Some of the less frequent operations are restricted to fewer slots.
Added code to support all the different load and store instructions for aligned and unaligned memory access.

_________________
Robert Finch http://www.finitron.ca


Wed Apr 01, 2020 3:32 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The fp divider had an atrocious clock cycle time, limiting performance to about a 30MHz cycle for the divider. I had used a radix16 divider primitive because it takes few clock cycles for the divide. Other factors in previous systems limited operating frequency to about 30MHz so the divider was setup to match that. So, to get a higher clock frequency for the fp divider the divider primitive has been changed to radix4. Some additional pipelining in the divider was also done for the exponent calc. which is done long before the divide so there’s lots of room for pipelining it.

After some more improvements the fp divider is no longer on the critical path. Now the critical path is between the register file output and address generation for load / store operations. It’ll take some work to bump up the fmax further by registering signals in the load / store path. That means loads and stores will take more clock cycles.
I included code for a data cache now and the core is much larger and somewhat slower. It was a bit unrealistic to have the high-speed datapath going directly out to main memory where it can take dozens of cycles to access something. I also changed the I$ to a 512-bit width to match the data cache for simplicity. That means a mux on the I$ output to select the instruction.

_________________
Robert Finch http://www.finitron.ca


Thu Apr 02, 2020 3:19 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Can a addition be split into two clocks. 1) a'' = mask & (a' $ invert) b"=mask & (b' $ invert)
gen = a" & b" prop= a"$"b
2) sum = generate carries $ prop


Thu Apr 02, 2020 3:53 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Can a addition be split into two clocks. 1) a'' = mask & (a' $ invert) b"=mask & (b' $ invert)
gen = a" & b" prop= a"$"b
2) sum = generate carries $ prop

The question seems a bit cryptic to me (what does $ stand for?). An addition could be split into two or more clocks. I would think that the carry chain would be split across two or more cycles (eg two 32-bit adds instead of one 64-bit). It's the prop delay of transferring data from one bit to the next that slows things down. At 2) it looks like the carries still have to be accounted for.

Put Thor on hold for now. I went to revise code to add more to it and it stopped synthesizing. I realized the code is really horribly written the way it is setup (it mixes control and data), so a re-write is in order. I stopped the synthesis after an hour or so. It may be able to synthesize given enough time, but the code’s just plain not written efficiently.
Given a re-write in order I decided to go back to looking at the ISA and revising that. I pulled the nvio ISA off the shelf to see if I could incorporate some things from it. Nvio with its vector instructions was about 5x too large for the FPGA so it got shelved. But pieces of the ISA may be useful.

I found the notion of performing Boolean algebra on the results from a compare operation intriguing. So, I’ve modified compares to be able to ‘or’ and ‘and’ to the result register instead of just plain copying the result. This allows multiple compare operations to be used to set a register value without having to branch in-between. It does also require a bit more opcode space.

_________________
Robert Finch http://www.finitron.ca


Sat Apr 04, 2020 5:09 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
robfinch wrote:
I found the notion of performing Boolean algebra on the results from a compare operation intriguing. So, I’ve modified compares to be able to ‘or’ and ‘and’ to the result register instead of just plain copying the result. This allows multiple compare operations to be used to set a register value without having to branch in-between. It does also require a bit more opcode space.


Mmm, yes, that could be an interesting exploration!


Sat Apr 04, 2020 7:05 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Not every thing is C.
$ xor
# or
& and
! not
With splitting the add into 2 cycles one would use regular
logic terms for lookahead carry say for 4 bit groups first cycle.
Second cycle would generate the sum terms, using something faster
than the fast ripple carry used with normal addition.
Have you tried a carry select adder to tweek your addition?


Sat Apr 04, 2020 4:00 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I borrowed the compare idea from the Itanium.

I’ve not tried the carry select adder although I had heard of it. I’ve read that in an FPGA one might as well use the regular carry chain built into the FPGA circuitry as it is just as fast for anything under 64-bits (IIRC). Also there tends to be other logic in the FPGA slowing things down. I’m not sure the adder is on the critical path. It’s a whole lot simpler just to use Verilog’s ‘+’ sign to do an add instead of instancing a module.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 05, 2020 3:08 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Indeed - choosing an adder architecture would normally be the job of the synthesis tool.


Sun Apr 05, 2020 7:40 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
But of course everybody has non standard way of selecting that hidden $$$ ( or Pound or Ero ) feature.
It seems to get speed you need to floor plan and that defeats the language you are programing in.
Does the FPGA software still suport importing of net lists?


Sun Apr 05, 2020 6:17 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Yes, to get the best performance floor-planning is essential. I haven’t taken to floor-planning anything and still get reasonable performance. I’m more concerned with getting the right architecture at this point. I like to fret over ISA’s and the way things are implemented so I do a lot of stuff manually, and that’s okay for a hobby. Bit of a retro-approach. The goal is once I’ve hit ‘the right’ architecture I’ll invest more into floor-planning etc. Coming up with a good ‘canned’ component is one goal.

I find the FPGA has gone the same way with hardware logic as high-level languages versus assembler in software. Used to be a lot of software was written in assembler, but high-level languages took over for ease of use even though performance may only be half. People have found performance to be ‘good enough’. Being able to express a good algorithm in a high-level language helps a lot.
Sure, maybe performance is half of what could be obtained by coding everything manually versus dealing with hardware structures using the toolset in a more abstract fashion. But it’s hard to beat the engineering productivity gain of using a toolset rather than hand-work. Using a gui-tool and canned components a system can be built in a few days that might takes months of work to do by hand.

_________________
Robert Finch http://www.finitron.ca


Mon Apr 06, 2020 3:43 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I think there is another point worth making though: for both the programming case and the hardware design case, if you know something about the underlying implementation there may be times when you can express your high level design differently and get a much better implementation from the tools.

And indeed, there's another little bit to that: there are skills in reading the logs and output from the tools which can help close the loop, guiding you to a well-chosen tweak to your high level sources.

So, it's possible to treat the tools as black boxes and the implementation as impenetrable, and while doing that you can get the highest productivity, but it's also possible to look inside the box, and often get higher performance.

In both cases, there will be some small part of the design which will be the biggest win if it can be done better - it's not efficient to optimise everything to the same degree.


Mon Apr 06, 2020 7:13 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Back onto this project with a new version with lessons learned. Spent the last three or four days creating a spec document for Thor2021 as I am calling the project. Thor2021 is a complete re-write but it has some of the features of the original. The number of code address registers is reduced to eight from sixteen. The register file size remains at 64 entries with eight selector registers for segmented memory management. Most of the opcodes are completely re-arranged but there is the odd one that remains the same. Much is inherited from the ANY1 project. Instructions vary in length from two to seven bytes.

_________________
Robert Finch http://www.finitron.ca


Tue Oct 12, 2021 7:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Borrowed the crypto accelerator instructions from RISCV and incorporated them into the ISA for Thor2021. Coded a crypto accelerator instruction module. I have started coding modules now for Thor2021. After a whirlwind of documentation, more than 300 pages. I figure the ISA is sketched out enough. Added the byte map BMAP instruction which maps bytes from a *pair* of registers to a target register. It is like a permute, PERM, instruction. Mapping from a pair of registers allows it to perform PACK operations. Which reminds me I need to add a nybble map instruction for crypto support.

_________________
Robert Finch http://www.finitron.ca


Fri Oct 15, 2021 5:35 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Worked on the segmentation model today. Got the idea to support PMA (physical memory attributes) as system descriptors in the descriptor table. There are eight PMA registers in the design. Each PMA has lower and upper bounds plus a device type field and accessibility field. This happens to be pretty similar to a segment descriptor, so I thought why not use one? I also made the page table address PTA, and memory key table address KYT registers system descriptors.
Descriptors are a whopping 256 bits in size. A structure larger than 128 bits was required and 256 is the next logical size. There are two 64-fields for the base address and limit, plus another 20 bits for access rights. Room was also reserved to expand the base and limit fields beyond 64 bits.

_________________
Robert Finch http://www.finitron.ca


Sun Oct 17, 2021 4:43 pm
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 31, 32, 33, 34, 35, 36, 37 ... 52  Next

Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software