View unanswered posts | View active topics It is currently Thu Mar 28, 2024 7:53 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got the compiler peephole optimization comments fixed up. Comments no longer throw the optimization off. I moved the comments off to their own special field creating more code bloat.

Added a branch target buffer. This potentially makes almost every flow control operation a single cycle operation (if predicted correctly).
Finally added char (16 bit) load / store operations.
Added some BCD math operations (add, sub and multiply).
Updated the fp portion of the core to include a fp status reg.
Added some debug registers and logic to the core (taken from Thor code).

So quite a few changes in just a few days.

_________________
Robert Finch http://www.finitron.ca


Mon Jul 17, 2017 12:19 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
BCD! I don't suppose you anticipate applications in point-of-sale or pocket calculators?


Mon Jul 17, 2017 12:43 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
BCD! I don't suppose you anticipate applications in point-of-sale or pocket calculators?
I wasn't thinking of that. But any application of the core would be good.

The compiler was modified to emit jump tables for switches that have enough cases in them. It takes about seven instructions to implement the jump table switch so there has to be about 14 or more cases before it’s more efficient on average to use a jump table. The compiler used to output only a series of compare and branches. So this is a new feature.
Added a check (chk) instruction to the core, used by switch statements to jump to the default case when the value isn't one of the case values (outside of range of cases). The naked keyword can now be applied to the switch statement to get it to omit the check. Faster code, but then no guarantees the right address will be jumped to.

A bus error signal and exception support was also added to the core.

_________________
Robert Finch http://www.finitron.ca


Tue Jul 18, 2017 10:32 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Re-wrote the stomp logic to remove a combinational loop. With the combinational loop removed the system now builds through bitstream generation. It’s possible to test in FPGA now. It took a while to identify the combinational loop. Strange thing is the very same combinational loop is in the original Thor code and that built without errors.
The first try in an FPGA didn't work. All it had to do was turn on some leds.

_________________
Robert Finch http://www.finitron.ca


Fri Jul 21, 2017 3:15 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got the LED test to work ! But the screen didn’t clear like it was supposed to. At least I know the instruction rom is likely hooked up correctly and the core can boot.

After a quick estimate of size increase, vector instructions are being added to the core. There will be eight vector registers with a length of 16 elements, a total of 128 more registers in the core.
One thing I’m not sure about is whether or not to add bypassing for the vector mask registers. A vector mask register controls which vector element is processed during a vector operation. They are set using one of the vector set instructions (VSLT for instance). Bypassing is not being provided for the vector length register because it’s likely to be infrequently altered. If the mask registers are not bypassed then SYNC instructions will need to be used and that costs six or eight clock cycles. However, if the mask registers are infrequently set it probably doesn’t matter compared to the number of clocks for the vector operation. (vector divide might take 1,800 clock cycles for instance). I haven’t written any vector code before so this is outta my ballpark.

_________________
Robert Finch http://www.finitron.ca


Sat Jul 22, 2017 3:27 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The number of vector registers has increased to thirteen from eight. Also the general purpose register set is treated as vector register #0 giving potentially fourteen vector registers. The vector load instruction may be used to load the general purpose registers.
Core size is sitting at about 120,000 LUTs with vector instructions.
The author had a quick look at CAL (Cray Assembly Language) and borrowed a couple of the Cray instructions (scan, cmprss and cidx).

The first vector test program adds a bunch of numbers to itself, saves it in another vector register, zeros out the vector register, then restores it.
Code:
FFFC0114 003FFF1A       lv      v0,tblvect    ; load up a bunch of number from a table
FFFC0118 01C00036
FFFC011C 10000001       vadd   v0,v0,v0,vm0   ; add to self (multiply by two)
FFFC0120 64030001       vors   v3,v0,r0,vm0    ; move to another register (or with scalar zero)
FFFC0124 28000001       vxor   v0,v0,v0,vm0    ; zero out vector
FFFC0128 640000C1       vors   v0,v3,r0,vm0    ; then restore it

The author used this program to get queing of vector instructions working.

_________________
Robert Finch http://www.finitron.ca


Sun Jul 23, 2017 9:26 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2017/07/23
More work on vector instructions. I’m trying to expand the number of vector registers to 32, 32 element registers. It would be nice if the length were 32 elements or more because then the general register file fits entirely in one vector register. It requires a couple of more bits to specify a register and some of the register related logic is four times the size.

I tried expanding the queue ability for vector instruction elements to two per clock cycle. But it leads to code bloat. Much of the time there is no point to being able to queue more than one instruction per clock because there is only a single FP unit. In a future version of the core with multiple FP units multiple vector elements could be issued in a single clock cycle. Then it would make more sense to be able to queue more instructions per clock.

_________________
Robert Finch http://www.finitron.ca


Mon Jul 24, 2017 8:52 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2017/07/24
Expanding the number of vector registers didn’t work. Synthesis crashed the machine after running for about three hours. There’s probably not enough resources in the FPGA making it very difficult to synthesize.

I’ve decided to convert the core code into System Verilog. System Verilog has a couple of nice features. I like being able to define types and classes.
Started work on a new core (Monster64) with experimental branch handling in the pipeline. I got this notion that only part of the pipeline needs to be flushed on a branch miss if the branch target address is already in pipeline.

_________________
Robert Finch http://www.finitron.ca


Tue Jul 25, 2017 10:19 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Well, the optimization to branch to targets already in the queue didn’t work as well as I thought it would. The problem is that instructions following the branch may have dependencies on other instructions skipped over. The simplest way to resolve the dependencies is to just flush the entire queue. Otherwise things get a bit complicated. I coded for the simple case of where instructions branched over don’t update any registers. That seemed to work. It should also be possible to account for cases where there are target registers but no dependencies. The difficult case is when there are dependents.

Busily converting things into System Verilog.

_________________
Robert Finch http://www.finitron.ca


Thu Jul 27, 2017 9:55 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
robfinch wrote:
Otherwise things get a bit complicated.

One of the things I've belatedly realised is that the steady progression of ever higher performance-per-tick of modern processors comes from highly experienced teams chipping away at this complexity, generation by generation. There's a great deal you could do, theoretically, but you have to do it right, so you do what you can and in the process you learn. If you bite off too much, you get one of those missing generation moments where the chip never gets shipped, which can hurt the business a lot.


Thu Jul 27, 2017 1:43 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
I am wondering how well this Theory of Technical Evolution is developed, and is there any explanation of how reverse evolution, i.e. simplification, re-enters the solution space. :)

_________________
Michael A.


Thu Jul 27, 2017 2:25 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I think the one-page computing challenge, and Jan Gray's work, both point to resource-limited CPUs as being necessarily simple. Oh, and the 128-slice challenge too. Although our own one-page efforts have got steadily less simple as they evolved more performance. We couldn't stop ourselves!


Thu Jul 27, 2017 2:43 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Technical evolution is interesting. The comment about experienced teams chipping away is probably true. I think sometimes it’s worth it to take a big bite, provided you can leverage existing knowledge to accomplish the goal.
There’s only so much one can do before having to jump to another level.

2017/07/27
Modified the core so that processing related to register bypassing and validation only sees 64 registers (32 scalar + 32 vector). Previously a separate register code was being used for each vector element. With 16, 16 element registers this led to 256 codes for logic to process. Vector registers don’t really need the same kind of bypassing as for general purpose registers and using 256 codes was wasteful of logic. It was quick to implement though. It was desired to expand the number of vector registers and the existing logic just didn’t cut it. So hopefully with the changes the logic will be smaller while allowing more vector registers. The only size increase should be in the register file which is using block ram memory. Hopefully it works.

_________________
Robert Finch http://www.finitron.ca


Fri Jul 28, 2017 7:43 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Modifications to the core seem to work ! There are now 32 vector registers of length 64 available. The compiler is being modified to support a vector type.
Still more mods to make. Currently the core now sees the vector and scalar register files as separate. The core may be switched back to viewing the GPR's as vector register #0.
It all depends what is used for the register id.

_________________
Robert Finch http://www.finitron.ca


Tue Aug 01, 2017 3:49 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
This core is now too big for the FPGA :( So it’s been a back-burner project now. It’s sitting at about 140kLUTs but there’s only about 135k available in the device. It really needs a honking big FPGA to evolve into a multi-core network on chip system.

The instruction set was modified slightly to increase the number of available opcodes. Quadrant oriented instructions were moved to a single opcode from four separate opcodes. Immediate quadrant operations are ADD, OR, AND, and XOR to allow building 64 bit constants in a register. The quadrant operations operate on one of four quadrants of the target register. Bits 0 to 15, bits 16 to 31, bits 32 to 47 or bits 48 to 63 are the quadrants. Set immediate instructions were added to the instruction set.

The instruction set is fairly packed, but it’s fortunate that the instruction set includes almost everything. There are only five opcodes available at the root level, and only two opcodes available in the R1 group. I’ve been toying with the idea of developing yet another processor with a seven bit base opcode rather than six. But it looks like a base opcode of six bits is just barely enough so I’ll probably stick with it.

FT64 uses fixed size 32 bit instructions with no prefix or postfix instructions.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 13, 2017 10:02 pm
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10 ... 52  Next

Who is online

Users browsing this forum: AhrefsBot, Applebot and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software