View unanswered posts | View active topics It is currently Thu Mar 28, 2024 4:38 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 9, 10, 11, 12, 13, 14, 15 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Stopped working on this project for now. I didn't like the bit pair addressable instructions, so the instruction set may go back to 32 bits or 40 bits.

Rivived a project from 2009 which was an 8088 core. It's being updated.

_________________
Robert Finch http://www.finitron.ca


Thu Mar 22, 2018 1:51 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Decided to go back to the original FT64 instruction set which has 32 bit opcodes and revise it. All the multiplexing for bit pair aligned instructions was bound to slow down the core at some point.
The BB and BD formats have been revised swapping the placement of the LSB of the displacement with the branch prediction indicator. So branches are organized into two groups of opcodes for predicted and not-predicted branches, rather than have the LSB of the displacement as part of the opcode field. In order to free up some opcode space at the root level the multiply and divide operations have all been placed into a single opcode which uses a 12 bit immediate rather than 16 bit. Vector instructions have been separated into two groups in order to allow an additional bit to be added to the type/precision field. A couple of other opcodes have been rearranged.
The instruction formats now look something like:
Attachment:
File comment: FT64v2 instruction formats
InsnFormat.png
InsnFormat.png [ 41.12 KiB | Viewed 6216 times ]

_________________
Robert Finch http://www.finitron.ca


Sun Mar 25, 2018 3:18 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Hey, I just discovered that the wrong pair of bits was being decoded as the branch predictor field in the instruction. It's been wrong all along and the core still branched to the correct address, just with a performance degradation. Bits 22,21 were decoded and it should have been bits 21,20. I'm not going back to modify this. I've called the new version of the core FT64v2 which fixes this because the predictor field is in a different place.

_________________
Robert Finch http://www.finitron.ca


Sun Mar 25, 2018 9:10 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
That's an interesting verification challenge!


Mon Mar 26, 2018 8:27 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I've had to find other things to do as I can't get the Vivado toolset to work. The IDE starts up, starts refreshing the file hierarchy then quits. I've posted a message on Xilinx board. I tried installing the latest version with the same results.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 15, 2018 11:45 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Ah, I've never (yet) had to stray beyond ISE. Which is increasingly limiting.


Mon Apr 16, 2018 6:57 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Today I’m wondering the value of branch prediction control bits in the instruction. I note that a number of contemporary instructions sets don’t have a provision for branch prediction control. Without these bits all branches must be predicted, even branches that are known to be always or never taken. Presumably because there is no provision in the high-level language to allow specification of the control bits, control bits in the instruction set are not present. Leaving these bits out of the instruction probably doesn’t impact performance very much. The size and complexity of the core could be reduced by removing the branch prediction control bits. But the difference isn’t very great.

An issue I haven’t worried a lot about yet is what to do when the branch range is exceeded by an instruction. Branches are limited to a 1k word displacement in either direction. What should happen if an attempt is made to branch further ? The standard approach is to output the longest form of code in all cases then have the linker shorten it if possible. This seems to be nonsense to me in the case of relative branches. The longest form of the code would branch around a jump instruction to the target address. But doing this for every single branch would seem to defeat the purpose of having a branch displacement. The compiler can’t really decide what to output because it would have to be counting the instruction words output to know when to provide alternate code for branching. Counting the instruction words at the compile stage probably isn’t very reliable.

_________________
Robert Finch http://www.finitron.ca


Mon Apr 23, 2018 10:56 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I have a hand wavey idea for resolving branches - it's a heuristic, but I think that might be the best that can reasonably be hoped for.

The problem is not knowing how long a branch implementation will be, because the distance of the branch may or may not be "short."

Hmm, I've tried a few ways to explain what I'm vaguely thinking of, and it's not coming out cleanly! Just possibly it's worse than I thought. I believe I heard it said of the Occam compiler that it could take an indeterminate number of passes to resolve the final form of code, because of this kind of problem.

If you can order the branches in terms of how deeply they interact with the final form of other branches, then you can visit them in order of increasing doubt. As you resolve the least doubtful ones, either trivially or arbitrarily, you reduce the doubt in the branches which skip over those.

If there's a branch which jumps out of a loop, and a branch at the end which jumps to the top, then both distances are dependent on the final form of the other branch. But this only needs an arbitrary solution if the distances of both are right on the threshold of "short."

Two things might be useful: to annotate straight-line sections of code with the their actual length, and to annotate branches with their max and min length, and their max and min distances. If the max distance is "short" then you know the branch will be of minimum length.

But all that said, it feels like you'd need your compiler to have a suitable stopping-off point to insert that analysis - after a lot of code generation is done, but before you've committed all of it.


Tue Apr 24, 2018 3:33 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Ed:

I don't know how the Occam compiler or the Transputer assembler resolved this problem, but as you probably suspected, I stepped on this landmine with the variable length branch that I implemented for the MiniCPU (and envisioned for the MiniCPU-S).

As I've struggled to resolve this issue, the only algorithm that I've come up with that appears to have a reasonable chance of success is to resolve the width of the offset (4, 8, 12, or 16) for backward branches during the first pass (since the target addresses are "known"), and assign a long (16) offset to all forward branches. Subsequent passes can reduce the width of the offset for forward branches from 16 to some shorter length while also adjusting the lengths of the backward branches.

There are some obvious refinements regarding the width of forward references, particularly since the address width is only 16 bits for these two processors. All in all, a combined branch target table and symbol table may be beneficial to keep the number of passes through the actual source code to a minimum.

_________________
Michael A.


Tue Apr 24, 2018 6:06 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Contemplating issues tonight with not enough opcode bits to represent vector operations. I’m wondering how I can squeeze five or six bits into two bits available in an instruction. The obvious way to extend opcode bits is to include the less frequently used bits in a control register then reference the control register from the instruction. The bits needed to specify a vector operation include precision and rounding bits, which mask register to use, the length of the vector. I suppose there could be a set of precision and rounding registers analogous to the vector mask register specified for use with the vector instruction.
Should the round mode for vector operations come from the scalar floating point control register ? Or should it come from the instruction itself. Or should another register be assigned to contain the round mode for vector operations ? Vector instructions currently specify a type field (integer, single, double) for the operation. This field is only two bits but there could potentially be more types. For instance there could be SIMD vector operations on 16, or 32 bit integers, or pairs of float singles.

_________________
Robert Finch http://www.finitron.ca


Tue May 08, 2018 3:10 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Precision and rounding do sound like modes to me, so mode bits would be a good fit. Arguably length too, in the sense that most vectorised code would (at a guess!) do a lot of work at some given vector width, before possibly switching to a different width.


Tue May 08, 2018 9:02 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I’m gravitating back to this project now. At least files for this project have been reviewed over the last couple of days. FT128 may be in the works. 96 bit floating point is desired, along with design simplicity including a unified register file. I don’t really see the need for a full 128 bit processor right now, especially for something homebrew. But for simplicity sake and supporting at least 96 bits in the register file means a 128 bit design may be made.

I would like the processor to inherit some of the better qualities from the 68k.

I note the trend for current design seems to be widening the register file. There are 512 bit wide registers in processors now to support SIMD operations.

_________________
Robert Finch http://www.finitron.ca


Tue Jul 10, 2018 5:02 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Are you thinking of SIMD? Or just very wide floating point?


Tue Jul 10, 2018 5:27 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I'm thinking SIMD. Going over 128 bits wide for FP seems unreasonable. I'd of thought that designs would follow a vector instruction set model to add vector registers, but they seem to just make the register set wider instead.

It looks like I'll have to put up with the current ISA. I tried manipulating the 68k instruction set to add desired features but it decreases the code density. There is just going to have to be some loss of things like code density and performance in this project or it wouldn't be simple enough to do.

Double word (128 bit) load/store operations need to be added to the ISA now in order to support the floating point. The bus is only 64 bit, so two bus cycles are required to transfer 128 bits. Using 128 bit accesses to load / store only 96 bit values seems a bit wasteful. I'm wondering if the data should be left aligned and padded with zeros on the right. This might make it more compatible with a real 128 bit format.


Just looking at the 68881 constant rom instruction. Seems like it might be a sensible thing to include a small constant rom for things like pi and other common constants so that they don't have to be loaded from a literal pool. Which constants to include ? pi, 1.0, 10.0, log 2, etc.

_________________
Robert Finch http://www.finitron.ca


Tue Jul 10, 2018 10:56 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I’ve now gone back to work on FT64v3 the version with the 18 / 36-bit instructions and fractional addressing.

To get better code density, it’s really tempting to vary the instruction size with a larger number of formats. It’s very easy to change. Because it’s a barrel processor fetching for different threads on each cycle, the instruction size doesn’t need to be determined immediately for the next instruction fetch. This means determining the size can be fairly complicated without impacting timing. It needs only to be determined before the next instruction fetch for the same thread.

The following line of code at the queue stage sets the next pc for the thread.
Code:
next_pc[fetchbuf1_thrd] <= fetchbuf1_pc + (fetchbuf1_dc : 6'b0100_10 : 6'b0010_01);   // Add 4.5 or 2.25

The mux driven by fetchbuf1_dc could just as easily read from a small table.

Found a bug in the Int128 fixed point integer class. A method was doing signed right shifts instead of unsigned ones. This ultimately causes the calculation of some constants to be off.

_________________
Robert Finch http://www.finitron.ca


Thu Jul 12, 2018 4:34 pm
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 9, 10, 11, 12, 13, 14, 15 ... 52  Next

Who is online

Users browsing this forum: AhrefsBot and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software