Last visit was: Sun Aug 01, 2021 5:05 am
It is currently Sun Aug 01, 2021 5:05 am



 [ 133 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9
 nvio 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1443
Location: Canada
Shoehorning in posit arithmetic functions into nvio v5. There is an empty row available in the instruction set where they could be placed. One issue is how to implement fused instructions. Because the posit fused dot product is required by the posit standard and instructions don’t have enough read ports, a macro instruction fused, fused dot product F2DP is going to be used. Generally, to use macro instruction fusion the processor must be able look at several instructions in the instruction stream to detect a macro fusion pattern. However, I want something simpler, so I’m looking at having the instruction set itself be encoded with indicators for macro fusion. That way the processor core shouldn’t need to pattern match on incoming instructions. There are a couple of extra bits available in instruction encodings to do this. So, there would be a bit in the instruction that says ‘this is the first half’ of a macro fused instruction and a bit for the second half. That way the core would know to feed the instruction to a fused unit in a simple fashion.

_________________
Robert Finch http://www.finitron.ca


Tue Apr 28, 2020 4:08 am WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 203
Location: Huntsville, AL
Rob:

You've implemented posit-based arithmetic. It looks that you're either satisfied or at least sufficiently intrigued by their utility that you're adding them to one of your projects. What's your take on them at this point? Good, bad, indifferent, or need more data. One claim made is that they should be faster than IEEE 754 floating point. Have you noticed an improvement in your cycle times by using posits instead of IEEE 754 floating point units?

_________________
Michael A.


Tue Apr 28, 2020 11:44 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1443
Location: Canada
I have not finished with them yet (I’ve not tried synthesizing them for size yet for instance), so I don’t know for sure what the timing is like. I suspect that the cycle times and latencies would be very similar to regular floating-point. How well the toolset can retime the logic across registers will affect cycle time. Latency is just a matter of how many layers of registers one wants to insert. I don’t expect drastic differences. Many of the mechanics of dealing with floats or posits are the same. Some operations with posits are simpler such as comparison, and rounding. If there are fewer pipeline stages in rounding results or to handle special cases then latencies may be lower. Posits are more accurate (at least that’s what I’ve read), and software using them may have to be coded differently than for IEEE-fp. Since there is no such thing as ‘NaN’ for posits values that will cause issues must be trapped in software. I think the main benefit of posits is the ability to use smaller data formats in some circumstances and a performance benefit from using posits may come from better cache usage. The fp hardware may not be any faster or smaller, but being able to move less data around will increase performance.

_________________
Robert Finch http://www.finitron.ca


Wed Apr 29, 2020 10:38 pm WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 203
Location: Huntsville, AL
Good point about comparison of posits. I hadn't thought that through, but the encoding appears to be such that integer compares will work just as if the posit was an integer.

_________________
Michael A.


Wed Apr 29, 2020 11:57 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1443
Location: Canada
Been contemplating how to implement fused instructions. Fused instructions combine multiple instructions to act as a single instruction. A common fused instruction may be the multiply-add instruction (a * b) + c. First a multiply instruction is done then it is followed by an add instruction. When the processor sees this sequence of instructions, it may fuse the operations performing it as a single unit. A more complex instruction, the dot product may be performed as (a * b) + (c * d), two multiplies and an add. Fused instructions may be issued as separate instructions but executed as a single one. One benefit of fusing instructions is that it makes more register read ports available for an operation. A characteristic of fused instructions is that they have intermediate results that don’t need to be stored to the register file. Only the final result need be stored. An issue I see that arises is how to determine when an intermediate result needs to be stored to the register file and when it doesn’t. A second benefit of fusing instructions is that intermediate results can contain more bits than would fit into a register in order to allow better rounding for more accurate results.
From what I’ve read so far on the net, there is no provision for intermediate results to be stored to a register. Macro fusion examples are simple for instance a compare followed by a branch operation, or using two instructions to perform an indexed load operation.
To hold onto intermediate results which could contain double-width products an intermediate results register file could be used. The register file would need to be spec’d by the instruction. It could lead to instructions like fmul.ddq (multiply double * double, produce quad result). A fused dot product would be then (fmul.ddq,fmul.ddq,fadd.qqd)

_________________
Robert Finch http://www.finitron.ca


Mon Jun 01, 2020 2:47 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 255
Do Fused operations apply for complex numbers?


Mon Jun 01, 2020 5:01 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1443
Location: Canada
Quote:
Do Fused operations apply for complex numbers?

The fused operations use the usual number sets. They do not for instance perform a fused dot product of two complex numbers by themselves.

The nvio instruction set is being switched from a fixed 40-bit wide format to a variable length format. The first nine-bits of the instruction will be examined to determine the instruction length via a lookup table. This table will probably be decoded at I$ load time. Opcodes are 8-bits plus one bit indicating an additional format byte added to the instruction. The format byte is used for vector operations to specify the mask register and operations where the size of the operands is not the default size. The 40-bit instruction included information for vector operations which is code space inefficient since most ops aren't vector ops.

_________________
Robert Finch http://www.finitron.ca


Tue Jul 14, 2020 8:08 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1443
Location: Canada
I've been working away on the core, reducing the size of instructions while keeping things functional. I found a potential way of eliminating the 'z' bit from vector instructions.

Currently vector operations are guided by a mask register selected in the instruction, and also a ‘z’ bit which may be set in the instruction. The ‘z’ bit instructs the core to force vector elements to zero when the mask register indicates not to process the vector element. I’m contemplating moving the ‘z’ bit of the instruction into the vector mask register resulting in the use of a 2-bit vector mask register for vector instructions rather than the usual single bit register. The reason being that there are typically one of three values that it is desired to set the target register to as a result of a vector operation. These values being the value zero, the newly calculated value, or the current value. Using a two-bit register allows one of these outcomes to be used, otherwise multiple instructions may be required to get the desired results.

Mask Register Contents
00 = leave corresponding result register contents alone
01 = set corresponding result register to calculated result
10 = set corresponding result register to zero
11 = set corresponding result register to calculated result

The above leads to the idea of having different Boolean operations performed for the result. With additional bits in the mask register, the result could be the bitwise ‘or’ or bitwise ‘and’ of the current and newly calculated results.

3-Bit Mask Register Contents
000 = leave corresponding result register contents alone
001 = set corresponding result register to calculated result
010 = set corresponding result register to zero
011 = set corresponding result register to calculated result
100 = set result register to bitwise 'or' of current and new result
101 = set result register to bitwise 'and' of current and new result

_________________
Robert Finch http://www.finitron.ca


Wed Jul 22, 2020 3:33 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1627
If this seems like a good idea, maybe also a good idea to be able to put all-1's into masked elements??


Wed Jul 22, 2020 7:39 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1443
Location: Canada
Putting all one’s to a register is a natural next step.

I’m not sure it’s that good of an idea to have multiple bits in the mask registers. It was a late-night idea. It complicates other mask register operations, and the same results can be achieved just by using more instructions. For instance, there is a mask register population count instruction, if there are multiple bits in the mask register what does the population count represent?

Code:
Example using extra instructions to set zero status result:
add   v1,v2,v3,m1   ; perform a vector operation
com    m2,m1      ; complement vector mask register
mov    v1,r0,m2   ; conditionally move a zero to the result under guidance of m2


Loopholes
The compare instructions may set a subroutine link register to the result of a compare operation. This is a side effect of the way compare results are stored. Compares normally target either general purpose registers or vector mask registers. It’s of no real value to be able to set a link register to a compare result, but it’s not prevented in the current core.

_________________
Robert Finch http://www.finitron.ca


Wed Jul 22, 2020 9:19 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1443
Location: Canada
Here are the formats of the float and vector mask instructions for NVIO5.
It hasn’t been polished off yet, but there’s enough laid out to have something to go by.
Instructions are either 32-bit or 40-bit. Usually determined by the ninth bit of the opcode.

Instructions follow a default format (64-bit double precision) using scalar registers. The default format is 32-bits. If something other than the default format is desired then 40-bit instructions must be used. The 40-bit format adds a mask register selection and a register format selection (eg 8x32 vector regs or 4x64, etc).

There’s an immediate field in the vector mask operations (MOP) format. I can’t for the life of me remember what it’s purpose was. I’m sure it was something cool a way back when I first setup the table, but I didn’t otherwise document it.

Attachment:
FltFmts.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Thu Jul 23, 2020 2:40 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1443
Location: Canada
Whittling away at the nivo5 instruction set again today.
Shown below are the branch unit instruction formats. Most the instructions are 32-bit. Branch compare to R0 are 32-bit instructions, branch compare to r2 are 40-bit instructions to accommodate additional space for the r2 field. Jump and jump to subroutine instructions have a short (32-bit) and long (48-bit form). The return from subroutine also has a short form, and a longer form which allows a stack pointer update. NOP is just a single byte, useful for aligning instructions at particular byte addresses. Several miscellaneous instructions each have their own unique 32-bit format.
Attachment:
BranchFmts.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Fri Jul 24, 2020 3:00 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1443
Location: Canada
Taking another approach to a 'new' 64-bit core. Using g-core (52 bits) as a guideline. Most instructions are 24 bit. I have g-core almost working in an FPGA so I figure it's a good base to start with. I need the core to fit into a smaller FPGA, so it's likely going to be a overlapped pipelined core, not a superscalar one.

_________________
Robert Finch http://www.finitron.ca


Sat Aug 15, 2020 3:18 am WWW
 [ 133 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software