View unanswered posts | View active topics It is currently Thu Mar 28, 2024 2:40 pm



Reply to topic  [ 133 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7 ... 9  Next
 nvio 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Found a bug in the shift operations. The shift modules were just copied from FT64. The instruction decode in the shift modules needed to be altered to match that for nvio. It could explain why the status leds weren’t cycling on the FPGA.

_________________
Robert Finch http://www.finitron.ca


Tue Jul 02, 2019 4:30 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Just about got rid of the poll-for-interrupt instruction since it seems redundant. Almost the same thing can be accomplished by a two instruction sequence of enabling then disabling interrupts. An important difference though is that pfi controls exactly where the interrupt could occur. This is important for some OS code. Unmasking then masking interrupts doesn’t have the same effect.

The load and store multiple instructions were axed. The value for the amount of hardware required just wasn’t there. Load and store multiple had to be handled at the decode buffer stage before instructions are queued. It led to additional decoding prior to the decode buffer. There were several issues with the instruction a) multiple versions of the instruction had to be present to load floating-point or integer register sets. Each register set also required two instructions as only thirty-two registers could be specified in a single instruction. This led to eight separate load / store instructions for an infrequently used operation. b) a bitmask of registers had to be converted into register codes. The register code from the load / store multiple had to override the normal register decode. Additional logic in the register decode path would hurt performance. c) queuing of instructions had to be modified to support queuing a load / store multiple until the bitmask was expired. This was made more complex by the possibility of a load / store multiple in any instruction slot, or multiple load / store multiples present in the bundle at the same time. In short the amount of hardware required started to grow.

Tried the latest incarnation of the core in the FPGA same result as before. The LEDs don’t light up the way they’re supposed to. I had thought fixing the shift bug would fix this, I guess not.

_________________
Robert Finch http://www.finitron.ca


Wed Jul 03, 2019 3:13 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
It's taking six or more hours to build the system. Not ideal for testing turn-around time. Shelved nvio for at least a few days while working on CS01.

_________________
Robert Finch http://www.finitron.ca


Sat Jul 06, 2019 5:32 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Been updating the floating point. Changed a parameter name from 'WID' which was really too generic to 'FPWID' to help avoid conflicts with other modules. Also fixed some minor FP bugs.

_________________
Robert Finch http://www.finitron.ca


Sun Jul 07, 2019 5:30 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The ISA had a field for memory acquire and release bits for all memory instructions. This has been switched to just AMO memory operations. This freed up two more bits put to use for the displacement.

_________________
Robert Finch http://www.finitron.ca


Thu Jul 18, 2019 3:16 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Started working on the second version of NVIO. The second version will be a 64-bit cpu rather than 80-bits, however it will support quad precision (128-bit) floating-point. The instruction set will remain basically the same. It will be a challenge to get NIVO2 running in an FPGA. NVIO at 80 bits just barely fits in the FPGA. A simple scalar version may be created to begin with to help validate the assembler / compiler.

_________________
Robert Finch http://www.finitron.ca


Fri Jul 19, 2019 3:15 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I think this version may go with a unified register file for integer and floating-point ops. This was the original intent for nvio. Part of the reason to use separate register files is the lack of bits to represent more registers in instructions. With the wider 40-bit opcodes there are enough bits for a unified register file. With 128-bit internal busses and register file, it's tempting to make nvio a full 128-bit machine.

There's also an idea to deviate from the standard 128-bit floating point format. The standard is 1-15-112 for sign, exponent and mantissa. The deviation would be 1-19-108. The reason being that multipliers in an FPGA are 18x18 and it takes six x six to make 108 bits. To perform a multiply a six x six matrix (36) of multipliers would be required for 108 bits. With just a few more bits a 7 x 7 matrix = 49 multipliers are required. It takes almost 40% more resources just to calculate four more bits!. Hence the desire to reduce the mantissa a few bits. (The multipliers in the FPGA are really 25x18 I think, so it may be possible to organize things to use fewer multipliers, but I was going by using a simple hand-coded approach where things are symmetrical). There's also less propagation delay multiplying fewer bits. The author is thinking: try and code a multiplier and get it right, a non-symmetrical setup is going to be headache. The author would use the multiplier generator tool supplied, but it only goes up to 64x64 multiplies.

_________________
Robert Finch http://www.finitron.ca


Sat Jul 20, 2019 4:05 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Any idea if you, or your toolchain, could do better than n x n multipliers, for example by using Karatsuba's method?
https://en.wikipedia.org/wiki/Karatsuba_algorithm

For example, instead of 6x6, which is (3x2)x(3x2) or (2x2)x(3x3), you can rearrange to use just 3x(3x3) which is just 27 multiplications instead of 36.

Or, to make good use of your toolchain, the 112x112 can be considered as (2x2)x(56x56) which you can reduce to 3x(56x56) - your toolchain gives you the 56x56 components. It might be that it can do better than using (4x18)x(4x18) because (2+3x18)*(2+3x18) requires some two-bit multiplies, which can be done by adders.

It has to be said, Dave (hoglet) and I tried to do this kind of thing for Dave's port of the MandelMachine, and got ourselves perhaps slightly beyond our understanding of how to put together the pieces of large and small products. (We wanted to square 35 bit fractional numbers into a 38 bit result)
https://github.com/hoglet67/DSPFract

We found Xilinx' UG389 helpful. ("Spartan-6 FPGA DSP48A1 User Guide")


Sat Jul 20, 2019 6:51 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
This post really almost belongs under a computer arithmetic topic.

Thanks for the references BigEd I had not seen Karatsuba before at least not that I could remember. I was able to put together a 114x114 multiplier for the floating-point unit that uses only about 30 multipliers. I didn’t notice originally, but a 113 bit multiplier is needed not 112, so one a bit larger is used. 114 bits is broken up into a Karatsuba using three 57x57 bit multipliers generated by the FPGA’s vendors toolset. I used a subtraction trick mentioned in Wikipedia to avoid the use of multiplies with one more bit.
The 128 bit version of the floating-point cores is bound to be much larger than the 32-bit version which was about 3,500 LUTs (so 14k LUTs).

_________________
Robert Finch http://www.finitron.ca


Sun Jul 21, 2019 4:40 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Hey that's great, not only is 30-some multipliers fewer than 45, it's (probably) fewer than the 36 you were considering for your reduced precision compromise!


Sun Jul 21, 2019 12:44 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Working on a 128-bit version of nvio. The version in the works is a simple non-overlapped pipeline design. About the only way it has a chance of fitting in the FPGA.

Mulling over ways to load a 128-bit constant into a register and recognizing that a simple load-quad instruction isn’t that bad a way of doing things, at least compared to a chain of six or seven separate instructions to build up an immediate. I’ve been thinking of the following: a load and branch instruction. The instruction that loads relative to the program counter, and branches past loaded values at the same time. This would allow placing constants in the instruction stream, branching around them without losing any performance. The branch would always be taken, and it’s branching just a few bytes ahead so it doesn’t need a large displacement. The pc relative displacement for the load address doesn’t need to be very large either. The address would be where the next instruction would usually be located, so a displacement of zero would be common. It should be relatively easy to encode such an instruction in the opcode space.

_________________
Robert Finch http://www.finitron.ca


Mon Jul 22, 2019 2:51 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Wrote a test vector generator for the 128-bit FMA. The generator uses a 128-bit float emulator class, so it’s not expected that the results will match exactly between the test vectors and the FMA output. Results should be close.

_________________
Robert Finch http://www.finitron.ca


Tue Jul 23, 2019 4:05 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Put some more work into the floating-point cores. Mainly running simulations to see if things are working.

_________________
Robert Finch http://www.finitron.ca


Wed Jul 24, 2019 4:29 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I think I've mentioned this approach before but it's been a while. Working on instructions to load large immediates into a register. It can be noted that there need be only one or two registers that large immediate can load into, statically allocated to the assembler. This can be used to reassign register spec bits to immediate value bits. For a superscalar core the registers used are renamed anyway, so multiple constant loads can be ongoing at the same time. Nvio2 allows a whopping big 32-bit immediate to be specified in a single instruction. There are several shifting instructions that allow a 128-bit constant to be built using only four instructions.
For nvio2 the LI instruction format looks like:
Attachment:
File comment: LI format
LIFormat.png
LIFormat.png [ 2.29 KiB | Viewed 5153 times ]


The R1 field specifies either register r53 or r54.
Working on loads and stores not so simple when dealing with floating-point. The core is going to perform all floating-point in quad precision. However, it will allow loading and storing single and double precision values. This means that there is an implicit conversion between single and quad or double and quad taking place on a load or store operation.

Instruction Formats nvio2:
Attachment:
File comment: Quick format reference
QuickFormatRef.png
QuickFormatRef.png [ 83.76 KiB | Viewed 5153 times ]

Attachment:
File comment: Quickref2b
QuickRef2b.png
QuickRef2b.png [ 26.75 KiB | Viewed 5153 times ]

_________________
Robert Finch http://www.finitron.ca


Thu Jul 25, 2019 4:32 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Just posted the instruction formats yesterday and they’re already outdated. Added another bit to the CSR op field and shifted the OL field to the right. This was done to allow immediates to be used as a source in place of register Rs1 for CSR updates. So, three more ops involving immediate constants were added to the existing four.
The non-overlapped pipeline version of nvio2 is coming along nicely. A trial synthesis was performed and the size came back as 2,800 LUTs indicating there are bugs in the code causing things to be trimmed as this is far too small considering the 128-bit nature of the core. Second synthesis: 17,300LUTs, this is likely a lot closer to the final size of the core.
The core is implemented with five basic pipeline stages – ifetch, decode, regfetch, execute and writeback. Writeback being overlapped with ifetch for the next cycle. There are lots of additional stages for memory operations, complex integer operations and floating-point.

_________________
Robert Finch http://www.finitron.ca


Sat Jul 27, 2019 3:08 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 133 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7 ... 9  Next

Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software