Last visit was: Sun Aug 01, 2021 3:19 am
It is currently Sun Aug 01, 2021 3:19 am



 [ 133 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9  Next
 nvio 
Author Message

Joined: Fri Nov 22, 2019 5:31 pm
Posts: 4
Regarding hardware loops, see Section 9 of this document:

http://ww1.microchip.com/downloads/en/devicedoc/70005158b.pdf

The REPEAT instruction repeats the following instruction n+1 times.
The DO instruction repeats a block of instructions n+1 times.


Sun Nov 24, 2019 7:51 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Shut synthesis down after 26 hours, then switched toolset strategies to run-time optimized. Having switched to run-time optimized, it synthesized in about four hours to a size of 883,421 LC’s. Still not finished adding code for all the features, but there is a better idea of how large the core would be.

Added CR logic functions to the branch unit alu.

Got rid of some redundant set instructions. For example, set less than and set greater than can actually use the same instruction if the registers are swapped. This reduces the number of set instructions from 10 to 6 without losing any code density.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 25, 2019 4:08 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Quote:
The DO instruction repeats a block of instructions n+1 times.

There are some interesting ideas there. Rather than storing the start and end address in memory, they could be stored in CSR’s. That way the processor doesn’t have to fetch from memory. For a given end address, the processor could branch back to the corresponding start address. It’s extra hardware to implement the DO operation, when there are already branch instructions.

The REPEAT instruction reminds me of the x86 REP prefix. In the case of nvio3 the instruction being repeated needs to be modified though. So maybe adding an amount to the instruction every repeat would work.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 25, 2019 4:30 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1627
Do you track some idea of clock speed as you explore the CPU design space Rob? My reason for asking is that it's looking like a very large design, and with large variation in size. So, if you added some feature which doubled the size, and as a consequence knocked the clock rate down a fair bit, it might not be a net win compared to the cycle count gain that you get from the feature.


Mon Nov 25, 2019 10:30 am

Joined: Mon Oct 07, 2019 2:41 am
Posts: 255
Partitioning and routing between modules seems to be the problem is my guess.
I use the OTHER brand of FPGA and am never sure if I even meet sample and hold times.
Sadly better tools for floor planning and routing don't seem to be out there, even for big $$$.


Mon Nov 25, 2019 6:57 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
There aren’t enough bits in the bitfield instructions to specify a full width of 128 bits for the field. Instead the field width is limited to 32 bits when specified as a constant (when specified in a register fields up to 128 bits are allowed). Not sure how big a limitation this is as most bitfields are small.
Quote:
Do you track some idea of clock speed as you explore the CPU design space Rob? My reason for asking is that it's looking like a very large design, and with large variation in size. So, if you added some feature which doubled the size, and as a consequence knocked the clock rate down a fair bit, it might not be a net win compared to the cycle count gain that you get from the feature.

I haven’t been tracking clock speed. The toolset doesn’t give time estimates. I don’t expect the variation in size to persist. The large variation’s come about not really as features added, but as coding become more complete. For instance, I missed converting a number of 80-bit to 256-bit busses in the first pass. That’s bound to have added a lot to the size but it’s not really a feature that can be omitted. Part of the goals of this project were to be large and interesting.

There are several easily identifiable spots where the code could be improved. For instance, there’s a results combiner / multiplexer on the fpu outputs that could be pipelined. The fpu could use better internal pipelining. There is also a lot of multiplexing and logic in the commit path, but that’s what provides 4-instruction commit per clock instead of two.

Much of the size of the design comes from the vector part where the width of functional units is 256-bit and supporting different sizes of vector elements. Making things wide does not necessarily slow things down, but it uses a lot of resources. Multiplexing 256-bit busses onto a single bus is bound to be slow in the FPGA. There is a maximum amount of signal width in the FPGA before it crosses into different areas. The FPGA is organized into rows and columns and if too many resources are used things need to cross over to the next module, slowing things down. For instance, I think there’s a maximum width of 4096 reads bits from rams.

I thought having separate register files for everything would work out to be a great idea, but the multiplexing in the commit path and figuring out how many results can actually commit turned out to be somewhat involved. There’s a bunch of different combinations. It might be better just to go with a single register file to simplify things. Such an approach changes the ISA. With a single register file compare-and-branch might be better than condition codes.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 26, 2019 3:12 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Got rid of the second results bus and simplified the commit logic a little bit. Removed push/pop/link/unlink and auto-increment, auto-decrement address modes, which were the items using a second results bus. Removing these features didn’t make the difference to the size of the core that I was expecting. Only about 3,000 LC’s were shaved off the size.

Added four base registers (program, data, io, and extra) and used two bits in memory instructions to select the base register. Fixed up the address generators with the changes. Selecting the program base register for data access is somewhat tricky as the program base is in terms of instructions, so the value must be multiplied by five. There is no dependency checking on the base registers as they are assumed to be primarily static values. Base registers also don’t apply in machine mode. Looking at the address generators, they could be pipelined a bit better.

Well, decided to get rid of the difference between instruction and data addresses. I should’ve known better not to allow a difference in the first place. So, the instruction cache was adapted to handle byte addresses. It now makes use of the dual-port output capability of distributed rams in the L1 cache to output instruction data that spans cache lines. The current line and the next line are read at the same time. The cache miss address had to be output from the cache because it could be the next line, not necessarily the current line. Previously just the current address had been used. These changes also required dual porting the tag ram. Reading two lines at once was done to avoid an extra clock cycle that would otherwise be required when instruction bundle spanned cache lines. In the FPGA the dual-port output comes almost “for free”. The cache line width was reduced to 512 bits from 640 as it’s no longer necessary to worry about fitting a whole bundle on a cache line.

The author forgot to supply the target argument for the floating-point units. The target argument is needed for vector operations where the operation is masked off.

Current size is: 880,925 LC's

_________________
Robert Finch http://www.finitron.ca


Wed Nov 27, 2019 3:21 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Memory Instruction Formats:
Attachment:
IFormats4a.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Wed Nov 27, 2019 3:45 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
After the latest round of fixes: 902,934 LC’s.

The author has gotten side-tracked today, working mainly on the micro-op 6502 instead.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 28, 2019 3:08 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Switched nvio3’s branches from absolute addressing to relative addressing. Rearranged the branch opcodes so they require fewer (1) opcodes at the root level. The switch was made after realizing that it would be easier to deal with relative branches in micro-op code. At the micro-op level all that’s readily available for reference is the program counter. In order for the micro-op to branch it would have to use relative branches.

Added a task for argument bypassing similar to what’s done for the micro-op 6502. Also added exception trigger for floating-point operations if there is no floating-point unit. One gotcha with the bypassing task is that task's in Verilog are not parameterizable, I needed the width to vary... So I created several tasks called bypass,bypass128, and bypass32 with different width's hard-coded.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 12, 2019 3:59 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Thought about ways to compress nvio’s instruction set down to 32 bits and realized it was possible by adding a prefix instruction for the cases where bits from the 40-bit instruction were removed. It then dawned on me that I could just as well use the FT64 instruction set and add prefix for it. FT64 already has 32-bit instructions and even 16-bit compressed ones. The addition of a prefix would allow specification of the data format (eg 8, 16, 32 bit ops) and which vector mask register to use. The prefix can be made to fit into a 16-bit compressed instruction so it doesn’t really take up much more room that would be required to widen the instruction set. There would be two forms of the prefix, one that applies only to the next instruction and a second which applies to all following instructions referred to as a sticky prefix. So, I’m back to working on FT64.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 13, 2019 4:37 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Worked on the nvio3 assembler/compiler today, starting with work done already for nvio and FT64. Modified the nvio rtl testbench for nvio3. Tried to get sim running for nvio3 but there’s a file permissions error. It could be because synthesis is running at the same time. Synthesis was started a while ago and allowed to run. It looks like the design is too complex now.
The register files were changed to allow quadruple update paths, rather than muxing the updates with clock signals. So the connection to the commit bus is much simpler now, there’s no need to re-route portions of the bus, but I’m afraid it takes more hardware.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 14, 2019 4:03 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Added the ‘BIT’ instruction to nvio3. It works the same way as an ‘and’ instruction except that it stores the status results to a condition register rather than storing the result to a general-purpose register. This is similar to the BIT instruction on the 6502. BIT is to AND as CMP is to SUB.
The PUSH and PUSHC instructions made it back into the instruction set. I had used PUSH and PUSHC all over the place in the boot code and other programs for nvio and when ported over to nvio3 wasn’t looking forward to changing it all. The PUSH and PUSHC instructions take only a small amount of additional logic to support.
Not sure if the extra effects possible by a PUSH operation should be documented. PUSH can push either one or two registers at the same time. PUSH actually has a count field which may take on values from 0 to 3. Even though only 1 or 2 should be used. The stack pointer decrement is multiplied by the count value which results in decrementing by (0,16,32, or 48).
Worked mainly on the assembler and compiler, which is just a port of the same for nvio.
Also downloaded the latest version of the FPGA toolset.
Sim keeps crashing. I may have to shelve this project.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 15, 2019 1:42 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Altered the nvio3 ISA (calling it nvio5 now) to eliminate load and store operations using register indirect with displacement addressing. There wasn’t enough difference between that mode and the scaled indexed addressing mode. By saving a mode bit in the instruction and compressing the scaling indicator, the displacement associated with scaled indexed mode was 16 bits. Only four bits shy of the 20-bit displacement with register indirect addressing. 16 bits was deemed useful enough to make the register indirect mode redundant, so it got eliminated. For comparison RISCV has only a 12-bit displacement for register indirect mode.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 05, 2020 3:11 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1442
Location: Canada
Worked on the nvio5 ISA. Added basic bitwise operations for floating-point. These will allow loading a floating-point register with an immediate constant.
Also setup an immediate instruction format that allows an 11-bit float immediate to be used in FMUL, FDIV, FADD, and FSUB.

_________________
Robert Finch http://www.finitron.ca


Mon Apr 06, 2020 3:18 am WWW
 [ 133 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9  Next

Who is online

Users browsing this forum: CCBot, YandexBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software