View unanswered posts | View active topics It is currently Thu Mar 28, 2024 9:08 am



Reply to topic  [ 133 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9  Next
 nvio 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added byte-nops ($E7) to the instruction set. A byte-nop is a single byte nop instruction that may be used to align the instruction stream to a specific byte address. The processor will recognize it an increment the instruction pointer by one instead of by five.

The author has been researching vector instruction formats. Most recently looking at the RISCV vector extension. It's interesting because CSR's are used to hold what amounts to instruction bits which can't be fit into the 32-bit format.

Vector instructions are being added to the ISA. There’s just enough room in the instruction to encode everything except for instructions requiring three source operands. In that case the vector mask register is assumed to be vm0. The vector instruction formats look like the floating-point formats with the addition of a vector mask register field. To support integer vector operations some of the bits of the precision and rounding mode fields are being repurposed to indicate integer operations. For instance, the round mode field is three bits but has only six useful values out of eight possible ones. So, the other two values are going to be used to indicate integer operations.

_________________
Robert Finch http://www.finitron.ca


Mon Jul 29, 2019 4:19 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
nvio2 is not going to be the fastest processor around. It takes five clock cycles to read the register file (the register file has only a single port due to the number of registers), probably at least five to fetch the instruction. That’s 10 cycles right there. Another cycle minimum to execute the instruction. I’m guessing it’d average about 15 clocks per instruction. So, lots of room for eventual improvements.
The number of load and store operations in the processor has doubled with the inclusion of vector load / store operations. Mnemonically a vector load / store is prefixed with a ‘V’ to indicate a vector operation. LDB is load byte, VLDB is vector load byte.

_________________
Robert Finch http://www.finitron.ca


Wed Jul 31, 2019 2:49 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
There seems to be a hint there at an architectural tradeoff, if you know that fewer registers would allow for more ports and therefore faster access.

From a teaching perspective, even if you only ever implement a single branch of the tree, recording the various decision points, their estimated tradeoffs and the reason for picking the one you did pick could be rather valuable.


Wed Jul 31, 2019 7:51 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
There seems to be a hint there at an architectural tradeoff, if you know that fewer registers would allow for more ports and therefore faster access.
There’s over 8,000 registers which requires 32 block rams just for a single port. Using 96 block rams (about 1/3 the resources available) just for triple porting the register file seemed a bit excessive. It is a tradeoff. I’m more interested in getting a decent instruction set (ISA) rather than a speedy implementation in this case. There are implementation choices which could improve performance considerably. For instance, adding an instruction cache or overlapping the pipelining. The vector instructions could be implemented as instruction traps in a much smaller implementation.

Got many of the vector instructions coded, excepting loads and stores.

Just looking at one proposal for vector registers, they were allowing the number and number of elements in vector registers to be varied dynamically from a fixed pool of registers. So, with 1024 available registers the design could have 32, 32 element vectors, or 16, 64 element vectors depending on settings in a config register. My gut tells me that maybe this isn’t the best idea, so I’ve gone with a fixed number of vector registers with a fixed number of elements. (64, 64 elements).
I’ve also seen where config registers are being used to set the size of vector elements. This avoids having to specify the operation size in the instruction, I think the idea to is to conserve opcode bits. But it results in config overhead which has to be managed at run-time.

_________________
Robert Finch http://www.finitron.ca


Thu Aug 01, 2019 6:44 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Goodness, that's a lot more registers than I imagined!


Thu Aug 01, 2019 7:13 am
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Rob:

I realize that a large pool of vector registers would speed certain classes of problems. However, aren't there a large class of problems where two, three, or four sets of registers would provide significant throughput improvements? The resource savings would allow you to explore other areas while clearly demonstrating that utility of the vector instructions.

For example, the application of an FFT kernel to a block of data 64 words in length will significantly improve that algorithm's memory utilization ratios. But the manipulation of the data to support the decimation-in-time and decimation-in-frequency algorithms may invalidate the use of such large vector register sizes. Such large vector register sizes would be good for any in-order algorithms such as FIR filters and linear convolution algorithms.

Fast algorithms like the FFTs, generally operate on discontinuous blocks of memory. I can still see ways to utilize your large vector registers efficiently, but there would also need to be significant effort necessary to set up the data in memory so passing over it with a vector instruction would be efficient.

Perhaps it's my focus on the FFT that drives me to recommend smaller numbers and sizes of the vector registers, but that is as far as I've taken it.

Does your instruction vector instruction set support the FFT butterfly which has three/four data sources, or does it support a simpler two/three data source multiply-accumulate structure?

_________________
Michael A.


Fri Aug 02, 2019 10:34 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
I realize that a large pool of vector registers would speed certain classes of problems. However, aren't there a large class of problems where two, three, or four sets of registers would provide significant throughput improvements? The resource savings would allow you to explore other areas while clearly demonstrating that utility of the vector instructions.
The 8,000+ registers are split between multiple register sets (64) for the general-purpose register file and vector registers. The vector registers are using 4,096 registers. When deciding to support vector registers, it was decided to just double the number of registers and support multiple register sets for the general register file too. Not sure if you’re referring to supporting multiple vector register sets or not. That could be done to but it’s a lot of registers.
Quote:
For example, the application of an FFT kernel to a block of data 64 words in length will significantly improve that algorithm's memory utilization ratios. But the manipulation of the data to support the decimation-in-time and decimation-in-frequency algorithms may invalidate the use of such large vector register sizes. Such large vector register sizes would be good for any in-order algorithms such as FIR filters and linear convolution algorithms.

Fast algorithms like the FFTs, generally operate on discontinuous blocks of memory. I can still see ways to utilize your large vector registers efficiently, but there would also need to be significant effort necessary to set up the data in memory so passing over it with a vector instruction would be efficient.
It sounds like the scatter-gather capability of vector indexed addressing might be able to help here, but I don’t know. Perhaps I could study the FFT data access patterns to see if a custom instruction calculating such would help. Note that the vector length register may be used to specify shorter vectors. Perhaps if the length were shortened to <=32 elements then somehow multiple vector register sets could be implemented.
Quote:
Does your instruction vector instruction set support the FFT butterfly which has three/four data sources, or does it support a simpler two/three data source multiply-accumulate structure?
The instruction set supports only two / three data sources. Three sources supported only for multiply-add instructions.
I have not thought much about the specific application of vector instructions and basically copied from existing vector instruction sets. (DLXV, Cray, RISCV). I was hoping to be able to use the vector instructions with neural networks. It’s been a while since I looked seriously at FFT’s (back in college in the ‘80s).

_________________
Robert Finch http://www.finitron.ca


Sun Aug 04, 2019 4:57 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Pipelined floating-point operations are being reviewed as the following diagram shows. This would allow a new floating-point operation to be issued every clock cycle. A tag identifying the result would need to be passed along the pipeline with the result along with an operation valid flag. Pipeline advance signals would have to be generated to stall the input pipelines when more than one intermediate result is available at the same time.

Ways of decoupling the decode of fp instructions from the fp unit are in thought. The fp functions have been in use for a couple of processors and it’s a pita to redo the instruction decode for each processor. It would be nice to be able to come up with an fp unit that could be plugged into any processor with a minimal of fuss.

Attachment:
File comment: Floating-point pipeline
fpPipeline.png
fpPipeline.png [ 10.78 KiB | Viewed 5154 times ]

_________________
Robert Finch http://www.finitron.ca


Fri Sep 06, 2019 3:02 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Rob:

Glad to see you back and posting about your projects. Looking forward to seeing how this component of your project materializes.

Have you considered using the Goldschmidt Square Root Algorithm? In a recent project I chose to implement that algorithm instead of implementing a fractional divider and a square root function. The Goldschmidt Square Root algorithm provides as outputs both the square root and the reciprocal of the square root. Thus, by squaring the reciprocal of the square root, I was able to determine the reciprocal. I used the reciprocal to perform division by multiplication. Since the Goldschmidt Square Root algorithm, like the Goldschmidt Reciprocal algorithm, use repeated multiplication, my project could focus on building a fast multiplier.

Since both the division and the square root functions were infrequently used, relative to the multiplication function, I chose to build my ALU with only addition, subtraction, AND, OR, XOR, and multiplication operations. This approach reduced the resources needed to implement my project.

_________________
Michael A.


Fri Sep 06, 2019 9:40 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Spent a lot of time that last couple of months taking a break from the daily grind and playing games instead.
Quote:
Have you considered using the Goldschmidt Square Root Algorithm?

I sketched out a GoldSchmidt division in Verilog for an FPGA and realized it didn't map that well to FPGA logic. The issue was with the multipliers required. The multipliers ended up to have latencies that affected the performance of a GoldSchmidt algorithm. With a six or eight cycle latency for each multiply it was just about as fast to use the standard division algorithm. I've not looked closely a the square root but I suspect the issue of getting a fast multiply is still present.


Migrated to NVIO version3, basically started again revamping the ISA since it’s been a while since I looked at it. Widened the instructions to 41 bits to make better use of the 128 bits available.

Modified the JAL instruction so that only registers 0 to 3 may be the target. Typically, a register is designated as the return address register, this only needs to be a single register. Being able to specify more bits in the register spec isn’t that helpful, and the bits are better used as part of the address.
With the JAL instruction able to specify a 26-bit address, the CALL instruction is being left out of the ISA for now. Are code blocks larger than 64MB really needed? Also removed is the explicit return instruction. Jal can perform this operation although the stack pointer will need to be adjusted with a separate add instruction.

_________________
Robert Finch http://www.finitron.ca


Wed Oct 23, 2019 4:20 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The author is contemplating the use of condition registers in the ISA similar to the PowerPC’s condition registers. A separate set of CR’s would make the ISA more complex and require dependency logic for eight more registers. The author is trying to deviate from creating the “ideal” ISA design into something more interesting.

DPO (displacement plus offset) addressing is being used for branch targets. The offset portion is set to a 256-byte page size to allow code to relocated at 256-byte address boundaries. Not sure if this actually makes a difference, but the idea behind this is that code is relocated with some of the cache line layout preserved. This should help avoid sudden changes in performance just because code is relocated. Cache lines are currently 32 bytes but 64 or more may be used in the future.

Dropped the branch-on-bit instructions from the ISA. They are nice to have but not frequently used. They have some overlap in functionality with branch-on-equal immediate.

_________________
Robert Finch http://www.finitron.ca


Sat Oct 26, 2019 4:38 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Tonight, thoughts on non-linear program memory. Would work based on executing pages of memory with a branch to the next page at the end of a memory page.

Still screwing around with the ISA. Working on the branches. It may be better if code is relocatable only on memory page sizes. Meaning the base memory page size needs to be known. However, it is desirable to decouple memory management from the ISA. With DPO addressing a page of code could be relocated to a non-contiguous section of memory merely by altering the displacement portion of branch target addresses. As long as the code is within +- 64MB of the current program counter. This might be a handy option for program loading; allowing programs to be loaded into dis-contiguous memory.

Fixing the base memory management page size needs to be done. With a 128-bit machine a page size of 4kB is probably too small; it would allow only 256 x 128 bit entries. This would result in many levels of page tables for larger app. The next nybble up in size is 64kB. Such a large page size may result in unused memory areas.

_________________
Robert Finch http://www.finitron.ca


Mon Oct 28, 2019 4:51 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Is virtual memory a good idea anymore for all memory access? The current memory model is a large random access memory for both code and data and stack/heap randomly placed in memory.
A segmented model might a better option,keeping critital code & data totaly in memory, with
virtual memory for user land stuff and video displays. A segment up to about 20 to 22 bits feels like a good size. A video coprocessor and I/O processor also may have share memory,so a CPU may be too fast with no free external memory cycles for other things.A 16KB of virtual memory block seems
a good size, so your disc i/o can read and write that sized blocks.


Mon Oct 28, 2019 7:21 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Is virtual memory a good idea anymore for all memory access?
That’s the kind of thing I was thinking. I don’t see virtual memory relocations being necessary for code. It’s probably not too bad to have the program loader fix up all the target addresses and place jumps at the end of pages. This is similar to the way a loader works on an eight-bit micro. It would save some space in the address translation tables if only data addresses needed to be translated. The page attributes still need to be maintained for code though. One issue that arises then is that program code cannot move easily once it’s loaded. I think there still needs to be a table in order to calculate shared code addresses though. But it might be possible to avoid looking up code addresses via an mmu on an I$ miss.

The number of general-purpose registers in the machine is being reduced to 56 from 64. Eight register codes will be used to represent condition code registers. This is to keep the register tags required in the processor down to six bits.

_________________
Robert Finch http://www.finitron.ca


Tue Oct 29, 2019 2:56 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2019/10/31
For nvio3 the author has decided to use a simple eight-bit base opcode from which the functional units targeted are easily decoded. The high order four bits determine the functional unit. This replaces using a template to select the functional unit. Only about half the opcodes are assigned, some sparseness used to simplify decoding.
2019/11/01
Worked on the branch architecture today. Added loop count registers and separate link registers. The great thing about loop count registers is that they can be decremented and tested in the same branch instruction as other conditions are tested in. The author is making the design look somewhat similar to the PowerPC in hopes of leveraging some of the software available for that architecture.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 02, 2019 11:38 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 133 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9  Next

Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software