Last visit was: Thu Dec 02, 2021 1:22 am
It is currently Thu Dec 02, 2021 1:22 am



 [ 133 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9  Next
 nvio 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Looking at the instruction formats, the author is considering switching to a 32-bit format from 41-bits. Nine bits would have to be trimmed off every instruction, but it looks like it’s doable for the most part. The author decided originally to go with a 40-bit instruction as register spec fields took up six bits each. But now going with only a five bit register spec. there are more free bits in the instruction.
The author is also looking at making the design a 4-way superscalar capable of queuing four instructions in a single clock cycle. That doesn’t work well with instructions held in 128-bit bundles. A six-way superscalar would work better with bundles. The author is wondering just how big to make the design.

Added instructions to load / store link registers from/to memory.

Well, the author decided to switch to 40-bit instructions which are not bundled. This should allow easier implementation of varying superscalar widths. The current goal is a four-wide machine. The instruction cache line will hold eight forty-bit instructions for a line width of 320-bits.

_________________
Robert Finch http://www.finitron.ca


Fri Nov 15, 2019 4:28 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1647
> In the morning I think I can conquer any level of complexity. At the end of the day I can’t make things simple enough.
Hee hee!

> Mask registers support logical operations between them, and a few other operations...
Hmm seems like they could be the same things as the usual integer registers. How do Intel manage their super-wide SIMD extensions - does that give any clue? I believe they have now reached 512 bits wide... ah yes, they have 8 "opmask" registers:
> Most AVX-512 instructions may indicate one of 8 opmask registers (k0–k7). For instructions which use a mask register as an opmask, register `k0` is special: a hardcoded constant used to indicate unmasked operations. For other operations, such as those that write to an opmask register or perform arithmetic or logical operations, `k0` is a functioning, valid register. In most instructions, the opmask is used to control which values are written to the destination. A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched.
- https://en.wikipedia.org/wiki/AVX-512#E ... d_features


Fri Nov 15, 2019 10:21 am

Joined: Mon Oct 07, 2019 2:41 am
Posts: 273
As a reminder 40/48 bits (external size) have a better standard floating point number format than 32 bits.Double precision is allways a slower option in both memory bandwith and computing speed.
Some good ideas on floating point/decmal computing can be found here. http://www.quadibloc.com/main.htm


Fri Nov 15, 2019 6:19 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Quote:
A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched.
Having a zero or merge is such a good idea that the nvio author decided to use one of the unused bits in the instruction to indicate for a zeroing or merge.
The author is amazed at how the x86 architecture has adapted. What started out as an eight bit accumulator oriented design is now register oriented vector processor.

Looked at the AVX512 features. Each of 32 registers is 512 bits wide. A subset of the register is then considered to be a vector element. nvio in those terms is actually an array processor. nvio features the “poor man’s” SIMD, allow portions of a register to be updated in isolation. A 128-bit wide register may be treated as four 32-bit values for instance. It’s tempting to make things 512 bits wide having seen the AVX512.

For AVX512 I wonder how an immediate constant is loaded into a register. 512 bits wide instruction operand?

Quote:
As a reminder 40/48 bits (external size) have a better standard floating point number format than 32 bits.Double precision is allways a slower option in both memory bandwith and computing speed.
The author prefers an 80-bit double format himself, with 40-bit single precision and designed a couple of 80-bit cpu’s as a result.

Back to nvio3. Switching to regular fixed size 40-bit instructions means that instructions can be accessed as an array of 40-bit values in program code. The author is tempted to add an instruction type to the compiler to allow manipulation of instructions.

Switching from 128-bit bundles has made the I$ more challenging. So, the paradigm of distinguishing between instruction and data addresses is used.

The processor reads from I$ in 160-bit bundles (4 x 40 bits). Four bundles are fit onto a cache line making a cache line 640 bit or 80 bytes wide. This is only about 20% wider than the typical 64-byte cache line. 640-bit lines presents a challenge accessing the cache as a cache lines are not a multiple of two in length. I$ is a multiple of 40 in length. This could be viewed in terms of instructions which is 16 instructions. This is a nice multiple of two. So, instruction addresses are in terms of the instruction count, not a byte address. Converting from an instruction address to a memory byte address is simple: multiply by five, which amounts to a shift and add. In terms of the I$ controller it needs to convert a line number to a byte address, so, multiply by 80 (shift and add). Accessing the correct memory bytes to load a cache line is handled by the I$ controller. The I$ controller loads 128-bit values which must then be shifted to the correct position for the cache line. Fortunately making lines 80 bytes wide means they are a multiple of 128-bits in width.

The author has seen the light of using more smaller register files rather a unified file. It makes it easier to update more registers in a single clock cycle. Having a separate link register file for instance means that the result of a branch instruction can update link register file, while two more updates are occurring for the general-purpose register file. With a unified register file only two updates per clock were possible.
***********
One complication in the current design is that results routing logic is now required to route the result to the appropriate register file. The re-order buffer contains a generic result buffer; it could contain data for any register file. There is some shuffling logic involved. Since updates are occurring in the same clock cycle the order from the re-order buffer doesn’t have to be strictly followed when different register files are targeted. (Register updates within the same register file do have to follow order). For instance. Update r1, Lk2, and r3. The value for r3 on commit lane #2 has to be moved to commit lane #1 updating the general-purpose register file. The value of the link register on commit lane #1, has to be moved to commit lane #0 for the link register file.

At the same time shuffling is occurring a check must be made that there are sufficient resources to update the processor. For instance, trying to update three general purpose registers in the same clock causes a stall of following instructions because only two ports are available.

The author feels that all this logic in the commit path may create timing issues.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 16, 2019 3:42 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Worked mainly on the condition code register side of things today.

Changed the case statement for the queue head increment amount to an if/else statement. The if/else is a more compact representation, otherwise there would have been 256 cases to code for (many redundant).

Not sure whether or not to include logic to process the condition registers as an entire group. It would be nice to be able to save all the condition registers in a single word in memory using a single load / store instruction, rather than having to save each register individually. The issue with treated all the registers as one unit is that more dependency checking logic is required in the core. The dependency check must be against a range of registers rather than a single register, and I’m not sure it’s worth the extra cost. The PowerPC must be doing this as it allows moving to/from all the condition registers to a general-purpose register, and there is no note about needing synchronizing instructions.
Designated one of the condition register bits as a user defined flag. This makes it possible to jump based on user defined conditions. Not sure about the merits of this. It might end up being used to return single bit values from functions.

Quite a few changes to the instruction formats have occurred so a newer copy of the formats is shown below.
Attachment:
IFormats1a.png

Attachment:
IFormats2a.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 17, 2019 3:51 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Have run into a number of undocumented instructions which are a consequence of the implementation. Since there’s a five-bit field for source and target registers and only three bits are significant to represent mask registers the extra two bits if set properly may access the link and condition registers. This makes it possible to perform masking operations on a link register for instance.

A side-effect of the way the processor is implemented allows storing a compare result to one of the link registers, mask registers or condition registers. Being able to store the result to a link or mask register was unintentional and is likely to have little use.

The vector length register isn’t going to have dependency checking on it. The length register acts like a sixth operand to a vector instruction, otherwise doesn’t have much use. The issue is making the core about 20% larger just to support the length register doesn’t make a lot of sense. Lack of checks on the vector length register probably isn’t much of an issue. It’s likely to be infrequently updated and can be followed by a sync instruction when it’s updated.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 18, 2019 4:11 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Seriously considering a major revision to the architecture. Thank heavens I haven’t got into major league coding yet. What’s present now is really an array processor. The author hadn’t really intended to develop an array processor, there’s issues cropping up outside of scope. The author may revise things to use 256-bit wide registers, which are really vector registers more inline with something like the AVX512. See the following for one reason for 256-bit registers.

Increased the size of the bit-matrix-multiply operation. It now supports a 10x10 bit matrix instead of 8x8. The size could be increased because registers are 128-bits. A 10x10 bit matrix uses the low order 100 bits of a register. This has got me thinking.

Toying with the idea of a 16x16 bmm using pairs of registers to hold the 256 bits. Three options for implementing are 1) The register pair would be an integer register and the corresponding floating-point register. eg. r3 and fp3. Using integer and fp registers is needed in order to read four registers at once, otherwise the design only supports three register read ports. Also, there’s only room in the instruction to specify three ports. So, the two-port version of the instruction would be used and indexing into integer and float register files must be the same. 2) Another possibility is to link two instructions together, so they are issued as a single entity. They would queue separately but issue together. That would give three more read ports and another write port. It’s appealing to do because it may be useful for other instructions as well (bitfield inserts). 3) Just use 256-bit registers.

Studying vector mask registers again with the notion that nvio3 is really an array processor. The current issue with the mask registers is that they work vertically, controlling processing for each element (row) of the vector. But the vector registers can contain multiple elements horizontally. So potentially a mask for a vector operation should be a two-dimensional mask. There are 16 bytes that could be processed in a vector register times 128 elements per register. To mask everything would require mask registers with 2048 bits in them. This is a bit ridiculous. So, the author is working with the idea of independent row and column masks. The row mask would restrict which rows of a vector register are processed, and the column mask would restrict which columns are processed. It’s not as powerful as having a bit for every element. So, a single mask register would be ‘L’ shaped containing both a row and column mask. Having the register ‘L’ shaped means only a single mask register must be specified in the instruction. How to implement an array set operation?

_________________
Robert Finch http://www.finitron.ca


Tue Nov 19, 2019 3:14 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Widened the vector registers to 256 bits and got rid of the rows associated with a vector register. Revamped the vector instructions (got rid of them). Instead the regular instruction set is used and a bit in the instruction specifies whether or not it’s a vector operation so there’s no longer a separate set of encodings. There were already bits reserved to specify the operation size, so these bits were combined with the vector indicator to form a format field. The vector mask field had to be squeezed into the instruction format. The instruction format charts are seriously out-of-date again. It’s been a lot of keying to code some of the alu operations. The alu is effectively 256 bits wide to support vector operations, but only the low order 128-bits are used for non-vector ops. Vector masking for merge or zero was added to the alu.

Got rid of a number of operations involving three registers. Looking at how much hardware was consumed versus their utility, they probably aren’t worth the extra hardware. They can always be added back in at a later date. There’s lots of room in the encoding space.

_________________
Robert Finch http://www.finitron.ca


Wed Nov 20, 2019 4:32 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Had to replicate the ‘B’ operand in the alu across the width of a vector register for vector-scalar operations. A bit in the instruction indicates that the second operation to the instruction is a scalar, not a vector for a vector instruction. This takes care of most of the cases where it’s desired to use a scalar and a vector. Using these instructions, a scalar can be loaded across a vector register for use with one of the vector instructions that don’t support scalars. (The vector R3 instructions like FMA don’t accept a scalar operand).

Debating whether or not to use separate vector register files for integer and floating-point vectors. there are after all separate files for scalar values. Should this dichotomy be maintained?

Had to split the memory used for the vector register file into two separate eight port memories. The toolset refused to use distributed memory because there were too many ports for a single memory. The vector register file needs 16 read ports. Four for each of four queue slots.

Got the instruction pointer module updated. A little simpler than nvio 1’s because of the lack of relative branches. So no need to add a displacement to the ip.

The author is estimating the design to be about 800k to 1.5M LC’s. Based on nvio1, which was 3-way, when maxed out was about 400k LC’s. The internal bus for nvio3 is three times the size, plus it’s 4-way.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 21, 2019 3:06 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
The latest instruction formats. Mainly rearranged to support a mask register spec. on "normal" instructions.
Attachment:
IFormats3a.png

Attachment:
IFormats3b.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 21, 2019 10:15 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Exploring ways to implement string loads and stores along with loading and storing multiple registers using a single instruction. To do this, variations on the same instruction must be queued. That means updating a copy of the instruction in the decode buffer. A string would be loaded into consecutive vector registers as required up to the string length. This requires incrementing the target register field in the instruction. But the question is: how does the core know when to stop queuing string instructions? -> Right now, the core is streaming string instructions to the queue until it receives a stop signal. The stop signal activates at the queue of the last valid string instruction and stays active until the next instruction is queued. While the stop signal is active, string instructions which are streaming to the queue are turned into nop operations.

Just what kind of string instructions are good enough? Thor had a set of string instructions like the x86. But with nvio3’s ability to load 16 bytes at a time, it doesn’t make a lot of sense to have memory-based string instructions which work a byte at a time. It’s probably faster to load bytes of a string into a register, then use other instructions with the register contents. So, string instructions are limited to loading and storing to/from vector registers.

To try and get a feel for string operations a strlen() routine was coded.
Code:
 ; Parameters:
;    r4 = address of string
; Returns:
;      r1 = string length
;
_bstrlen:
      mov         r2,r0            ; r2 = 0
.0001:
      ldh         r1,[r4+r2]   ; load 16 bytes at once
      add         r2,r2,#16      ; increment pointer
      bytndx   r3,r1,r0      ; get index of zero byte
      slt         cr0,r3,r0      ; not found if index < 0
      jeq         .0001
      sub         r2,r2,#16      ; pointer was to next
      add         r1,r2,r3      ; compute length in r1
      rts

Arithmetic for string instructions is only supported in one address generator. An offset needs to be added to the address which is updated as the string operation proceeds.

Modified the data cache controller to work with data up to 256-bits in size. nvio1 only dealt with data items up to 80 bits in size. The write buffer logic still needs to be updated. Since the external memory interface is only 128-bits wide, multiple bus cycles are required to update and read memory.

Added the ability to push two general-purpose registers to the stack at the same time. Since the gpr’s are only 128-bits wide and the internal memory system can handle 256-bit data, two register values could be concatenated together to form a single 256-bit value. When this goes against the data cache it gives twice the performance. The author is very tempted to do the same for other load and store instructions, turning them into load/store pair. The cost is six bits of opcode space which must come from the displacement constant. This may not be much of an issue since there is a 21-bit displacement. It’s probably worth it to reduce the displacement to 15 bits to get double the performance out of load and store operations. It’s more opcodes though and opcode space is getting tight.
Loading / storing pairs of registers in this way is not the author’s idea. He’s seen it implemented elsewhere. (eg. Itanium)

_________________
Robert Finch http://www.finitron.ca


Fri Nov 22, 2019 3:34 am WWW

Joined: Fri Nov 22, 2019 5:31 pm
Posts: 4
I wonder if a looping mechanism similar to what DSPs offer may be more beneficial than single instruction string moves (a la x86). dsPICs offer such a hardware loop feature, for example.


Fri Nov 22, 2019 5:37 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1647
(welcome, plainsteve!)


Fri Nov 22, 2019 8:02 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
Quote:
I wonder if a looping mechanism similar to what DSPs offer may be more beneficial than single instruction string moves (a la x86). dsPICs offer such a hardware loop feature, for example.
Welcome! I'm not sure I'm familiar with DSP looping. Can you elucidate on the workings ? Is that like a dedicated decrement/increment and branch instruction like the PowerPC?

Put logic in the commit path to handle two results per commit slot. This is to support instructions like load register pair and pop. There’s now quite a bit of routing in the commit path. It’s late and I’m thinking this is maybe a bad idea.

Stole two bits off the displacement constant to add an address mode field to load / store instructions. Then got rid of the opcodes for indexed addressing, as it is represented by the address mode bits. Four basic modes are supported: indirect with displacement, auto increment, auto decrement, and indexed.

The author isn’t sure he has the re-order buffer increment proper, or the commit shuffling logic. It’s complicated by the fact that a single rob entry can commit two results, but it doesn’t have to. The issue is what if the first rob entry result can commit, but the second one can’t? At the moment the re-order buffer increment is being managed as a fixed-point number with one binary point, so the increment can increment by ½ if needed. The increment is then rounded down before being applied to increment the pointer. So, if 2.5 out of four commit paths can commit, the pointer will increment by just two and the commit of the 0.5 will be done again in the next clock cycle. It’s uncertain if committing results twice will be an issue. An additional complication is the re-order buffer increment must skip over entries that are no longer valid, but only the entries after the valid ones have committed up to the tail pointer.

Well, after a monster code editing session, it’s time to take a stab at synthesizing the code. The author hopes to get an approximate size.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 23, 2019 4:42 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1531
Location: Canada
First synthesis result: almost exactly 300,000 LC’s about twice the size of the minimum configuration of nvio1. However, this is a minimal configuration, only 1-way setup. Resynthesizing to a larger configuration: 611,478 LC’s, still a ways to go yet.

Wired up multiple FPU units for vector and scalar operations. Connected the instruction bus, inputs, and outputs, but not the status bus yet. Not quite sure what to do about the status. It’s not impossible to maintain a separate status for each lane of the fpu as long as the fpu lane is at least 32 bits wide. But it’s probably not all that useful to have the entire status available to software. (There isn’t enough room in a single register to hold 16 status results for a 16-bit half precision fpu). Only 32,64 and 128-bit floating point sizes are currently supported. There’s room in the instruction set for more fp sizes.

Forgot to connect up the target value and vector mask ports of the ALU’s. And missed widening some of the internal busses to 256-bit from 80-bit. The code was a port of the 80-bit nvio1, so lots of it remained the same.

The core only allows a single update port for condition reg logic functions.

Resynthesized with numerous fixes. Well, synthesis has been running all day (13 hours so far), so it must be big.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 24, 2019 3:48 am WWW
 [ 133 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software