View unanswered posts | View active topics It is currently Thu Apr 25, 2024 9:51 am



Reply to topic  [ 159 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6 ... 11  Next
 ANY-1 
Author Message

Joined: Wed Feb 03, 2021 7:09 am
Posts: 4
robfinch wrote:
Quote:
I don't think the MRISC32 has any mask registers.
I believe he dedicated one of the scalar registers as the mask register. The issue with doing that is that the vector length is then restricted to the size of the scalar register. RISC-V vector extension uses one of the vector registers for the mask. The issue with that is that another read port is required on the vector register file. x86 AVX extensions has eight mask registers.


Hi!

It is correct that using a scalar register as a mask register (as the Cray 1 did) would constrain the size of the vector registers, thus to support implementation defined vector register sizes, a vector register must be used as a mask.

Unlike the old "pure" vector machines, MRISC32 is designed to allow for parallelized execution units (e.g. process four or eight vector elements in parallel). This means that using masks for "skipping" calculations on masked out elements would have little value. Even worse, it would require extra vector register file read ports since the result needs to be a combination of the old and the new value, and operand forwarding needs to work without going through the register file. It's probably possible to solve this with clever hardware solutions, but to simplify the design the primary way to deal with conditionals for vector registers is through bitwise operations (bitwise select, bitwise and, etc) to combine results based on comparisons (e.g. SLT and friends).

The current plan for MRISC32 is to dedicate one vector register to being the vector mask. This register will only be used as a mask for load/store operations (as on opt-in mode for load/store). As it turns out in the MRISC32 ISA, load instructions use at most one vector register read port, and store instructions use at most two vector register read ports. This means that the third read port (already there due to SEL and FMA) is free to be used for the mask register. Another option is to store the relevant bits (e.g. one bit per vector element) of the mask register in a separate copy (internal state) that is wired to the load/store operations only.

For stores, only elements that are kept according to the mask will be written to memory. For loads, the masked out elements will be set to zero.


Wed Feb 03, 2021 7:38 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
(Welcome!)


Wed Feb 03, 2021 10:07 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I think having a dedicated mask register will make it easier to support parallel processing of elements. Pulling multiple bits from a dedicated mask register is a bit easier than using a vector register. The mask register is basically a single bit wide vector register. Using a mask register makes it easy to manipulate multiple bits at once. Suppose one wants to operate on only elements 7 to 23 of a vector register. With dedicated mask register it is just a single instruction to set bits 7 to 23 of the register which takes only a single clock cycle. If a vector register is used instead then its multiple clock cycles to set the appropriate bits.

*************

The instruction width has been expanded to 56 bits from 48. This is to allow the rounding mode to be specified and the expansion of the register specs to 7 bits from 6 bits. The register spec now specifies vector or scalar register using bit 6 of the spec. So, any mix of vector or scalar registers may be used. The goal being to keep things simple. The ‘V’ bit in the instruction is gone, instead if any specified register is a vector register then the instruction is considered a vectored instruction. This is one ‘or’ gate on the MSB’s of the register specs.

A lot of the vector instruction descriptions have been merged with scalar descriptions. There are still a few vector specific instructions. These mainly begin with the letter ‘V’.
The docs have been updated.

_________________
Robert Finch http://www.finitron.ca


Fri Feb 05, 2021 3:35 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I have not posted for a while, so this is a keep-alive type post. I've been busy playing Civ VI.

Just going through the ANY1 instruction set and noticing that there a lot of legal to do operations that do not make any sense to do. I am wondering if there should be some kind of filter for these operations, or just let them stand as they are. For instance, it is possible to perform a calc using vector registers then store to a scalar register. Which result should be in the scalar register? The last one? The first one? or is it an undefined operation?
What about a vector-scalar operation that stores to a scalar, for instance an accumulate? It is tricky to do with hardware because each calc with a new vector element must use a previously calculated value in a scalar register. Eg. x1 = x1 + v2.

_________________
Robert Finch http://www.finitron.ca


Thu Feb 18, 2021 8:21 am
Profile WWW

Joined: Wed Feb 03, 2021 7:09 am
Posts: 4
I think that with a very generic instruction format you will find that are many combinations that either make little/no sense or are superfluous (duplicates). For instance V = S + V can usually be replaced by V = V + S.

The sport, IMO, is to minimize the instruction word size without losing (too much) functionality while still enabling easy decoding. In the process you usually have to sacrifice some functionality, but it's often possible to do so without sacrificing (too much) performance. E.g. large immediates can often be loaded in a separate instruction (often outside of loops) instead of encoding them as part of vector instructions.


Sat Feb 27, 2021 12:47 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Currently the ISA makes use of 56 bits for instructions. I have been wondering about expanding the seven-bit register spec fields to eight bits. The additional bit would be to indicate if the reg spec field is actually a seven-bit constant. Expanding the spec field would use most of the 64-bit instruction and make it easier to read the register number in machine code. There is really only a handful of instructions where it is useful to specify either a register or small constant – for instance shift or bit-field instructions. A seven bit float could also be encoded.

_________________
Robert Finch http://www.finitron.ca


Sun Mar 21, 2021 11:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added left and right pair shift instructions. There is lots of room in the instruction to be able to specify the additional register to allow shifting a pair of registers. Shifting a pair of registers allows rotates to be performed in addition to shifts, while using only a single instruction.
Also added field extraction from a pair of registers. Then I realized that extracting a field and a right shift instruction were really two forms of the same instruction. The only difference is that one specifies the field width and the other assumes a maximum field width. So, I eliminated the right pair shift instruction, which may be performed instead with an EXTU instruction.
The extract instructions make use of four source registers.

Got rid of the register indirect with displacement memory operations. The only difference between those and indexed operations was the size of the constant field and the lack of an index register. Since the constant field is 22 bits for indexed operation that was deemed large enough for other operations as well. There is now basically only a single address mode for memory operations. Although this is represented as two forms for simplicity.
1a) “Strided” indexed with displacement
EA = d[Ra+Rb * n], where n is the vector element number, Rb is a scalar register or a constant value
1b)“Vector“ indexed with displacement
EA = d[Ra+Rb[n]], where n is the vector element number, Rb is a vector register

These are ISA updates only. There is no code yet. Things moving at a snail's pace.

_________________
Robert Finch http://www.finitron.ca


Mon Apr 05, 2021 2:27 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added the bounds check (CHK) instruction which will trap if register contents are outside of a bound. While bounds checking can be done with regular compare and branch instructions, a CHK instruction is more code dense which is important considering 64-bit instructions are used.

The load / store instructions are being worked on. They are turning out to be very powerful. There will be about a dozen basic load / store instructions. There is a single address mode for scalar operations which is scaled indexed addressing with a 22-bit displacement. There are two basic addressing modes for vector operations, stridden data access and vector indexed addressing. So, there are three modes each for loads and stores, then for vector operations there are also compressed vector loads and stores. Each mode has its own opcode for decoding simplicity. There are potentially many mnemonics for loads and stores since there are four different ALU units each with multiple sizes of operations.

A new address mode was discovered due to the way instructions are encoded. The address mode being “double vector indexed”. The effective address for this mode is the sum of two vector registers and a constant. EA = d[Va + Vb]. Indexing by a vector register allows scatter and gather operations and sometimes the instruction mnemonics directly reflect this. The usefulness of being able to index by two vector registers is yet to be seen. It comes about because each register field of an instruction may specify a vector register.

It is possible to store immediate constants up to seven bits to memory.

Wondering about a data cache bypass bit for loads and stores.

_________________
Robert Finch http://www.finitron.ca


Wed Apr 07, 2021 5:51 am
Profile WWW

Joined: Wed Feb 03, 2021 7:09 am
Posts: 4
Regarding shift instructions: Consider extending shift instructions to bit-field instructions. For instance the MC88100 RISC CPU had the instructions "EXT", "EXTU" and "MAK", that can operate as pure ASR, LSR and LSL instructions, but also (optionally) mask the source operand so that you can do shift + and in a single instruction, which is quite common.

I adopted those instructions for MRISC32 (I called the instructions "EBF", "EBFU" and "MKBF")

See:
* http://electro.fisica.unlp.edu.ar/arq/d ... _88000.pdf
* https://mrisc32.bitsnbites.eu/doc/mrisc ... manual.pdf (instructions EBF, EBFU and MKBF).
* https://github.com/mrisc32/mrisc32/issues/119


Wed Apr 07, 2021 8:40 am
Profile

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
mbitsnbites wrote:
MC88100 RISC CPU had the instructions "EXT", "EXTU" and "MAK", that can operate as pure ASR, LSR and LSL instructions, but also (optionally) mask the source operand so that you can do shift + and in a single instruction, which is quite common.


For what do you use these instructions?


Wed Apr 07, 2021 2:29 pm
Profile

Joined: Wed Feb 03, 2021 7:09 am
Posts: 4
DiTBho wrote:
mbitsnbites wrote:
MC88100 RISC CPU had the instructions "EXT", "EXTU" and "MAK", that can operate as pure ASR, LSR and LSL instructions, but also (optionally) mask the source operand so that you can do shift + and in a single instruction, which is quite common.


For what do you use these instructions?


For bit-field operations, which usually means extracting or inserting a chunk of bits from a word that contains packed bit fields. More generally you can combine a shift and a mask operation into a single instruction. For instance consider the following code:

Code:
int foo(int x) {
  return (x >> 3) & 0x3f;
}


In MRISC32 assembler this would be:

Code:
foo:
  ebfu  s1,s1,#(6<<5)|3   ; LSR 3 bits, and mask out 6 bits
  ret


Likewise, just doing an arithmetic shift (e.g. "return x >> 3;") would work too, like this:

Code:
foo:
  ebf   s1,s1,#3   ; ASR 3 bits, and keep all bits
  ret


Wed Apr 07, 2021 3:37 pm
Profile

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
OK, but for which application? I mean useful for crypt loop? Evaluating hashing? Or what else?

I am on a 8bit machine at the moment, there is only "unsigned div" but I need a "signed div", so I modified my algorithm to use Arithmetic Shift Right on a 2^x base (x=4..5), and boom, problem solved :D

ASR is useful for this, for example, but why should I have to mask(&sign-ext) the result?


Wed Apr 07, 2021 4:19 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
As mentioned in a prior post the EXT, EXTU instructions are present and perform shift pair + and. Shifting a pair of registers allows a bitfield to span an octa-byte boundary. Although the compiler does not currently support this feature. It also allows rotate operations to be performed.

There is no need to specify a mask and sign extend result. These options may be wrapped up in mnemonics for the instruction which imply specific values. For instance, using the ASR mnemonic will automatically select the full width mask and sign extension. But ASR is really an alternate mnemonic for the more general EXT instruction.

_________________
Robert Finch http://www.finitron.ca


Thu Apr 08, 2021 3:47 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I am guessing that vector load / store with stride needs only a positive stride. I can not see a negative stride value being that useful.

It is looking like all the major instructions will fit into a seven-bit opcode, eight bits are reserved for this purpose.

I have been slowly thinking about the pipelining requirements for a simple version of the core to fit in an FPGA. For the vector instructions each element of the vector will probably need to be processed serially. That means a vector element counter is present. Creating an overlapped pipeline will be tricky. It is desirable to have register fetch, decode, execute, and writeback overlapped. There also needs to be a pipeline loop to account for different vector elements.

_________________
Robert Finch http://www.finitron.ca


Mon Apr 12, 2021 2:38 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Some work has been done on the exception handling mechanism for ANY-1. All exceptions cause a transfer of operations to the highest operating mode. From that mode exceptions may be redirected towards lower operating modes. The REX instruction (Redirect EXception) performs a redirection.

Operating modes. The highest operating mode of the core is debug mode, which has access to all registers and features. Modelled after the RISC-V, x286+ processors and others, there are multiple (five) operating modes. The five modes being user, supervisor, hypervisor, machine, and debug.

_________________
Robert Finch http://www.finitron.ca


Sat Apr 24, 2021 5:07 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 159 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6 ... 11  Next

Who is online

Users browsing this forum: AhrefsBot and 14 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software