View unanswered posts | View active topics It is currently Thu Mar 28, 2024 12:00 pm



Reply to topic  [ 159 posts ]  Go to page 1, 2, 3, 4, 5 ... 11  Next
 ANY-1 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Spurred on by discussions of the CRAY-1 the aim of this project is to implement a vector machine in an FPGA. Anyone is welcome to join in the ANY-1.

A suggested register file is as follows. The register file contains more registers than the CRAY-1. It is known that more registers are better to a point. Ideally 32 generic scalar registers would be supported, however that may not leave enough room in the instruction for a 3-r register form. So, registers are split between address and data just as they are for the CRAY. Address registers are limited to 32-bits since the FPGA board is likely to have less than 1GB of memory on it. Data registers are 64-bit. Rather than a single vector mask register there are eight vector mask registers. There are 16 vector registers of 64 elements each. 64 elements is convenient because vector mask registers can then be 64-bit and transferrable to data registers. 64 is also convenient because 0-63 can fit into a six bit number which works well with the FPGA's six input logic elements. 64 is also the same number of elements in the CRAY.
The vector length register determines the number of elements of a vector register that are processed.
The vector stride register contains the distance between consecutive elements of a vector stored in memory. It is multiplied by the vector element number to determine the memory address.
Attachment:
File comment: ANY-1 register file
ANY-1 Register File.png
ANY-1 Register File.png [ 22.16 KiB | Viewed 2415 times ]

_________________
Robert Finch http://www.finitron.ca


Wed Jan 20, 2021 9:11 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Off to a good start. Surely the stride register would be 'successively added' to form an address - one would avoid multiplication?


Thu Jan 21, 2021 1:55 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The stride could be just successively added. I was just seeing if people were awake. The only case I could think of where it might be desirable to multiply occurs when skipping over vector elements during a load or store. If the load/store of some of the elements has been masked off then it would take a clock cycle for every element skipped over if successive addition is used. A six by eight bit multiply can likely be done in a single clock cycle. It is likely a clock cycle for address formation. However, it may be the case that it takes a clock per element anyway unless a bundle of logic is used to skip over multiple elements at once.

Another issue is that successive addition requires a temporary intermediate register to store the intermediate address. Using a multiply means the address does not need to be stored. It could get complicated to use intermediate addresses if there is vector chaining. A separate intermediate register would be required for each chain.

It might be better if the stride value came from another address/data register, that would allow vector chaining. The instruction would look like indexed addressing.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 21, 2021 2:28 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The suggested instruction format is a fixed 32-bits in size to keep the front end simple. The Cray-1 uses 16-bit or 32-bit parcels. Vector or scalar operation would be determined by bit seven which is in the low order eight bits.
Attachment:
File comment: ANY-1 Major Opcodes
ANY-1 Major Opcodes.png
ANY-1 Major Opcodes.png [ 13.78 KiB | Viewed 2403 times ]

_________________
Robert Finch http://www.finitron.ca


Thu Jan 21, 2021 2:48 pm
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
What about floating point? is that 64 bits internal or 64 bits mantisa + 12? bits sign+exponent.
Ben.


Thu Jan 21, 2021 7:35 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Was that a total of 76-bits? Seems unusual. 80-bit fp is a thought.
I vote for IEEE 64-bit internal to keep things simple. However there should be room left in the instruction set for other options like quad-precision or decimal floating-point. It is one reason why I suggest reserving seven bits for the opcode. It allows a fair number of instructions at the root level.
I see most of floating point being implemented around a FMA instruction, which should do for add, subtract, and multiply. Then reciprocal approximate for divide.

First the integer instructions should be made to work. Then floating-point added in. Or maybe it is better to implement floating-point then integer. Pipelining floating-point ops is bound to be more complex than integer.
**********
Any qualms about the register file? Here are a couple of mine.

1) CRAY-1 just has a single vector mask, but I note x86 vector extensions use eight mask registers. CRAY-2 kept the single vector mask register so I am wondering if eight is really necessary. Since it will be possible to transfer between a mask register and a data register, multiple masks could be held in data registers then a single mask register loaded just prior to vector operations. It makes a difference in the number of bits to encode a mask register with in the instruction.

2) Splitting the register file between address and data is a bit older design. But there are usually about five or six registers permanently allocated to address purposes even if there were only a single file. A0-A3 could be used for link registers.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 22, 2021 4:24 am
Profile WWW

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Teaching a compiler to use the address register vs data register split is a pain, but it is doable. With 32-bit instructions, though, I am wondering if it's necessary to have that split? Couldn't you just have 32 completely general purpose registers? I guess they would have to all be 64 bits then....

But it makes the instruction set a lot more orthogonal when there isn't a split. Otherwise you have to have separate load, store, move and add instructions for the address registers and I found it got annoying.

On the other hand this is a vector computer and maybe a compiler isn't on the horizon. Maybe it's better just to make it enjoyable to program in assembly and not worry about compilers. If you want to do serious SIMD programs on x86 you generally do it in assembly -- compilers can "vectorize" your program but it's fiddly and often much better to be done by hand still.

So then, if there's no compiler, the only consideration would be the extra opcodes needed to load/store/move/add the address registers. I guess for those instructions you could widen the register fields to 5 bits so only those instructions would see 32 registers. That does mean a sacrifice of a couple bits of the immediate on load/store though. Probably fine?


Fri Jan 22, 2021 12:38 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
32-regs is a great idea. I would prefer 32 gpr’s myself over splitting address and data. The issue then becomes one of encoding things into 32-bits. It may be better to use a 40-bit instruction with wider register specs. Code would probably be about 20% less dense, but the same number of instructions could be fetched per cycle.

A second issue is the desire to have something that looks a bit like the CRAY. This would help with porting software. Maybe some CRAY programmer could be attracted. I suppose different designators could be used for parts of a gpr register file as is done for RISC-V. A gpr register file is likely to be split into at least two separate register files to get high-performance loading of multiple registers at once.

I sketched out a couple of instruction formats for load/stores and fma. As can been seen they depend on four bit register specs. It would not be possible to encode the round mode and mask register in the fma instruction if register specs were five bits. It would also leave only 11-bits for the displacement constant in memory instructions which is starting to get to being too small.

However, things could be made to fit if the round mode and vector mask spec were left out of the instruction.

One instruction I would like to see that just does not fit is the fused dot product instruction which requires four source registers. It could be made to fit with a 40-bit instruction.

Attachment:
File comment: ANY-1 Vector FMA Proposed
ANY-1 Vector FMA.png
ANY-1 Vector FMA.png [ 19.89 KiB | Viewed 2366 times ]

Attachment:
File comment: ANY-1 Vector load_store proposed instr.
ANY-1 Vector Load_Store.png
ANY-1 Vector Load_Store.png [ 44.83 KiB | Viewed 2366 times ]

_________________
Robert Finch http://www.finitron.ca


Fri Jan 22, 2021 10:11 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I have created a Github repository for this project:

https://github.com/robfinch/ANY-1

_________________
Robert Finch http://www.finitron.ca


Sat Jan 23, 2021 9:08 am
Profile WWW

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
The issue with 40 bit instructions might be alignment. At least with 32 or 64 bits you can ignore the last few bits or zero them out to get alignment. An alignment of 5 bytes is awkward, especially if you're trying to calculate a jump and you need to figure out if the address is aligned with code. 5 is 101 in binary... so it's not a simple and (or shift). Likely you need to do a modulus (or multiply). I guess that could be fixed by making special program memory address space that is only word addressable... then the bit width doesn't matter at all.

So I would vote either 64 bit instructions because, well, this is a vector processor not a microcontroller. This will give plenty of room for interesting and orthogonal instructions, and later a "compressed" ISA could be designed and maybe the prefetch unit could expand compressed instructions to 64-bit so the main decoder doesn't have to know about them. If you go this route, I would recommend reserving a couple of bits at each 16-bit (or 32-bit) boundary for future use by the compressed ISA -- those bits will come in handy for auto-aligning (or alignment faults) when doing calculated jumps into compressed code.

Or just cut features to make them fit. Since the cray had separate address registers from data registers, then if your goal is to try to port cray software to any-1, then it seems reasonable to keep the separated registers.

I guess another option is prefix instructions that carry some state into the next instruction. You have to have extra logic to ensure an interrupt can't happen between the prefix and the following modified instruction. I didn't get as far as designing the interrupts in my CPU to know whether that is annoying or not, but I believe the xr16 did this and it was fine. But if you split the parameters between the prefix and the modified instruction right you can reduce the number of read ports required on the register file, presuming the prefix instruction doesn't get fused in the prefetch.

Anyway, TL;DR: I would recommend either 64-bit instructions with some reserved bits for a future compressed ISA, or stick with 32-bits and the address/data register split. The latter is likely better for porting cray software, but the former may end up better for making the any-1 fun to write assembly for, since it would afford more powerful instructions.


Sat Jan 23, 2021 12:59 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I think 64-bits would be a good size to package instructions in. If instructions could be made to be less than 64-bits wide, say 48 bits (TBD), then only 48 of 64 bits would need to be implemented in the cache so there would be no wasted space. There would be wasted bits for instructions in main memory, but main memory is cheap.

If going with 64-bits I would also suggest more registers. More vector registers in a vector machine probably is not a bad idea. Since the register files are unified having more registers might be handy. Part of the register file could be allocated for float and part for integer. How about six bit register specs? 64 registers is also the max that can be had from the FPGA in a single LUT.

64-bit would also allow larger constants to be formed in fewer instructions.

I would say go for a 64-bit ISA with 32-bit compressed instructions, but I think having one fixed size is extremely valuable if one wants to process more than one instruction per clock. The ARM went back to a single fixed size instruction for 64-bits.

PS. 40-bits is not too bad to work with. Jumps and calls can be aligned on 16-byte boundaries.
Attachment:
File comment: ANY-1 major opcode format
ANY-1 Major Opcodes.png
ANY-1 Major Opcodes.png [ 14.82 KiB | Viewed 2350 times ]

_________________
Robert Finch http://www.finitron.ca


Sat Jan 23, 2021 2:37 pm
Profile WWW

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Yeah, I mean, this is probably not going to fit in a low end FPGA, and most medium range FPGAs will have a few megabytes of RAM on them. This computer probably isn't destined for linux or any large operating system, so I really don't think that the size of the program should be a factor at all other than maybe some consideration for memory bandwidth. But I would imagine a vector processor will sit there in a tight loop running some math heavy something or other, so I don't even think memory bandwidth will be an issue if there's a bit of cache. So, I think 64-bit instructions are totally fine.

An interesting point is that the cray had a "manager" CPU that would send jobs to the "supercomputer" CPU and the OS of the cray ran on that "manager" CPU. I wonder if we should do the same -- have a RISC-V (or an ARM microcontroller for those boards that have one) that runs a monitor and job manager software, and that can be written in C, and then the vector processor could more easily just run an assembly language program. That could mean most of the boilerplate and annoying parts of programming in assembly -- writing a bootloader, monitor, and figuring out how to load programs from storage, etc -- all of that could be skipped and done on the manager CPU. Maybe there could be a way for the manager CPU to do a memory bus request so it can copy a program directly to memory, trigger it to run, and fetch the result from memory afterwards.

So then, if we go that direction, I think the vector processor should focus on being really nice to program in assembly. To that end, I guess the number of registers it should have should be chosen around that goal. How many registers does one need when writing assembly by hand? I guess I am not really sure since I haven't written any code for a vector processor before, nor have I done much SIMD on x86 either.


Sun Jan 24, 2021 7:13 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
this is probably not going to fit in a low end FPGA, and most medium range FPGAs will have a few megabytes of RAM on them
This definitely will not fit into a low end FPGA. I designed a CPU supporting vector operations that was somewhat similar and it turned out to be something like 500,000 logic cells IIRC. I am hoping this one (or a pared down version) will be able to fit in a xc7a200t (200,000 logic cells).

Quote:
I think the vector processor should focus on being really nice to program in assembly.
The human factor is something to keep in mind. I wrote a fair bit of 6502 and 80x88 assembler so I am somewhat biased to a small register set. But I think the machine should stick to a modern design. Lots of registers for a compiler to use, even if an assembly language programmer can not make use of them all.

Quote:
I guess I am not really sure since I haven't written any code for a vector processor before, nor have I done much SIMD on x86 either.
Neither have I. I think that is what makes this a good project for me to work on.

I like the manager CPU idea. Some of the ZYNQ FPGA’s have ARM cpus built in, but they just do not have enough logic resources (at least the low cost ones) for this project. Any manager cpu must be small to fit into the same FPGA as the vector cpu.

_________________
Robert Finch http://www.finitron.ca


Mon Jan 25, 2021 5:56 pm
Profile WWW

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
serv / servant is a tiny, slow, bit-serial RISC-V that's probably plenty fast enough for a monitor and bootloader, and I hear it's crazy small.

I did manage to stumble upon another processor inspired by the Cray-1 -- MRISC32. I think the key takeaway from his effort was all registers should be able to contain floats and scalars, and also that all scalar operations should be possible on vectors.

So if we have a 6 bit register field, the lower 32 registers could be scalars and the upper 32 be vectors. If an operand of an instruction (like a jump for example) doesn't make sense to take a vector, then that could be a 5 bit field limiting it to the scalar registers. And, of course, floating point ops would also work on both vectors and scalars.

I think, if this is possible, it would reduce the number of opcodes you would need to learn. It could also act as a powerful orthogonalizing force on the ISA to consider if each operand could be a scalar or a vector, what would that look like, and how to make it so more instructions can take vectors.

Take SLT (Set Less Than) for example. I think in RISC-V it sets the register to 1 if the other two are less than each other. It might make sense, like the MRISC32, to make this instruction set the register to -1 (all ones), that way it can be used as a mask. And maybe if a destination is a scalar, it could set each bit to 1 depending on the the result of the less than operation on a vector? Then maybe any scalar register could be a vector mask register? And if so maybe any scalar operation can take a mask to mask its bits in whatever operation its involved in? This could give you byte and short and 32-bit operations for free by masking out the upper bits?


Tue Jan 26, 2021 1:35 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Could a Sign extend instruction (32 bits) work instead.
Sgn32 %A %B
if sign %A is true set %B #FFFFFFFF else SET %B #0


Tue Jan 26, 2021 6:30 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 159 posts ]  Go to page 1, 2, 3, 4, 5 ... 11  Next

Who is online

Users browsing this forum: Applebot and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software