Last visit was: Sun Sep 19, 2021 11:16 am
It is currently Sun Sep 19, 2021 11:16 am



 [ 237 posts ]  Go to page 1, 2, 3, 4, 5 ... 16  Next
 rj16 - a homebrew 16-bit cpu 
Author Message

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Yeah, I named it after myself.... naming things is hard, forgive me.

I got into this rabbit hole by watching Ben Eater build an 8 bit cpu on breadboards on youtube. When he got to the control logic a light bulb went on, and I was gripped by a strong desire to build my own. And so I did, but I gave it 64 KB of memory instead of 16 bytes. The issue though is it didn't have enough registers to be able to do the dance required for indirect memory addressing (loading a pointer from memory and looking up what it pointed to). It could, if I modified the program's instructions as it ran... but... well... let's say it's mostly finished in a box in storage for now.

So that got me thinking I should probably design the ISA for my next CPU before I start building it, especially if I want to build it on breadboards again. That way I don't run into a problem where I miss something fundamental like having pointers. So I started researching and reading the programmers manuals for tons of CPUs, hackaday projects, and eventually I found this forum where John Lluch is documenting his CPU74 project. So thank you again John!

Originally I was most heavily influenced by xr16, CPU74, ARM, RISC-V, and the m68k. It had 8 general purpose registers and 8 address registers, and the ISA had 3 operand instructions. But then I found LCC, read the book, and started trying to teach a compiler about the difference between general purpose registers and addressing registers and well... It can be done, but it's painful to say the least. The advice to keep registers as orthogonal as possible is good advice if you want a C compiler to be easy. But I felt like having 16 registers was important for a CPU named rj16.

So I figured, the SH-2 is out of patent now, which is a 2 operand machine, and I discovered LCC is fairly optimized for 2 operand machines (of which x86 is one) and can remove a lot of the move instructions you might think would be everywhere and slow programs down. So with the extra encoding space freed up by ditching an operand, I moved the design a lot closer to SH-2 and I would say that's now the stronger influence. Although it's still very different.

It now has 16 general purpose 16-bit registers. It doesn't have a zero register any more. It has stack relative, PC relative, absolute and indirect-absolute addressing. Pointers are currently 16-bit. It has a harvard architecture. If it doesn't cause problems, the code address space will be word addressed, and the data address space is byte addressed. So 128 KB of code, 64 KB of data. I am thinking about ways to expand that. And the current revision (v7) of the ISA has 7 and 9 bit immediates (16-bit with a prefix instruction) with opcode encoding space for 108 instructions.

This has been a project I have worked on for a few months a year ago and now I have added a few more months to it this year. I keep re-designing the ISA over and over because I find something I can't resist trying to fix. I do have a mostly working 5-stage pipelined simulator written in Go for the previous revision, v6, of the ISA. I have an assembler based on customasm, and a way to generate the rule definitions customasm requires to generate the binary data from an ISA description written in Go. The simulator is driven from the ISA description too. I can also generate a microcode rom image, though I haven't used that yet. And there's a LCC machine description file, so it has a C compiler, but no linker or other bintools yet.

And I recorded tons of video of me building the Go code and commentary on each revision of the ISA. But I should probably try to edit that video down before uploading it to youtube, and I have no idea when I will get around to that. But so far recording videos has helped reload the context into my brain after taking a break from the project for many months, so I probably wouldn't be still working on it if I hadn't done that, so bonus there.

In future posts I will describe the current ISA revision and my current thoughts of the direction I want to go.

EDIT: Click here to jump to the latest version of the ISA, a 32-bit version (though the machine is still planned to be mostly 16-bits)


Last edited by rj45 on Thu Dec 31, 2020 12:45 am, edited 1 time in total.



Mon Dec 07, 2020 11:58 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
rj45 wrote:
But then I found LCC, read the book, and started trying to teach a compiler about the difference between general purpose registers and addressing registers and well... It can be done, but it's painful to say the least. The advice to keep registers as orthogonal as possible is good advice if you want a C compiler to be easy. But I felt like having 16 registers was important for a CPU named rj16.

Great - thanks for sharing your thinking.
Quote:
the code address space will be word addressed, and the data address space is byte addressed.

Interesting!
Quote:
I do have a mostly working 5-stage pipelined simulator written in Go for the previous revision, v6, of the ISA. I have an assembler based on customasm, and a way to generate the rule definitions customasm requires to generate the binary data from an ISA description written in Go. The simulator is driven from the ISA description too. I can also generate a microcode rom image, though I haven't used that yet. And there's a LCC machine description file, so it has a C compiler, but no linker or other bintools yet.

Simulator, assembler, C compiler - very strong position for experimenting! I look forward to following developments.


Mon Dec 07, 2020 1:21 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Rj45,

It's great that you finally decided to start this thread. I will be following it with interest, and for now I already can't wait for the time you get your videos ready to share :).

Quote:
Originally I was most heavily influenced by xr16, CPU74, ARM, RISC-V, and the m68k. It had 8 general purpose registers and 8 address registers, and the ISA had 3 operand instructions. But then I found LCC, read the book, and started trying to teach a compiler about the difference between general purpose registers and addressing registers and well... It can be done, but it's painful to say the least. The advice to keep registers as orthogonal as possible is good advice if you want a C compiler to be easy. But I felt like having 16 registers was important for a CPU named rj16.

So I figured, the SH-2 is out of patent now, which is a 2 operand machine, and I discovered LCC is fairly optimized for 2 operand machines (of which x86 is one) and can remove a lot of the move instructions you might think would be everywhere and slow programs down. So with the extra encoding space freed up by ditching an operand, I moved the design a lot closer to SH-2 and I would say that's now the stronger influence. Although it's still very different.


The balance between more registers and two operand instructions versus less registers and three operand instructions is interesting. Of course if we are going to make 32 bit CPU with 32 bit long instructions, the choice is very clear: "lots of registers, and 3 operand instructions", but for 16 bit instructions, what's the optimal choice is highly debatable. I did think of that for a long time and although I finally opted for 3-operands 8-registers, I've been more than tempted in numerous occasions to change it all for a 2-operands, 16-registers architecture, just to be able to increase the length of immediate fields. My ultimate decision was based on what I believed that would produce more compact code, and the consideration of the amount of register ics that would be required to implement it in real hardware.

Another interesting consideration is whether the PC and the SP should be part of the general purpose registers or not. I suppose that if you have 16 registers in total, it is just easier and more convenient to dedicate a couple of them for these purposes. That's in fact the most common approach in RISC machines (and some very remarkable ciscs, such as the VAX-11). This helps reusing instruction encodings for functions that otherwise would require explicit instructions. Some processors go even further and include the SR as part of the general set, for example the MSP430, which I guess that has it's advantages too. Since I only have 8 registers I decided to place the PC and the SP apart from the general set, so this means that in reality I have 10 registers if I account for that. This also means that I need explicit instructions to work with stack frames which makes the isa less orthogonal.

As you point out, the "Harvard" model enables for different addressing sizes for program memory and data memory, so in effect with 16 bit registers you can get 128K of word addressed program, concurrently with 64K of byte accessed data. One consideration to be made, is that if your isa is very orthogonal to the point of including the PC as part of the general register set, and you share instruction encodings for PC related operations, then the compiler must be aware of the different addressing sizes of both memories.

All very interesting!


Mon Dec 07, 2020 5:14 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
joanlluch wrote:
The balance between more registers and two operand instructions versus less registers and three operand instructions is interesting. Of course if we are going to make 32 bit CPU with 32 bit long instructions, the choice is very clear: "lots of registers, and 3 operand instructions", but for 16 bit instructions, what's the optimal choice is highly debatable. I did think of that for a long time and although I finally opted for 3-operands 8-registers, I've been more than tempted in numerous occasions to change it all for a 2-operands, 16-registers architecture, just to be able to increase the length of immediate fields. My ultimate decision was based on what I believed that would produce more compact code, and the consideration of the amount of register ics that would be required to implement it in real hardware.


I am glad you stuck with the 3 operands, in fact I really like the ISA you came up with. I haven't actually seen too many 16-bit processors do that. And I think you're right, it will be more compact, especially considering you have an LLVM compiler with all of its optimization power as well. I already compromised quite a bit on the compactness front, what with using LCC which doesn't have as many powerful optimizations, as well as other design choices I will get into in future posts. I think I will compensate by coming up with a way to allow programs to use more memory. Still thinking of the best way to do that though.

joanlluch wrote:
Another interesting consideration is whether the PC and the SP should be part of the general purpose registers or not. I suppose that if you have 16 registers in total, it is just easier and more convenient to dedicate a couple of them for these purposes. That's in fact the most common approach in RISC machines (and some very remarkable ciscs, such as the VAX-11). This helps reusing instruction encodings for functions that otherwise would require explicit instructions. Some processors go even further and include the SR as part of the general set, for example the MSP430, which I guess that has it's advantages too. Since I only have 8 registers I decided to place the PC and the SP apart from the general set, so this means that in reality I have 10 registers if I account for that. This also means that I need explicit instructions to work with stack frames which makes the isa less orthogonal.

As you point out, the "Harvard" model enables for different addressing sizes for program memory and data memory, so in effect with 16 bit registers you can get 128K of word addressed program, concurrently with 64K of byte accessed data. One consideration to be made, is that if your isa is very orthogonal to the point of including the PC as part of the general register set, and you share instruction encodings for PC related operations, then the compiler must be aware of the different addressing sizes of both memories.


So, PC is separate because I plan to use a pair of dual port memories for the registers to reduce the chip / LUT count. It would be awesome to have PC in the register file somehow, but I have no idea how that would work with pipelining. That doesn't mean I couldn't route the PC to the ALU inputs when a specific register number is addressed. I would have to think about that. But for now, I just have an "lpa" instruction that lets you load a program address to a register you can later jump to or call.

Currently the thought is to move program literals into data memory and I think that will skirt problems with teaching the compiler that program memory is word addressed not byte addressed. The current design just keeps literals in a separate ROM, so there's no way to read or write program memory other than executing instructions. But if I want a bootloader or OS, I will need to add some way to write programs to RAM somewhere and execute them.

I have really debated pulling the SP out of the register file as well. I already have special instructions for SP relative addressing because I wanted to stick with 2 operands and stack-relative operations are really common. So you have to use register 15 as a stack pointer because of the special instructions. If I did move it out of the register file, then I might be able to move the address calculation into the decode stage of the pipeline since the stack is the only relative addressing I have currently. It's a lot of rework, but it might be worthwhile since it would allow me to move memory access into the execute stage and shave a stage off the pipeline which simplifies a lot.

On the other hand, ZipCPU implements supervisor mode by having a supervisor register bank, and it swaps banks on interrupt/syscall. That seems really appealing, especially seeing how xv6 has to do a lot of work to manage the registers when switching modes. And if the register files are dual port memories, in theory I could have multiple register banks for fast switching between a number of tasks. But then I need to deal with the PC not being banked along with the other registers.... Much thinking still to do on that.


Mon Dec 07, 2020 10:01 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
So, I will describe some of the ISA of rj16. Feel free to continue above conversations.

EDIT: Click here to jump to the latest version of the ISA, a 32-bit version (though the machine is still planned to be mostly 16-bits)

Firstly I want to say, if you want to use this design in your own work, commercial or not, go ahead. Consider it MIT licensed. I would love if you let me know, but you don't have to. It's taken me many, many, many hours of thinking to come up with what I think is a good ISA, and I would love if that work is useful to others.

Also, I haven't implemented this ISA yet, and usually when I do that I make tweaks to fix various things I missed, so a few revisions will probably happen.

Here is the instruction formats. All instructions are 16-bits wide. There's one prefix instruction that extends the immediate to 16 bits.

Attachment:
isa_encodings_v7.png


op0-3 are opcode fields. Specific values (prefix codes) select which instruction format is in use, and which of those op field bits form the 7-bit opcode. I managed, after many hours, to simplify it to just 3 special codes -- op0 == 00, op1 == 000 and op1 == 100, and only the first two are used to select the opcode bits from the instruction. That's depicted in the "Opcode Space" box. So the pre-decode logic should be just 2 nor gates and a mux.

With 7-bit opcodes, there's 128 possibilities, but the special prefix codes eliminate 20 of them, so there's a max of 108 instructions. I don't think I will need that many, but there's room to grow if necessary. And given this is a 2 operand machine, there's a lot more specialized instructions that would disappear if there was 3 operands.

rd is the first source register and/or the destination register. rs is the second source register (or a CSR).

For the immediates, they are always sign extended and word-memory instructions shift left 1. If the special imm prefix instruction appears before an instruction, instead of sign extending the immediate in the next instruction, the imm instruction's immediate value forms the high bits. The extended immediate is forgotten after one instruction passes, though. When I get to implementing interrupts, the imm prefix will disable interrupts for a cycle to make sure the immediate can't be paired with some other instruction, and it will execute properly before an interrupt.

For the i9 format, the opcode is indeed 3 bits, so each instruction is encoded twice in the microcode to account for either value of the sign bit. This simplifies the logic for converting from instruction fields to the opcode.

One thing you will notice is that there's no two-register plus immediate format. So you cannot, then, have relative addressing off of an arbitrary register. But there are special instructions to allow you to do relative addressing off the stack pointer, and I am debating on adding this also for a base pointer to make shared libraries easier to implement.

And this is where the instruction compactness has been compromised -- the C compiler generates a lot of pointer arithmetic to do relative addressing off an arbitrary pointer, such as a pointer to a struct. I do have way more opcodes than I need in the register-register format, so I could possibly add a format for register-register immediate, but then the immediate would only be three bits. So those instructions would probably be often prefixed and would be two instructions anyway. In other words, you could think of the add to a temp register to do the relative addressing as a prefix instruction on a load/store.

The big win here, though, over previous versions of the ISA is figuring out a way to get 24 register-immediate instructions. That gives me lots of room to make all the ALU operations orthogonal -- you can do both register-immediate and register-register on all of them, and there's still space for a variety of memory load/store instructions.

In a post coming in the near future, I will share the preliminary opcode map. It's still very much a work in progress though, as I think I want to support proper OSes on this CPU, and I am still learning what instructions and processing modes would be required for that.


You do not have the required permissions to view the files attached to this post.


Last edited by rj45 on Thu Dec 31, 2020 12:46 am, edited 1 time in total.



Tue Dec 08, 2020 2:43 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Your instruction formats and encodings look really nice, indeed. And as you point out, decoding should be pretty straightforward.

From your description, I do not quite see the format of the "prefix" instruction. I assume it is one of the i9 format, as this is what it seems to be required to complete the 7 bit immediate field of the register-immediate format instructions.

Quote:
In a post coming in the near future, I will share the preliminary opcode map. It's still very much a work in progress though, as I think I want to support proper OSes on this CPU, and I am still learning what instructions and processing modes would be required for that.


For running any sort of compiled code and support "proper" oses, you need at least both byte and word load/stores to data memory and the ability to sign/zero extend bytes either as part or the load/store instruction or as a separate instruction working on registers. This is something that tends to be systematically forgotten by all the homebrew designs that I had the opportunity to look at in the past, although I think that since you already apply sign extensions on the immediate fields, you have already considered this.

Also, if you want to get efficient code from compilers the overall instruction set is very important. For example, compilers tend to love multiple shift instructions. If my experience with the LLVM compiler has to serve on this, I can tell that the compiler uses shifts to implement unbeilable target independent optimisations like crazy, but these optimisations are often undesired and improper on architectures not supporting multiple shifts. For example, the compiler may decide to replace a comparison to the constant 32 (or any 2s complement constant btw), by a 10 bit shift left followed by a 15 bit arithmetic shift right on a 16 bit architecture. Very nice and astute, but look at what happens for the AVR architecture:

[C Code]
Code:
int testSExtICmp_1( int x )  // 1274 (InstCombineCasts:transformSExtICmp)
{
  return (x & 32) == 0 ? -1 : 0;
}


[AVR assembly]
Code:
.globl   testSExtICmp_0
   .p2align   1
   .type   testSExtICmp_0,@function
testSExtICmp_0:
   lsl   r24
   rol   r25
   lsl   r24
   rol   r25
   lsl   r24
   rol   r25
   lsl   r24
   rol   r25
   lsl   r24
   rol   r25
   lsl   r24
   rol   r25
   lsl   r24
   rol   r25
   lsl   r24
   rol   r25
   lsl   r24
   rol   r25
   lsl   r24
   rol   r25
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   ret

Well, the known 32 bit processors supporting cheap multiple shifts get that done in only two instructions, a shift left, followed by a shift right, which is the purpose of the optimisation after all, but you just have seen what happened when multiple shifts are not supported on a particular architecture.

I can tell that I had to fight the compiler very hard to prevent this and MANY other shifts related optimisations. There are literally hundreds of cases similar to this. Just for reference, this is what the CPU74 backend implementation produces instead, which has a hooks to prevent such optimisation:

[CPU74]
Code:
   .globl   testSExtICmp_0
testSExtICmp_0:
   and   r0, 32L, r0
   mov   -1, r0
   selcc   0, r0, r0
   ret


And this is the actual intend of the compiler, as shown for the 32 bit ARM

[ARM 32bit]
Code:
   .globl   _testSExtICmp_0         @ -- Begin function testSExtICmp_0
_testSExtICmp_0:
@ %bb.0:                                @ %entry
   lsl   r0, r0, #26
   asr   r0, r0, #31
   mov   pc, lr

As you see, it's just a couple of shifts.

Corollary, is that choosing the right set of instructions to get efficient code of a state of the art compiler is particularly tricky. Less evolved compilers may not be that intrusive on such scenarios, but they would produce overall worse assembly code in most general cases.


Last edited by joanlluch on Tue Dec 08, 2020 6:42 pm, edited 1 time in total.



Tue Dec 08, 2020 3:43 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Yes, I am strongly considering a barrel shifter as a result of this very line of reasoning. LCC doesn't have nearly as many optimizations, but it still does some such things. And maybe one day I will learn LLVM and do a backend there.

I did see a fantastic idea on simplifying barrel shifters in Robert Baruch's recent LMARV reboot videos. The idea is to just implement a single direction barrel shifter, and then put a bit reverser on at the beginning and end of it, which should be just a mux. PCB routing might be really annoying, but solvable. If I am right that could make it just 6 levels of 16-bit muxes, a 1-bit, 2-bit, 4-bit and 8-bit shifter mux plus two reversers.

And as far as a ttl version, I am wondering if it would be possible forgo muxes and use tri-state buffers. That would be even fewer chips. Care would need to be taken with balancing the buffer control signal timing, but I think it would work. I wish I had an oscilloscope so I could build a model and try switching it at high speed to see if it causes any power rail spikes.

I do want to keep this design open to a TTL version, and initially I really didn't want any barrel shifter because a bidirectional one is way too many chips IMHO. But the reverser idea may just change my mind, especially if this could be done with tristate buffers.


Tue Dec 08, 2020 6:16 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Quote:
I did see a fantastic idea on simplifying barrel shifters in Robert Baruch's recent LMARV reboot videos. The idea is to just implement a single direction barrel shifter, and then put a bit reverser on at the beginning and end of it, which should be just a mux. PCB routing might be really annoying, but solvable. If I am right that could make it just 6 levels of 16-bit muxes, a 1-bit, 2-bit, 4-bit and 8-bit shifter mux plus two reversers.
Well, that guy acts a bit disorganised in my opinion, in Catalan we say something to the effect of "building the house from the roof", or "hitting with the stick of a blind", but he has certainly good ideas and watching his videos is no doubt quite entertaining, he always looks happy too!. [As a bonus he speaks slowly enough that I somehow I am able to understand about 50% of his speech with youtube subtitles disabled (as I'm only able to use the written form of the English language)]

So another feature that compilers make good use of, is a memory addressing mode of the form
Code:
Base + Scale*Index + Displacement
, but of course having all that in single instructions is impossible with 16 bit encodings.

- The "scale" factor is in some cases a generic shifter inserted before the alu, one of the first processors that featured that was the ARM1. Or just an alu feature supporting power of 2 shifts, up to a factor of 8 in a 64 bit processor like the X86-64.

- ''base' is a register that is used to point base addresses of global structs or arrays, or the stack frame

- 'Index' is also a register that is used as an array index that is scaled by the 'scale' factor to access any supported data size

- 'displacement' is a constant value, usually embedded in an instruction encoding field, that is used to access struct members, constant array indices, function arguments passed on the stack, local variables or register spills in a stack frame.

Some processors also support post/pre increments/decrements, in the old 68000 fashion, but I found that compilers do not tend to make a proper use of them, particularly if the above mentioned addressing mode is available.


Tue Dec 08, 2020 11:50 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Here are the currently planned load/store instructions with their opcode assignments:

Attachment:
load-store_before.png


There's 7 bit immediates on both absolute memory instructions as well as stack-relative instructions. The stack-relative addressing will cover register spills, stack local variables, and function arguments with plenty of immediate room, and it's easy to teach LCC to use these instructions.

But there is no instructions for load/store base + immediate other than if base is the stack. So the compiler generates at least an add instruction and needs to use a temporary register whenever it's accessing array indexes or struct members. Those operations are less common, but it does make programs use more memory.

Without any compromises as far as opcode space goes, I could do this:

Attachment:
load-store_after.png


The issue here is that 3 bit immediate is almost useless, and I don't have enough opcodes to do both a 7-bit immediate on stack relative addressing, and also have these 3 bit immediate arbitrary base memory ops. If I make it so some ALU operations don't have immediate variants, maybe I could fit both.

But because that immediate is so small, many operations will end up having an imm prefix instruction anyway, which would be basically the same as having an add instruction as a prefix.

I could get a 4 bit immediate by sacrificing 4 opcodes to that extra immediate bit. I think the SH-2 only had a 4 bit immediate on these instructions. So that would look like this:

Attachment:
load-store_imm4.png


The biggest issue here is losing 4 opcodes to the immediate bit, and losing another 4 opcodes because now the immediate absolute addressing instructions need new opcodes. To make this work I would probably need to remove the possibility of using an immediate from some of the ALU instructions. Either that or have a zero register, but then references to global variables are also now restricted to 4 bit immediates.

A slightly better sacrifice would be to make 4 registers into special "base" registers, and the stack pointer would be one of those special registers. That would look like this:

Attachment:
load-store_special_regs.png


A 5 bit immediate is actually useful in a lot of cases. I think they say the ideal is 6 bits? This is real close. There's no opcode space sacrifice so that's nice. And we can use more than just the stack pointer as a base register. The only issue I have with this is trying to teach LCC that it should store pointers in those base registers, specifically if it needs to do struct or array indexing off a pointer. In those cases LCC would likely use some other register and do the add instruction making this option essentially the same as the original design, but with only 5 bits for an immediate instead of 7 on stack operations.

Now, if we let go of the idea of having everything in a single instruction, and instead consider prefix instructions, now that might be a proper solution. We could have a prefix instruction for Base + Index<<Scale that would modify the load/store immediate instructions into load/store Base + Index<<Scale + Displacement. I need to think about that some more.


You do not have the required permissions to view the files attached to this post.


Last edited by rj45 on Wed Dec 09, 2020 2:32 pm, edited 1 time in total.



Wed Dec 09, 2020 1:52 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I think all the proposed solutions above are ok, all of them will have situations where one will work better than the others. You start with the advantage of having a big number of registers, so this possibly reduces the pressure to directly support particular addressing modes. I agree that 'address' registers are difficult to teach to a compiler, but if they are also part of the general set as you propose, it is less painful, after all it ends to be a matter of placing constraints to the physical register allocation phase. It most cases register to register moves are not needed.

This is all an interesting game, and something I've been playing not that long ago :-). I ended with 9 bit signed immediates for stack frame adjustment (add to SP) ; 8 bit unsigned immediates for stack based memory access, and 5 bit unsigned immediates for general purpose register base addresses. In all cases, immediates that refer to a word address are encoded shifted right by one, so they are effectively reaching double memory range. On top of that, I also managed to support baseregister+indexregister addressing mode, and also in this case the index register is meant to be scaled by 2 for word accesses. But of course I have a mixed direct and rom based instruction decoding (inspired on the ARM1) that helps on this.


Wed Dec 09, 2020 2:27 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Getting back to shifts, if you want to natively support any shift amount, then maybe you can consider a "funnel" shift, where two data values (say registers) are concatenated to produce a shifted result. The known 32 or 64 bit processors do not implement this because they already support shifts with native word sizes, so they would not get much advantage of a funnel shifter anyway. However, on a 16 bit processor, a funnel shifter may have some use if there's the intention to comfortably support bigger data sizes.

Imagine for example that you want to perform a 32 bit shift, like in this example.

Code:
long shift32( long a )
{
  return a >> 10;
}

With a funnel shifter this can be performed in two or three instructions on a 16 bit processor.

Otherwise, the required number of instructions is much increased, and requires an 'or' to compose the 32 bit result. See what LLVM generates for AVR, MSP430, and CPU74

[AVR assembly]
Code:
   .globl   shift32
   .p2align   1
   .type   shift32,@function
shift32:
   mov   r18, r24
   mov   r19, r25
   lsl   r18
   rol   r19
   lsl   r18
   rol   r19
   lsl   r18
   rol   r19
   lsl   r18
   rol   r19
   lsl   r18
   rol   r19
   lsl   r18
   rol   r19
   lsr   r23
   ror   r22
   lsr   r23
   ror   r22
   lsr   r23
   ror   r22
   lsr   r23
   ror   r22
   lsr   r23
   ror   r22
   lsr   r23
   ror   r22
   lsr   r23
   ror   r22
   lsr   r23
   ror   r22
   lsr   r23
   ror   r22
   lsr   r23
   ror   r22
   or   r22, r18
   or   r23, r19
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   asr   r25
   ror   r24
   ret



[MSP430 assembly]
Code:
   .globl   shift32
   .p2align   1
   .type   shift32,@function
shift32:
   swpb   r12
   mov.b   r12, r12
   clrc
   rrc   r12
   rra   r12
   mov   r13, r14
   add   r14, r14
   add   r14, r14
   add   r14, r14
   add   r14, r14
   add   r14, r14
   add   r14, r14
   bis   r14, r12
   swpb   r13
   sxt   r13
   rra   r13
   rra   r13
   ret


[CPU74 assembly]
Code:
   .globl   shift32
shift32:
   lsrb   r0, r0
   lslb   r1, r2
   or   r0, r2, r0
   asrb   r1, r1
   asr   r1, r1
   lsrc   r0, r0
   asr   r1, r1
   lsrc   r0, r0
   ret


None of these processors support native any-amount shifts, however the MSP430 can do a bit better than the AVR because it has the 'swap byte' instruction. The CPU74 does even better because it has native 8 bit shift instructions. However all of them require an 'or' instruction to catenate 16 bit data shifts into a 32 bit result value.

Of course the ARM32 has it much easier for 32 bit shifts because it's just one instruction. But we can compare with the above code by attempting a shift on a 64 bit value on the ARM. Since 'long' is a 64 bit integer on the ARM, the same C code applies. I just renamed the function to make it clearer:

Code:
   .globl   _shift64                @ -- Begin function shift64
   .p2align   2
   .code   32                      @ @shift64
_shift64:
@ %bb.0:                                @ %entry
   lsr   r0, r0, #10
   orr   r0, r0, r1, lsl #22
   asr   r1, r1, #10
   mov   pc, lr


Note the use of the integrated OR with Shifted Operand (using the barrel shifter before the alu inputs), which also reduces the number of instructions despite not implementing a funnel shifter.

Finally, the esp32 processor is the only one that I am aware of supporting funnel shifts, see page 25 of this document:
https://0x04.net/~mwk/doc/xtensa.pdf

Unfortunately, The esp32 is not supported by LLVM so I can't post the assembly code, but I'm pretty sure it would be just 2 or 3 instructions for the 64 bit 'long' version.


Wed Dec 09, 2020 2:53 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Fascinating. If it's possible to implement a barrel shifter with tristate buffers, it's 12 chips for the bit reverser and 20 chips for the funnel shifter. Both allow you to need hardware for only a single direction of shift. But the funnel shifter is considerably more versatile. Bit-field extraction I believe can greatly speed up floating point emulation, and of course, the compiler is going to assume 32-bit shifts are fast. Hmmmmmmm.....


Wed Dec 09, 2020 5:39 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
For more food for thought, we can also discuss the case of Link Register (LR) versus "old good" stack based subroutine calls.

I have a love-hate relationship with the LR. On the CPU74 I opted for the classic stack based mechanism, at first I even had push/pop register instructions for function prolog/epilogue but at some time I decided to remove them in favour of direct stack frame manipulation. So it's a mixed approach if you want. Calls and Returns still automatically push/pop the PC to/from the stack in a typical CISC fashion, but function prologue/epilogue is implemented in the RISC way.

The LR has many advantages though. If I had 16 registers as you have, I would definitely have defined a register as the LR, and implement "Branch&Link" instruction for calls and a "Branch LR" for returns; like ARM, PowerPC, RiscV and all the other riscs out there. On pipelined processors that's faster than the classic stack based Call/Return sequence, because even if you have to save/restore the LR as part of the function prolog/epiloge code, this potentially only uses one cycle in a pipelined processor. Also, on short 'leaf' functions with little register usage, the LR does not even need to be saved/restored, so the entire call/return sequence is as fast as a couple of unconditional jumps. That's definitely the way to go in your case, I think.

The LR is only potentially detrimental on architectures with a limited number of registers. I found that 8 general purpose registers is the absolute minimum to keep the compiler with just half a smile. So the CPU74 is at the edge of whether a LR is beneficial or not. It is probably better if I'm going to fully pipeline the processor with register save/restores only taking one cycle and call/returns being treated as unconditional jumps. But I think it's detrimental for the non-pipelined version because that generally means more register pressure and more register spills.


Wed Dec 09, 2020 7:32 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 256
With my cpu designs, use SP as a frame register.
A call saves the PC @ (S+0). Jump (s+0) returns.
Having a true stack is expensive in cpu logic.
Interupts are pdp8 style. PC->(0) PC=2;

Hardware should have 3 bits for size ( size 0 (8) 1 (16) 2 (32) 3 (64) ) signed or
unsigned. They knew this the 1970's but nobody did that. Compilers are
forced to save to a free register. Convert type. repeat. then eval the expression.
Spills a compiler/langage problem because the language does not map well to the hardware.


Wed Dec 09, 2020 10:53 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1484
Location: Canada
I prefer the subroutine call / return that uses the stack for return addresses. Call is just a store, return is just a load and they can execute just as fast as regular loads and stores. Most often, the address needs to be stacked and unstacked, so the link register value ends up on the stack anyway. Call and ret instructions are more code dense than using a link register which may be important on machines with limited memory. Using a link register complicates things, is that a leaf routine or not? Both means are supported in the RTF64 (stack-based call / ret and link register based jal). The compiler however just uses stack-based call / ret to keep things simple.
I wonder if a stack cache could make stack based call and return just as fast as using a link register.

Quote:
Having a true stack is expensive in cpu logic.
Interupts are pdp8 style. PC->(0) PC=2;
Yes it is, but transistors are pretty inexpensive. I like the idea of a more expensive more capable machine.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 10, 2020 4:41 am WWW
 [ 237 posts ]  Go to page 1, 2, 3, 4, 5 ... 16  Next

Who is online

Users browsing this forum: CCBot, PetalBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software