AnyCPU - View topic - GF-RV16 - an experimental 16-bit RISC-V ISA

gfoot

Joined: Sat Oct 04, 2025 10:54 am
Posts: 9

GF-RV16 - an experimental 16-bit RISC-V ISA

Hi all

As I mentioned in my introduction a month or so ago, I've recently found myself going deep into the rabbit hole of designing a CPU that I can build. It's been in the back of my mind for a while, but I never really felt it was something I'd actually like to go ahead and build until recently, but once I got started working on it I haven't really been able to stop myself.

I have uploaded everything so far to a github project here: https://github.com/gfoot/gf-rv16/tree/main The README there is quite out of date, but I wrote a lot more documentation in the wiki: https://github.com/gfoot/gf-rv16/wiki

The repo is all software - no working hardware yet - but it includes in particular:

The ISA and ABI, and Instruction Encoding
A partial hardware design
An assembler and a simulator based on simulating microcode that is designed to work on simple hardware
Target source code including a growing runtime library, various test programs, and a cut-down "dc" implementation (stack-based calculator found on Unix systems)

My intention here was always to design something that's practical to build on solderless breadboard, using simple logic chips and GALs but probably nothing more complex than that. This could change, but really the desire to do it this way is what has led to most of the constraints I've placed on the design, and it's those constraints that really give it its identity. Some of the constraints then are that it can't have a large number of registers, and I don't want to deal with very wide bus widths (neither internally within the CPU, nor externally in the memory interface). I decided that the pinout of a 6502 is about right, and is also something I'm very used to.

I had some ideas for super-simple CPUs that would be very easy to build in this fashion but hard to program effectively, but they weren't very compelling for me and didn't feel like something I'd actually want to use beyond its simple novelty value, and that often leads to unfinished projects which I already have plenty of! But after reading a post on 6502.org by mysterymath, talking about making a 16-bit CPU based roughly on the RISC-V ISA, something clicked and I realised that it really made quite a lot of sense, and was actually a really good fit as an instruction set to layer on top of the simple kinds of core I had already been designing.

This is not what RISC-V was designed for, but I went ahead and thought about an ISA that could work, fitting into a 16-bit instruction encoding. I wanted to stick as close as I could to RV32I - partly because I didn't know it well enough to know whether diverging from it was a good idea or not. When the dust settles there may be some points where it would make sense to just diverge more from it, but it has actually been a surprisingly good fit overall - not perfect by any means, but also not terrible. I expected more difficulties than those that actually came up.

The instruction set is pretty much the whole RISC-V RV32I instruction set, but only supporting 8 registers, and with severely limited widths for immediate operands. Nonetheless I have found it quite comfortable writing code within these restrictions, and the assembler helps out a lot by rewriting certain instructions where necessary so that they can fit into the encoding. There is a full instruction set listed on the ISA documentation page linked above, and also some discussion of the limitations and workarounds.

The rough core design I started with has evolved slightly into this diagram:

Attachment:

corediagram.png

and based on that, fleshing out more aspects of the hardware, I've ended up with this block diagram - there are a couple of missing control signals that I've since spotted, but I think it is otherwise complete:

Attachment:

gf-rv16-blockdiagram.png

I think overall this is actually simpler than I originally thought it would end up being, and nothing in there really scares me a lot. The ALU and shifter modules are going to be quite large, as is the register file, but not unmanageable I think, and there's always the option of using an EEPROM if I don't want to actually build it out - and maybe even static RAM for the register file.

Anyway that is about where I am with this now - I thought I'd share it in case anybody was interested to see, though it is just a hobbyist passion project and probably isn't of much value to anybody else!

You do not have the required permissions to view the files attached to this post.

Sun Oct 26, 2025 12:58 am

gfoot

Joined: Sat Oct 04, 2025 10:54 am
Posts: 9

Re: GF-RV16 - an experimental 16-bit RISC-V ISA

Definitely, actually writing code for the target was an important step to understand the impact that the restrictions would have, both to check whether it would be practical and to make sure sensible compromises were being made in the encoding. And I came to appreciate that RISC-V in general leans quite heavily on the assembler (and linker) to rewrite certain instructions depending on code layout.

I also made other tools to analyse aspects of the ISA, such as finding correlations between certain control signals in the microcode that should make it easier to pack into a ROM later on. And a tool to analyse my code and report how often each instruction appears with each possible immediate width - this is really valuable when deciding how much of the instruction encoding space to give to each instruction. For example:

Code:

addi8_4           78     -8      7      4          jalr_6             2     24     24      6
addi8_5           18    -15     10      5          jr_4              34      0      2      3
addi8_6            3    -22     27      6          lb_4              17     -2      1      2
addi8_7           24     32     62      7          lb_5               3      8      8      5
addi8_8           13   -127    127      8          lb_6               4     17     25      6
addi8_9            1    128    128      9          lbu_4              2      0      0      1
addi_4             2     -1     -1      1          lbu_6              1     16     16      6
addi_5             5      8      8      5          li_4              52     -1      1      2
andi_4             6     -8      7      4          li_5               7      8     13      5
andi_5             1     15     15      5          lui_12             2  -2048   1024     12
andi_6             1     16     16      6          lui_13             1   2048   2048     13
auipc_16           1 -32768 -32768     16          lui_14             5   4096   4096     14
auipc_addi_12      2   1224   1732     12          lui_15             3  12544  13312     15
auipc_addi_4       1     -8     -8      4          lui_16             5  21248  31488     16
auipc_addi_7       1     40     40      7          lui_4              1      0      0      1
beq_4              3     -2      2      3          lui_9              4   -256   -256      9
beq_5              2      8      8      5          lui_addi_11        3   1023   1023     11
beq_6              4    -30     20      6          lui_addi_12        1   2046   2046     12
beq_7              3     42     52      7          lui_addi_13        1   2092   2092     13
beqz_4            10      4      6      4          lui_addi_9         1    129    129      9
beqz_5            10      8     12      5          lw_4             103     -8      6      4
beqz_6             7     16     24      6          lw_5               7      8     12      5
beqz_7             1     36     36      7          lw_7               5     50     60      7
bge_4              5      4      6      4          lw_l_11            1    950    950     11
bge_6              2     16     16      6          lw_l_12            4   1030   1674     12
bgeu_4             5     -4      6      4          lw_l_4             1     -4     -4      3
bgez_4             3      0      0      1          lw_l_5             1    -10    -10      5
bgez_5             1    -10    -10      5          lw_l_6             1    -24    -24      6
blt_5              2    -12    -10      5          mret_4             1      0      0      1
bltu_4             3      4      6      4          ori_4              8      4      6      4
bltu_5             2      8      8      5          ori_6              2     16     16      6
bltz_4             4      4      4      4          sb_4              17     -2      0      2
bltz_6             1    -18    -18      6          sb_5               1    -12    -12      5
bne_7              2     54     58      7          sb_6               4     16     27      6
bnez_4            11     -8      6      4          slli_4            44      0      3      3
bnez_5             6    -14     14      5          slli_5             2      8      8      5
bnez_6             8    -22     22      6          slti_4             4      0      0      1
bnez_7             1    -48    -48      7          srai_4             3      1      1      2
j_4                4     -8      4      4          srli_4             1      4      4      4
j_5                6    -12     10      5          srli_5             4      8     12      5
j_6                5    -28     28      6          sw_4             104     -8      6      4
j_7                8    -48     60      7          sw_5               6      8     12      5
j_8                1     86     86      8          sw_6               1     24     24      6
j_9                2   -242    146      9          sw_l_12            3   1380   1538     12
jal_10            17   -502    510     10          sw_l_7             3    -60    -44      7
jal_5              3    -16     14      5          xori_4             4      3      7      4
jal_6              6    -30     28      6          xori_5             5      9     14      5
jal_7              5    -40     58      7          xori_6             1     16     16      6
jal_8              7    -90    112      8
jal_9              5   -256    228      9
jal_l_11          31   -936    934     11
jal_l_12           6  -1518   1728     12

Instructions are listed grouped by immediate width in bits, with widths smaller than 4 grouped together with 4. The columns show:

Frequency - how many times this instruction appeared in the program, with this immediate width
Smallest immediate value that was used
Largest immediate value that was used
Largest immediate width that was used (which is a bit redundant now)

For example, the "jal_9" line shows that JAL with a 9-bit immediate appeared 5 times, the smallest value used was -256, and the largest was 228.

The analysis script also looks for certain sequences of instructions, and combines them into a single virtual instruction - for example, LUI (load upper immediate) loads the upper 8 bits of a register and ADDI8 (add immediate 8-bit) adds a signed 8-bit number to a register, and they are used together to load a 16-bit value - whereas LI (load immediate) loads a sign-extended immediate into a register but only supports 5 or 6 bits. So LUI+ADDI8 is essentially a wider version of LI. The script combines them so we can see that if LI supported wider immediates it would have been used once with 9-bit immediate, three times with an 11-bit immediate, etc.

Another example is AUIPC+ADDI8 which together add a 16-bit offset to the program counter storing the result in a register. This is especially often followed by a JALR instruction, to jump-and-link to where that register is pointing, so the analysis script fuses all of that together into a virtual "JAL_L" instruction. This lets us see whether it would be valuable to widen the immediate in the JAL instruction so that we wouldn't need to issue this combination as many times - and it would appear to be extremely valuable to make that change in the ISA, as JAL_L_11 is even more common that JAL_10 is (the widest currently supported JAL) However, this depends a lot on how much code you've actually written, and a smart linker could also reduce this impact by putting routines that call each other next to each other in memory. Also, I think however wide JAL becomes, it will never be wide enough if more code is written!

I'd also highlight the branches in the table above. Comparisons against zero are more common than comparisons between two registers, and BEQZ and BNEZ seem particularly common. That might be just the way I write code, but I think it's also partly due to the limited number of registers available - loading a constant into a register just to make a comparison against it is a bit of a waste of a register. So I find myself writing code a bit differently, so that I can make these comparisons without needing an extra register - here's a fragment from my "gets" implementation:

Code:

.again:
    call    getchar

    addi    a0, a0, -127
    beqz    a0, .backspace
    bgez    a0, .again

    addi    a0, a0, 127 - 10
    beqz    a0, .enter

    addi    a0, a0, 10 - 32
    bltz    a0, .again

    addi    a0, a0, 32

    sb      a0, (s0)
    addi    s0, s0, 1

    call    putchar
    j       .again

You can see that I'm avoiding loading various constants into another register before comparing against them by instead just modifying a0 multiple times and comparing against zero. I don't know whether this is something you'd do in normal RISC-V as well, but it makes a lot of sense with this architecture with limited available registers.

Having gone through writing the assembler, I do wonder how easy it would have been (or would be) to add a new target to GNU binutils that does all these things. Last time I looked into that it seemed very hard to figure out where to get started. Or maybe LLVM's assembler is easier to adapt.

Sun Oct 26, 2025 10:05 am

gfoot

Joined: Sat Oct 04, 2025 10:54 am
Posts: 9

Re: GF-RV16 - an experimental 16-bit RISC-V ISA

This looks interesting, regarding adding new targets to binutils: https://www.opensourceforu.com/2010/01/ ... hitecture/ It at least maps things out a bit.

I did look at LLVM in this regard, but got the impression that their assembler might not be very mature anyway - it sounds like for a long time their tool frontends were being used with the GNU Assembler as the back end. That seems to have changed for some target types now, but not all. I do think that LLVM is very well thought-out and structured - I expect it carries less baggage than GNU binutils - but maybe also less maturity as a result.

Architectures like RISC-V seem to be particularly tricky for binutils to work with, as the linker has to do a lot more extra work than I believe is necessary in most architectures, changing some instructions around significantly based on where symbols have ended up. These things often contract two instructions into one, but that then moves all the other symbols later in that section, and that could mean that now an instruction that appeared earlier in the section can also be written in a shorter form, so then everything moves yet again... My assembler also has to deal with this and there are situations where you can get in an infinite loop with instructions becoming shorter in one pass and then longer in the next pass. It is quite complicated.

So if adding a new target to binutils, it would be nice to be able to reuse all that logic that's already there, if possible. It could be interesting to look into one day but is also a bit of a distraction, you have to pick your battles!

Sun Oct 26, 2025 4:14 pm

BigEd

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1846

Re: GF-RV16 - an experimental 16-bit RISC-V ISA

May be of interest, I made a new thread Easy RISC-V (online tutorial with emulation in browser)

Tue Oct 28, 2025 9:47 pm

gfoot

Joined: Sat Oct 04, 2025 10:54 am
Posts: 9

Re: GF-RV16 - an experimental 16-bit RISC-V ISA

Talking of instruction-fetch overlapping, here is where I am now with the microcode, including instruction fetch. It's all fully-simulated in the simulator and seems to work well.

First here is a simple instruction - Load Immediate - which just loads an immediate value (embedded in the opcode) into a register:

Code:

mnemonic               .        cycle   last    high    bus_a   bus_b   reg_r   aluop   mar_w   reg_w   reg_win pc_w    mem_w   if
                       .       
LI    rd, imm          .        0       .       low     imm     zero    .       ad0     .       rd      alu     .       .       if
                       .        1       1       high    imm     zero    .       adc     .       rd      alu     .       .       if

It's a two-cycle instruction (as the buses are only 8-bit, almost everything takes at least two cycles). The first is a low cycle, the second a high one - this affects things like register reads, register writes, and memory access. Other than that the two cycles are almost identical.

In each cycle, bus A is driven with the immediate value taken from the instruction encoding - it's probably 5 or 6 bits wide for LI, signed and hence sign-extended. Bus B is zero because we don't really want to do any arithmetic. No registers are read, the ALU runs an ADD operation with 0 carry, the MAR is not written, but a register is written (rd in the instruction encoding), and the value written to the register comes from the ALU. The PC is not written, memory is not written, and both cycles are instruction fetch cycles.

Instruction fetch takes two cycles because instructions are 16-bit. Whether the instruction fetch on a particular cycle is for the low or high byte depends on the cycle number being even or odd - not the "high" control signal (though that usually drives this sort of thing). On an instruction fetch cycle the memory address is driven by the program counter (actually the next value of the program counter), rather than the MAR, with bit 0 of the memory address coming from the low bit of the cycle number (though I might use the microcode address for that in fact). So over the course of these two cycles, the Instruction Register is loaded with a new value ready for the next instruction.

Instructions that modify PC are more complex - here is a Jump-And-Link instruction:

Code:

mnemonic               .        cycle   last    high    bus_a   bus_b   reg_r   aluop   mar_w   reg_w   reg_win pc_w    mem_w   if
                       .       
JAL   ra, imm          .        0       .       low     imm     pc      .       ad0     .       ra      pcnext  pc_w    .       .
                       .        1       .       high    imm     pc      .       adc     .       ra      pcnext  pc_w    .       .
                       .        2       .       .       .       .       .       .       .       .       .       .       .       if
                       .        3       1       .       .       .       .       .       .       .       .       .       .       if

Again the main work is done in two cycles, but we must also spend two cycles on the instruction fetch because we can't do that until PC has been loaded with the new value, so there are two extra rather dead cycles at the end. The first two cycles compute PC+imm via the ALU, writing the result to the program counter ('pc_w' is set), while simultaneously register 'ra' is written with the value taken directly from 'pcnext' (not via the ALU which is already busy).

So the program counter actually consists of (at least) two registers - 'pc' is a static view of the value of the program counter for the instruction that's currently being executed, and feeds into the ALU where necessary as in the above case. 'pcnext' is a counter that holds the next value for the program counter - this is usually 2 greater than 'pc', but during a jump instruction it gets written with something different, as in this case. 'pcnext' is copied to 'pc' at the end of the last cycle of an instruction.

Finally here's a more complex instruction - a branch, where we may or may not need to delay the instruction fetch. This particular example is the longest one in my current microcode, it is a particularly awkward case...

Code:

mnemonic               .        cycle   last    high    bus_a   bus_b   reg_r   aluop   mar_w   reg_w   reg_win pc_w    mem_w   if
                       .       
BNE   rs1, rs2, imm    .        0       .       low     0xff    regs    rs2     xor     mar_w   .       .       .       .       if
                       .        1       .       low     mar     regs    rs1     xor     mar_w   .       .       .       .       if
                       .        2       .       low     mar     marn    .       ad0     .       .       .       .       .       .
                       .        3       .       high    0xff    regs    rs2     xor     mar_w   .       .       .       .       .
                       .        4       .       high    mar     regs    rs1     xor     mar_w   .       .       .       .       .
                       .        5       .       high    mar     zero    .       adc     .       .       .       .       .       .
                       .        6       z       high    0xff    zero    .       adc     .       .       .       .       .       .
                       .        7       .       low     imm     pc      .       ad0     .       .       .       pc_w    .       .
                       .        8       .       high    imm     pc      .       adc     .       .       .       pc_w    .       .
                       .        9       .       .       .       .       .       .       .       .       .       .       .       if
                       .        10      1       .       .       .       .       .       .       .       .       .       .       if

It will always cost four cycles to read two 16-bit registers 8-bits at a time, two cycles to set pc, and two cycles for a final instruction fetch, so the shortest this could be is eight cycles; but it takes three more in this implementation.

During the first two cycles we still do an instruction fetch in parallel as usual, to cover the case where the branch is not taken. Some branch instructions like BEQ are also able to early out as soon as the end of the second cycle, if they can already determine that the branch shouldn't be taken. BEQ is really hard though.

The first two cycles compute (0xff ^ (rs1 & 0xff) ^ (rs2 & 0xff)) and the third cycle adds on another 1 if the result (temporarily stored in the MAR) was negative. This causes the internal carry flag to get set if and only if the initial result was 0xff, indicating equality of the low bytes of the registers.

The fourth and fifth cycles then compute the same expression but for the high byte, and this time on the sixth cycle we add zero with carry - so if the carry was 1 from the third cycle, and the high byte calculation resulted in 0xff (equality), then when we add the carry, the carry will again be set. The seventh cycle then adds zero to zero with carry, resulting in zero if the numbers were not equal and one if they were equal. The 'last' column shows a 'z' meaning this is conditionally the last cycle - in the case that the ALU result from this cycle was zero. So if the numbers were not equal, the ALU result will be zero and this will be the last cycle; the next one will begin execution of the instruction that was already fetched in the first two cycles.

If the numbers were equal, on the other hand, this is then not the last cycle, and the remainder of the instruction is basically the same as the jump instruction above, but without the link register being written.

This BNE instruction is (hopefully!) the worst case for this 8-bit internal architecture, and there's some potential to add some shortcuts (and complexity) to speed it up. For example making the interface to the registers be 16 bits wide would allow more direct comparisons between them in fewer cycles. But I don't want to commit to that extra complexity yet. Another possibility is allowing more flexible use of the MAR, which is 16-bit but when used as a temporary we only use 8 bits of it. For now though I'm accepting the longer microcode in exchange for maybe slightly simpler control signal complexity.

Wed Oct 29, 2025 7:22 pm

GF-RV16 - an experimental 16-bit RISC-V ISA

Who is online