Last visit was: Wed Oct 29, 2025 8:10 pm
|
It is currently Wed Oct 29, 2025 8:10 pm
|
GF-RV16 - an experimental 16-bit RISC-V ISA
| Author |
Message |
|
gfoot
Joined: Sat Oct 04, 2025 10:54 am Posts: 9
|
Hi all As I mentioned in my introduction a month or so ago, I've recently found myself going deep into the rabbit hole of designing a CPU that I can build. It's been in the back of my mind for a while, but I never really felt it was something I'd actually like to go ahead and build until recently, but once I got started working on it I haven't really been able to stop myself. I have uploaded everything so far to a github project here: https://github.com/gfoot/gf-rv16/tree/main The README there is quite out of date, but I wrote a lot more documentation in the wiki: https://github.com/gfoot/gf-rv16/wikiThe repo is all software - no working hardware yet - but it includes in particular: My intention here was always to design something that's practical to build on solderless breadboard, using simple logic chips and GALs but probably nothing more complex than that. This could change, but really the desire to do it this way is what has led to most of the constraints I've placed on the design, and it's those constraints that really give it its identity. Some of the constraints then are that it can't have a large number of registers, and I don't want to deal with very wide bus widths (neither internally within the CPU, nor externally in the memory interface). I decided that the pinout of a 6502 is about right, and is also something I'm very used to. I had some ideas for super-simple CPUs that would be very easy to build in this fashion but hard to program effectively, but they weren't very compelling for me and didn't feel like something I'd actually want to use beyond its simple novelty value, and that often leads to unfinished projects which I already have plenty of! But after reading a post on 6502.org by mysterymath, talking about making a 16-bit CPU based roughly on the RISC-V ISA, something clicked and I realised that it really made quite a lot of sense, and was actually a really good fit as an instruction set to layer on top of the simple kinds of core I had already been designing. This is not what RISC-V was designed for, but I went ahead and thought about an ISA that could work, fitting into a 16-bit instruction encoding. I wanted to stick as close as I could to RV32I - partly because I didn't know it well enough to know whether diverging from it was a good idea or not. When the dust settles there may be some points where it would make sense to just diverge more from it, but it has actually been a surprisingly good fit overall - not perfect by any means, but also not terrible. I expected more difficulties than those that actually came up. The instruction set is pretty much the whole RISC-V RV32I instruction set, but only supporting 8 registers, and with severely limited widths for immediate operands. Nonetheless I have found it quite comfortable writing code within these restrictions, and the assembler helps out a lot by rewriting certain instructions where necessary so that they can fit into the encoding. There is a full instruction set listed on the ISA documentation page linked above, and also some discussion of the limitations and workarounds. The rough core design I started with has evolved slightly into this diagram: Attachment: corediagram.png and based on that, fleshing out more aspects of the hardware, I've ended up with this block diagram - there are a couple of missing control signals that I've since spotted, but I think it is otherwise complete: Attachment: gf-rv16-blockdiagram.png I think overall this is actually simpler than I originally thought it would end up being, and nothing in there really scares me a lot. The ALU and shifter modules are going to be quite large, as is the register file, but not unmanageable I think, and there's always the option of using an EEPROM if I don't want to actually build it out - and maybe even static RAM for the register file. Anyway that is about where I am with this now - I thought I'd share it in case anybody was interested to see, though it is just a hobbyist passion project and probably isn't of much value to anybody else!
You do not have the required permissions to view the files attached to this post.
|
| Sun Oct 26, 2025 12:58 am |
|
 |
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1846
|
Most excellent - thanks for sharing!
> Nonetheless I have found it quite comfortable writing code within these restrictions, and the assembler helps out a lot by rewriting certain instructions where necessary so that they can fit into the encoding.
This is a great place to have reached - shows the value of simulator and assembler, and allows you to experiment before implementation.
|
| Sun Oct 26, 2025 8:24 am |
|
 |
|
gfoot
Joined: Sat Oct 04, 2025 10:54 am Posts: 9
|
Definitely, actually writing code for the target was an important step to understand the impact that the restrictions would have, both to check whether it would be practical and to make sure sensible compromises were being made in the encoding. And I came to appreciate that RISC-V in general leans quite heavily on the assembler (and linker) to rewrite certain instructions depending on code layout. I also made other tools to analyse aspects of the ISA, such as finding correlations between certain control signals in the microcode that should make it easier to pack into a ROM later on. And a tool to analyse my code and report how often each instruction appears with each possible immediate width - this is really valuable when deciding how much of the instruction encoding space to give to each instruction. For example: Code: addi8_4 78 -8 7 4 jalr_6 2 24 24 6 addi8_5 18 -15 10 5 jr_4 34 0 2 3 addi8_6 3 -22 27 6 lb_4 17 -2 1 2 addi8_7 24 32 62 7 lb_5 3 8 8 5 addi8_8 13 -127 127 8 lb_6 4 17 25 6 addi8_9 1 128 128 9 lbu_4 2 0 0 1 addi_4 2 -1 -1 1 lbu_6 1 16 16 6 addi_5 5 8 8 5 li_4 52 -1 1 2 andi_4 6 -8 7 4 li_5 7 8 13 5 andi_5 1 15 15 5 lui_12 2 -2048 1024 12 andi_6 1 16 16 6 lui_13 1 2048 2048 13 auipc_16 1 -32768 -32768 16 lui_14 5 4096 4096 14 auipc_addi_12 2 1224 1732 12 lui_15 3 12544 13312 15 auipc_addi_4 1 -8 -8 4 lui_16 5 21248 31488 16 auipc_addi_7 1 40 40 7 lui_4 1 0 0 1 beq_4 3 -2 2 3 lui_9 4 -256 -256 9 beq_5 2 8 8 5 lui_addi_11 3 1023 1023 11 beq_6 4 -30 20 6 lui_addi_12 1 2046 2046 12 beq_7 3 42 52 7 lui_addi_13 1 2092 2092 13 beqz_4 10 4 6 4 lui_addi_9 1 129 129 9 beqz_5 10 8 12 5 lw_4 103 -8 6 4 beqz_6 7 16 24 6 lw_5 7 8 12 5 beqz_7 1 36 36 7 lw_7 5 50 60 7 bge_4 5 4 6 4 lw_l_11 1 950 950 11 bge_6 2 16 16 6 lw_l_12 4 1030 1674 12 bgeu_4 5 -4 6 4 lw_l_4 1 -4 -4 3 bgez_4 3 0 0 1 lw_l_5 1 -10 -10 5 bgez_5 1 -10 -10 5 lw_l_6 1 -24 -24 6 blt_5 2 -12 -10 5 mret_4 1 0 0 1 bltu_4 3 4 6 4 ori_4 8 4 6 4 bltu_5 2 8 8 5 ori_6 2 16 16 6 bltz_4 4 4 4 4 sb_4 17 -2 0 2 bltz_6 1 -18 -18 6 sb_5 1 -12 -12 5 bne_7 2 54 58 7 sb_6 4 16 27 6 bnez_4 11 -8 6 4 slli_4 44 0 3 3 bnez_5 6 -14 14 5 slli_5 2 8 8 5 bnez_6 8 -22 22 6 slti_4 4 0 0 1 bnez_7 1 -48 -48 7 srai_4 3 1 1 2 j_4 4 -8 4 4 srli_4 1 4 4 4 j_5 6 -12 10 5 srli_5 4 8 12 5 j_6 5 -28 28 6 sw_4 104 -8 6 4 j_7 8 -48 60 7 sw_5 6 8 12 5 j_8 1 86 86 8 sw_6 1 24 24 6 j_9 2 -242 146 9 sw_l_12 3 1380 1538 12 jal_10 17 -502 510 10 sw_l_7 3 -60 -44 7 jal_5 3 -16 14 5 xori_4 4 3 7 4 jal_6 6 -30 28 6 xori_5 5 9 14 5 jal_7 5 -40 58 7 xori_6 1 16 16 6 jal_8 7 -90 112 8 jal_9 5 -256 228 9 jal_l_11 31 -936 934 11 jal_l_12 6 -1518 1728 12
Instructions are listed grouped by immediate width in bits, with widths smaller than 4 grouped together with 4. The columns show: - Frequency - how many times this instruction appeared in the program, with this immediate width
- Smallest immediate value that was used
- Largest immediate value that was used
- Largest immediate width that was used (which is a bit redundant now)
For example, the "jal_9" line shows that JAL with a 9-bit immediate appeared 5 times, the smallest value used was -256, and the largest was 228. The analysis script also looks for certain sequences of instructions, and combines them into a single virtual instruction - for example, LUI (load upper immediate) loads the upper 8 bits of a register and ADDI8 (add immediate 8-bit) adds a signed 8-bit number to a register, and they are used together to load a 16-bit value - whereas LI (load immediate) loads a sign-extended immediate into a register but only supports 5 or 6 bits. So LUI+ADDI8 is essentially a wider version of LI. The script combines them so we can see that if LI supported wider immediates it would have been used once with 9-bit immediate, three times with an 11-bit immediate, etc. Another example is AUIPC+ADDI8 which together add a 16-bit offset to the program counter storing the result in a register. This is especially often followed by a JALR instruction, to jump-and-link to where that register is pointing, so the analysis script fuses all of that together into a virtual "JAL_L" instruction. This lets us see whether it would be valuable to widen the immediate in the JAL instruction so that we wouldn't need to issue this combination as many times - and it would appear to be extremely valuable to make that change in the ISA, as JAL_L_11 is even more common that JAL_10 is (the widest currently supported JAL) However, this depends a lot on how much code you've actually written, and a smart linker could also reduce this impact by putting routines that call each other next to each other in memory. Also, I think however wide JAL becomes, it will never be wide enough if more code is written! I'd also highlight the branches in the table above. Comparisons against zero are more common than comparisons between two registers, and BEQZ and BNEZ seem particularly common. That might be just the way I write code, but I think it's also partly due to the limited number of registers available - loading a constant into a register just to make a comparison against it is a bit of a waste of a register. So I find myself writing code a bit differently, so that I can make these comparisons without needing an extra register - here's a fragment from my "gets" implementation: Code: .again: call getchar
addi a0, a0, -127 beqz a0, .backspace bgez a0, .again
addi a0, a0, 127 - 10 beqz a0, .enter
addi a0, a0, 10 - 32 bltz a0, .again
addi a0, a0, 32
sb a0, (s0) addi s0, s0, 1
call putchar j .again
You can see that I'm avoiding loading various constants into another register before comparing against them by instead just modifying a0 multiple times and comparing against zero. I don't know whether this is something you'd do in normal RISC-V as well, but it makes a lot of sense with this architecture with limited available registers. Having gone through writing the assembler, I do wonder how easy it would have been (or would be) to add a new target to GNU binutils that does all these things. Last time I looked into that it seemed very hard to figure out where to get started. Or maybe LLVM's assembler is easier to adapt.
|
| Sun Oct 26, 2025 10:05 am |
|
 |
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1846
|
Ah yes, another small feedback loop - you learn to code with the ISA you've designed, and you adjust accordingly. Very satisfying.
It would of course be rather marvellous to connect into an existing binutils - both GNU and LLVM have struck me as quite heavy, but of some people have the software skills I lack!
|
| Sun Oct 26, 2025 1:59 pm |
|
 |
|
gfoot
Joined: Sat Oct 04, 2025 10:54 am Posts: 9
|
This looks interesting, regarding adding new targets to binutils: https://www.opensourceforu.com/2010/01/ ... hitecture/ It at least maps things out a bit. I did look at LLVM in this regard, but got the impression that their assembler might not be very mature anyway - it sounds like for a long time their tool frontends were being used with the GNU Assembler as the back end. That seems to have changed for some target types now, but not all. I do think that LLVM is very well thought-out and structured - I expect it carries less baggage than GNU binutils - but maybe also less maturity as a result. Architectures like RISC-V seem to be particularly tricky for binutils to work with, as the linker has to do a lot more extra work than I believe is necessary in most architectures, changing some instructions around significantly based on where symbols have ended up. These things often contract two instructions into one, but that then moves all the other symbols later in that section, and that could mean that now an instruction that appeared earlier in the section can also be written in a shorter form, so then everything moves yet again... My assembler also has to deal with this and there are situations where you can get in an infinite loop with instructions becoming shorter in one pass and then longer in the next pass. It is quite complicated. So if adding a new target to binutils, it would be nice to be able to reuse all that logic that's already there, if possible. It could be interesting to look into one day but is also a bit of a distraction, you have to pick your battles!
|
| Sun Oct 26, 2025 4:14 pm |
|
 |
|
DockLazy
Joined: Sun Mar 27, 2022 12:11 am Posts: 60
|
SRAM is a good choice for a register file as it works well down to a 2 cycle per instruction machine, as register file writeback can happen in parallel with instruction fetch. It also gives the option of having multiple register sets or multithreading with the addition of a thread scheduler. gfoot wrote: I'd also highlight the branches in the table above. Comparisons against zero are more common than comparisons between two registers, and BEQZ and BNEZ seem particularly common. That might be just the way I write code, but I think it's also partly due to the limited number of registers available - loading a constant into a register just to make a comparison against it is a bit of a waste of a register. So I find myself writing code a bit differently, so that I can make these comparisons without needing an extra register - here's a fragment from my "gets" implementation:
I think it was Alpha that used branches that only compared to zero. So it's probably not just your code.
|
| Tue Oct 28, 2025 6:17 am |
|
 |
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1846
|
May be of interest, I made a new thread Easy RISC-V (online tutorial with emulation in browser)
|
| Tue Oct 28, 2025 9:47 pm |
|
 |
|
gfoot
Joined: Sat Oct 04, 2025 10:54 am Posts: 9
|
DockLazy wrote: SRAM is a good choice for a register file as it works well down to a 2 cycle per instruction machine, as register file writeback can happen in parallel with instruction fetch. It also gives the option of having multiple register sets or multithreading with the addition of a thread scheduler. I had thought about it - I would want it to work within a single cycle, but that could probably be arranged with fast-enough RAM. My clocks and buses are going to look something like this: Attachment: gf-rv16-cycles.png The PHI2 cycle is a bit like the 6502's, with external memory access happening while PHI2 is high. The CPU activity is mostly synchronised when PHI2 is low, though some things do depend on other transitions and the chains of asynchronous activity can overlap well into the PHI2-high phase (shown here by BUS_C not becoming stable until late in the cycle). - MCA means "MicroCode Address", which changes in response to PHI1 going high (either being incremented, or loaded with a fresh value if it's the start of an instruction)
- CTL is an aggregate of all internal control signals, which are set based on decoding the next microcode instruction
- BUS_A is one half of the ALU's input buses, and is mostly set based on constant values or immediates encoded in the instruction
- BUS_B is the other half, and this can come from registers or (to some extent) memory look-ups. What is illustrated here is a register lookup, which takes a while before BUS_B is reliable.
- BUS_C is the ALU's output, so there's some delay after BUS_A and BUS_B are settled before this has as good value.
Memory writes always come from BUS_B (as they are always from registers, never involving arithmetic), so as far as external memory is concerned it doesn't matter if BUS_C is quite late in the cycle. But register writes do come from BUS_C (or from memory reads) and these are only really reliable at the falling edge of PHI2. So in a nutshell with SRAM for the register file it should be possible to fit register writes in at the start of PHI2's low period and register reads towards the end of that low period, possibly overlapping into the high period. I haven't thought through exactly how that would affect the control signals though, e.g. to ensure a certain duration for the SRAM write before trying to start the SRAM read. It may be better to actually delay register reads until PHI2 does go high, so that the whole of the low phase is available for any writes to complete. Regarding instruction fetch - I am doing that in parallel with other things already, which may involve other register operations, so I'm not sure that the writeback can be overlapped with it as well, at least not for the entire cycle (only as discussed above).
You do not have the required permissions to view the files attached to this post.
|
| Wed Oct 29, 2025 6:29 pm |
|
 |
|
gfoot
Joined: Sat Oct 04, 2025 10:54 am Posts: 9
|
Talking of instruction-fetch overlapping, here is where I am now with the microcode, including instruction fetch. It's all fully-simulated in the simulator and seems to work well. First here is a simple instruction - Load Immediate - which just loads an immediate value (embedded in the opcode) into a register: Code: mnemonic . cycle last high bus_a bus_b reg_r aluop mar_w reg_w reg_win pc_w mem_w if . LI rd, imm . 0 . low imm zero . ad0 . rd alu . . if . 1 1 high imm zero . adc . rd alu . . if
It's a two-cycle instruction (as the buses are only 8-bit, almost everything takes at least two cycles). The first is a low cycle, the second a high one - this affects things like register reads, register writes, and memory access. Other than that the two cycles are almost identical. In each cycle, bus A is driven with the immediate value taken from the instruction encoding - it's probably 5 or 6 bits wide for LI, signed and hence sign-extended. Bus B is zero because we don't really want to do any arithmetic. No registers are read, the ALU runs an ADD operation with 0 carry, the MAR is not written, but a register is written (rd in the instruction encoding), and the value written to the register comes from the ALU. The PC is not written, memory is not written, and both cycles are instruction fetch cycles. Instruction fetch takes two cycles because instructions are 16-bit. Whether the instruction fetch on a particular cycle is for the low or high byte depends on the cycle number being even or odd - not the "high" control signal (though that usually drives this sort of thing). On an instruction fetch cycle the memory address is driven by the program counter (actually the next value of the program counter), rather than the MAR, with bit 0 of the memory address coming from the low bit of the cycle number (though I might use the microcode address for that in fact). So over the course of these two cycles, the Instruction Register is loaded with a new value ready for the next instruction. Instructions that modify PC are more complex - here is a Jump-And-Link instruction: Code: mnemonic . cycle last high bus_a bus_b reg_r aluop mar_w reg_w reg_win pc_w mem_w if . JAL ra, imm . 0 . low imm pc . ad0 . ra pcnext pc_w . . . 1 . high imm pc . adc . ra pcnext pc_w . . . 2 . . . . . . . . . . . if . 3 1 . . . . . . . . . . if
Again the main work is done in two cycles, but we must also spend two cycles on the instruction fetch because we can't do that until PC has been loaded with the new value, so there are two extra rather dead cycles at the end. The first two cycles compute PC+imm via the ALU, writing the result to the program counter ('pc_w' is set), while simultaneously register 'ra' is written with the value taken directly from 'pcnext' (not via the ALU which is already busy). So the program counter actually consists of (at least) two registers - 'pc' is a static view of the value of the program counter for the instruction that's currently being executed, and feeds into the ALU where necessary as in the above case. 'pcnext' is a counter that holds the next value for the program counter - this is usually 2 greater than 'pc', but during a jump instruction it gets written with something different, as in this case. 'pcnext' is copied to 'pc' at the end of the last cycle of an instruction. Finally here's a more complex instruction - a branch, where we may or may not need to delay the instruction fetch. This particular example is the longest one in my current microcode, it is a particularly awkward case... Code: mnemonic . cycle last high bus_a bus_b reg_r aluop mar_w reg_w reg_win pc_w mem_w if . BNE rs1, rs2, imm . 0 . low 0xff regs rs2 xor mar_w . . . . if . 1 . low mar regs rs1 xor mar_w . . . . if . 2 . low mar marn . ad0 . . . . . . . 3 . high 0xff regs rs2 xor mar_w . . . . . . 4 . high mar regs rs1 xor mar_w . . . . . . 5 . high mar zero . adc . . . . . . . 6 z high 0xff zero . adc . . . . . . . 7 . low imm pc . ad0 . . . pc_w . . . 8 . high imm pc . adc . . . pc_w . . . 9 . . . . . . . . . . . if . 10 1 . . . . . . . . . . if
It will always cost four cycles to read two 16-bit registers 8-bits at a time, two cycles to set pc, and two cycles for a final instruction fetch, so the shortest this could be is eight cycles; but it takes three more in this implementation. During the first two cycles we still do an instruction fetch in parallel as usual, to cover the case where the branch is not taken. Some branch instructions like BEQ are also able to early out as soon as the end of the second cycle, if they can already determine that the branch shouldn't be taken. BEQ is really hard though. The first two cycles compute (0xff ^ (rs1 & 0xff) ^ (rs2 & 0xff)) and the third cycle adds on another 1 if the result (temporarily stored in the MAR) was negative. This causes the internal carry flag to get set if and only if the initial result was 0xff, indicating equality of the low bytes of the registers. The fourth and fifth cycles then compute the same expression but for the high byte, and this time on the sixth cycle we add zero with carry - so if the carry was 1 from the third cycle, and the high byte calculation resulted in 0xff (equality), then when we add the carry, the carry will again be set. The seventh cycle then adds zero to zero with carry, resulting in zero if the numbers were not equal and one if they were equal. The 'last' column shows a 'z' meaning this is conditionally the last cycle - in the case that the ALU result from this cycle was zero. So if the numbers were not equal, the ALU result will be zero and this will be the last cycle; the next one will begin execution of the instruction that was already fetched in the first two cycles. If the numbers were equal, on the other hand, this is then not the last cycle, and the remainder of the instruction is basically the same as the jump instruction above, but without the link register being written. This BNE instruction is (hopefully!) the worst case for this 8-bit internal architecture, and there's some potential to add some shortcuts (and complexity) to speed it up. For example making the interface to the registers be 16 bits wide would allow more direct comparisons between them in fewer cycles. But I don't want to commit to that extra complexity yet. Another possibility is allowing more flexible use of the MAR, which is 16-bit but when used as a temporary we only use 8 bits of it. For now though I'm accepting the longer microcode in exchange for maybe slightly simpler control signal complexity.
|
| Wed Oct 29, 2025 7:22 pm |
|
Who is online |
Users browsing this forum: alibaba-cloud, chrome-10x-bots, chrome-7x-bots, sougou and 1 guest |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|