View unanswered posts | View active topics It is currently Wed Apr 24, 2024 12:27 am



Reply to topic  [ 237 posts ]  Go to page Previous  1, 2, 3, 4, 5 ... 16  Next
 rj16 - a homebrew 16-bit cpu 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Rob, you have a lot more experience in hardware design than me, as I am just starting my learning process, but while attempting to pipeline my CPU74 I found that classic stack based call/returns are harder than LR based ones.

- The stack based call involves a register update (SP decrement), a store (PC), and a jump (to an immediate address).
Return is a load, a jump, and a register update (SP increment).

- The LR based call is just a register update (LR) and a jump.
Return is just a register based jump.

Both cases are of course ultimately a jump, but the stack based approach involves a load/store in addition to the jump. If we want to resolve the jump address in the decode stage -in order to reduce jump latency for as much as possible- we need to have the jump address ready on that stage.

For the call, the jump address is obtained from the instruction encoding in both approaches, so it is available early and the jump can be resolved on the decode stage.

However, in the case of the stack based Return, the jump address must be obtained from memory. This I think that slows down the pipeline -unless there's some trick that I'm missing- because that produces a hazard that can only be solved by stalling the pipeline. On the contrary, for the LR based Return, we can detect the instruction on the decode stage and update the PC right away, incurring only on a single fetch delay. Of course, in a number of cases the LR will have to be loaded from memory in an earlier instruction, but this does not need to be always the case, and even so, it does not need to be exactly the previous instruction, so there's either no hazards or any hazards are solved without stalling the pipeline.

The summary is that the Stack based approach is always like the worse case of the LR based mechanism. While the LR based mechanism looks generally faster to me


Thu Dec 10, 2020 8:27 am
Profile

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
rj16 has had a link register from the beginning. When I first learned of the idea I was quite enamoured by it. I guess I haven't thought critically about whether or not it's a good idea, so this discussion is really good.


Thu Dec 10, 2020 12:05 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
One characteristic of a microarchitecture is how many read and write ports are needed for the register file. Might it be that the stack pointer approach costs more ports (or more cycles if port-limited)?

I'm sure something has been written somewhere about ARM's approach. Might it be that once you have a load/store multiple instruction, which is useful for register spill and fill at subroutine boundaries, you can then include PC and LR in the set in a natural way, and maybe that's in some sense more efficient? I'm a bit fuzzy on this, I'll admit.


Thu Dec 10, 2020 1:01 pm
Profile

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Here's a funnel shifter design:

Attachment:
File comment: Funnel Shifter
funnel_shifter.png
funnel_shifter.png [ 33.07 KiB | Viewed 964 times ]


Internally it only shifts right, but there's a negate circuit on the shift amount input to make it easier to implement shift left instructions. For those, you swap A & B, which would probably happen in a mux before the funnel shifter.

What I wanted to do is see if Digital could fit the shifter in a small CPLD, like the ATF1508, but Digital can't generate equations for more than 24 inputs, so I would have to figure out some other way. There's some really old windows software that in theory could take a verilog file and compile it for the ATF1508... I would want to have high confidence that it would fit before I mess with windows though (I haven't used windows in 20 years).

Anyway, my thought is to save 20 chips by using the smallest programmable logic device possible. Though, now that I have figured out the circuit, I don't think it's 20 chips, I think it's more like 15 or 16. But if there's options for saving the design time and chip costs by cheating a little bit with a programmable logic device, then I would feel better adding it to rj16.


Last edited by rj45 on Fri Dec 11, 2020 12:38 am, edited 1 time in total.



Thu Dec 10, 2020 3:44 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
TI had a 16 bit shift unit, 74AS897.
The 74F350 is 4 bit shift unit. This may be hard to find, so replacing it with
a PAL is fair game. 74S350 (TI).
The PAL databook also gave a barrel sifter using PAL's.
Shift right,shift left and scale by 2 or 4 or 8 are the most common shifts.


Thu Dec 10, 2020 4:11 pm
Profile

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Yeah the 74350 I couldn't find for sale (new) anywhere, but it fits in a 22V10 and possibly it can fit in a 16V8. The only issue is I think it's around 9 GALs for the funnel shifter and that's a lot of power draw.

I am pretty sure it will fit in a CPLD now, though. The 74350 is just a 4:1 mux. I am pretty sure the whole funnel is just two layers of 4:1 muxes and a final 2:1 mux. That should be no problem for a CPLD.

So... rj16 is going to get a funnel shifter :D


Fri Dec 11, 2020 3:45 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
One characteristic of a microarchitecture is how many read and write ports are needed for the register file. Might it be that the stack pointer approach costs more ports (or more cycles if port-limited)?

I'm sure something has been written somewhere about ARM's approach. Might it be that once you have a load/store multiple instruction, which is useful for register spill and fill at subroutine boundaries, you can then include PC and LR in the set in a natural way, and maybe that's in some sense more efficient? I'm a bit fuzzy on this, I'll admit.

These are interesting points.

I am unsure if the first statement is true in a general way, particularly if the PC is part of the general register set, but it really can be the case at least in architectures where the PC is implemented as a separate register. That's because the stack based call/return could be the only instruction requiring the PC to be stored/load from memory, and that would imply that additional hardware (such as ports or plexes) could be required only to support these instructions.

About the ARM approach involving multiple load/store instructions, and including the LR as a general register which can be specified in such multiple instructions, I think this is the whole point of the concept. In effect, this helps having a calling/return convention that is relatively cheap compared with a classic stack based convention. Furthermore, the multiple load/store instructions can increment/decrement the base address pointer (in this case the SP) on the same instruction, they are in effect like very efficient Push/Pop instructions on steroids.

Also, ARM multiple load/store instructions always store registers with the lower register number in the lower memory address, starting from the lower register number in the case of stores and the opposite in the case of loads. Since the Link Register (LR) is a high number register, it is always the last to be stored, and it is the first to be load. To my understanding, this is a clever architectural decision to prevent pipeline hazards on any following return from subroutine instruction, which is just a move from LR to PC. To make things even faster, the return instruction is not even resolved on the execute stage but on the the decode stage. The net result is a very fast calling convention that gets very little influence of the pipeline.


Fri Dec 11, 2020 8:41 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Actually, I now found that I had my understanding or the ARM stack load/store reversed. It turns out that it is even better than I thought!.
In fact, register saves at the function prologue is performed by the STMDB instruction. This means Store Multiple, Decrement Before. We can consider the following instruction that is placed at the begining of a function
Code:
stmdb    sp!, {r0-r4, lr}
This stores registers R0 through R4 and the LR to the stack. The R0 is stored in the lower address, the LR is stored at the higher address, and the SP is decremented accordingly.

At the end of the function, the following instruction is placed
Code:
ldmia    sp!, {r0-r4, pc}
This means Load Multiple, Increment After. Registers R0, through R4 are restored from the stack, but note that the previously saved LR is now being restored directly to the PC, with no additional instruction. The SP is also incremented accordingly. This implies that with one single instruction we are both restoring the stack saved registers, incrementing the SP, and returning from the subroutine. No need for explicit return instruction anymore (!!)


Fri Dec 11, 2020 9:24 am
Profile

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
I wonder if ARM implements the multiple register stores/loads without a pipeline stall. If you have the stack in an L1 cache and lots of read/write ports on the register file, and lots of spare transistors lying around wanting to be useful, I think it's entirely possible they do. That essentially makes function calls free -- and in that sense compilers should stop inlining functions as an optimization. Though, inlining allows the compiler optimize with more scope, so I guess it still has some value.

For rj16, one of the architecture decisions I have made is that all instructions should be one cycle, or in other words, they don't stall the pipeline to do another step of computation (though memory access and hazards might stall). This is a big simplification to the microcode -- it's just an opcode to control signal lookup table. But of course that means we sacrifice code density, because it takes multiple instructions to do a lot of common things. Adding an instruction that loads/stores multiple registers would save a lot of code space.

Another constraint on the microcode is trying to fit the microcode into what passes for a ROM on an FPGA. I am still a newbie to FPGAs but I believe there's two options: You can initialize a block RAM to act like a ROM, but they are only 256x16 bits big and the ice40 doesn't have a lot of them. The other option is to build a ROM from LUTs (which are basically programmable 4-input logic gates). That will use a lot of LUTs depending on the design of the microcode, but if the opcodes are laid out in opcode-space with good relationship with the control signals they generate, I am hoping that will help reduce LUT count. But currently the v6 microcode rom is only 96 x 26 bits.

Of course all this optimization is moot if it's built in TTL with a big fat flash ROM for microcode storage. But putting the microcode in GALs would be an option.

Anyway, just as a side note, here's the above funnel shifter using 4:1 muxes. I think the propagation delay on 4:1 muxes is the same as 2:1 muxes, so this should be a faster design (the negate circuit fits in a GAL and can be simplified to one layer of and+or). The first layer shifts by 4, the second by 1. The last mux handles the case where the shift amount is zero when dir is 1, but it might be possible to do without it.

Attachment:
File comment: 4:1 mux funnel shifter
2-level_funnel_shifter.png
2-level_funnel_shifter.png [ 33.06 KiB | Viewed 938 times ]


Fri Dec 11, 2020 12:40 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
I found the oddity I was vaguely remembering. There is an interesting wrinkle in the ARM implementation, which you don't see in the architecture. See this post and the nearby ones and the links within:
Quote:
[Furber] says that initially the order of writes to stack was as you'd expect (which means the stack pointer need only ever be incremented or decremented) but when at a late stage they decided they would want to implement an abort mechanism, they needed the PC to be updated last, so that an aborted multiple load would be restartable.

That was me, perhaps getting close to the truth. Further downthread, we see Ken Shirriff quoted:
Quote:
The solution used in the ARM1 is to always read or write registers starting with the lowest register and the lowest address. In other words, to pop R2, R1, R0, the ARM1 jumps into the middle of the stack and pops R0, R1, R2 in the reverse order. It sounds crazy but it works. (The bit counter determines how many words to shift the starting position.) The consequence of this redesign was that the circuitry to decrement addresses and priority encode in reverse order is never used. This circuity was removed from the ARM2.


Sorry, these ARM details are a little off-topic for your thread, rj45, but hopefully as interesting to you as they are to me.


Fri Dec 11, 2020 1:01 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
rj45 wrote:
I wonder if ARM implements the multiple register stores/loads without a pipeline stall. If you have the stack in an L1 cache and lots of read/write ports on the register file, and lots of spare transistors lying around wanting to be useful, I think it's entirely possible they do. That essentially makes function calls free -- and in that sense compilers should stop inlining functions as an optimization. Though, inlining allows the compiler optimize with more scope, so I guess it still has some value.

For rj16, one of the architecture decisions I have made is that all instructions should be one cycle, or in other words, they don't stall the pipeline to do another step of computation (though memory access and hazards might stall). This is a big simplification to the microcode --

The ARM multiple load/stores are efficient but definitely not cheap. To start, the processor was not designed for one cycle instructions, as this was rejected early in the development cycle. Very notably, the 3 stage pipelined ARM1 and ARM2 https://en.wikichip.org/wiki/acorn/microarchitectures/arm2, and the more recent ARM Cortex-M0 through Cortex-M4 processors, take 2 or 3 cycles to execute the single word load/store instructions. Despite having "RISC" in their name, the "Advanced Risc Machines" make use of microcode and a decode PLA for instruction decoding http://www.righto.com/2016/02/reverse-engineering-arm1-processors.html. The microcode sequencer implements a loop for the multiple load/store instructions, and these instructions can take one or two cycles per register being stored/loaded according to the linked document.


Fri Dec 11, 2020 9:43 pm
Profile

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Okay, so I want to talk about something hypothetical here. I am really not sure if I want to go down this path, but I would like to explore it a bit and gather people's feedback and thoughts about it.

Say I want to support more than 64KB of data. LCC doesn't support segmented pointers or multiple sizes of pointers (like near or far pointers), banked memory, overlays etc. It's basically either all pointers are 16-bit pointers or all pointers are 32-bit pointers. I'd like C programs to just have a linear address space to deal with, rather than have to manage segments or banks manually.

So, if pointers are 32-bit, there's two options for storing them in registers: some registers can be 32-bits and with some patience you can teach LCC to store 32-bit values only in those registers -- but it's easier not to. The easiest option is storing 32-bit values in adjacent register pairs, and I would prefer that.

But it's easy to make some registers 32-bit because LCC already treats them in a special way: the program counter, link register, stack pointer and a base pointer, so I think I would do that and pull them out of the register file into special registers, adding instructions required to manipulate them. These would usually only be used with relative addressing, so that would save program space. A base pointer isn't strictly necessary, it's just easy to do.

As for addressing modes, I think it makes sense to have base relative addressing for global variables and PC-relative addressing for branches. Not sure yet if literals (constants) will be PC-relative or base-relative. I think also, having the ability to easily use a 16-bit integer as an index into a chunk of memory specified with a 32-bit pointer would be very useful.

So given the formula Base + Index*Scale + Displacement, with base being either PC, SP, BP or a register pair, I think I would want these addressing modes:

1. Base plus immediate displacement
2. Base plus scaled index register plus displacement, the index register and scale would be provided by a prefix instruction

So, what is the simplest (hardware wise) way to implement that?

The special 32-bit base registers with a displacement are probably straight forward. A cheat would be if they produce a carry to the high word, it stalls the processor to calculate the high word but otherwise just passes the high word through.

Using a register pair as a base is more complicated though. I think I have these options, but I am curious if there's more:

1. If I move from a lookup table for control signals, and have multiple micro-ops per instruction, then it's relatively easy to run a 32-bit address 16-bits at a time through the main ALU to a memory address register, then do the load/store/jump.

2. Without micro-ops, it's possible to have a local state machine, and stall the processor while multiple steps happen. And so this is similar to option 1 otherwise.

3. Prefix instructions that calculate the low half of an address into a temporary register, which the following instruction uses after calculating the high half. A large immediate may also need a prefix instruction, so you might have memory ops using several prefixes and so that's not really ideal.

4. Add a read port or two to the register file, and implement a separate 32-bit address adder.

5. Make the register file 32-bits wide with muxes on the outputs to select which half of the register to use. This keeps 2 read ports and a single write port, but the write port would need to be able to separately write each half of the register or the whole register at once. And of course, there would be a 32-bit adder for addresses.

Thoughts? I know 8-bit processors do this all the time. So in some sense I am just trying to double an 8 bit processor to 16-bits. What's the simplest / easiest to implement way you've seen that done?


Sat Dec 12, 2020 4:06 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
> Thoughts? I know 8-bit processors do this all the time. So in some sense I am just trying to double an 8 bit processor to 16-bits. What's the simplest / easiest to implement way you've seen that done?

I think the 6502 perhaps does the least by way of 16 bit affordances: any 16 bit addresses are held in memory, and you can only apply an 8 bit index to them. We had some fun making the simplest-possible 16 bit extension, in the form of the 65Org16, which simply acts as a 6502 but with 16 bit bytes. No facility at all for 8 bit operations. Of course that leaves low-hanging fruit, to add some bytewide operations, or at least a byteswap. And it gives a 32 bit address space for free (a space of 16 bit words) - if you want to make it smaller, to fit a 16 bit address, there's again some optimisation or simplification possible.

The OPC series, similarly, are simple machines, word-oriented, with perhaps a nod to byte swapping.

What's more involved is the usual approach, allowing both byte and word sized operations, and worrying about sign-extending some values (but perhaps not all.) For me, this is all too much, but for others it's exactly what they want to be building.

About complexity though: you have to build it, and test it, and document it, and convince your compiler or your users to make good use of it. So I would always advise trying to simplify: perhaps keep some ideas in reserve, and explore a minimal viable subset.


Sat Dec 12, 2020 5:51 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
Lea is a needed instruction, once you have more complex addressing modes.
64Kb is just ample for a tiny OS like cp/m or mini-flex. They have a simple file format that
is compact but at the price of reading small sectors and small floppy discs.
A MMU is needed with a more moden O/S, replacing use of base registers for time sharing.
OS/9 level II (6809) was good example. 74LS610 / 74LS612 does all the work, if can find one.


Sun Dec 13, 2020 12:56 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Thoughts? I know 8-bit processors do this all the time.

The 6502 has an interesting way of allowing access to memory via indirect pointers stored in zero page. The 65816 can use 24-bit pointers stored in zero-page memory even though the registers are only 16-bit. Maybe some sort of memory indirect addressing would work. 6809 also have indirect memory addressing.
I think you’ve hit upon one of the conundrums of computing. Address space limitations.
Quote:
74LS610 / 74LS612 does all the work, if can find one.

74LS612 is a great chip, a little on the sluggish side. It’s really just a dual-ported ram chip. They can also be used as register files for a cpu.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 13, 2020 6:09 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 237 posts ]  Go to page Previous  1, 2, 3, 4, 5 ... 16  Next

Who is online

Users browsing this forum: No registered users and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software