NorthWay wrote:
My first thought was to track each register with a single bit that you can clear in supervisor mode, and which is set whenever the register is written to.
I think the biggest problem is the opposite: when to mark a register `unused`.
I'd think that in any non-trivial program, any callee-saved registers are never
unused. There is always a valid value -- although not always belonging to the immediate caller, but somewhere up the call chain. You could choose to not have callee-saved registers in your ABI though, but then they would need to be caller-saved somehow.
You could perhaps have a `call` instruction sequence with instructions that both saves registers and marks them as unused. Those that are already marked unused would not have to be saved again.
Then save the bitmask of the saved registers in the unused high bits of the return address on the stack.
Marking registers unused inside a function could be difficult though. You could also perhaps have an instruction fetcher that looks ahead to see when a register next gets overwritten without being reused, so as to mark registers `unused` a few instructions beforehand. (But I'd think that could be very complicated to implement)
Another approach would be to make registers expire automatically.
The Mill architecture does not have registers at all, but a "Belt": It works like a stack machine that you can only push to, and the belt size is bounded: older values get pushed off the belt and are lost.
But the belt can also be
shorter than its maximum size. It works a little bit like register windows on function calls in that a callee sees only its parameters on the belt: The caller's belt is saved.
Something in-between The Mill and a conventional architecture is the
STRAIGHT architecture.
A side-effect of STRAIGHT and The Mill is that each instruction is that the "destination register" is implicit and not in the instruction encoding. STRAIGHT uses that to get many architectural registers while The Mill has shorter instructions.
(Do beware though the The MIll's Belt is quite heavily patented. I think the original STRAIGHT design avoids to infringe it, but a follow-up paper to it does not)
Mitch Alsup's
My 66000 architecture spares registers by having only 32 64-bit registers — in total. The register file is unified for both integer and floating-point ops. Instead of architectural SIMD registers, it can enter a vector-loop mode in which regular operations have become vector operations. The microarchitecture's vector length is hidden by the microarchitecture.
When there is an interrupt (such as before a task switch) the processor falls back to scalar operation, so that the OS never has to store more than the 32 64-bit registers.
I'd think there are big drawbacks to this design though. If I understand it right, the vector mode supports only one-dimensional loops, not multi-dimensional ("polyhedral") loop nests. Nor does it make it easier to use fixed-length vectors such as those used for 3D coordinates, colour spaces, etc.
NorthWay wrote:
Apart from sliding register windows, are there any (modern) approaches to loading and storing registers faster than storing them one by one?
The Itanium stores register windows in a separate on-chip memory. It was was supposed to have a "register stack engine" that automatically shuffled the register stack to/from a hidden area in main memory during spare bus cycles. I don't know if there ever was any member of the Itanium architecture that got one, however. One reason why the Itanium was known to be slow was that overflow/underflow of the internal stack memory was instead handled in software by the OS.
The Mill architecture is supposed to have an engine from the start though — and by having it, it is supposed to be able to do a context switch in a single clock cycle. In the best circumstances, of course.