Last visit was: Sat Jun 14, 2025 6:54 pm
It is currently Sat Jun 14, 2025 6:54 pm



 [ 7 posts ] 
 Faster process switching approaches? 
Author Message
Online

Joined: Thu Jan 17, 2013 4:38 pm
Posts: 56
I've had two loosely connected thoughts about process switching that I have been pondering about. If anyone knows if something similar has been done before, please let me know:

Apart from sliding register windows, are there any (modern) approaches to loading and storing registers faster than storing them one by one? (Loop with register number in another register??)
A type of "movem" or the ARM A26 equivalent is what I would label as not modern.

My first thought was to track each register with a single bit that you can clear in supervisor mode, and which is set whenever the register is written to.
The idea is to make a new store instruction that will become a NO-OP if the register hasn't been updated between switches.

The second thought was to introduce a new store instruction that stores each N adjacent registers (which in size is equivalent to a cacheline) to cache only. (I.e. registers 0-7, 8-15, 16-23, 24-31 for example)
That would require there to actually be datapaths the width of a cacheline to transfer all N at once. It would also not need the cacheline to be loaded before use (just expunged/invalidated).

Or combine both in the hope of getting some "free" performance at times.

---

(What do modern cpus do to switch faster? Opcode fusion?)


Fri Jun 06, 2025 9:02 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1832
I believe the 64-bit ARM approach is to have load and store pairs of registers: faster than singles, not too big a hit on interrupt latency, maybe helpful for cache. I'm mildly surprised to see it applies to any pair in any order - not just adjacent odd/even:
Quote:
These LDP and STP pair instructions transfer two registers to and from memory. Registers are processed in operand order, from left-to-right. That is, the first register operand is loaded or stored first, and the second register operand is loaded or stored next.
...
Remember that in AArch64 the stack-pointer must be 128-bit aligned.

(presumably alignment of the stack pointer is also a performance/simplicity/cache-friendly decision?)

It looks like RISC-V has a vector extension which brings in multiple register operations.


Sat Jun 07, 2025 5:34 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2358
Location: Canada
I do not think there is a magic solution that has not been thought of.

For limited context switching, I had a version of the 68k that stored all the registers at once to a wide memory within a clock cycle or two. IIRC the memory was about 600-bits wide to store all 16 regs, the PC and SR. The memory was a dedicated RAM capable of storing up to 512 contexts. This sort of thing may be possible with a small register set for some embedded app that needs super-fast context switching as long as there are not many threads running. 512 contexts is not very many for a modern machine.

I do not think there is much beyond storing registers a few at a time in a general-purpose processor. I have heard of at least one design storing registers a cache-line at a time (8 regs for a 64-bit CPU). I think it is just not worth the extra hardware and trouble to try and store them all at once. It is bound to make the register file more complex, more ports, and worse timing for something that does not usually happen that often at gigahertz speeds.

Right now my PC is running about 6,000 threads. Dedicated storage for that many threads would be expensive.

Something that does show up is load-multiple / store-multiple register instructions. While not really loading and storing all registers at the same time, it is good for code density. Subroutine entry and exit do multiple loads and stores all the time. Some machines can do the store multiples in the background as the program continues so it looks like there is no time consumed.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 07, 2025 9:41 am WWW
Online

Joined: Thu Jan 17, 2013 4:38 pm
Posts: 56
I was thinking of a RISC approach to "movem" style instructions, but it violates one principle in that it will do 2 updates from 1 instruction:
Put the bitfield that specifies the registers you want to save in a register itself, and when you store one register you clear the matching bitfield in that register (so the instruction can be restarted easily if needed). This might be a supervisor-only instruction that only accesses user-mode registers.


Sat Jun 07, 2025 6:11 pm

Joined: Fri Jan 19, 2024 1:06 pm
Posts: 17
NorthWay wrote:
My first thought was to track each register with a single bit that you can clear in supervisor mode, and which is set whenever the register is written to.

I think the biggest problem is the opposite: when to mark a register `unused`.

I'd think that in any non-trivial program, any callee-saved registers are never unused. There is always a valid value -- although not always belonging to the immediate caller, but somewhere up the call chain. You could choose to not have callee-saved registers in your ABI though, but then they would need to be caller-saved somehow.

You could perhaps have a `call` instruction sequence with instructions that both saves registers and marks them as unused. Those that are already marked unused would not have to be saved again.
Then save the bitmask of the saved registers in the unused high bits of the return address on the stack.

Marking registers unused inside a function could be difficult though. You could also perhaps have an instruction fetcher that looks ahead to see when a register next gets overwritten without being reused, so as to mark registers `unused` a few instructions beforehand. (But I'd think that could be very complicated to implement)

Another approach would be to make registers expire automatically.
The Mill architecture does not have registers at all, but a "Belt": It works like a stack machine that you can only push to, and the belt size is bounded: older values get pushed off the belt and are lost.
But the belt can also be shorter than its maximum size. It works a little bit like register windows on function calls in that a callee sees only its parameters on the belt: The caller's belt is saved.

Something in-between The Mill and a conventional architecture is the STRAIGHT architecture.
A side-effect of STRAIGHT and The Mill is that each instruction is that the "destination register" is implicit and not in the instruction encoding. STRAIGHT uses that to get many architectural registers while The Mill has shorter instructions.
(Do beware though the The MIll's Belt is quite heavily patented. I think the original STRAIGHT design avoids to infringe it, but a follow-up paper to it does not)

Mitch Alsup's My 66000 architecture spares registers by having only 32 64-bit registers — in total. The register file is unified for both integer and floating-point ops. Instead of architectural SIMD registers, it can enter a vector-loop mode in which regular operations have become vector operations. The microarchitecture's vector length is hidden by the microarchitecture.
When there is an interrupt (such as before a task switch) the processor falls back to scalar operation, so that the OS never has to store more than the 32 64-bit registers.
I'd think there are big drawbacks to this design though. If I understand it right, the vector mode supports only one-dimensional loops, not multi-dimensional ("polyhedral") loop nests. Nor does it make it easier to use fixed-length vectors such as those used for 3D coordinates, colour spaces, etc.

NorthWay wrote:
Apart from sliding register windows, are there any (modern) approaches to loading and storing registers faster than storing them one by one?

The Itanium stores register windows in a separate on-chip memory. It was was supposed to have a "register stack engine" that automatically shuffled the register stack to/from a hidden area in main memory during spare bus cycles. I don't know if there ever was any member of the Itanium architecture that got one, however. One reason why the Itanium was known to be slow was that overflow/underflow of the internal stack memory was instead handled in software by the OS.

The Mill architecture is supposed to have an engine from the start though — and by having it, it is supposed to be able to do a context switch in a single clock cycle. In the best circumstances, of course.


Mon Jun 09, 2025 6:12 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1832
(Thanks for the mention of STRAIGHT - new to me!)


Mon Jun 09, 2025 8:05 pm

Joined: Fri Jan 19, 2024 1:06 pm
Posts: 17
STRAIGHT was a research project. I came across an article about it when looking for information on how to build compilers.


Tue Jun 10, 2025 7:18 pm
 [ 7 posts ] 

Who is online

Users browsing this forum: claudebot and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software