Last visit was: Mon Sep 26, 2022 7:28 pm
It is currently Mon Sep 26, 2022 7:28 pm

 [ 5 posts ] 
 Decoding wildly complex instruction sets 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1712
Might be of interest:
First off, let’s talk about how x86 is decoded in the first place: all x86 CPUs you’re likely to interact with can decode (and execute) multiple instructions per cycle. Think about what that means: we have an (aggressively!) variable-length encoding, and we’re continually fetching instructions. These chips can decode (given the right code) 4 instructions per clock cycle. How does that work? They’re variable-length! We may know where the first instruction we’re looking at in this cycle starts, but how does the CPU know where to start decoding the second, third, and fourth instructions? That’s straightforward when your instructions are fixed-size, but for x86 they are most certainly not. And we do need to decide this quickly (within a single cycle), because if we take longer, we don’t know where the last instruction in our current “bundle” ends, and we don’t know where to resume decoding in the next cycle!

You do not have enough time in a 4GHz clock cycle (all 0.25ns of it) to fully decode 4 x86 instructions. For that matter, you don’t even have close to enough time to “fully
decode” (what exactly that means is fuzzy, and I won’t try to make it precise here) one. Two basic ways to proceed: the first is simply, don’t do that! Try to avoid it at all cost. Keep extra predecoding information (such as marking the locations where instructions start) in your instruction cache, or keep a separate decoded cache altogether, like Intels uOp caches. This works, but it doesn’t help you the first time round when you’re running code that isn’t currently cached.

Which brings us to option two: deal with it. And the way to do it is pretty much brute force. Keep a queue of upcoming instruction bytes (this ties in with branch target prediction and other things). As long as there’s enough space in there, you just keep fetching another 16 (or whatever) instruction bytes and throw them into the queue.

Then, for every single byte position in that queue, you pretend that an x86 instruction starts at that byte, and determine how long it is. Just the length. No need to know what the instruction is. No need to know what the operands are, or where the bytes denoting these operands are stored, or whether it’s an invalid encoding, or if it’s a privileged instruction that we’re not allowed to execute. None of that matters at this stage. We just want to know “supposing that this is a valid instruction, what is it’s length?”. But if we add 16 bytes per cycle to the queue, we need 16 of these predecoders in parallel to make sure that we keep up and get an instruction length for every single possible starting location. We can pipeline these predecoders over multiple cycles if necessary; we just keep fetching ahead.

Once our queue is sufficiently full and we know that size estimate for every single location in it, then we decide where the instruction boundaries are. That’s the stage that keeps track. It grabs 16 queue entries (or whatever) starting at the location for the current instruction, and then it just needs to “switch through”. “First instruction says size starting from there is 5 bytes, okay; that means second instruction is at byte 5, and the queue entry says that one’s 3 bytes; okay, third instruction starts at byte 8, 6 bytes”. No computation in that stage, just “table lookups” in the small size table we just spent a few cycles computing.

That’s one way to do it. As said, very much brute force, but it works. However, if you need 16 predecoders (as you do to sustain a fetch rate of 16 bytes/cycle), then you really want these to be as dumb and simple as you can possibly get away with. These things most certainly don’t care about 6000 different iforms. They just squint at the instruction just enough to figure out the size, and leave the rest for later.

- from ... are-there/

Thu Aug 25, 2016 9:05 pm
User avatar

Joined: Tue Jan 15, 2013 5:43 am
Posts: 189
They just squint at the instruction just enough to figure out the size
I'll bet even this (comparatively superficial) examination is far from simple! x86 instruction encodings carry a lot of backward-compatibility baggage. They weren't designed with a view to easing the modern-day problem of simultaneous decoding of multiple instructions.


Fri Aug 26, 2016 2:39 pm WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1712
It's possible that a length-determining engine which almost always gets it right for the common instructions would be adequate - if the decoding process is deeply pipelined, some slower logic which detected the rare cases could squash the in-flight decoding which is now known to have gone wrong. It's a horror of complexity though.

I understand it to be the case that the later x86 machines will decode instructions in cache-line-length packets, and will run efficiently with almost any instructions at the beginning of the packet and with a more limited set of instructions at the end of the packet. But rather than my burblings, let's hear from someone who knows what they they are talking about: ... 2-and.html

This might also be of interest:

Edit: the resource I wanted to find is now found - Agner Fog's writings at

Sat Aug 27, 2016 4:14 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1735
Location: Canada
Just reading those links it seems the x86 series still only issues three micro-ops instruction in a cycle. No doubt the decoders keep the issue logic busy even with the 1 complex + X simpler decoders. I think the idea of using 16 decoders one beginning at each byte might consume a lot more power than just the four decoders. Having the cache indicate the instruction boundaries seems like a reasonable idea. Cache loads can't be that frequent given the size of caches these days. If it takes an extra cycle or two for a pipelined length decoder to work at cache load time so what. It probably isn't a significant portion of the program execution time. => I wonder if there can be a gain of performance by ensuring that all instructions fit within a 16 byte cache line - padding with NOP's if necessary.
It appears also that the x86 uses a 16 byte window into the cache. I've been considering the fact that a cpu core with cache will likely use a window into the cache to develop an instruction set. It means that information for one instruction can span multiple words if for instance the window is four 32 bit words wide. Useful for handling extended constants.

For the 6502 the instruction length can be determined by looking at only a single byte.

Thor has to determine the length of instructions too to find the second instruction in cache window, but it's probably simpler than the x86 to do so. It might help to add a pre-decoded cache like the x86.

Reading the article I get the impression that the x86 is one complex beast to implement to tweak the max performance out of it. At the same time designers are balancing with power consumption.

Robert Finch

Sun Aug 28, 2016 2:48 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1712
> For the 6502 the instruction length can be determined by looking at only a single byte.

Although that's true, if you branch into the second or third byte of a cache line (or prefetch buffer) there's still an ambiguity about where an instruction might be in the earlier part of the line (or buffer) - it might be known, if previous bytes are known, or it might not be known. Indeed, given the rarely used trick of have a byte be both an instruction and an operand, it's fundamentally ambiguous! I think somehow the machine has to proceed with the idea of confidence, rather than the idea of certainty.

Sun Aug 28, 2016 7:14 am
 [ 5 posts ] 

Who is online

Users browsing this forum: CCBot and 0 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software