Last visit was: Fri Jun 02, 2023 8:28 am
It is currently Fri Jun 02, 2023 8:28 am

 [ 30 posts ]  Go to page Previous  1, 2
 Possible 16-bit design, need guidance on pipelines & fetches 
Author Message

Joined: Sun Jul 05, 2020 9:00 pm
Posts: 17
No problem. I am a tad sensitive on this, but I try not to go full "Gamestop mode" on folks. That's a reference to the person who went off in Gamestop and told the clerk to take it outside, after referring to this person who was obviously attired as a woman a sir. To me, that strategy sends the wrong message, since that is typically how a hot-headed guy would respond. I tend to use humor or simply share what is true about me. I tend to get the wrong pronouns online, even on sites where I use names with girl or lady as part of the screen name.

Still, I haven't figured out where to start on a pipe-lined, variable instruction size design. The Gigatron has single-cycle times for all instructions.

As a side comment that would likely be better in its own topic would be that when/if I get to a video controller, if I want to do scrolling, the best way I can see for that would be to use some sort of virtualization of the addresses. Renaming addresses is a whole lot faster than blitting, and any complex calculations can happen during the syncs. That is one thing the Gigatron does. Since everything is bit-banged, and since the X register is the only one other than PC that has a counter, that means that if you reach the end of the page it addresses, it rolls over. So for a game like Racer (like Pole Position), you can simply change the offset to do side-scrolling and if it wraps, you can recycle part of the background on the opposite side of the screen.

Wed Feb 24, 2021 8:34 pm

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
(Miss Pipeline
nerd humor !)

Wed Feb 24, 2021 9:14 pm

Joined: Sun Jul 05, 2020 9:00 pm
Posts: 17

Wed Feb 24, 2021 10:26 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
(Please forgive me for my gender assumption! I am pleased for any gender to join us! I will have a think about why I assumed that so hopefully I won't make that error in the future. Thank you for pointing it out!)

Wed Feb 24, 2021 11:11 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
So, for multi-word/byte fetches, I think it's usually done with a circuit that fetches bytes into a FIFO queue, and will run any time there's empty room in the queue. You then have a way to flush the queue when a jump happens. Then you can read multiple bytes in the queue from the decode unit and decode them, and have them deleted from the queue. If the FIFO isn't full enough yet, you stall the CPU until it is.

The "front" of the queue would allow you to read the max instruction length number of bytes in a single cycle (so if the max len is 5 bytes, you would need to be able to read up to 5 bytes off the front in one cycle)... The filler would need to be able to fill the queue in an unaligned manner. Will take some figuring to get right. And you need extra logic to allow less than the max len to be deleted from the front of the queue.

If you are going the instruction cache route, you would probably want the instruction cache to be wide enough that the FIFO filler circuit can get ahead of the program executing. So if the average instruction length is say 2 bytes, you would want the cache to be at least 16-bits wide, if not 32 bits wide. That way in a cycle the FIFO filler can refill at least 2 bytes, preferably 4 bytes in a single clock cycle. So as the CPU is executing short instructions (1 or 2 bytes long) then it doesn't have to stall unless the FIFO runs dry. Keep in mind the program will likely be executing a loop or something with good cache locality, so most of the time the cache will get hits and the FIFO will be kept full.

I think the x86 used to work this way back in the day (maybe still does?), and I believe this is how some pipelined Z80s work, and I have seen some stack processor designs that do this too.

Wed Feb 24, 2021 11:30 pm

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
The beauty of RISC is ...having fixed length opcodes.

Why? Because
  • it makes the decoding step simple
  • it makes the fetch and decode stages already intrinsically decoupled
  • it makes PC_NEXT as simple as a mux(PC+=4(1),PC+=rel,PC=addr)
  • it only requires a simple (dedicated) adder to evaluate PC+=4(1)
  • it makes dis-assembly easy, since each instruction is always 4(1) bytes
  • it's proven to consume less logic elements
The other side of the coin, it makes the code density worse :shock:

Which may be a valid and good reason to make hybrid machines, I have to admit.

(1) 32bit machine

Thu Feb 25, 2021 11:04 am

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Yeah, I agree, the only reason the extra complexity of variable-length instructions might be worth it is to save memory with the program size. Fixed length instructions are a lot simpler and will run much faster as a result. Simpler == faster in electronics, it produces less heat, there's shorter wires, less logic elements, etc.

This is one of the reasons why the apple m1 (fixed length) is blowing the x86 (variable-length) out of the water right now. They were able to make the decode unit in the m1 decode more instructions per cycle than the x86, which is at the theoretical maximum numbers of instructions because of how complicated variable length instructions get to decode. Once the logic gets too complicated it makes it hard to get the propagation delay to fit within a clock cycle, so the x86 can't really decode any more instructions than it already is per clock cycle. Not without removing instructions from the instruction set, and well, x86 just keeps adding more. The m1 on the other hand has a much higher limit due to its (mostly) fixed length instructions, which are much simpler to decode. More instructions decoded per cycle means more instructions that can potentially execute in that cycle via re-ordering to make use of the multiple pipelines these processors have.

What I am doing with my CPU is a compromise: it's fixed-width, but only 16 bits (despite being a 32-bit machine). This lets the instructions take pretty much the minimum possible memory for a pure RISC machine. It's really, really difficult to design an ISA that fits in 16 bits with 16 registers, but I have made probably a couple dozen attempts at it and got good at fitting everything in. 8 registers is a tiny bit easier. 4 registers is trivial.

You might wonder how you can fit full length literals in a fixed length instruction set.... well... you don't. There's studies showing the average number of bits required. Off the top of my head I think it's roughly 6 bits for ALU functions, 5 bits for stack relative addressing, 10 bits for jump/call, and 7 bits for local branches but I would look up the proper numbers or build a program to calculate it from input programs. In the unusual case of requiring a longer literal, you either store it separate to the program (like the superH) with PC relative addressing of program memory, or you have special instructions ("prefix" instructions) that load the immediate into a temporary immediate register for the very next instruction (and be careful not to allow interrupts between the prefix and its use).

Storing it separately is simpler, and is what I am doing now, but initially I wanted to use prefix instructions. The advantage to separate storage though is multiple instructions can share the same literal, so it (in theory) should use less memory. I think prefix instructions would be faster though, since it's not an extra memory operation.

Thu Feb 25, 2021 1:08 pm

Joined: Sun Jul 05, 2020 9:00 pm
Posts: 17
I thought of something, why not go part-fixed and part VLIW? Like what if I used 24-bits, total length for instructions but have 1-3 instructions. So could do up to 6 instructions in 3 cycles. Now, I do see the downsides since the assembler would need to be intelligent, there could be multiple ISAs, multiple control units and ALUs, and jumps may lack some granularity without using nops in the slave unit fields. The granularity issue would need to be handled by the assembler since you'd probably want jump targets to be in the first slot of a cluster.

An idea that comes to mind would be using $00 as NOP. Thus when clearing pipelines, it could make bubbling easier.

And NOPs don't have to be wasted. An interesting use would be to use them to modify an internal register that could be used as part of hardware RNG. Maybe it could increment the hidden register, or work it in nybbles and increment 1 on every nop and the other on every other NOP, with the roles changing every so many NOPs. I have a lot more to say about RNGs, but I think I will start a thread for that.

Thu Feb 25, 2021 3:10 pm

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
On VLIWs you move the complexity to the software, and I don't know, but I owned an Itanium Workstation and I sold it because if the c-toolchain was a nightmare to support ... well trying to do something in assembly was even worse.

I think a good compromise is the Blackfin manufactured and marketed by Analog Devices. It incorporates aspects of DSP architecture and advanced-RISC architecture into a single core, combining digital signal processing and microcontroller functionality, and the combination is designed to improve performance, programmability and power consumption over traditional DSP or RISC architecture designs.

Blackfin has:
  • qty=2, 16-bit hardware MACs
  • qty=2, 40-bit ALUs and accumulators
  • qty=1, 40-bit barrel shifter
  • qty=4, 8-bit video ALUs
Blackfin+ adds:
  • qty=1, 32-bit MAC
  • qty=1, 72-bit accumulator

This allows the processor to execute up to three instructions per clock cycle, depending on (ADIcc vs Gcc) the level of optimization performed by the compiler or programmer.

Wow, that's amazing :shock:

In 2018, I bought a Blackfin BF537 EZ-Kit-Lite board, a B1000 debugging cable and a license for VisualDSP+ for something like 700 Euro. A lot of money for an hobby toy, but I love it.

An other weird beast is the MIL by Ivan Godard and his startup Mill Computing, Inc, located in East Palo Alto, California. It may be interesting for you because it uses a very long instruction word encoding to place up to 33 simple operations ... just it's done weird, and for sure stranger than usual.

Anyway, that's somehow a modern example of VLIW you can look at.

Fri Feb 26, 2021 5:47 am

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
rj45 wrote:
What I am doing with my CPU is a compromise

Talking about personal projects, after several hundred unsuccessful attempts looking for a decent compromise between { performance, programmability, deterministic cycles, power consumption (1), code density } .. I gave up and decided to be focused only on { assembly-programmability, power consumption (1), deterministic cycles }

That's why I am with a multi-cycle design.

(1) in terms of ... less bits you flip, the less you consume => don't implement too many unnecessary things that run in parallel

Fri Feb 26, 2021 6:53 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1754
There are so many paths through the tangled world of cpu designs!

I think it's interesting that both ARM and RISC have 16-bit forms of their instruction sets, in fact ARM had two attempts at this: Thumb which is a limited 16 bit subset, and Thumb2 which is a mixed-length of mostly 16 bit encodings.

In both cases, I think, it illustrates that it's fine to have a different memory bus width than your instruction width: decoupling is the thing, and a prefetch queue, or a cache, gives you that decoupling. It's fine to have a 16 bit machine with an 8 bit wide memory - or vice-versa.

Fri Feb 26, 2021 11:34 am

Joined: Sun Jul 05, 2020 9:00 pm
Posts: 17
I'm thinking more at the moment of converting a Gigatron to a Von Neumann architecture. I already see how complex that might be and how that might adversely impact performance. For instance, if you got it to where everything took 2 cycles, then you'd have to clock it twice as fast to properly bit-bang video through the port. Of course, a way around that would be to either use DMA for the video or create a specialized memory to port instruction. That instruction would pause the program counter, likely shove a NOP into the pipeline for as long as necessary, and run a state machine that sequences the RAM to the port. That way, single-cycle OUTs would be possible, even in an architecture that cannot otherwise do this. Or, taking things further, make the memory go at least twice the execution speed, with proper arbitration circuitry, and do fetches and random accesses in the same cycle.

Now, going with DMA video where the syncs are outside of software control, I'd likely want to add interrupts. That way, things like software-based sound generation would get regular timings without regard to the code timings. Of course, we have to consider what an interrupt does. First, it looks up the address to the routine and saves the PC. Then it jumps to the routine. When IRet is reached, it sets the Program Counter to the next address that needs to run. So parts of this shouldn't be hard to do with the Gigatron design. One would need to get rid of the delay slot, at least temporarily. This can likely be done by temporarily increasing the pipeline size. So pause the program counter and insert a read-only register or "wire" with a NOP on it. So this means the delay slot is overwritten with a NOP for certain types of hard jumps such as interrupts and calls and gives a convenient way to halt the program counter. Now, for other situations, the timing here would likely need to be changed. So if you do the memory to port state machine idea, care would need to be taken that the next instruction is not overwritten while the PC is paused.

Modifying the Gigatron to be Von Neumann would allow the machine to do 16-bit data transfers while still keeping 8-bit instructions and operands. It would also be more FPGA friendly if one is using external memory and devices. Since ROM and RAM would be on the same bus, there would be fewer GPIO lines used. Of course, using only the resources of the FPGA would be even easier.

And yes, I'd likely want to take the unused operand space to add secondary opcodes there, so long as they don't access memory, alter flow control, or create any register hazards. Theoretically, this could make 32-bit (and 16.16 fixed point) register additions possible since the slave control unit and ALU could work on different registers in tandem. Adding a carry flag would be helpful since the next instruction could be an Increment-on-Carry.

Sun Feb 28, 2021 5:07 pm

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
  0:0f 8e 01 ff 8d 1d 8d 3d ce 00 c4 8d 54 86 00 8d 40 ce 00 ce 8d 69 8d 49 20 ee 18 09 18 8c 00 00
 20:26 f8 39 86 30 b7 10 2b 7f 10 2c 86 0c b7 10 2d 39 b6 10 2e 84 20 27 f9 b6 10 2f 39 7d 10 2e 2a
 40:fb b7 10 2f 39 86 80 b7 10 39 18 ce 03 e8 8d ca 39 86 00 b7 10 30 b6 10 30 84 80 27 f9 b6 10 31
 60:39 a6 00 27 0e 8d d5 b7 70 00 08 18 ce 27 10 8d a9 20 ee 39 84 0f 8b 30 81 39 23 02 8b 07 39 36
 80:44 44 44 44 8d ee a7 02 32 84 0f 8d e7 a7 03 39 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 a0:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 c0:00 00 00 00 41 44 43 2e 63 68 31 3a 20 00 30 78 5f 5f 0d 0a 00 ff ff ff ff ff ff ff ff ff ff ff
 e0:ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

about code density ... the above is a working code of a very old (1992) acquisition system that manages an ADC, cleans data, and sends packages over the serial

The onboard ram is only 256 byte (not Kbyte, it's not a mistake), and it contains the stack, the data and the code area.

WOW :o :o :o

I am afraid you cannot do it with a common RISC :D

Wed Mar 03, 2021 5:45 pm

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
DiTBho wrote:
about code density ... the above is a working code of a very old (1992) acquisition system that manages an ADC, cleans data, and sends packages over the serial

The onboard ram is only 256 byte (not Kbyte, it's not a mistake), and it contains the stack, the data and the code area.

WOW :o :o :o

That is really impressive - what's the architecture? Z80?

Meanwhile, the remote job-management software for our new laser printer is a 1.5gb download. And the guys installing the machine couldn't see how utterly insane this is!

In my project the prefetch queue was one of the trickier things to get working reliably, but it allowed me to use 8-bit instructions on a 32-bit bus.

Wed Mar 10, 2021 1:16 pm

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
robinsonb5 wrote:
what's the architecture?

6800, 6.8k

Thu Mar 11, 2021 1:30 pm
 [ 30 posts ]  Go to page Previous  1, 2

Who is online

Users browsing this forum: CCBot and 0 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software