View unanswered posts | View active topics It is currently Thu Mar 28, 2024 9:20 am



Reply to topic  [ 22 posts ]  Go to page 1, 2  Next
 EightThirtyTwo 
Author Message

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
The CPU design I've been tinkering with for the last few months has 32-bit registers and address bus, but instruction words of only eight bits, so I've come to call it "EightThirtyTwo".

Needless to say the eight-bit instruction word format places limits on how many registers can be referenced - I have three bits devoted to a register number, allowing for eight general purpose registers. Instructions take only one operand; where a second operand is required, a ninth register named "tmp" is used implicitly.

R7 is the program counter. Control flow is achieved by manipulating R7 directly - branch to subroutine is done with "exg r7", PC relative branch can be done with "add r7". "Add" is special-cased for r7, writing the register's existing contents to tmp, allowing tmp to serve as a link register. Thus with both calling methods, a subroutine can save tmp as its first operation and restore it to r7 later, avoiding the need for a dedicated "return" instruction.

A 32-bit immediate value can be loaded into tmp with "ldinc r7" followed by ".int <value>"
Shorter immediates can be built up six bits at a time with an "li" instruction.

Rather than conditional branches, or devoting opcode bits to predication, I've implemented conditional execution, kind of like "sticky" predication. A dedicated "cond" instruction determines whether the following opcodes will be executed or not, and its effect lasts until either another "cond" instruction is encountered or until something manipulates (or would have manipulated) r7.

To give a flavour of what code for this ISA looks like, here's an LZ4 decompression routine (transliterated from MC68000 code found in the smallest version of Arnaud Carré’s lz4-68k project on github):
Code:
//   r0 packed buffer
//   r1 destination pointer
//   r2 packed buffer end

lz4_depack:
   stdec   r6 // Save return address on the stack...
   li   PCREL(.tokenLoop)
   add   r7  // branch to .tokenLoop
         
.lenOffset:
   ldbinc   r0 // load byte from address in r0, post increment r0
   mr   r3
   li   8
   ror   r3
   ldbinc   r0
   or   r3
   li   24
   ror   r3

   mt   r4  // move r4 to r5 by way of tmp register.
   mr   r5

   mt   r1
   mr   r4
   mt   r3
   sub   r4

   li   IMW1(PCREL(.readLen-1))
   li   IMW0(PCREL(.readLen))
   add   r7  // branch to .readlen subroutine.  (Add r7 puts the return address in tmp.)

   li   4
   add   r5
.copy:
   ldbinc   r4
   stbinc   r1
   li   1
   sub   r5
   cond   NEQ
     li   IMW0(PCREL(.copy))
     add   r7
         
.tokenLoop:   
   ldbinc   r0 // Load byte with post-increment
   mr   r4
   mr   r5
   li   15
   and   r4
   li   4
   shr   r5
   cond   EQ
     li   IMW1(PCREL(.lenOffset-1))
     li   IMW0(PCREL(.lenOffset))
     add   r7

   li   IMW0(PCREL(.readLen))
   add   r7

.litCopy:
   ldbinc   r0
   stbinc   r1
   li   1
   sub   r5
   cond   NEQ
     li   IMW0(PCREL(.litCopy))
     add   r7

   mt   r2
   cmp   r0
   cond   SGT
     li   IMW1(PCREL(.lenOffset-1))
     li   IMW0(PCREL(.lenOffset))
     add   r7
         
.over:
   ldinc   r6
   mr   r7

.readLen:
   stdec   r6
   li   15
   cmp   r5
   cond   NEQ
     li   IMW0(PCREL(.readEnd))
     add   r7

.readLoop:
   ldbinc   r0
   mr   r3
   add   r5
   li   IMW1(255)
   li   IMW0(255)
   xor   r3
   cond   EQ
     li   IMW0(PCREL(.readLoop))
     add   r7

.readEnd:
   ldinc   r6 // Fetch return address from stack
   mr   r7


The project can be found on github at https://github.com/robinsonb5/EightThirtyTwo - so far the CPU has full load/store alignment, 32x32->64bit multiply, interrupts, big- and little-endian build-time switch, and dual thread support.

A backend for the VBCC C Compiler is under construction.

I'll maintain some demo projects in a separate repository: https://github.com/robinsonb5/EightThirtyTwoDemos


Fri Nov 22, 2019 1:27 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Very interesting! Is the 'cond' idea novel?


Fri Nov 22, 2019 2:07 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
BigEd wrote:
Very interesting! Is the 'cond' idea novel?


I've not seen it used anywhere else, but I can't imagine it's never been done before.
The closest thing I've seen is probably the BTFSC and BTFSS instructions in PIC assembly language - in as much as they predicate the next instruction (though only one instruction) using an opcode instead of dedicated bits in the encoding.


Fri Nov 22, 2019 6:09 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I suppose it's a few bits of state which would need to be preserved if an interrupt hit. (That is, if you have interrupts!)

Thumb2 has a not-quite-general-predication scheme:
Quote:
Thumb-2 instructions do not have the 4-bit condition code field that most Arm instruction have. Instead, Thumb-2 has the it instruction, which conditionally executes up to four subsequent instructions. The instructions affected by an it instruction are said to be in an it block


Fri Nov 22, 2019 8:07 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
BigEd wrote:
I suppose it's a few bits of state which would need to be preserved if an interrupt hit. (That is, if you have interrupts!)


I do have interrupts (or rather, one interrupt signal) but currently expect it to remain asserted until acknowleged by the handler, so that the CPU can service the interrupt when convenient. Because the tmp register is used as a link register, its contents are lost when an interrupt triggers, so I only allow interrupts to happen when an instruction is about to write to tmp - so that losing its old contents won't matter!
(The instruction in question is replaced by one that xor's r7 with itself, while writing its old value to tmp, causing control to jump to location zero with the Z flag set. I currently save the old Z and C flags in the top two bits of tmp, and could in theory save the cond flag there too, but I currently I don't allow interrupts to fire until the cond stretch has ended. All of this means that while I do have interrupts, the response time isn't great.)

Quote:
Thumb2 has a not-quite-general-predication scheme:
Quote:
Thumb-2 instructions do not have the 4-bit condition code field that most Arm instruction have. Instead, Thumb-2 has the it instruction, which conditionally executes up to four subsequent instructions. The instructions affected by an it instruction are said to be in an it block


Thanks for that; I'm not particularly familiar with ARM (yet) - I've written for it in C but very little assembler - clearly I have some reading to do!


Fri Nov 22, 2019 10:24 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
ARM is quite nice, Thumb a bit less so, in my limited experience. Starting from a 6502 background I found ARM fairly straightforward. But I wouldn't recommend you dive into ARM assembly until you have a good reason!

As it turns out, some of the OPC designs are also quite nice to program for, if you come from a 6502/ARM background. It's pretty simple and regular.

But of course if you're on a path of building your own CPU, you'll be doing your own thing. Kudos for picking something as nontrivial as decompression!


Fri Nov 22, 2019 10:35 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
BigEd wrote:
ARM is quite nice, Thumb a bit less so, in my limited experience. Starting from a 6502 background I found ARM fairly straightforward. But I wouldn't recommend you dive into ARM assembly until you have a good reason!


As an Amiga-owning teenager my first exposure to ARM was as the heart of the Acorn Archimedes computers at school. Archimedes owners tended to look down their noses at Amiga owners, so naturally I developed an irrational bias against it back then!

Quote:
As it turns out, some of the OPC designs are also quite nice to program for, if you come from a 6502/ARM background. It's pretty simple and regular.


Yes, as I said in the introductions thread, OPC5 (rather than the LS variant) is the one that interests me most, simply because its logic footprint is unbelievably tiny - on a Cyclone III it's a mere 254 logic elements + 1 blockram, or 534 if you force the register file to logic instead. EightThirtyTwo is around 1,500 but no blockRAM (when built without dual-threading). the f32c MIPS-compatible core is also around 1500 (plus blockram) when the bells-and-whistles are disabled. The TG68k MC68020-compatible core is between 4,000 and 6,500 depending on which features are enabled.

Quote:
But of course if you're on a path of building your own CPU, you'll be doing your own thing. Kudos for picking something as nontrivial as decompression!


I picked that mainly because I'd already translated it from 68K to MIPS recently, so it was an interesting comparison. 204 bytes in MIPS vs. 72 bytes in 832, vs 74 bytes for the 68k original. (I'm sure someone who knows MIPS better than me could reduce the 204 somewhat, though.)

There are some projects that I want to port, in the longer term, from a device called MiST, which has a Cyclone III with 25,000 LEs and a supporting low-end ARM µC, to a device called the Turbo Chameleon 64, which has the same FPGA but no supporting µC. This means I need to include an extra CPU in the cores to replace the missing µC. So far I've used ZPUFlex for this, but it's a bit slow and while the code density's not terrible it could be better. I've experimented with the f32c MIPS core, and it's super-fast but the code density's awful. My aim with EightThirtyTwo was to hit the sweet spot, balancing speed, logic footprint and code density (and thus block RAM usage for boot code), but OPC5 might turn out to be even more useful in the long term.


Fri Nov 22, 2019 11:44 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
All very interesting! The OPC machines were initially built on Xilinx, where there's a very efficient implementation of register files within the LUTs. So, it seems, on Cyclone that's not happening, and you either get a block RAM or rather a lot of LUTs. Which makes it even more remarkable that OPC5 is still in the running!

> A backend for the VBCC C Compiler is under construction.
That's rather handy!


Sat Nov 23, 2019 12:41 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
BigEd wrote:
All very interesting! The OPC machines were initially built on Xilinx, where there's a very efficient implementation of register files within the LUTs. So, it seems, on Cyclone that's not happening, and you either get a block RAM or rather a lot of LUTs.


That's a property of Altera/Intel devices rather than anything specific to the OPC5 - Xilinx does distributed RAM much more efficiently, it seems. "Rather a lot of LUTs" is still relative, however - it's still a very small CPU!

Quote:
Which makes it even more remarkable that OPC5 is still in the running!


The only CPUs I've seen of similar size have a bit-serial ALU.


Sat Nov 23, 2019 5:36 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
I've made some progress on the vbcc C compiler backend for EightThirtyTwo over the last week or so. I now have both varargs and soft-division & modulo working well enough to support printf().
http://retroramblings.net/?p=1315


Fri Nov 29, 2019 10:31 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
The C compiler backend is now working well enough to run a dhrystone benchmark!
http://retroramblings.net/?p=1322


Sat Dec 07, 2019 6:26 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Congratulations! It sort of looks vaguely like the feedback from llvm would push you towards a more CISCy machine - is that so?


Sat Dec 07, 2019 6:44 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
I feel terms like RISC don't apply anymore, as marketing and being able to throw millions of transistors err $$$$ around can sell anyting. (Intel and IBM) Your computer, you can call it anything.
Oddly I don't belive in complex features to gain speed, other than a few special cases. Idle states are important in that other devices have time on the memory bus. I don't see any thing new in computer hardware since November 5th, 1955 a red letter date in science. err wrong invention umm 1965.
VLSI can give faster transitors, but how you use them just changes your computer architecure around.
This design works best when you CPU and memory are the same speed ...
Also you need keep in mind what a compiler might want to generate as well.


Sat Dec 07, 2019 8:28 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
BigEd wrote:
Congratulations! It sort of looks vaguely like the feedback from llvm would push you towards a more CISCy machine - is that so?


Sort of, yes - certainly if registers aren't plentiful then efficient stack manipulation is important. I have a load-indexed instruction which helps, but I can't see any way of implementing a store-indexed instruction with the current design, so for now I'm stuck with "li <offset+4>, addt r6, stmpdec r<n>" - which is still smaller than it would be on MIPS, so I can live with that.

It's not always easy to make good use of CISC-y instructions from a code-generator, either - it took significant effort to make use of load-with-postincrement and store-with-predecrement instructions.

(This is vbcc, not llvm though).


Sat Dec 07, 2019 8:33 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
oldben wrote:
I feel terms like RISC don't apply anymore, as marketing and being able to throw millions of transistors err $$$$ around can sell anyting. (Intel and IBM) Your computer, you can call it anything.


True - even back in the 90s ARM was loudly touted as RISC, despite having some very CISC-y features like the load/store multiple instructions.

Quote:
Oddly I don't belive in complex features to gain speed, other than a few special cases. Idle states are important in that other devices have time on the memory bus.


Yes, in my case, in the projects where I want to use this CPU it will be somewhat bandwidth starved, which is another reason why code density is one of my primary goals. It might look like I'm chasing raw speed with this project; what I'm actually chasing is efficiency, in the hope that the CPU can run code fast enough even when it's last in the queue for RAM access.

Quote:
I don't see any thing new in computer hardware since November 5th, 1955 a red letter date in science. err wrong invention umm 1965.


I've been surprised just how many of these ideas do go back to the 60s or beyond.

Quote:
Also you need keep in mind what a compiler might want to generate as well.


Or you learn what a compiler might want to generate when you try to implement one!


Sat Dec 07, 2019 8:49 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 22 posts ]  Go to page 1, 2  Next

Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software