View unanswered posts | View active topics It is currently Thu Mar 28, 2024 11:45 am



Reply to topic  [ 33 posts ]  Go to page 1, 2, 3  Next
 Thoughts about a high-density ISA 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
.
One of the usual design objectives for a homebrew ISA is simplicity: easy to build, easy to verify, perhaps a good compiler target. And maybe with a simple decoder a high clock speed.

But what about density? We know, I think, that the Z80 is pretty good for density, and likewise the x86 isn't bad. Probably a byte-sized instruction, with the assistance of prefix or suffix bytes, can get a lot of mileage in limited program size. Without needing backward compatibility, perhaps some degree of regularity can come out. But slavish regularity is probably a constraint on instruction encoding.

I'm thinking variable-number-of-byte instructions, and byte-addressable memory. Perhaps not too large a register file, although getting operands from registers saves on bits to specify locations. Probably also both 8 bit and 16 operations: can save instruction bytes by offering 16 bit increments, and probably other 16 bit operations. Perhaps a short-mode addressing, again to save on program length - but could have stack-based or indexed addressing, again saving on program space.

If we don't worry at all about complexity, performance, compatibility, compiler friendliness - if we care only for making programs small - what might we end up with?

Edit: a thought... a Domain Specific Language will often be a density win. So this dense ISA should perhaps be designed to make it easy to implement a virtual machine - perhaps, to make it easy to implement a Forth.


Tue Feb 02, 2021 5:53 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Micro computers z80 PDP 11 6502 6809 8086 all had 2 data sizes 8 bit byte and 16 bit words, and 16 bit address space.
This could be packed into a 8 or 16 bit opcode. The next size up is a 32 bit cpu,to keep 8 bit bytes.Yyou have the problem of more data type and sizes. This means a 32 bit opcode.The best I can pack has 512KB of memory.
Attachment:
CPU32.txt [589 Bytes]
Downloaded 450 times


Tue Feb 02, 2021 10:06 pm
Profile

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
Ed,

Introducing 16-bit instructions started with the 8080, continued and extended with the Z80 and probably reached a height with the 6809.

The 6809 had 16-bit instructions for loads, stores, compares, adds, subtracts, transfers, exchanges,
pushes and pulls.

With two stack pointers, and all those 16-bit instructions, the 6809 would have made a great Forth cpu. Unfortunately it was old technology and the clock of 1MHz increasing to 2MHz really could not cut it.

Unfortunately the 6809 was 6X the price of a 6502 and was a little too late to the 8-bit party. Not only that but overshadowed by it's bigger sibling the 68000.

Motorola were never any good at promoting their early microprocessors - which is much of the reason why Chuck Peddle and the 6502 crew bailed out in 1974.

The 6809 also had an extended monitor that almost formed a tiny operating system. Called Assist09 - there is a listing here:

https://github.com/tgtakaoka/ASSIST09/b ... SIST09.LST


Tue Feb 02, 2021 11:31 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
I know this is probably a bit out of line, but the 32-bit T400/T800 processors from Inmos used a variable length instruction that I found to provide good code density. The instruction set focused on encoding the most frequent instructions in a single byte, and it further encoded additional instructions as multiples of the most basic instructions. There are some issues with the encoding mechanism that complicate the compilation of programs, but those issues can be resolved in a number of ways. Thus, high density code was available for the 32-bit machine. The external 32-bit bus width was used to prefetch as many as four instructions, thereby reducing the memory bandwidth required for instruction fetches versus data reads and writes.

_________________
Michael A.


Wed Feb 03, 2021 12:58 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
I remember Byte Magizine had FORTH cpu design, in that era, with 16 bit instruction word.
Something like even subroutine call, odd Forth stack microcoded operations.
Ben.


Wed Feb 03, 2021 1:03 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Good points: I do wonder if the 6809 has good code density - it certainly has some nice facilities.

Rockwell's 65f11 family has Forth support at instruction level, I think. (Hmm, maybe not...)
See http://archive.6502.org/datasheets/rock ... h_roms.pdf
http://soton.mpeforth.com/flag/jfar/vol ... ticle1.pdf

Maybe I was thinking of the R65C19? Datasheet and explanation within:
http://forum.6502.org/viewtopic.php?p=31222#p31222

And yes, Michael, very good point, the transputer's variable-length stream might be a good approach and is certainly worth considering. As a stack machine, it has even less to say about operands than a machine with a small register file. But IIRC inline constants would take up one byte per nibble of useful value. So, you can cheaply address 16 addresses and 256 addresses isn't too bad.

From the book:
Quote:
Transputer instructions are a single byte in length, although one instruction is merely "execute the contents of the operand register as an instruction," which is the way around a one-byte instruction limitation. Instructions that are one byte in length are called "direct instructions." Instructions that use the "execute the contents of the operand register" trick and are more than one byte in length are called "indirect instructions." Indirect instructions are composed of prefix instructions that load a value into the operand register, followed by an operate instruction that triggers the actual execution.
The idea behind single-byte instructions is to improve the effectiveness of the instruction pre-fetch, which in turn improves processor performance. Since there is an extra word in the processor’s pre-fetch buffer, the transputer rarely has to wait for an instruction fetch before proceeding to decode and execute the next instruction. Since transputer instructions are only one byte in length and one word is fetched from memory at a time, transputers are effectively equipped with a four-byte instruction cache.
The most commonly used instructions were chosen to be the direct instructions, since direct instructions are the most compact (one byte) and execute the fastest. However, it is obviously impossible to code all necessary instructions into a single byte. As a result, some operations in the transputer require more than one instruction


Wed Feb 03, 2021 9:56 am
Profile

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
Quote:
Rockwell's 65f11 family has Forth support at instruction level, I think. (Hmm, maybe not...)


Ed,

I always thought that the 65F11 was basically a 6502 with an on-chip ROM that contained a 6502 Forth.

I was not fully aware of the R65C19, which clearly has had it's architecture tweaked to make it an efficient host for exectuing the 16-bit Forth VM.

In the late 1990's I worked for a modem manufacturer. Many of the 14.4 and 28.8 kbit modem designs were based on a Rockwell microcontroller (before DSP based designs appeared), which I believe was related to the 65C19.


Wed Feb 03, 2021 11:34 am
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
Good code density was one of my design goals when I started experimenting with EightThirtyTwo. That was the reason I settled on single-byte instructions (along with the fact that I'd already experimented with ZPU, which also uses single-byte instructions.)

I think the best way to think about improving code density is to look at real code, and examine how it could be made tighter. To that end, here's a simplistic LZ4 decompression routine.

Code:
   .global lz4_depack
lz4_depack:   // Source pointer in r0, dest pointer in r1, end of compressed data in r2
   stdec   r6

.tokenLoop:
   ldbinc   r0
   mr   r4
   mr   r5
   li   15
   and   r4
   li   4
   shr   r5
   cond   EQ
     .lipcrel .lenOffset
     add   r7

   .lipcrel .readLen
   add   r7

.litCopy:
   ldbinc   r0
   stbinc   r1
   li   1
   sub   r5
   cond   NEQ
     .lipcrel .litCopy
     add   r7

   mt   r2
   cmp   r0
   cond   SLT
     .lipcrel .lenOffset
     add   r7
         
.over:
   ldinc   r6
   mr   r7
         
.lenOffset:
   ldbinc   r0
   mr   r3
   li   8
   ror   r3
   ldbinc   r0
   or   r3
   li   24
   ror   r3

   mt   r1
   exg   r4
   mr   r5

   mt   r3
   sub   r4

   .lipcrel .readLen
   add   r7

   li   4
   add   r5
.copy:
   ldbinc   r4
   stbinc   r1
   li   1
   sub   r5
   cond   NEQ
     .lipcrel .copy
     add   r7

   .lipcrel .tokenLoop
   add   r7

.readLen:
   stdec   r6
   li   15
   cmp   r5
   cond   NEQ
     .lipcrel .readEnd
     add   r7

.readLoop:
   ldbinc   r0
   mr   r3
   add   r5
   .liconst 0xff
   xor   r3
   cond   EQ
     .lipcrel .readLoop
     add   r7

.readEnd:
   ldinc   r6
   mr   r7


In 832 code it comes out at 72 bytes.
Interestingly, the m68k code it's based on is also exactly 72 bytes! (https://github.com/arnaud-carre/lz4-68k ... allest.asm)

I found some Z80 source in an old Google Groups post: (https://groups.google.com/g/lz4c/c/A6TLHThL0c8) - this assembles to 93 bytes.

An ARM Thumb implementation comes to 84 bytes (https://community.arm.com/developer/ip- ... -and-later)

I translated it to MIPS and got 204 (!) bytes - though someone more skilled than me with MIPS could no doubt beat that!

So what could be done to make this smaller still? I'm wondering what would happen if the number of bits per instruction wasn't a multiple of 8? The ARM Thumb code was 42 instructions, 16 bits each, so 84 bytes. Could we get the number of bits per instruction down to, say, 12, while keeping enough expressiveness in the code that we don't need more than a negligible increase the instruction count?


Thu Feb 04, 2021 10:11 pm
Profile

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
A pipe dream would be to implement LZ4 at the hardware level and compress the instruction stream. But then you have all kinds of trouble with interrupts and branches.

Other more sane ideas that haven't yet been mentioned:

- Function prologues and epilogues use a lot of bytes, so making those super short with multi-load/store instructions helps a lot.
- Pulling literals/immediates out of the instruction stream into shared pools (like the SuperH does), where multiple instructions can refer to the same constant with a shorter PC-relative offset
- Having a barrel shifter and multiply/divide instructions and other common things that would have to be a function call if they were missing.
- Looking for common sequences of instructions that could become a single instruction
- Maybe some serious thought into running the ISA through an optimization algorithm with a variety of test programs to see if a few more bytes can be squeezed out


Fri Feb 05, 2021 9:42 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Function calls are expensive, but how often do you call a subroutine. Recursive routines like in compiler only nest a few times. The same goes for * / % , they are not used that often. Floating point does require some thought.
The other question, is what memory size you are looking at? You have 64KB 256KB 1024KB 32BIT 64BIT and constants for that. Modern software tends to wasteful of memory so you need a large offset for the stack or frame pointer,
Ben.


Sat Feb 06, 2021 12:32 am
Profile

Joined: Mon Aug 14, 2017 8:23 am
Posts: 157
Ben,

I'll put a stake or two in the ground:

1. 8 or 16-bit machine with 64K addressable words of memory.

2. "3M" workstation - at least a megabyte of memory, a megapixel display and a million operations per second i.e. 1 MIPS. Also costing 1 megapenny so $10,000

The PDP-11 probably satisfied #1 about 1969.

#2 was achieved around 1980-1986 by Sun Microsystems.

Where we go from there probably advances on a logarithmic scale.

The recently announced Pi Pico exceeds #2 for a $4 asking price.

The typical smart phone probably exceeds #2 by 2 orders of magnitude in terms of MIPS, and memory, plus a 4 million pixel colour display.


Sat Feb 06, 2021 12:58 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Some interesting ideas, for sure!

Coding up LZ4 for various architectures is an excellent idea, and a great contribution. I wonder: how does RISC-V's compressed instruction set score?

From Wikipedia:
Quote:
optional (but recommended) variable-length compressed instruction set, RVC, that includes 16-bit instructions. Like ARM's Thumb and the MIPS16, the compressed instructions are simply aliases for a subset of the larger instructions. Unlike ARM's Thumb or the MIPS compressed set, space was reserved from the beginning so there is no separate operating mode. Standard and compressed instructions may be intermixed freely

A prototype of RVC was tested in 2011. The prototype code was 20% smaller than an x86 PC and MIPS compressed code, and 2% larger than ARM Thumb-2 code

Standard RVC requires occasional use of 32-bit instructions. Several nonstandard RVC proposals are complete, requiring no 32-bit instructions, and are said to have higher densities than standard


Sat Feb 06, 2021 8:23 am
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
rj45 wrote:
A pipe dream would be to implement LZ4 at the hardware level and compress the instruction stream. But then you have all kinds of trouble with interrupts and branches.


That would be amazing, but LZ4 doesn't play well with random access - it requires the ability to look back at what's already been decoded. Some kind of Huffman-like encoding might be more realistic, though - perhaps with 4, 8 and 12-bit instructions with some kind of dictionary lookup to the actual instructions? Effective addresses could just be shifted one bit so the bottom bit selects the low or high nybble? Or maybe 4, 6, 8 and 10-bit codes?

Quote:
- Function prologues and epilogues use a lot of bytes, so making those super short with multi-load/store instructions helps a lot.


True - but that gets more important the more registers you have.
With only 8 registers, with r7 as PC and r6 as the stack, my worst-case epilogue is:
Code:
   ldinc   r6
   mr   r5
   ldinc   r6
   mr   r4
   ldinc   r6
   mr   r3
   ldinc   r6
   mr   r7

(r0, r1 and r2 are scratch) - so that's 8 bytes.
but when compiling for size, I define this once, with the label _functiontail: then branch to that + an offset, depending on which
registers I need to restore.
I have one functiontail per compilation unit, and can jump to it from +/- 2 kilobytes away with 3 bytes
Code:
   .lipcrel _functiontail+n // this takes 1 byte if within +31/-32 bytes, 2 bytes if within +2047/-2048 bytes...
   add r7


Quote:
- Pulling literals/immediates out of the instruction stream into shared pools (like the SuperH does), where multiple instructions can refer to the same constant with a shorter PC-relative offset


That's an interesting concept. I'd also wondered whether some kind of shared literal / immediate could work alongside instructions with inconvenient bit lengths, like maybe pack 5 11-bit instructions and a shared 9-bit literal into a 64-bit word?

Quote:
- Looking for common sequences of instructions that could become a single instruction


In the LZ4 case above I could shave off 3 bytes if I'm definitely running in little-endian mode - there's currently some cumbersome shifting going on to read a 16-bit value in an endian-agnostic way.

The other thing that occurs is that a bit-split instruction would be useful. Say you have two values packed into a single byte or word and want to separate them - currently you need to copy, shift one copy, and mask the other.
With such an instruction, one could replace:
Code:
   ldbinc   r0
   mr   r4
   mr   r5
   li   15
   and   r4
   li   4
   shr   r5


with

Code:
   ldbinc   r0
   mr   r4
   li   4 // number of bits to split
   split r4
   mr r5


This could certainly find uses in soft-floating point code.

BigEd wrote:
Some interesting ideas, for sure!

Coding up LZ4 for various architectures is an excellent idea, and a great contribution. I wonder: how does RISC-V's compressed instruction set score?


Interesting question, once again. I'm not familiar enough with RISC-V to say for sure, but I'd be surprised if for this code it were markedly different from Thumb.

The one thing to bear in mind with LZ4 though is that the working set is pretty small. How cumbersome it is to spill data onto the stack when the working set gets bigger is an important consideration. (And it's where 832 starts to fall down.)


Sat Feb 06, 2021 1:46 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
For a 1970's era computer I have a nice compact ISA with 512KB ram with 2901's (1975)
1974 Ram cost per bit in cents
is .3 * (.72)^(t-1974) from computer engineering - digital press. ~ 1 cent a bit 1970 ~ .05 cents a bit 1980.
96KB is what would call a useable amount of memory so that would be in 1974 at least $2500.A pdp 8/e
was ~ $4500 with 4K ram. PDP 11 16KW ram was ~$5,600, http://woffordwitch.com/Witch.asp
This seems to a good all around ISA for the memory and TTL logic at the time.

As for modern systems,.I have no idea where the memory is going so I can not say how to improve things,
I need a static profile of a OS for memory offsets and usage to do that. I have always felt that laguages needed
a 2 pass to sort data and code segments. That is how would improve the ISA,

Modern software are now more coded as objects, so hardware for objects might be included
as function block, as software DMA device.


Sat Feb 06, 2021 8:30 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
robinsonb5 wrote:
Some kind of Huffman-like encoding might be more realistic, though - perhaps with 4, 8 and 12-bit instructions with some kind of dictionary lookup to the actual instructions? Effective addresses could just be shifted one bit so the bottom bit selects the low or high nybble?


I was thinking about that: for a conventional 8 bit ISA like the 6502's, it's certainly tempting to do some static analysis of frequencies and see how huffman encoding would work.

But as you note, you might need to jump half-way into a byte. Or, there could be a NOP before any unaligned target address: probably not a 4 bit NOP, as that's really expensive, so perhaps a 12 bit NOP.

For a larger instruction set such as a 32 bit RISC, a frequency analysis of instruction words feels near-meaningless. Just possibly a CISC instruction set like x86 could be compressed this way - but I'd go for 6502 as a starting point.


Sat Feb 06, 2021 10:29 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 33 posts ]  Go to page 1, 2, 3  Next

Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software