View unanswered posts | View active topics It is currently Thu Mar 28, 2024 3:23 pm



Reply to topic  [ 24 posts ]  Go to page 1, 2  Next
 Stuck between a hard place and a 16-bit rock. 
Author Message

Joined: Sun Mar 27, 2022 12:11 am
Posts: 40
What is everyone's thoughts on breaking the 16-bit memory barrier?

I've been a bit stuck on this problem. 64k is plenty for retro gaming, but doesn't quite cut it if you want to implement a self hosted compiler, or have decent online help, or a good UI etc. To put it in perspective 64k is 800 lines of 80 char text or one 320x200 8-bit image. My instinct is to just build a 32-bit computer, and I've almost pulled the trigger on that a couple of times. But I think it violates the aesthetics of a TTL computer, or at least the number chips gets a bit silly.

The current plan is to build a 16-bit RISC barrel processor with a hardwired decoder. It is Harvard and the address space can be stretched a little bit, but data structures would be limited to less than 64k, so it's not much use.

One idea was to implement a VM for programs that need the extra address space. The performance hit from that is probably too much.


Sun Mar 05, 2023 12:44 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
DockLazy wrote:
What is everyone's thoughts on breaking the 16-bit memory barrier?

I've been a bit stuck on this problem. 64k is plenty for retro gaming, but doesn't quite cut it if you want to implement a self hosted compiler, or have decent online help, or a good UI etc. To put it in perspective 64k is 800 lines of 80 char text or one 320x200 8-bit image. My instinct is to just build a 32-bit computer, and I've almost pulled the trigger on that a couple of times. But I think it violates the aesthetics of a TTL computer, or at least the number chips gets a bit silly.

The current plan is to build a 16-bit RISC barrel processor with a hardwired decoder. It is Harvard and the address space can be stretched a little bit, but data structures would be limited to less than 64k, so it's not much use.

One idea was to implement a VM for programs that need the extra address space. The performance hit from that is probably too much.


RISCy design if I say so. Not having byte addressing gives you just ample space for GUI with bank select.
20 or 32 bits is the only way to expand to keep the design with a simple data path simple.
Xerox Alto is good example of TTL computer, with GUI. I like the potrait display since it is about 600x800.
(none of this cheap TV displays like the ibm pc)
https://archive.org/details/byte-magazi ... ew=theater


Sun Mar 05, 2023 3:30 am
Profile

Joined: Sun Oct 14, 2018 5:05 pm
Posts: 59
DockLazy wrote:
What is everyone's thoughts on breaking the 16-bit memory barrier?

I've been a bit stuck on this problem. 64k is plenty for retro gaming, but doesn't quite cut it if you want to implement a self hosted compiler, or have decent online help, or a good UI etc. To put it in perspective 64k is 800 lines of 80 char text or one 320x200 8-bit image. My instinct is to just build a 32-bit computer, and I've almost pulled the trigger on that a couple of times. But I think it violates the aesthetics of a TTL computer, or at least the number chips gets a bit silly.

The current plan is to build a 16-bit RISC barrel processor with a hardwired decoder. It is Harvard and the address space can be stretched a little bit, but data structures would be limited to less than 64k, so it's not much use.

One idea was to implement a VM for programs that need the extra address space. The performance hit from that is probably too much.


The BBC Micro (6502, 32KB RAM) could run a 16-bit BCPL compiler (from Tape if you really wanted to stick to that 64KB limit). Apple II with 64KB RAM could run the Aztec C Shell and C compiler. Many Z80 CP/M systems could run C and FORTRAN (and COBOL?) Compilers. Older systems? a PDP/8 with 32K core can run an OS and FORTRAN compiler....

But you hit the same issues I've hit in that trying to run a modern compiler is nigh on impossible. There are some C-Like systems out there that run on ATmega 328p (Arduino) though - Bltlash is one, so if you had (up to) 64K Flash and 64K RAM Harvard style then something a bit more sophisticated ought to be possible.

Hope it goes well.

-Gordon


Sun Mar 05, 2023 7:06 am
Profile

Joined: Sun Oct 14, 2018 5:05 pm
Posts: 59
DockLazy wrote:

One idea was to implement a VM for programs that need the extra address space. The performance hit from that is probably too much.


Adding another data point here - My systems went from a "re-creation" of a system I used in the 80's - 8-bit 6502 CPU, 16-bit VM to run BCPL to a hybrid 8/16-bit CPU (65c816) running a 32-bit VM to let me write an OS to run and compile BCPL programs and while not fast, it's very usable - no on-board graphics so I'm never going to write some high-speed shooter but it would be more than fast enough for an invaders/tetris style game or some sort of CAD application. The base CPU runs at 16Mhz.

Simple demo here:

https://youtu.be/ZL1VI8ezgYc

-Gordon


Sun Mar 05, 2023 7:11 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
I am tempted to say modern compilers, will not even run with 32 bit addressing.
All the older unix compilers seem to be too updated to run under a small unix system any more
and source can't be found for the orginal compilers. You might find something with a VAX
system running under simh, around 1982 to 1990 that is adaptable. C seems to have lot hidden
dependances with details in code layout and stack frames.
Ben.


Sun Mar 05, 2023 7:59 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I like idea of a 32 bit VM - that can make use of lots of banked memory, presumably, without the application needing to know or care.

George Foot's recent explorations of a SuperShadow board are very interesting too: offering a 64k application and 64k OS space. It's a different kind of division than I and D. It comes out, AFAICT, a lot cleaner than any other banking scheme for 6502, and very minimal in hardware. (I don't know anything about how the larger-RAM Z80 systems went about their business.)

All that said, adventures in 32bit or 24 computing are entirely valid! Or 16 bit address spaces with 16 bit words - that gives you twice as much.


Sun Mar 05, 2023 8:19 am
Profile
User avatar

Joined: Sun Dec 19, 2021 1:36 pm
Posts: 68
Location: Michigan USA
I can totally relate to this discussion. My first project was the TTL-Retro. It was sort of 12 bit TTL stack computer, but the memory addressing was really a kluge.

http://www.mtmscientific.com/stack.html

Now the 16 bit addressing of a 64K flat memory space, evenly split between ROM and SRAM, feels downright luxurious on my current project the LALU computer. Still, these systems have both gravitated to an interpretive language using RPN and stacks. Getting to something like C seems like a real challenge, but I keep looking.

Probably you have seen this discussion over at Hackaday, but it is still an interesting read.

https://hackaday.com/2015/07/31/build-y ... easy-part/


Sun Mar 05, 2023 11:39 am
Profile WWW

Joined: Sun Oct 14, 2018 5:05 pm
Posts: 59
BigEd wrote:
I like idea of a 32 bit VM - that can make use of lots of banked memory, presumably, without the application needing to know or care.


Well, me too... Which is what I did on the 65C816 - So could it be done on other systems? I'm sure it could and for a while I flirted with the idea of it on the Commander X16 project which has an interesting banked RAM system.

The trade-off is, as usual, speed but speed for the flexibility of an easier to use high level language? Worth it in my books. An 8-bit CPU with 64K "ROM" and 64K (banks of) RAM? Should be very capable especially if the CPU is in software (FPGA) and might be tweakable to help run the VM...

I crudely estimated my system to be sort of like a 250Khz CPU running the bytecode instructions, however they are very variable timings, but even so it's very interactive, (the bytecode instruction set is very CISC)

So I can edit and compile smallish programs on it without getting too frustrated and so on. The underlying CPU did not make it easy though and so I have to include some unwieldy code work arounds to make up for some of it's shortcomings. e.g. just having one instruction to load a byte into a 16-bit register and zero the top 8 bits would give a significant speedup... (So if you ever design a CPU to run a bytecode think of stuff like that!)

-Gordon


Sun Mar 05, 2023 5:13 pm
Profile

Joined: Sun Oct 14, 2018 5:05 pm
Posts: 59
drogon wrote:
BigEd wrote:
I like idea of a 32 bit VM - that can make use of lots of banked memory, presumably, without the application needing to know or care.


Well, me too... Which is what I did on the 65C816 - So could it be done on other systems? I'm sure it could and for a while I flirted with the idea of it on the Commander X16 project which has an interesting banked RAM system.

The trade-off is, as usual, speed but speed for the flexibility of an easier to use high level language? Worth it in my books. An 8-bit CPU with 64K "ROM" and 64K (banks of) RAM? Should be very capable especially if the CPU is in software (FPGA) and might be tweakable to help run the VM...


Following up myself, but some ponderings over the past few hours and I've been thinking ... (and I'm sure many others have thought something along the same lines) might be to have some blisteringly fast but relatively simple CPU - one that's designed to do no more than write a bytecode interpreter in - I think the term "millicode" has been coined in the past too but I heard some RISC-V people talking about something similar a while back. Might be an interesting step from actual microcode to a high level/general purpose CPU...

Also thinking about the number of actual different instructions I used on both the 65x816 and RISC-V to implement by bytecode VM - it's less than half on the '816. Hard to gauge on RISC-V due to the registers but there is scope for an even more reduced instruction set CPU to execute it..

-Gordon


Sun Mar 05, 2023 8:55 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
My 16+ bit computer, started off with the idea of small-c 1.0 from Dr, Dobbs as the software base.
The project stalled as the small-c compiler for dos, ran out memory tying to cross compile the c-compiler.
Set condition was added for logic operations. Run time library was to be in ROM. Rather than a VM machine
most code is compiled directly, Rather than having a clean stack based model, Small C kept changing the run time
libaries with each version and you could not self compile to boot strap a new version, both for the 8080 and 8086
code.
Dosbox has bugs since it was made to play games, not run real programs, and finding a vintage Dos assembler
and linker is problem. The FPGA software has bugs, and is out dated making revising things a problem.
I use ALTERA 's ADHL since all the VHDL and VERLOG are both defective products as far a hardware langage
goes.
Even if I could buy the old software, I still can't pay the old prices. $800 to $1600 for the software. MicroSoft sold
DOS cheap, and then over charged other stuff.
Ben.


Mon Mar 06, 2023 6:12 pm
Profile

Joined: Sun Oct 14, 2018 5:05 pm
Posts: 59
oldben wrote:
My 16+ bit computer, started off with the idea of small-c 1.0 from Dr, Dobbs as the software base.
The project stalled as the small-c compiler for dos, ran out memory tying to cross compile the c-compiler.
Set condition was added for logic operations. Run time library was to be in ROM. Rather than a VM machine
most code is compiled directly, Rather than having a clean stack based model, Small C kept changing the run time
libaries with each version and you could not self compile to boot strap a new version, both for the 8080 and 8086
code.


I have to admit to spending more than a small amount of time recently looking for a C compiler (Tiny/Small or otherwise) that might fit the bill of being self hosting on my Ruby 816 BCPL system - that would require the compiler to compile itself (although to be honest I can live without that as long as it can cross compile itself), but the key thing would be for it to be able to be modified to output code that would run under my bytecode VM.

(and yes, I'm aware that this may seem like somewhat an anachronism, given that C was born from BCPL via B)

However I have not succeeded and more or less given up for now. Some are just too complex or too hard-wired to the 8086, etc. to make it feasible for me in a sensible timescale, so I'll stick to BCPL for now.

-Gordon


Mon Mar 06, 2023 8:45 pm
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 40
Sorry for the late replies, you guys have given me a lot to think about.
oldben wrote:
RISCy design if I say so. Not having byte addressing gives you just ample space for GUI with bank select.
20 or 32 bits is the only way to expand to keep the design with a simple data path simple.
Xerox Alto is good example of TTL computer, with GUI. I like the potrait display since it is about 600x800.
(none of this cheap TV displays like the ibm pc)
https://archive.org/details/byte-magazi ... ew=theater

No byte access has a relatively large cost in performance when dealing with text and pixels, lots of shifts, masking, and read modify writes.

Funny you should bring up the Alto. What I'm doing is kind of an accidental modern take on the Alto. It may be a bit RISCy but my design cuts out the microcode middleman and exposes everything to the programmer.

That article brings up some things I forgot about the Alto, like that it is effectively a VM. I have had a video cued on Youtube demoing the Alto for a while. I might watch it tonight to get an idea what it is capable of.

drogon wrote:
The BBC Micro (6502, 32KB RAM) could run a 16-bit BCPL compiler (from Tape if you really wanted to stick to that 64KB limit). Apple II with 64KB RAM could run the Aztec C Shell and C compiler. Many Z80 CP/M systems could run C and FORTRAN (and COBOL?) Compilers. Older systems? a PDP/8 with 32K core can run an OS and FORTRAN compiler....

But you hit the same issues I've hit in that trying to run a modern compiler is nigh on impossible. There are some C-Like systems out there that run on ATmega 328p (Arduino) though - Bltlash is one, so if you had (up to) 64K Flash and 64K RAM Harvard style then something a bit more sophisticated ought to be possible.

Hope it goes well.

-Gordon

I've looked into the CP/M compilers a little bit, have not actually used them yet. Building a Z80 emulator for CP/M is on my to do list.

My issue is that to get everything to fit you need to cut corners somewhere, with only 64k those cuts will be everywhere. These types of creative constraints are a lot of fun when dealing with the artsy stuff, like music and games. But will lead to pain when it comes to tooling.

drogon wrote:
Adding another data point here - My systems went from a "re-creation" of a system I used in the 80's - 8-bit 6502 CPU, 16-bit VM to run BCPL to a hybrid 8/16-bit CPU (65c816) running a 32-bit VM to let me write an OS to run and compile BCPL programs and while not fast, it's very usable - no on-board graphics so I'm never going to write some high-speed shooter but it would be more than fast enough for an invaders/tetris style game or some sort of CAD application. The base CPU runs at 16Mhz.

Simple demo here:

https://youtu.be/ZL1VI8ezgYc

-Gordon

That is very cool!

Performance concerns are probably a bit overblown on my part. I grew up with an 8-bit Atari computer with probably the slowest of all Basics. I don't think it even managed to interpret a single line per frame. Yet it was still very usable. It could've done with some more useful graphics functions though, something that actually used the hardware the computer had...

BigEd wrote:
I like idea of a 32 bit VM - that can make use of lots of banked memory, presumably, without the application needing to know or care.

George Foot's recent explorations of a SuperShadow board are very interesting too: offering a 64k application and 64k OS space. It's a different kind of division than I and D. It comes out, AFAICT, a lot cleaner than any other banking scheme for 6502, and very minimal in hardware. (I don't know anything about how the larger-RAM Z80 systems went about their business.)

All that said, adventures in 32bit or 24 computing are entirely valid! Or 16 bit address spaces with 16 bit words - that gives you twice as much.

A VM seems like the sensible approach, it might also help with the software sharing problem homebrew computers have.

It's highly likely 32-bit wide instructions will be kept. In addition adding an 8-bit extension to the register file to store the bank address is possible.

Another option might be to add another read port to the register file for concatenating a 32-bit address from two 16-bit registers. The required extension of the ALU to support 20-24 bit addition only isn't too expensive. This would allow 24+/-16 for load/stores.

The problem with going from 16-bit to 32-bits is that the increase in chips isn't just doubled. For example a 16-bit barrel shifter is roughly 24 chips, while the 32-bit version is 56 chips. The same goes for load/store alignment and the zero detection circuit.

Not that I want to discourage anyone from pursuing a 32-bit TTL computer. It's just for my design goals it gets a bit expensive.

mmruzek wrote:
I can totally relate to this discussion. My first project was the TTL-Retro. It was sort of 12 bit TTL stack computer, but the memory addressing was really a kluge.

http://www.mtmscientific.com/stack.html

Now the 16 bit addressing of a 64K flat memory space, evenly split between ROM and SRAM, feels downright luxurious on my current project the LALU computer. Still, these systems have both gravitated to an interpretive language using RPN and stacks. Getting to something like C seems like a real challenge, but I keep looking.

Probably you have seen this discussion over at Hackaday, but it is still an interesting read.

https://hackaday.com/2015/07/31/build-y ... easy-part/

Nice! I'm always surprised how simple hardware decoders can be.

That article is very true. I'm very much in the camp of fitting the computer to my needs. In saying that, being the only person to make software for your computer is a bit of a problem. This is actually what drove me to looking at VMs and emulators in the first place.

There are a few retro VMs out there like Chip-8, the various dialects of BASIC, and the Infocom Z-machine that should be able to be ported to almost anything. There's also some more modern examples like the minimalistic Forth inspired VM named Uxn: https://100r.co/site/uxn.html

drogon wrote:
drogon wrote:
BigEd wrote:
I like idea of a 32 bit VM - that can make use of lots of banked memory, presumably, without the application needing to know or care.


Well, me too... Which is what I did on the 65C816 - So could it be done on other systems? I'm sure it could and for a while I flirted with the idea of it on the Commander X16 project which has an interesting banked RAM system.

The trade-off is, as usual, speed but speed for the flexibility of an easier to use high level language? Worth it in my books. An 8-bit CPU with 64K "ROM" and 64K (banks of) RAM? Should be very capable especially if the CPU is in software (FPGA) and might be tweakable to help run the VM...


Following up myself, but some ponderings over the past few hours and I've been thinking ... (and I'm sure many others have thought something along the same lines) might be to have some blisteringly fast but relatively simple CPU - one that's designed to do no more than write a bytecode interpreter in - I think the term "millicode" has been coined in the past too but I heard some RISC-V people talking about something similar a while back. Might be an interesting step from actual microcode to a high level/general purpose CPU...

Also thinking about the number of actual different instructions I used on both the 65x816 and RISC-V to implement by bytecode VM - it's less than half on the '816. Hard to gauge on RISC-V due to the registers but there is scope for an even more reduced instruction set CPU to execute it..

-Gordon

An instruction scratch pad or a ROM(Some FPGAs support preloading blockram from the bitstream) would be great for a smaller FPGA, where it's a bit of challenge to fit any decent size cache. That would free up a huge amount of bandwidth from external memory for things like graphics and sound.

Have you profiled the RISC-V version of your VM? I'm curious about the average instruction count per bytecode executed.


Tue Mar 07, 2023 6:53 am
Profile

Joined: Sun Oct 14, 2018 5:05 pm
Posts: 59
DockLazy wrote:
That is very cool!


Thanks.

DockLazy wrote:
Have you profiled the RISC-V version of your VM? I'm curious about the average instruction count per bytecode executed.


Not exactly but I'll give a couple of examples:

This:
https://unicorn.drogon.net/nextOpcode.txt

Is the bytecode instruction fetch and dispatch, side by side on both CPUs.

This code is exeuted for every single bytecode instruction executed.

The '816 version is at best 29 clock cycles and at worst 37. Cycles are "wasted" as I have to fetch a 16-bit value from RAM then mask-off the top 8 bits before I can use the code as an index into the jump table.

The RISC-V version is much simpler, as that's not needed. The shift is *2 on the '816 but *4 on the RV, but that's fine if the RV has a barrel shifter (1 cycle each). If we naively assumed 1 cycle per instruction on the RV side then it's 6 cycles vs. 29,even at 2 cycles per instruction, 24 vs. 29.

Opcode execution is where there are real advantages - take ADD for example:

'816 version:

Code:
.proc   ccADD
        .a16
        .i16

        clc
        lda     regB+0
        adc     regA+0
        sta     regA+0
        lda     regB+2
        adc     regA+2
        sta     regA+2
        nextOpcode
.endproc


vs. RISC-V:

Code:
ccADD:
        add     regA,regB,regA
        nextOpcode


The RISC-V model keeps everything in registers. (absolutely everything the VM requires no RAM to run!) the '816 version is dealing with multi-word adds to add the full 32-bit values together. Even using direct (zero) page on the '816 it's still many more cycles and more bytes of code - just 4 bytes on RV (one instruction fetch) vs. 13 bytes on the '816 - 7 instruction fetches requiring 20 clock cycles. (Not counting the 'nextOpcode' thing which is a macro expansion of the above code.

My crude estimations is that the RV version is some 5 times faster clock cycle for clock cycle (On an esp32c3 CPU) Some operations are obviously going to be much faster - MUL, DIV, MOD for example, and often times the '816 has to deal with taking the banks of 64KB and make that all transparent to the BCPL program via the VM. (literally any time a data value is loaded or stored)

Other big speed-ups come when loading values into the register - the VM has a 2-register stack, so load a value pushes register A into register B - on the '816 this is 4 instructions to copy 4 bytes (load/store 16-bit values twice), on the RV it's one instruction to copy regA to regB. (and of-course there's the added headache on the '816 of fetching that value to be pushed into regA from somewhere in that 64B segmented memory system...

I must find the energy to get back into this though - my port to RV was a nice challenge and actually turned out to be quite easy (after i'd written the RV emulator in BCPL on my '816 system - I've written about that elsewhere!) then making it run on real hardware (esp32c3) however what it lacks on real hardware is a filing system, so that's on the to-do list before it's actually usable to the stage of being bootable, printing"Hello, world", then halting.

Cheers,

-Gordon


Tue Mar 07, 2023 9:28 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Rather than using a barrel shifter, I just loop on the shift until the counter is done. Here I traded
slower shifts, to have opcode space for byte,short and long data types. CISC format works best
for me as I just have room for 8 registers like the PDP 11.
I just have to test the long intructions ( 3 16 bit words) and my 48 bit TTL design is done.
The main advaintage is now I have a 48 bit software floating point,the same width as long ints.
short is 16 bits, and bytes are 8 bits. I did not expect to need 48 bits, but with all new features
like index registers and byte addressing and more than 64Kb ram it grew bigger than a 18 bit
machine I thought was a huge computer in the 1980's compared to a PDP 8.
Ben.


Tue Mar 07, 2023 10:18 am
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 40
drogon wrote:
The RISC-V version is much simpler, as that's not needed. The shift is *2 on the '816 but *4 on the RV, but that's fine if the RV has a barrel shifter (1 cycle each). If we naively assumed 1 cycle per instruction on the RV side then it's 6 cycles vs. 29,even at 2 cycles per instruction, 24 vs. 29. -snip-

So clock for clock RISC-V is interpreting (simple)bytecodes roughly as fast as an 68k or 8086 can execute native instructions! I suppose I shouldn't really be surprised, that was kind of the point of the move to RISC.

Looking at the RISC-V code you posted and ignoring the 32-bit registers for a sec, but assuming a 16-bit ALU. My computer would have the same instruction count except for "inc regPC". This would be 3 instructions, low add, carry check, and add CC result to high. A 32-bit full add would be 4 instructions.


Wed Mar 08, 2023 7:28 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 24 posts ]  Go to page 1, 2  Next

Who is online

Users browsing this forum: AhrefsBot and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software