Last visit was: Wed Nov 13, 2024 6:34 am
It is currently Wed Nov 13, 2024 6:34 am



 [ 36 posts ]  Go to page Previous  1, 2, 3
 Tugman 18-bit CPU 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2219
Location: Canada
Quote:
Amazingly, no extra resources used (after some reshuffling of the condition muxes), and fMax edged up to 53.811.

I find FPGAs can be a bit mysterious on logic use. Sometimes adding code decreases the amount of logic the FPGA uses. It is a case of 'try-it-and-see'.

Quote:
I started with a J1, a much less capable CPU, and actually reduced size, increased fMax, and added a ****ton of instructions and capabilities!
The J1 is an excellent starting place. I did some experimentation with it too.

Stack machines are great in a lot of ways, but are not the best performers. They chew up cycles pushing and popping values to the stack. A pipelined RISC machine would likely have better performance at the same clock rate.

Quote:
Software and OS design has big inpact on a design layout, so I hope your design can handle bigger software projects.

Yes, what software one wants to run makes a difference. The address space is always expanding. Need some sort of MMU for modern OS.

_________________
Robert Finch http://www.finitron.ca


Sun Sep 22, 2024 6:30 am WWW

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 26
Rewrote the stack... Like them more so far.

Some minor changes: instead of manual control of write-to-stack and a two-bit incrementor, implemented a higher-level control: the two bits stand for nothing, push, pop, and write. In case of the return stack, the last one increments the TOR. Writes are implied on push and write (or increment/write).

Back to 497/55.37Mhz, and everything is working.

Oldben, your system sounds interesting! This CPU was never meant for a large system - it maxes with 8K words. In fact, I want to get it to work with a single 1Kx18 BRAM as a main use case - as an embedded controller inside the FPGA. The small footprint combined with a fast build (still under 15 seconds) makes it pretty useful, and I can see having several of these if needed.

I do have some ideas about how to make a much larger system using mostly the same computational platform (and a radically different instruction sequencer/jump logic). But that's for later.

The CPU has a pretty good instruction density -- because you can do so many things at once. I think it can give a RISC cpu a good run for the money. And if you ever look at the code C compilers generate upon subroutine entry and exit, and to set up parameters for function calls, it is hardly better than stack machines with carefully crafted code! And then there is the near-zero-overhead for interrupts, and zero-overhead returns from calls! And instant task switching, and co-routines -- there is no state! I think people gave up on stack machines too early.

In the meantime, I have to get it to work with a smaller memory and maybe cleanup the IO, which is a bit messy.


Sun Sep 22, 2024 7:45 am

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 26
Here is a snippet that copies n words of memory from source to destination:

Code:
;============================================================================
; memcpy (cnt-1,dst,src
;                                DATASTACK        RSTACK
    push                       ;(--cnt,dst)       (--src)
.loop:                       
    op OP_B,B_TOS,TORI         ; --                 --src+       issue read,++
    op OP_B,B_MEM,RPUSH        ;(cnt,dst--cnt,val)  --src+,dst
    op OP_B,B_NOS,DPOP,WM,TORI ;(cnt,val--cnt)      --src+,dst+  write,drop,inc
    op OP_ADD,B_N1             ;(cnt--cnt-1)     
    pop                        ;(--cnt,dst)         src,dst--src
    jnc .loop


That is pretty impressive for a stack machine. Just look at how much is going on here!

In fact, the first instruction in the loop is there just to increment the source on the rstack, because after the pop src is in tor, a read is issued during the jump, and is available at loop entry... But I read it again, for the damn TORI increment. Next instruction I have to push the destination onto the RSTACK, and cannot increment src...

But even so, a classic RISC machine especially without autoincrement is no better... You have to load, increment, store, increment, increment count, test/jump. Six things.

Not that you want to write code like this all the time -- more obvious (and verbose) Forth-like instructions are available. But it's nice to be able to optimize a tight loop every now and then.

I've been running at 55MHz without problems for a while now. Running out of a single BRAM (1K words), less than 6% of the resources. The system without the UARTS is actually under 400 LUTS, something like a couple of hundred Xilinx Spartan-3 slices. It is the smallest CPU I've worked with (maybe picoblaze is smaller, but it is really specialized, has no memory, and is more like an old PIC. I much prefer Forth in hardware.)


Sun Sep 22, 2024 2:11 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 679
enso1 wrote:
I've been running at 55MHz without problems for a while now. Running out of a single BRAM (1K words), less than 6% of the resources. The system without the UARTS is actually under 400 LUTS, something like a couple of hundred Xilinx Spartan-3 slices. It is the smallest CPU I've worked with (maybe picoblaze is smaller, but it is really specialized, has no memory, and is more like an old PIC. I much prefer Forth in hardware.)

Looks up the picoblaze. That is back from the time with a little floor planning you could do wonders and your logic
did not get messed up like today. You also had features that worked well with a little planning like internal tristate buffers
and a free 16x1 register file.


Mon Sep 23, 2024 6:12 am

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 26
Yes, KCPSM was my early inspiration, and I got really good at working with gates. Verilog looks like C, and it's easy to forget what you are doing and write tons of crap that sort of works, then get a bigger FPGA. Long ago I've built some tools for instantiating circuits directly, wiring them together, and constructing larger circuits (and placing them manually, like a jigsaw puzzle... Personally, I think small FPGAs are great -- and builds are fast. I am at 6% of this $20 FPGA, and I can rebuild the system in under 15 seconds.

Speaking of minimalism, I am probably yak shaving at the edge of the tarpit, but it's just so much fun I can't stop.

I've been a bit annoyed with memory access on stack machines -- writing a simple memory copier is a serious puzzle of rotating varibles around. I've already been playing with TOR (top of return stack) addressing memory and capable of autoincrementing, which cost me about 15% of the CPU for extra muxes and another incrementor... But it made a huge difference, -- a six-cycle memcopy loop is nothing to sneeze at...

But I think I have a much better idea: I have a free instruction slot for a return-stack operation, which I am currently using for increment-in-place. If I make it a swap with increment, I can keep two pointers alternating on the return stack, and eliminate a big mux as there is not much going up above TOR... Check this out:
Code:
;============================================================================
 40 ; memcpy(cnt,src,dst--)
 41 ;     
 42        push                        ;(--cnt,src   --dst
 43        push                        ;(--cnt       --dst,src
 44        jz  .done       
 45 .loop: op  OP_B,B_MEM,DPUSH,RSWP   ;(--cnt,val   --src+1,dst
 46        op  OP_B,B_NOS,DPOP,WM,RSWP  ;(--cnt       --dst+1,src+1 
 47        op  OP_ADD,B_N1             ;(--cnt-1                    read issued
 48        jnz .loop                   ;


Yes, that is a 4-cycle memory-copy loop, with a counter. That's crazy.

It's not just a fluke... Here is my program loader, receiving a number of words and storing them in memory:
Code:
.loop: jsr     rx32bits               ;--val           --cnt+,dst
       op      OP_B,B_TOS, WM,RSWP    ;--val           --dst+,cnt
       op      op_B,B_TOR, RSWP       ;--cnt           --cnt+,dst
       jmid    .loop                  ;--


The loop counter needs to be negated as it's counting up, and we loop while it's negative... But man, I am really on fire here.

I will now try it, and hopefully it will not cause bloat. I am hoping for the opposite.


Mon Sep 23, 2024 7:28 am

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 26
I actually implemented it, and need to think about it for a bit. While trying to convince myself it's a good idea, the code looked great, swapping the return stack just in time. After doing it and trying it out, somehow it always swapped at the wrong time, and I had to waste cycles syncing it back up, with convoluted code...

Also it costs a lot -- an additional read/write port is necessary to do a swap. These Gowin devices do not support dual-ported register arrays, so the whole thing costs 50-100 cells just to do a swap.

An incrementor on TOR, on the other hand, can be smaller, and is very useful and understandable...

So I am leaning to leaving the incrementor...


Fri Sep 27, 2024 6:22 pm
 [ 36 posts ]  Go to page Previous  1, 2, 3

Who is online

Users browsing this forum: claudebot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software