Last visit was: Wed Oct 09, 2024 7:19 pm
It is currently Wed Oct 09, 2024 7:19 pm



 [ 9 posts ] 
 Seeking the smallest 68000 implementation... 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1799
I've been experimenting with the BlackIce FPGA board (or, to be more accurate, Dave [Hoglet] has been forging ahead and I've been trying to keep up with progress.)

Dave has implemented a series of 8-bit retrocomputers on the board, based on 6502 and on Z80. In the case of his Acorn Atom model, he also managed to fit in a SID, which is a great deal bigger than the rest of the machine because it needs multipliers for the signal mixing, and the Lattice FPGA doesn't have multipliers as such, so they use lots of resources.

Yesterday at the CCH in Cambridge I was chatting, with Revaldinho, with Ken Boak, who was demoing BlackIce, and Ken wondered if a 68000 model could fit into the Lattice FPGA. A good question. I suspect it would be a challenge. Ken wondered if the 68k could be emulated by a simpler engine. Perhaps a picocoded machine to run the nanocode which underlies the 68000's microcode...

Anyone have any ideas, or experience with small 68000 implementations?

Image


Mon Sep 18, 2017 11:13 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
To the best of my understanding, the ROM-based controls in the 68000 instruction sequencer can best be understood as a width constrained ROM micro-program, i.e. micro-code, and another ROM used to expand the encoded control fields in the micro-program ROM, i.e. nano-code. I can't see a need to expand the definition to include a third level of micro-program ROM, i.e. pico-code.

By allowing the micro-code to use a vertical format, which uses a second level ROM to expand the control encoded control fields, Motorola was able to make substantial savings in the total ROM needed to implement the 68000. If you examine the two halves of the micro-program of my M65C02/M65C02A, you will see that both ROMs contain a lot of redundant data. This is due to the nature of the ROM-based control stores used in their implementation. The use of a PLA in the implementation of the actual 6502/65C02 control sequencer allows for a substantially more efficient implementation than the ROM-based micro-programmable approach that I used for these two cores.

I am of the opinion that the additional complexity of the 68000 drove Motorola into using a ROM-based micro-programmed implementation. I suspect that additional complexity represented by the 68000 ISA would have resulted in a PLA that was too large (i.e. wide) to provide the desired operating speed. In order to reduce the total chip area dedicated to the control sequencer, a pipelined vertical micro-program was a good compromise.

In the case of my soft-core implementations, I found myself needing to use a PLA, but that is a logic structure not available in FPGAs except as a discrete logic implementation. As many others have demonstrated, a micro-programmable approach to implementing the 6502/65C02 architecture is not required. The resulting control logic and state machine is very manageable using modern tool sets, and I suspect that the approach could be used with a modern re-implementation of the 68000. A micro-programmable approach may be easier to debug and update, but will suffer from the speed limitations of the row-column decoder/multiplexer needed to implement the block RAM structures in modern FPGAs. In the final analysis, there is a fundamental speed limit for my soft-cores: the speed of the block RAMs used for the micro-program ROMs are an order of magnitude (or more) slower than LUTs in the FPGAs.

_________________
Michael A.


Mon Sep 18, 2017 1:33 pm

Joined: Tue Feb 10, 2015 7:07 am
Posts: 52
Ed,

Here's one concrete data point: the TG68K core in the Matchbox.
Code:
Slice Logic Utilization:
 Number of Slice Registers:             833  out of  11440     7%
 Number of Slice LUTs:                 3235  out of   5720    56%
    Number used as Logic:              3163  out of   5720    55%
    Number used as Memory:               72  out of   1440     5%
       Number used as RAM:               72

These numbers are for the whole design, including the Tube, so you can probably knock 10% off.

Also, note the LUTs are 6-input.

For reference, the iCE40HX8K has 7,680 4-input LUTs.

Unfortunately the TG68 core is VHDL, so you can't easily run it through IceStorm. Translation would be possible using vhd2vl with some manual intervention.

But I would start by trying the ao68000 core, which is Verilog.
https://github.com/alfikpl/ao68000

According to the spec
Quote:
Uses about 4810 LE on Altera Cyclone II and about 45600 bits of RAM for microcode.

A Cyclone II LE contains a 4-input LUT and a register, so this should fit in the iCE40HX8K....

Looking at the Verilog, you'll need to replace the ALTSYNCRAM ram blocks with simple behavioural equivalents.

Dave


Mon Sep 18, 2017 5:53 pm

Joined: Tue Feb 10, 2015 7:07 am
Posts: 52
Hmmm, the ao68000 also contains a 17x17 signed multiplier.

Dave


Mon Sep 18, 2017 6:22 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1799
Cheers Dave! Interesting, so the 68k might not be such a monster, modulo the big multiplier.


(Oh but also Lattice have no distributed RAM so the register file might be an issue.


Mon Sep 18, 2017 7:24 pm

Joined: Tue Feb 10, 2015 7:07 am
Posts: 52
BigEd wrote:
Cheers Dave! Interesting, so the 68k might not be such a monster, modulo the big multiplier.

According to this:
Quote:
The multiplication algorithm implemented requires 38+2n clocks, where n is defined as:
MULU: n = the number of ones in the <ea>
MULS: n = concatanate the <ea> with a zero as the LSB; n is the resultant number of 10 or 01 patterns in the 17-bit source; i.e., worst case happens when the source is $5555

So I guess the original uses something like shift-and-add, with a 16-bit adder?
BigEd wrote:
(Oh but also Lattice have no distributed RAM so the register file might be an issue.

There should be enough block RAM.

Here's are some real results for you:
Code:
seed: 1
device: 8k
read_chipdb +/share/arachne-pnr/chipdb-8k.bin...
  supported packages: cb132, cb132:4k, cm121, cm121:4k, cm225, cm225:4k, cm81, cm81:4k, ct256, tq144:4k
read_blif test.blif...
prune...
read_pcf blackice.pcf...
instantiate_io...
pack...

After packing:
IOs          122 / 167
GBs          0 / 8
  GB_IOs     0 / 8
LCs          7191 / 7680
  DFF        832
  CARRY      826
  CARRY, DFF 26
  DFF PASS   220
  CARRY PASS 59
BRAMs        15 / 32
WARMBOOTs    0 / 1
PLLs         0 / 2

place_constraints...
promote_globals...
  promoted CLK_I$2, 877 / 877
  promoted $abc$48572$n4696, 665 / 665
  promoted $abc$48572$n3479, 80 / 80
  promoted $abc$48572$n3805, 49 / 49
  promoted $abc$48572$n2814, 37 / 37
  promoted $abc$48572$n3770, 37 / 37
  promoted $abc$48572$n256, 87 / 87
  promoted 7 nets
    2 sr/we
    4 cen/wclke
    1 clk
  7 globals
    2 sr/we
    4 cen/wclke
    1 clk
realize_constants...
  realized 0, 1
place...
  initial wire length = 131136
  at iteration #50: temp = 12.2702, wire length = 138329
  at iteration #100: temp = 6.29881, wire length = 93060
  at iteration #150: temp = 3.23344, wire length = 60879
  at iteration #200: temp = 1.35197, wire length = 42932
  at iteration #250: temp = 0.00812532, wire length = 32436
  at iteration #300: temp = 1.44961e-07, wire length = 32075
  final wire length = 32068

After placement:
PIOs       78 / 167
PLBs       951 / 960
BRAMs      15 / 32

  place time 149.96s
route...
  pass 1, 947 shared.
  pass 2, 723 shared.
  pass 3, 599 shared.
  pass 4, 555 shared.
  pass 5, 589 shared.
  pass 6, 617 shared.
  pass 7, 640 shared.
  pass 8, 660 shared.
  pass 9, 685 shared.
  pass 10, 727 shared.
  pass 11, 741 shared.
  pass 12, 717 shared.
  pass 13, 750 shared.
  pass 14, 752 shared.
  pass 15, 835 shared.
  pass 16, 765 shared.
  pass 17, 814 shared.
  pass 18, 833 shared.
  pass 19, 849 shared.
  pass 20, 883 shared.
  pass 21, 806 shared.
  pass 22, 822 shared.
  pass 23, 705 shared.
  pass 24, 697 shared.
  pass 25, 853 shared.
  pass 26, 692 shared.
  pass 27, 789 shared.
  pass 28, 765 shared.
  pass 29, 800 shared.
  pass 30, 733 shared.
  pass 31, 768 shared.
  pass 32, 777 shared.
  pass 33, 714 shared.
  pass 34, 679 shared.
  pass 35, 639 shared.
  pass 36, 543 shared.
  pass 37, 479 shared.
  pass 38, 488 shared.
  pass 39, 506 shared.
  pass 40, 493 shared.
  pass 41, 507 shared.
  pass 42, 542 shared.
  pass 43, 493 shared.
  pass 44, 489 shared.
  pass 45, 492 shared.
  pass 46, 497 shared.
  pass 47, 443 shared.
  pass 48, 462 shared.
  pass 49, 423 shared.
  pass 50, 422 shared.
  pass 51, 359 shared.
  pass 52, 328 shared.
  pass 53, 267 shared.
  pass 54, 293 shared.
  pass 55, 282 shared.
  pass 56, 220 shared.
  pass 57, 263 shared.
  pass 58, 247 shared.
  pass 59, 250 shared.
  pass 60, 206 shared.
  pass 61, 201 shared.
  pass 62, 181 shared.
  pass 63, 134 shared.
  pass 64, 162 shared.
  pass 65, 164 shared.
  pass 66, 127 shared.
  pass 67, 112 shared.
  pass 68, 75 shared.
  pass 69, 57 shared.
  pass 70, 44 shared.
  pass 71, 58 shared.
  pass 72, 46 shared.
  pass 73, 54 shared.
  pass 74, 46 shared.
  pass 75, 28 shared.
  pass 76, 19 shared.
  pass 77, 17 shared.
  pass 78, 31 shared.
  pass 79, 32 shared.
  pass 80, 21 shared.
  pass 81, 21 shared.
  pass 82, 21 shared.
  pass 83, 20 shared.
  pass 84, 27 shared.
  pass 85, 18 shared.
  pass 86, 10 shared.
  pass 87, 7 shared.
  pass 88, 6 shared.
  pass 89, 4 shared.
    shared net #4625 (demand = 2).
      used by wire $abc$58379$n6279
      used by wire registers_m.pc[23]
    shared net #9632 (demand = 2).
      used by wire $abc$58379$n6279
      used by wire DAT_I[10]$2
    shared net #102099 (demand = 2).
      used by wire $abc$48572$n250
      used by wire $abc$58379$n3863
    shared net #105519 (demand = 2).
      used by wire $abc$48572$n182
      used by wire $abc$58379$n3863
  pass 90, 2 shared.
    shared net #125179 (demand = 2).
      used by wire $abc$58379$n3044_1
      used by wire registers_m.ir[7]
    shared net #125274 (demand = 2).
      used by wire $abc$58379$n5904
      used by wire registers_m.ir[7]
  pass 91, 0 shared.

After routing:
span_4     20765 / 29696
span_12    3716 / 5632

  route time 1149.62s
write_txt test.txt...
// Reading input .asc file..
// Reading 8k chipdb file..
// Creating timing netlist..
Total number of logic levels: 15
Total path delay: 39.38 ns (25.40 MHz)

So very tight indeed!

Dave


Mon Sep 18, 2017 8:10 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1799
Wow that struggled! Surprising that the speed isn't too bad.

Thanks for running it through.


Mon Sep 18, 2017 8:40 pm

Joined: Tue Feb 10, 2015 7:07 am
Posts: 52
What about the J68?

viewtopic.php?f=13&t=347


Mon Sep 18, 2017 8:42 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1799
I'd like to say I'd never heard of it, but I don't think I can away with it - thanks for digging!


Tue Sep 19, 2017 5:45 pm
 [ 9 posts ] 

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software