Last visit was: Fri Nov 21, 2025 4:02 am
|
It is currently Fri Nov 21, 2025 4:02 am
|
GF-RV16 - an experimental 16-bit RISC-V ISA
| Author |
Message |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2423 Location: Canada
|
I have been following along on your project. I liked to see the priority you have given to testing. Testing is a lot of the work. I am used to working with FPGAs. But I think CPLDs are similar. I have found that large elsif trees tend to generate large priority tree logic during synthesis that is chained together resulting in a lot of hardware that is slow. Synthesis may optimize it or it may not depending on how good a job it does. It is often better to use a case statement where possible as that tends not to generate cascaded priority logic. Verilog’s case statement (casez) supports don’t care bits. I suspect the same is available in VHDL.
Fewer micro-code entry points could be used, and some of the lower order micro-code address bits would not need to be specified if micro-code routines were spaced out on a power-of-two address. Could four bits of the opcode be used directly as part of the micro-code address? It would save on the decode logic. If ROMs (EPROMs) are used for micro-code they tend to be available in larger sizes so some of the low order bits can be wasted.
To get rid of some of the cases, you could have branches to the same two micro-code addresses (true and false) for all the conditional branch instructions. Using the branch outcome as a bit determining the micro-code address. Then have the condition as an input to the module determining the branch outcome. For ALU operations the same micro-code address could be used as well, with the operation code as input to the module.
_________________Robert Finch http://www.finitron.ca
|
| Sun Nov 09, 2025 3:41 pm |
|
 |
|
gfoot
Joined: Sat Oct 04, 2025 10:54 am Posts: 25
|
robfinch wrote: I have found that large elsif trees tend to generate large priority tree logic during synthesis that is chained together resulting in a lot of hardware that is slow. Synthesis may optimize it or it may not depending on how good a job it does. It is often better to use a case statement where possible as that tends not to generate cascaded priority logic. Verilog’s case statement (casez) supports don’t care bits. I suspect the same is available in VHDL. Thanks Rob - yes VHDL has the "case?" statement, with a question mark at the end - however it doesn't seem to be very well supported, at least the GHDL engine that I'm running on doesn't like it. Hence all the "elsif" statements. It's good to know that if I do synthesize it in an FPGA at some point then it'll be more of a problem, I have no experience of that, but at the moment I think I'm OK with how it is, as any hardware implementation is likely to ultimately come from cupl anyway - I'm just using VHDL/GHDL as a development tool to help figure out how I could break it down into modules, and verify how the modules would interact with each other. If I do eventually synthesize it from the VHDL then I guess I could probably just write the logic expressions for the individual microcode address bits in the VHDL, like I'm doing in the cupl code. Quote: Fewer micro-code entry points could be used, and some of the lower order micro-code address bits would not need to be specified if micro-code routines were spaced out on a power-of-two address. Could four bits of the opcode be used directly as part of the micro-code address? It would save on the decode logic. If ROMs (EPROMs) are used for micro-code they tend to be available in larger sizes so some of the low order bits can be wasted. Good points indeed - and if the next stage (decoding microcode to control signals) is done in SPLDs, the microcode could still be quite sparse without adding extra "cost". There are some especially interesting possibilities if the high bits of the microcode address are constant throughout the instruction. I need to dig a bit more into the microcode decoding stage though to figure out what the implications would be for that. Quote: To get rid of some of the cases, you could have branches to the same two micro-code addresses (true and false) for all the conditional branch instructions. Using the branch outcome as a bit determining the micro-code address. Then have the condition as an input to the module determining the branch outcome. For ALU operations the same micro-code address could be used as well, with the operation code as input to the module. I hadn't thought of branching in the microcode itself. I think I want to avoid it though because I prefer to reduce the number of different sources that something like the microcode address pointer gets loaded from. Maybe that's something to come back to later though. I did look for instructions that were identical to the final sequences of other instructions, but "jump" was the only one I spotted - being the same as the last few microcode entries for some of the branch instructions.
|
| Sun Nov 09, 2025 6:19 pm |
|
 |
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 866
|
gfoot wrote: I hadn't thought of branching in the microcode itself. I think I want to avoid it though because I prefer to reduce the number of different sources that something like the microcode address pointer gets loaded from. Maybe that's something to come back to later though.
I did look for instructions that were identical to the final sequences of other instructions, but "jump" was the only one I spotted - being the same as the last few microcode entries for some of the branch instructions. I have my microcode in a large CPLD, and use WinCUPL. Anything is better than Verlog or VHDL as I like to work at the gate level logic. a.d=a&c#b&!c;a.clk=clk1; What I have done is I have microcode assembler that reads a text file and creates the the control logic file. @ toggles pass data through. if not passing data though it expects microcode to assemble. It then appends to the output,logic tables for ROM lookup, when done.I use a 256x16 rom. This way I can keep my microcode defines, separate from the microcode decode logic. Code: * OCT 17 2025 18 bit classic computer. This is a generic micocode source a 18 bit computer with byte addressible memory. This part is microcode assembler (micode.exe) *
@ // enter CPLD logic compiler /* OCT 17 - KAL,KAH,BUS 18 bit classic computer Swr version. Single IRQ trap. This is a generic micocode source a 18 bit computer with byte addressible memory. This part contains the pal logic and the rom lookup tables. 2901 ALU order code.No restart from HLT unless followed by a nop. 3x clock 14.3 Mhz ~ .85 uS. IRQ included. 74lS189,74H04,74HC14 ######____## ___#####____ >< CONTROL/MAR ########### RD (STROBE)
QUICK 102 macro cells
CPU model 18 9 1 +---+---+---+---+---+---+ | : | 0 A GP A E +-----------------------+ | : | 1 B GP B F +-----------------------+ | : | 2 C GP C G +-----------------------+ | : | 3 D GP D H +-----------------------+
+---+---+---+---+---+---+ | : | 4 #/Z (PC) I +-----------------------+ | : | 5 X J +-----------------------+ | : | 6 Y K +-----------------------+ | : | 7 S L +-----------------------+
+-+ | | CF CARRY FLAG SET IF CARRY / A,B,C,D +-+ SET ON SHIFT OUTPUT
ALU FUNCTIONS AC IX CC SHIFT # OPCODE FUNC 0 ST/STB STORE Ae A 0 0 1 ADD/ADC ADD Bf B z 0 2 SUB/SBC SUB A-B Cg C s C 3 CAD/CAC SUB B-A Dh D s+z S 4 AND AND Pi #/Z ~c 0 5 OR OR Xj X ~c+z 0 6 XOR XOR Yk Y true C 7 LD/LDB LOAD Sl S true S
E..L LOAD STORE BASE REGS
876 543 210 987 654 321 +---+---+---+---+---+---+ |1YO|OOO:AAA|320|XXX:+##| AUTO +-----------------------+ |1XO|OOO:AAA|321|XXX:+##| INDEXED +-----------------------+ |0YO|OOO:AAA|###|###:###| BYTE/CTL 2 hlt 0 di 1 ei +-----------------------+ |1-O|OOO:AAA|001|###:...| SHIFT 0..7 is limit +-----------------------+ counter incriments ST OP 0 SWR REG% 1 NOP dsp SFT # 2 JSR JCC+ 3 LEA SCC 4 R+ R+ 5 X+# X+# 6 R+ R+ 7 X+# X+# TRAP PUSH PC, PC = AT 2 */
// PROPERTY Atmel {pin_keep = off }; // PROPERTY Atmel {soft_buffer = on}; // QUICK NAME CTL.PLD ; PARTNO ; DATE 2025-10-17; REVISION ; DESIGNER ; COMPANY ; ASSEMBLY ; LOCATION ; DEVICE F1508PLCC84 ;
PINNODE = [OP4..1,PSW,SYNC,BUSY,BOUNCE,RS,PAN1,PAN2,GO]; PINNODE = [LCTR,CC,CF,WCY,SFT,TS,CT,TSFT]; PINNODE = [SX,LD,TWO,WRD,OP,RA,RX,BYY]; PINNODE = [Y,AUX,EF,RK,EFC,TSW,BY,WR,RD]; PINNODE = [CTR3..1,CTL,TST,IQ,IRQ,NO,DSP,ODD,YT]; PINNODE = [IR18..2,TIR,TIN,TMAR,G2..1,TP,RUN,BANK];
PIN 2 = ACK; // PHASE 2 CLOCK (AP IN) PIN 4 = K1; // DATA MULX PIN 5 = K2; // DATA MULX PIN 70 = K3; // DATA MULX PIN 8 = CP; // PHASE 1 CLOCK OUT PIN 9 = SHO; // SHIFT OUT -L
PIN 25 = IRQI; // IRQ IN ACTIVE HIGH FROM 74LS14 PIN 1 = CLR; // RESET ACTIVE HIGH FROM 74LS14
PIN 11 = SH1; // RAM REG 00 LOAD 01 LOAD UNSIGNED PIN 12 = SH2; // 1X SHIFT PIN 63 = C18; // CARRY FROM ALU
PIN 6 = APP; // ADDRESS CLOCK PIN 67 = ALU3; // ALU FUNC PIN 65 = ALU2; // ALU FUNC PIN 64 = ALU1; // ALU FUNC /* 00- ADD 01- SUB 100 AND 101 OR 110 XOR 111 LOAD */ PIN 27 = SWR1; // BOT ALD -L PIN 28 = SWR2; // BOT+1 EXAM -L PIN 29 = SWR3; // TOP-1 DEP -L PIN 30 = SWR4; // TOP R/S -L PIN 31 = SWR5; // PC/AC
// UN A SIGNED NODES
PINNODE = AD0..7; // AD7..AD0 MICRO CODE ROM ADDRESS PIN = IR; // TRUE IF INSTRUCTION FETCH //PIN = STOP; PIN 33 = BI18; PIN 34 = BI17; PIN 35 = BI16; PIN 36 = BI15; PIN 37 = BI14; PIN 39 = BI13; PIN 40 = BI12; PIN 41 = BI11; PIN 44 = BI10; PIN 45 = BI9; PIN 46 = BI8; PIN 48 = BI7; PIN 49 = BI6; PIN 50 = BI5; PIN 51 = BI4; PIN 57 = YY; // Y MULX FOR MAR,JAM 0 SUM 1 RAM REG PIN 54 = MR; // MEMORY REQUEST PIN 55 = MW; // MEM WRITE PIN 56 = MB; // MEM BYTE PIN 52 = DK; // DCLOCK ACTIVE HIGH READ PANEL SWITCH // DISPLAY MAR PIN 60 = MAR; // LOAD MAR REG AND WR,BY FLAGS PIN 61 = IN; // LOAD INPUT PIN 71 = CY0; // CARRY OUT PIN 79 = WE_; // 219 WRITE STROBE PIN 68 = NSGN; // SIGN BIT RAM -L PIN 73 = BOUT; // DATA BUS OUTPUT ENABLE WR&MR PIN 74 = RG3; // 219 RAM AD 3 PIN 75 = RG2; // 219 RAM AD 2 PIN 76 = RG1; // 219 RAM AD 1 PIN 77 = RG4; // 219 RAM AD 4 PIN 58 = EQ; // EQ FLAG IN PIN 69 = NODD; // ODD BIT FROM RAM -L PIN 81 = CI; // 4X CLOCK PIN 83 = CLK; // PHASE 1 CLOCK (CP IN )
/* MICRO CODE STUFF ROM A OUTPUTS SH,NO,LD,RA,TWO,WRD,OP,WR
"SX", 0X0000C 3 bit data sign extended "WRD",0X00004 word data "2" ,0X00008 constant 2 "OP", 0X00002 alu opcode "SUB",0X00002 subtract "WR", 0X00001 WR/rd bus flag "AC" ,0X00010 select ac "LD" ,0X00020 load operation "NO" ,0X00040 no load ram
BY,RX,XIR,IN,RD,MAR,Y,AUX
"BY", 0X08000 BY/wrd bus flag "Y", 0X00200 select sum/ram to mar "IR", 0X03000 fetch "IN" ,0X01000 input data "CTL",0X00100 control state AUX !RA !RX decode IX2 "TST",0X00100 test acc and jump if true to 4 AUX AX "DSP",0X00100 display mar and read swr AUX RX "SFT",0X00100 shift a input "PC" ,0X00000 pc select "MAR",0X00400 load mar "IX" ,0X04000 rx index reg select "RD", 0X00800 memory request "SWR",0X04010 ac/pc switch display */
IR.D = TIR; // intruction fetch IR.CK = ACK;
IN.D = TIN; // load b input reg IN.CK = ACK;
SFT = TSFT;
MAR.D = TMAR; // load mar register MAR.CK = ACK;
// alu operation ST = !IR15&!IR14&!IR13; OP4 = IR16; OP3 = IR15; OP2 = IR14; OP1 = IR13;
// front panel switches and irq
PSW.D = SWR5; PSW.CK = CLK;
SYNC.D = (!SWR1#!SWR2#!SWR3#!SWR4)&!CLR; SYNC.CK = CLK; SYNC.CE = IR; SYNC.AR = CLR;
BUSY.D = SYNC&!CLR; BUSY.CK = CLK; BUSY.CE = IR; BUSY.AR = CLR;
BOUNCE.D = BUSY; BOUNCE.CK = CLK; BOUNCE.CE = IR; BOUNCE.AR = CLR;
OK = SYNC&BUSY&!BOUNCE;
RS.D = !SWR4&OK; RS.CK = CLK; RS.CE = IR; RS.AR = CLR;
PAN1.D = OK&(!SWR1#!SWR3); PAN1.CK = CLK; PAN1.CE = IR; PAN1.AR = CLR;
PAN2.D = OK&(!SWR2#!SWR3); PAN2.CK = CLK; PAN2.CE = IR; PAN2.AR = CLR; // false panel true running
RUN.D = RUN & !STOP # !RUN & RS;
RUN.CK = CLK; RUN.CE = IR; RUN.AR = CLR;
TP = EF & !AUX & IQ & RUN & IR ;
/* MAIN IRQ ENABLE 00 01 10 10 up 10 01 00 down */
EF.D = CTL & OP4 #!CTL & RUN & EF & !TP; EF.CK = CLK; EF.CE = IR; EF.AR = CLR;
/* READ IRQ ON FALLING EDGE OF CLOCK */
IRQ.D = IRQI & RUN & !AUX ; IRQ.CK = CLK; IRQ.CE = IR; IRQ.AR = CLR;
IQ.D = IRQ & RUN & !AUX ; IQ.CK = CLK; IQ.CE = IR; IQ.AR = CLR;
BANK = !OP1&!OP2&!OP3&OP4 # OP1& OP2& OP3&OP4;
// SELECT ALU REGISTER
RG1 = RA&!RX & IR10 & RUN # !RA& RX & IR4; RG2 = RA&!RX & IR11 & RUN # !RA& RX & IR5; RG3 = RA&!RX & IR12 & RUN # !RA& RX & IR6 # !RA&!RX // PC # RA&RX&PSW; // PC,AC
// SELECT BANK EFGHIJKL RG4 = RA&!RX & BANK & RUN # !RA& RX & NO & IR17;
// WORD ZLD = !RA&RX & IR6 & !IR5 & !IR4 & NO & K1 & !K2 &!IR17; CTL = AUX & !RA & !RX ; TST = AUX & RA;
// FRONT PANEL DK DK = DSP; DSP.CK = CI; DSP.D = AUX & RX & !CLR & !CP; DSP.AR = CLR;
BOUT = MW&MR&!CLR;
// INTERNAL TIMING AND FLIP FLOPS // RAM STROBE
WE_ = !(!NO&!CLK&ACK);
G1.D = !G1&!G2&!CLR; G2.D = G1&!CLR; G1.CK = CI; G2.CK = CI;
// INTERNAL CLOCK GEN 4 PHASE // START HIGH CP.CK = CI; CP.AP = CLR; CP.CE = G2; CP.D = !APP; // SET HIGH ON CLEAR
APP.CK = CI; APP.AR = CLR; APP.CE = G2; APP.D = CP&!CLR;
// MEMORY CONTROL // G BUS BYTE REQUEST MB.D = BY; MB.CK = CLK; MB.AR = CLR; // WR/rd BUS REQUEST MW.D = WR; MW.CK = CLK; MW.AR = CLR;
// MEMORY REQ STROBE
MR.D = RD; MR.CK = ACK; MR.AR = CLR;
STOP = CTL&YT # RUN&!SWR1&SYNC # RUN&RS;
LCTR = TST &(CC$OP4)// NORMAL TEST # TST & OP3&OP2; // TRUE TEST YT = IR17;
YY.D = !RX & Y // SELECT RAM FOR OUTPUT PC++,AC++ # RX & Y & !YT; // RAM FOR OUTPUT R++ YY.CK = ACK;
CY0.D = OP4 & OP & !OP3 & OP2 & CF # !OP4 & OP & !OP3 & OP2 // SUB OPS # OP & !OP3 &!OP2 &!OP1;
CY0.CK = ACK;
WCY = OP & !OP3 & (OP2#OP1) & !IR12; // ABCD ONLY
BYY = RUN&(!IR18#IR8); K1 = WRD; K2 = TWO;
K3 = K1&!K2&!(BYY&OP);
// WORD K1' SGN IN K2'
ALU1.CK = ACK; ALU2.CK = ACK; ALU3.CK = ACK; ALU1.D = LD # ZLD # OP &OP1;
ALU2.D = LD # ZLD # OP & !OP3 &!OP2 &!OP1 // SUBTRACT # OP & OP2 ; ALU3.D = LD #ZLD # OP3&OP;
SH1 = SFT & OP3; // DOWN SHIFT RIGHT // SHIFT 0..7 SH2 = SFT ; INC = SFT & !(IR6&IR5&IR4);
SHO = OP2&OP1&NSGN // SHIFT OUT SIGN # OP2&!OP1&CF;
TS.D = NSGN; TS.CK = ACK; // TEMP SIGN FLAG RAM 18 ODD.D = NODD; ODD.CK = ACK;
CT = SFT&TS&!OP3 // SHIFT UP, CF = SIGN # SFT&ODD&OP3 // SHIFT DOWN, CF = ODD #!SFT&!WCY&CF #!SFT&WCY&C18;
CF.D = CT; CF.CK = CLK; CC = ( // REGULAR CC
OP1&EQ # OP2&TS # OP3&!CF );
$REPEAT N = [7..18]
IR{N}.CK = CLK; IR{N}.CE = IR;
$REPEND
IR4.CK = CLK; IR5.CK = CLK; IR6.CK = CLK;
IR4.CE = IR # INC; IR5.CE = IR # INC; IR6.CE = IR # INC;
IR4.D = !BI4& !INC # !IR4&INC; IR5.D = !BI5& !INC # (IR5$IR4)&INC; IR6.D = !BI6& !INC # (IR6$(IR5&IR4))&INC;
// AC SET ON TRAP $REPEAT N = [10..12] IR{N}.D = !BI{N}#TP; $REPEND // TRAP DON'T CARE $REPEAT N = [7..9] IR{N}.D = !BI{N}; $REPEND // TRAP CLEAR FOR CTL OPCODE $REPEAT N = [13..18] IR{N}.D = !BI{N}&!TP; $REPEND
$REPEAT N = [3..1]
CTR{N}.CK = CLK; CTR{N}.AR = CLR; AD{N-1} = CT{N}; $REPEND
CT1 = CTR1; CT2 = CTR2; CT3 = CTR3;
CTR1.D = !LCTR & !IR & !CT1 & !INC ; CTR2.D = !LCTR & !IR & (CT2$CT1) # TP; CTR3.D = !IR & (CT3$(CT1&CT2)) # LCTR;
AD3 = !RUN & PAN1 # RUN & IR7 & IR18; AD4 = !RUN & PAN2 # RUN & IR8 # RUN & !IR18; AD5 = RUN & IR9 # RUN & !IR18; AD6 = !IR15&!IR14&!IR13#!RUN; // STORE AD7 = !RUN#!IR18; // PANEL,QUICK
@ / START OF MICRO CODE
#000 / REG% REGISTER OP'S IX NO MAR IN / XXXX PC 2 Y MAR AC OP WRD IR RD
#010 / SHIFT 0..7 1..8 TIMES / IR6..4 IS INCRIMENTED N-1 TIMES AC SFT MAR / SHIFT A PC 2 Y MAR PC IR RD
#020 / JCC IX SX Y MAR AC TST RD IN PC 2 Y MAR AC IR RD #024 / MET CC PC LD WRD MAR PC 2 IR RD
#030 / SCC AC = # AC TST MAR PC 2 Y MAR AC LD IR RD #034 / MET CC PC 2 Y MAR AC LD SX IR RD #040 / R+ IX SX Y MAR PC RD IN PC 2 Y MAR AC OP WRD IR RD
#050 / R PC 2 Y MAR PC RD IN IX NO WRD MAR PC RD IN PC 2 Y MAR AC OP WRD IR RD
#060 / R+ IX SX Y MAR BY PC RD IN BY PC 2 Y MAR AC WRD OP IR RD
#070 / R PC 2 Y MAR PC RD IN IX NO WRD MAR BY PC RD IN BY PC 2 Y MAR AC OP WRD IR RD
/ STORE OPS
#110 / NOP (DISPLAY) AC MAR DSP IN PC 2 Y MAR PC IR RD #100 / READ SWR AC MAR DSP IN PC 2 Y MAR AC LD WRD IR RD
#120 / JSR -S R+ IX SX Y MAR PC RD IN AC 2 SUB MAR WR PC RD WR PC LD WRD MAR PC 2 IR RD
#130 /LEA PC 2 Y MAR PC RD IN IX NO WRD IN MAR PC 2 Y MAR AC LD WRD IR RD
#140 / R+ IX SX Y MAR WR AC RD WR PC 2 Y MAR PC IR RD
#150 / R PC 2 Y MAR PC RD IN IX NO WRD MAR WR AC RD WR PC 2 Y MAR PC IR RD
#160 / R+ IX SX Y MAR WR BY AC RD WR BY PC 2 Y MAR PC IR RD
#170 / R BYTE PC 2 Y MAR PC RD IN IX NO WRD MAR WR BY AC RD WR BY PC 2 Y MAR PC IR RD
#260 /QUICK PC 2 Y MAR AC OP WRD IR RD
#360 / CTL PC 2 Y MAR CTL IR RD / TRAP / PUSH PC PC = @(2)
PC SUB 2 AC SUB 2 MAR WR PC RD WR PC 2 LD MAR PC RD IN PC LD WRD /WRAP AROUND
/ FRONT PANEL
#300 / IDLE
SWR MAR DSP IN / DATA IN HERE AC PC IR IN / TOGGLE IN OFF
#310 / LOAD ADR PC LD WRD AC LD WRD PC PC IR #320 / READ MEM -> AC PC 2 Y MAR AC RD IN AC LD WRD AC IR
#330 / WRITE MEM -> AC AC WRD LD PC 2 Y MAR WR AC RD WR PC IR
@ /* S S G A V A C C G W V R R R H H C N P K K C C L L N C E C G G G 1 O P D P 2 1 C K R K D I _ C 4 1 2 ------------------------------------------- / 11 9 7 5 3 1 83 81 79 77 75 \ / 10 8 6 4 2 84 82 80 78 76 \ SH2 | 12 (*) 74 | RG3 VCC | 13 73 | BOUT IR | 14 72 | GND | 15 71 | CY0 | 16 70 | K3 | 17 69 | NODD | 18 68 | NSGN GND | 19 67 | ALU3 | 20 66 | VCC | 21 65 | ALU2 | 22 ATF1508 64 | ALU1 | 23 84-Lead PLCC 63 | C18 | 24 62 | IRQI | 25 61 | IN VCC | 26 60 | MAR SWR1 | 27 59 | GND SWR2 | 28 58 | EQ SWR3 | 29 57 | YY SWR4 | 30 56 | MB SWR5 | 31 55 | MW GND | 32 54 | MR \ 34 36 38 40 42 44 46 48 50 52 / \ 33 35 37 39 41 43 45 47 49 51 53/ -------------------------------------------- B B B B B V B B B G V B B B G B B B B D V I I I I I C I I I N C I I I N I I I I K C 1 1 1 1 1 C 1 1 1 D C 1 9 8 D 7 6 5 4 C 8 7 6 5 4 3 2 1 0
*/
@ / ALL DONE
|
| Sun Nov 09, 2025 10:52 pm |
|
 |
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1861
|
> The work involved in generating the extended bits is a little heavier than I'd like - e.g. a "less than" comparison basically involves performing a three-bit subtraction. It is on the critical path, though
Hmm, I did wonder about that. But a three bit subtraction here is just a one bit function of 6 input bits - I wonder to what degree a synthesiser could crunch that down into a small fast implementation. (It's a different problem in CPLD, where we use minterms, than in FPGA, where we have LUTs, and a 64 bit mini-rom can be essentially one unit delay.)
|
| Mon Nov 10, 2025 8:35 am |
|
 |
|
gfoot
Joined: Sat Oct 04, 2025 10:54 am Posts: 25
|
BigEd wrote: > The work involved in generating the extended bits is a little heavier than I'd like - e.g. a "less than" comparison basically involves performing a three-bit subtraction. It is on the critical path, though
Hmm, I did wonder about that. But a three bit subtraction here is just a one bit function of 6 input bits - I wonder to what degree a synthesiser could crunch that down into a small fast implementation. (It's a different problem in CPLD, where we use minterms, than in FPGA, where we have LUTs, and a 64 bit mini-rom can be essentially one unit delay.) I ran it through a Quine-McCluskey simplifier I've been using to see what it could do - here are the product terms it came up with for comparing two 3-bit values: Code: lo hi 00- --1 00- -1- 0-- -11 0-- 1-- -0- 1-1 -0- 11- --- 111
If "lo < hi" one of these terms will match, and if "lo > hi" then none of them will match. If "lo == hi" then a term may or may not match - it doesn't matter for my purposes so the algorithm is allowed to go whichever way results in fewer product terms in the end. This simplifier might still not be doing the best job, it's not something I know a lot about. I've been using it to analyse the complexity of my microcode ( https://raw.githubusercontent.com/gfoot ... rocode.txt). I have a little under 256 distinct microcode addresses, and it looks like about 30 bits' worth of control signal information coming out. Encoding those would require 4 8-bit EPROMs, or 2 16-bit ones. I expect it would all fit in one CPLD. Using ATF22V10s instead, though, is limiting due to the number of product terms supported for each output, and that's what I wanted to analyse. For each signal that I need, I made my program loop through the microcode and gather the addresses where the signal should be 1, and separately the addresses where I don't care what value the signal takes. It then feeds these into the simplifier and outputs the number of product terms that remain, along with an estimate of how many macrocells it will take to compute those product terms: Code: ALUOP0: 2 18 ALUOP1: 3 34 ALUOP2: 3 25 ALUOP3: 1 6 HIGH: 4 42 IF: 4 45 MEMW: 1 2 PCW: 2 15 MARW: 3 31 ENDZ: 3 29 ENDNZ: 3 31 REGW_SRC0: 2 14 REGW_SRC1: 1 2 REGW: 2 22 B_MARNR: 1 7 B_MEMR: 1 3 B_PCR: 2 14 B_REGR: 3 35 B_ZERO: 2 21 REGSEL_W0: 1 6 REGSEL_W1: 1 8 REGSEL_W2: 1 3 REGSEL_R0: 2 15 REGSEL_R1: 1 12 REGSEL_R2: 1 3 BUS_A0: 2 20 BUS_A1: 2 20 BUS_A2: 2 24 BUS_A3: 2 21
Each of these is a single bit. The ones with numbers work together as you'd expect, e.g. REGSEL_W[0..2] form a 3-bit enum value identifying which register to write to (not a register number), and ALUOP[0..3] specify an ALU operation, which takes four bits. These will get decoded further externally. I thought it was interesting to see which terms were particularly expensive. Something like "HIGH" is fairly randomly set or clear on each microcode instruction, and is never "don't care". "IF" is also often set and never "don't care". I guess ~45 product terms is probably just what it takes to encode a random bit value across the 8-bit input range. Other signals that are sometimes "don't care" are a lot cheaper, and also signals that are frequently zero - like ALUOP3 - are much cheaper as well. B_REGR is set whenever REGSEL_R[0..2] has a specific value; when REGSEL_R[0..2] are "don't care", B_REGR is zero. B_REGR ends up a lot more expensive than REGSEL_R[0..2] are individually. I did try using a specific value of REGSEL_R[0..2] to mean "don't read a register" instead of using a separate signal (B_REGR) but that made all the bits of REGSEL_R more expensive. I think having an explicit extra bit, and more "don't cares", gives the simplifier more options. I adjusted the ALUOP enum order to reduce the number of product terms required for its bits - not very methodically but it made a big difference. There may be more opportunities to reorder other enums but it's a bit hard to predict and the numbers depend a lot on the specific microcode content. Overall though, it looks like if I wanted to decode this using ATF22V10s it'd need about six of them (each one has 10 output pins, and the numbers in the first column add up to about 60 macrocells). I had hoped it could be done with fewer, but it's not too bad, and the EPROM option would require several EPROMs so isn't necessarily any better. EPROMs are also slower. So this is a bit inconclusive at the moment - still, maybe good enough! And an interesting journey with Quine-McCulskey's algorithm. I would like to get this working using simpler devices than CPLDs if possible, but at the same time this would be an easy fit for a CPLD so maybe a future iteration can be based on that.
|
| Mon Nov 10, 2025 1:52 pm |
|
 |
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1861
|
Always interesting to see the effects of declaring don't-care values, and of reordering arbitrary sub-encodings...
|
| Mon Nov 10, 2025 2:06 pm |
|
 |
|
gfoot
Joined: Sat Oct 04, 2025 10:54 am Posts: 25
|
I haven't had much time lately, but today I did spend a bit of time writing some code to try to pack the microcode decoding into ATF22V10s. Each device has 10 macrocells - two with 8 product terms, two with 10, two with 12, two with 14, and two with 16. My code is probably not optimal, but it does successfully find a packing of the various bits I currently have defined into 5 devices: Code: pin 14 pin 23 pin 15 pin 22 pin 16 pin 21 pin 17 pin 20 pin 18 pin 19
0 ALUOP0 B_PCR ALUOP0 B_PCR !ALUOP1 !ALUOP1 !ALUOP1 !REGSEL_R0 !IF !IF 1 B_MARNR REGSEL_W0 HIGH REGSEL_R1 HIGH MARW HIGH MARW ENDZ ENDZ 2 REGSEL_W1 ALUOP3 PCW PCW A_CONST2 !REGW_SRC0 ENDNZ REGW ENDNZ REGW 3 REGW_SRC1 REGSEL_W2 A_MAR A_MAR REGSEL_R2 A_MARSX A_CONST0 ALUOP2 A_CONST0 ALUOP2 4 MEMW B_MEMR !B_REGR B_ZERO !B_REGR A_IMM !B_REGR A_IMM A_CONST1 B_ZERO
The leftmost columns are 8-term pins, the rightmost ones are 16-term pins. Some of the signals require multiple macrocells chained together - these always have to be on the same device as each other of course. There are quite a lot of spare product terms here, but as all of the macrocells are used, in this particular case I think this is the best that can be done. In general I think my packing code is probably not always going to find the best packing though. I made the packing code run the Quine-McCluskey algorithm twice - once coding for 1s and again coding for 0s - and pick whichever resulted in fewer product terms. This is why some signals have a ! before them - it means that in the PLD they'll be calculated inverted, and re-inverted at the output pin. The code also looks for opportunities to share macrocells between output signals, but doesn't yet apply these optimisations. For example it finds than ENDZ and ENDNZ have 18 product terms in common, and together they currently use four macrocells; if instead one macrocell was allocated to 16 of the common product terms, they could both use that one along with one other, meaning they'd only need three in total. This could probably save 5 or 6 macrocells overall I think. This is all very dependent on the exact contents of the microcode, and the order in which the instructions appear in the microcode, so changes there could push this over the edge. We were talking about sparse encodings before, and I think Rob suggested feeding some bits from the encoded instruction straight through, so I also prototyped the effect this would have on the microcode decoder. I currently have about 240 active microcode entries, with 8-bit microcode addresses. To apply Rob's suggestion, I considered using 12-bit microcode addresses, with the top 4 bits coming directly from the encoded instruction, the next 4 coming from the instruction decoder, and the bottom 4 always starting at 0 for each instruction. This spreads the microcode out over the 12 bit range (4096 total addresses) and unused addresses are passed to the Quine-McCluskey simplifier as "don't cares" for all bits. The QM simplifier runs much more slowly, but the result is that a lot fewer product terms are required overall - another example of sparser data allowing simpler decoding. The result then fits into one fewer ATF22V10 device: Code: pin 14 pin 23 pin 15 pin 22 pin 16 pin 21 pin 17 pin 20 pin 18 pin 19
0 !REGW_SRC0 B_MARNR MARW REGSEL_W0 PCW MARW ALUOP1 ALUOP1 ALUOP0 IF 1 !REGSEL_W2 ALUOP3 REGSEL_W1 REGSEL_R2 !ENDZ !ENDZ HIGH HIGH B_ZERO !REGSEL_R0 2 REGW A_MARSX !ENDNZ !B_REGR !ENDNZ !B_REGR A_IMM A_CONST1 A_MAR REGW 3 MEMW REGW_SRC1 A_CONST0 ALUOP2 A_CONST0 ALUOP2 B_PCR REGSEL_R1 A_CONST2 B_MEMR
This addressing method would also reduce the complexity of the microcode address counter, as only the low 4 bits would need to be able to count up. And as Rob said, it should make the instruction decoder a bit simpler as well, as it will have fewer bits to output now. I'm hoping I didn't miss something here - I'll need to get the results into VHDL and add some tests to confirm that it really works. Does anybody know of freely-available tools that can do this kind of packing automatically? Can VHDL/Verilog do this for you, and write out the final product terms? The code I've written here seems to work OK but if there are good off-the-shelf tools for it then I'd like to try them too.
|
| Sun Nov 16, 2025 3:51 am |
|
 |
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1861
|
Interesting, not just a mapping problem but a packing problem. I wonder if the larger synthesis toolchains might be capable of helping, but I can't immediately think of how to describe the problem. I think perhaps the state of the art would operate at larger scale - multi-FPGA. And therefore maybe not very helpful.
|
| Sun Nov 16, 2025 9:23 am |
|
 |
|
gfoot
Joined: Sat Oct 04, 2025 10:54 am Posts: 25
|
Yes it's probably unusual to bother trying to fit things into multiples of these devices and I can imagine that tools for fitting across devices probably assume that you are at least using more capable devices in the first place! However, applying known techniques to unusual-shaped problems is something generative AI can be quite good at, so I thought I'd give it a go: https://gemini.google.com/share/79b521f65e7bIt did OK - its first attempt was pretty good, it immediately stated pretty much all the considerations I had also identified, and was able to split large numbers of product terms across marcocells and fit them into devices bearing in mind that the macrocells don't all have the same widths of product term inputs. But it didn't try to keep all the macrocells concerned with each output on the same device, so it would have required cross-device wiring, extra input pins, and more latency. I asked it to change that, and it seemed to start making mistakes which I had to spot and correct a few times. If I didn't already know there was a better solution, I might not have spotted the miscounting error. But it does seem to have ended up with a viable mapping in the end. It also sounds like if I gave it the product terms it might be able to also find cases where intermediate results can be shared between output terms, though that's another layer of complexity that would be a lot harder to spot errors in. Edit - looking again it seems to have got the pin assignments wrong, e.g. claiming that pin 23 supports 16 PTs.
|
| Sun Nov 16, 2025 11:05 am |
|
 |
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1861
|
That's something I wouldn't have thought of trying... remarkable responses.
|
| Sun Nov 16, 2025 8:53 pm |
|
Who is online |
Users browsing this forum: chrome-8x-bots, CN-mobile-9808-b, CN-mobile-9808-c, DotBot, PetalBot and 27 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|