TOYF processor design --- by Hugh Aguilar --- October 2017 Abstract: The TOYF (Toy Forth) processor is less complicated than my 65ISR processor design, and more Forth-centric. The intention is that the TOYF would be easier to implement in HDL than the 65ISR. The TOYF is somewhat similar to the MiniForth at Testra which is also 16-bit Harvard (built on the Lattice isp1048 PLD). The MiniForth was later upgraded to an FPGA and the name changed to RACE. I wrote the MFX, the development system for the RACE --- the TOYF development system is based on MFX. The TOYF has 125 instructions, which is quite a lot, but they are all simple single-cycle instructions that don't do much. The TOYF does out-of-order execution, so as many as three instructions can execute in one clock cycle. The 65ISR has instructions that work with 1-bit data, so it is a much better choice for machines with a lot of switches. The 65ISR-abu addresses 16MB of memory, so it is a much better choice for applications that need to buffer entire files. The TOYF is mostly a fun processor for Forth enthusiasts, but it could be used in applications that have minimal I/O. The TOYF doesn't actually have interrupts --- it polls the I/O after executing each primitive in the main program. The TOYF would be practical as a micro-controller if it had a 65ISR-chico as a coprocessor to handle the I/O interrupts. The TOYF provides better support for the main-program than the 65ISR does because it has Forth's traditional two stacks. Section 1.) the architecture We have the following registers (those that access data-memory have an Rx alternate name): CF 1-bit carry flag CF should be 0 prior to SBC of low bytes (like the MC6800 not the 6502) VF 1-bit overflow flag PC 16-bit program counter this is the machine-code (the only register that points to code-memory) AX R0 16-bit general purpose BX R1 16-bit general purpose this is the top value of the single-stack DX R2 16-bit general purpose IP R3 16-bit instruction pointer this is the Forth threaded-code LF R4 5-bit local frame pointer high-byte is $00, low-byte is 111xxxxx --- this is the local-frame pointer RS R5 5-bit return stack pointer high-byte is $00, low-byte is 111xxxxx --- this is the return-stack pointer SS R6 5-bit single stack pointer high-byte is $00, low-byte is 110xxxxx --- this is the data-stack for single-cell data LX R7 16-bit primarily for use by LDR most memory reads use LX as the destination MA R8 16-bit memory access pointer all memory loads and stores use MA as the address The processor is Harvard Architecture. Code-memory and data-memory can be accessed concurrently in the same clock cycle. Both have a 16-bit data-bus and are word addressable. The two stacks are both in zero-page. The RS stack is in the upper 32 words and the SS stack is in the 32 words below the RS stack. I/O ports and important global variables can be in the lower 192 words of zero-page. Presumably the code-memory (32KW max) would be internal to the FPGA so that it is fast. If zero-page RAM is internal to the FPGA it should be fast too. When MA points outside of zero-page, some extra clock cycles are added on for the memory access to external memory. I would expect maybe 100 Mhz. normally, but dropping down to 25 Mhz. when accessing external data-memory. There are no instructions that access two adjacent memory locations, so the processor is neither little-endian or big-endian. As a convention though, little-endian should be used, except for the Forth data stack that is big-endian traditionally. Code-memory has a 15-bit address-bus (32KW). The cfa has the high-bit clear for primitives, so 32KW is all we can address. The data-bus is 16-bit. I was originally going to have 24-bit wide opcodes, but then I found 16-bit to be adequate. If the TOYF is to be upgraded however, the opcodes will have to be made wider to accommodate some more groups of instructions. Data-memory has a 16-bit data-bus (64KW). We address words not bytes! ADD 1,BX advances BX to the next 16-bit word. Forth code must be in the upper 32KW because the cfa has the high-bit set for colon words. Code-memory and data-memory are 32KW and 64KW respectively --- or 64KB and 128KB in terms of bytes. Each opcode has fields, with an instruction in each field. All of these instructions execute concurrently in one clock cycle. The assembler rearranges the instructions and packs them into the fields inserting NOP instructions as necessary. The goal is to minimize how many NOP instructions get inserted while guaranteeing that the program does the same thing as it would if we had one instruction per opcode in the same order that the instructions appeared in the source-code. The opcode has these fields: MMMMMAAAAAABBBBB FEDCBA9876543210 these are the bit positions of the fields There are three fields: MMMMM AAAAAA BBBBB We are executing up to three instructions per clock cycle. To help the assembler pack instructions into opcodes efficiently (minimum NOP instructions inserted), the programmer should hold data in a register for a while before using that register. Don't modify a register and then immediately use that register for something. Try to interleave instructions that modify different registers as these instructions can parallelize. This is very similar to Pentium programming in which UV-either and V-specific instructions were interleaved. The TOYF is more complicated though because there are three "pipes" (opcode fields) rather than just two. Not counting NOP, there are 31 MMMMM instructions, 63 AAAAAA instructions, and 31 BBBBB instructions. So, the TOYF has 125 instructions total, which is a lot, but this shouldn't make implementation unreasonably difficult. The TOYF has simple instructions that execute in one clock cycle, whereas the 65ISR has complicated multi-cycle instructions. These are completely different designs --- the TOYF is RISC and the 65ISR is CISC --- the 65ISR is a throwback to the 1980s. The TOYF is a throwback to the 1990s (it is similar to the MiniForth that Testra came out with in 1995). Upon start-up, RS and SS are set to zero. They can't be modified except incremented and decremented. Section 2.) the instruction set In the following instruction descriptions, the | separates events that are done concurrently. These are group-M instructions that modify LX PC MA IP LF RS SS: nop mov AX,LX ; move AX to LX ; this allows EXECUTE and QEXECUTE to work mov -1,LX ; move -1 to LX ; a CAD LX,IP after this will back up to redo the primitive ldr LX ; load (MA) into LX ldr IP ; load (MA) into IP ldr LF ; load (MA) into LF str IP ; store IP to (MA) str AX ; store AX to (MA) str BX ; store BX to (MA) str LX ; store LX to (MA) ; there is no STR DX so use: MOV DX,LX STR LX mov RS,LF ; move RS to LF mov LF,RS ; move LF to RS add AX,RS ; move RS + AX to RS ; this discards some items from the return-stack mov IP,MA ; move IP to MA cad LX,IP ; if CF then move IP + LX to IP add 1,IP ; move IP + 1 to IP pol ; move POLL to PC nxt ; move IP + 1 to IP and to MA | move NEXT to PC isr ; if CF then move FULL to PC nxp ; if high-bit of LX then move LX to PC (note that PC is 15 bits, so high bit is discarded) nxf ; store IP to (MA) | move LX to IP and to MA | move NEXT to PC mov AX,MA ; move AX to MA mov BX,MA ; move BX to MA mov DX,MA ; move DX to MA mov LF,MA ; move LF to MA ; this is the address of the parent's LF under the local variables lcl AX,MA ; move LF + AX to MA ; if AX is the local offset, sets MA to local address mov S0,MA ; move SS to MA ; this is the address of the second-of-stack (BX is top-of-stack) mov S1,MA ; move SS + 1 to MA ; this is the address of the third-of-stack phr ; move RS - 1 to RS and to MA plr ; move RS to MA | move RS + 1 to RS phs ; move SS - 1 to SS and to MA pls ; move SS to MA | move SS + 1 to SS These are group-A instructions that modify AX: nop mov LX,AX ; move LX to AX mov BX,AX ; move BX to AX mov DX,AX ; move DX to AX mov LF,AX ; move LF to AX ; this is needed because we don't have a STR LF instruction sum BX,AX ; if CF then move AX + BX to AX add BX,AX ; move AX + BX to AX sub BX,AX ; move AX - BX to AX add DX,AX ; move AX + DX to AX and BX,AX ; move BX -and- AX to AX ior BX,AX ; move BX -ior- AX to AX xor BX,AX ; move BX -xor- AX to AX add 1,AX ; move AX + 1 to AX sub 1,AX ; move AX - 1 to AX not AX ; move -1 -xor- AX to AX neg AX ; move 0 - AX to AX shl AX ; shift AX left, move 0 into low-bit add LX,AX ; move AX + LX to AX sub LX,AX ; move AX - LX to AX and LX,AX ; move LX -and- AX to AX ior LX,AX ; move LX -ior- AX to AX xor LX,AX ; move LX -xor- AX to AX swp AX ; exchange low byte and high byte in AX ; useful for accessing bytes byt AX ; move AX -and- $FF to AX ; useful for masking a byte mov 0,AX ; move 0 to AX ; 0, 1, etc. are small constants --- use NEG AX to negate them mov 1,AX ; move 1 to AX mov 2,AX ; move 2 to AX mov 3,AX ; move 3 to AX mov 4,AX ; move 4 to AX mov 5,AX ; move 5 to AX mov 6,AX ; move 6 to AX mov 7,AX ; move 7 to AX mov 8,AX ; move 8 to AX mov 9,AX ; move 9 to AX mov 10,AX ; move 10 to AX ; 10, 20, etc. are useful as the base to add a value, for a sum up to 99 mov 20,AX ; move 20 to AX mov 30,AX ; move 30 to AX mov 40,AX ; move 40 to AX mov 50,AX ; move 50 to AX mov 60,AX ; move 60 to AX mov 70,AX ; move 70 to AX mov 80,AX ; move 80 to AX mov 90,AX ; move 90 to AX mov 100,AX ; move 100 to AX ; 100, 200, etc. are useful as the base to add a value, for a sum up to 999 mov 200,AX ; move 200 to AX mov 300,AX ; move 300 to AX mov 400,AX ; move 400 to AX mov 500,AX ; move 500 to AX mov 600,AX ; move 600 to AX mov 700,AX ; move 700 to AX mov 800,AX ; move 800 to AX mov 900,AX ; move 900 to AX mov $10,AX ; move $10 to AX ; 1, 2, 4, 8, $10, 20, etc. are useful as bit masks for 1-bit I/O ports mov $20,AX ; move $20 to AX mov $40,AX ; move $40 to AX mov $80,AX ; move $80 to AX mov $100,AX ; move $100 to AX mov $200,AX ; move $200 to AX mov $400,AX ; move $400 to AX mov $800,AX ; move $800 to AX mov $1000,AX ; move $1000 to AX mov $2000,AX ; move $2000 to AX mov $4000,AX ; move $4000 to AX mov $8000,AX ; move $8000 to AX These are group-B instructions that modify BX DX VF CF: nop mov LX,DX ; move LX to DX mov AX,DX ; move AX to DX mov BX,DX ; move BX to DX add AX,DX ; move AX + DX to DX add AD,BX ; move AX + DX to BX mov LX,BX ; move LX to BX mov AX,BX ; move AX to BX mov DX,BX ; move DX to BX mov MA,BX ; move MA to BX xch DX,BX ; exchange DX and BX values neg BX ; move 0 - BX to BX mov CF,BX ; move CF to BX ; if CF then sets BX to TRUE (TRUE is 1, not the traditional -1) flg BX ; if BX<>0 then move 1 to BX ; normalizes a nonzero into a TRUE flag add 1,BX ; move BX + 1 to BX sub 1,BX ; move BX - 1 to BX sgn DX,BX ; if BX < 0 then move 0 - BX to BX | if DX < 0 then move 0 - DX to DX | set VF to -xor- original sign bits vng BX ; if VF then move 0 - BX to BX bit DX,BX ; shift DX right, move 0 into high-bit, move low-bit into CF | shift BX left, move 0 into low-bit ror 1,BX ; shift BX right, move CF into high-bit, move low-bit into CF rol 1,BX ; shift BX left, move CF into low-bit, move high-bit into CF add LX,BX ; move BX + LX to BX, move carry into CF, set VF if BX high-bit changes add AX,BX ; move AX + BX to BX, move carry into CF, set VF if BX high-bit changes adc AX,BX ; move AX + BX + CF to BX, move carry into CF, set VF if BX high-bit changes sbc AX,BX ; move BX - AX - CF to BX, move borrow into CF, set VF if BX high-bit changes abs BX ; if BX < 0 then move 0 - BX to BX | move original BX bit-15 to CF mov 0,CF ; move 0 to CF mov VF,CF ; move VF to CF not CF ; move -not- CF to CF mov BX,CF ; move -ior- of all BX bits to CF ; set CF is BX is nonzero mov BN,CF ; move BX bit-15 to CF ; set CF if BX is negative mov BL,CF ; move BX bit-15 -xor- VF to CF ; set CF if SBC AX,BX indicates AXY mov 0,CF pls ldr LX mov LX,AX ; AX=X BX=Y sbc AX,BX ; BX= Y-X mov BL,CF ; CF set if YR mov 0,AX phr str AX endm macro TOR ; this is: >R phr str BX ; push the item to the return-stack drop ; remove the item from the single-stack endm RFETCH: ; -- val ; this is: R@ dup plr ldr LX mov LX,BX phr ; this just undoes the PLR earlier pol Note that we don't have interrupts, so there is no danger of an interrupt occurring between PLR and PHR to clobber the stack. ZERO8: ; -- ztor ZERO7: ; -- ztor ZERO6: ; -- ztor ZERO5: ; -- ztor ZERO4: ; -- ztor ZERO3: ; -- ztor ZERO2: ; -- ztor ZERO1: ; -- ztor pol ENTER8: ; a b c d e f g h -- tor ENTER7: ; a b c d e f g -- tor ENTER6: ; a b c d e f -- tor ENTER5: ; a b c d e -- tor ENTER4: ; a b c d -- tor ENTER3: ; a b c -- tor ENTER2: ; a b -- tor ENTER1: ; a -- tor ENTER0: ; -- mov LF,AX phr str AX ; push the old LF to the return-stack mov RS,LF ; set the LF to the new local-frame pol At the front of a colon word that has locals, we optionally have a ZEROx primitive for any zero-initialized locals. Then we have a mandatory ENTERx primitive for the stack-initialized locals (use ENTER0 if there aren't any). Any number of locals can be supported. A maximum of 10 seems reasonable --- it is bad Forth style to use a lot of locals. It is okay for each TOR to fall into the next one, because they aren't going to parallelize (they're group-M instructions). It is also okay for TOR to fall into ENTER0 because the MOV LF,AX will parallelize with the PHR anyway. All the other instructions are group-M and won't parallelize with each other. In code that primarily moves data around in memory, all the instructions are going to be group-M (the 'M' means memory). You get better parallelization when there are calculations being done with AX BX and DX --- these parallelize with group-M. Fortunately, there is usually a lot of calculation being done (that is where the term "computer" comes from). LEAVE1: ; -- ; assumes that there is one local on the stack mov 1,AX plr ldr LF ; restore the old LF add AX,RS ; discard the local variable EXIT: ; -- plr ldr IP ; restore the old IP pol Our EXIT primitive is for simple colon words. Our LEAVEx primitive is for colon words that started with ENTERx. We need LEAVE2 LEAVE3 ... LEAVE10 available, assuming that there is a maximum of 10 locals (ZEROx and ENTERx combined). QEXIT: ; -- ; assumes that this is a quotation called with QEXECUTE plr ldr IP ; pull old IP plr ldr LF ; pull HOF's LF pol Our QEXIT primitive is for quotations. This is like EXIT except that it restores the HOF's LF too. LOCAL1_ADR: ; -- adr mov 1,AX dup lcl AX,MA mov MA,BX pol LOCAL1_FETCH: ; -- val mov 1,AX dup lcl AX,MA ldr LX mov LX,BX pol LOCAL1_STORE: ; val -- mov 1,AX lcl AX,MA str BX drop pol LOCAL1_PLUS_STORE: ; val -- mov 1,AX lcl AX,MA ldr LX add LX,BX str BX drop pol L1..L10 can be done like this efficiently. Locals above this would need some arithmetic. Note that there is no L0 because at LF we have the old LF value, not a local variable. LOCAL11_ADR: ; -- adr mov 10,AX mov AX,DX dup mov 1,AX add DX,AX lcl AX,MA mov MA,BX pol Our LOCAL11_ADR is an example of the arithmetic needed to get above L10. This isn't very efficient, but it should be rare that anybody has this many locals. LIT0: ; -- 0 mov 0,AX phs str BX ; the PHS and STR BX are the DUP macro mov AX,BX nxt The constants 0 1 2 3 4 5 6 7 8 9 10 are easy and fast because the code can be written as above. There are three opcodes total, despite having five instructions. The MOV $0,AX and PHS parallelize into one opcode. The STR BX and MOV AX,BX parallelize into one opcode. The NXT doesn't parallelize with anything, so it is an opcode by itself. LIT29: ; -- 29 mov 20,AX phs str BX ; the PHS and STR BX are the DUP macro mov AX,DX mov 9,AX add AD,BX nxt LIT29 parallelizes pretty well. There are three opcodes total, despite having seven instructions The MOV 20,AX and PHS parallelize into one opcode. The STR BX and MOV AX,DX and MOV 9,AX all parallelize into one opcode. The ADD AD,BX and NXT parallelize into one opcode. Parallelization is good! This is what makes a processor fast --- instructions executing in parallel. Both LIT0 and LIT29 got a NXT because they are so short we don't need to poll I/O afterward. LIT: ; -- val dup add 1,IP mov IP,MA ldr LX mov LX,BX pol LIT_PLUS: ; n -- n+val add 1,IP mov IP,MA ldr LX add LX,BX pol LIT and LIT_PLUS are generic for any constant. They each use two cells in the threaded code. PR0_FETCH: ; -- val ; this is pseudo-register 0 ; leaves the address in DX mov 0,AX ; AX= adr dup mov AX,DX mov AX,MA ldr LX mov LX,BX nxt PR0_STORE: ; val -- ; this is pseudo-register 0 ; leaves the address in DX mov 0,AX ; AX= adr mov AX,DX mov AX,MA str BX drop nxt Global variables are slow and bloaty because we need to put the address in the threaded code and load it with LIT. We can have pseudo-registers in low memory that are much faster. PR0_FETCH is seven opcodes and PR0_STORE is six opcodes. If we have primitives that are specific to these pseudo-registers, they should be quite fast. It is also possible to put important I/O ports in these low addresses so the I/O code is fast. The MOV AX,DX parallelizes, so it doesn't cost anything --- having the address in DX might be useful. PREP_REG_COPY: ; src dst cnt -- cnt ; sets AX=src DX=dst pls ldr DX pls ldr AX nxt REG_COPY: ; needs AX=src DX=dst BX=cnt moves a datum from (AX) to (DX) post-incrementing, and decrements the cnt mov AX,MA add 1,AX sub 1,BX mov BX,CF ; CF set if BX is non-zero ldr LX mov DX,MA xch DX,BX str LX add 1,BX xch DX,BX mov -1,LX ; this will cause NXT to back up and do the REG_COPY primitive again cad LX,IP ; if CF then move IP + LX to IP nxt Our REG_COPY takes 7 clock cycles per word copied, which is pretty fast considering how many group-M instructions we have. MOV AX,MA and ADD 1,AX and SUB 1,BX all parallelize into one opcode. MOV BX,CF and LDR LX parallelize into one opcode. MOV DX,MA and XCH DX,BX parallelize into one opcode. STR LX and ADD 1,BX parallelize into one opcode. XCH DX,BX and MOV -1,LX parallelize into one opcode. CAD LX,IP doesn't parallelize with anything, so it is an opcode by itself. NXT doesn't parallelize with anything, so it is an opcode by itself. REG_COPY can be used to do very fast block moves! The problem is that it requires that NXT rather than POL be used, so a large block move could delay the POLL code too long. Because of this, it is best to only copy small blocks of data. SLOW_COPY: ; src dst cnt -- ; copy SRC (post-increment) to DST (post-increment) and decrement CNT mov S1,MA ldr AX ; AX= src mov S0,MA ldr DX ; DX= dst mov AX,MA ldr LX mov DX,MA str LX add 1,AX mov S1,MA str AX mov DX,AX mov S0,MA add 1,AX str AX sub 1,BX pol The SLOW-COPY primitive keeps the src and dst in memory on the stack, rather than in registers. This takes 13 clock cycles, which is quite a lot. It ends in POL though, so the POLL code is getting executed. FAST_NIP: ; a b -- b ; only 2 opcodes (compared to 3 for NIP) but doesn't set DX pls nxt NIP: ; a b -- b ; hold A in DX pls ldr LX mov LX,DX nxt DROP: ; a -- ; hold A in DX mov BX,DX drop nxt DUP: ; a -- a a ; hold A in DX mov BX,DX dup nxt OVER: ; a b -- a b a ; hold B in DX mov BX,DX dup mov S1,MA ldr LX mov LX,BX nxt SWAP: ; a b -- b a ; hold B in DX mov BX,DX mov S0,MA ldr LX str BX mov LX,BX nxt ROT: ; a b c -- b c a ; hold C in DX mov S0,MA ldr LX mov LX,AX ; AX= B mov S1,MA ldr LX ; LX= A str AX ; store B at S1 mov S0,MA mov BX,DX str BX ; store C at S0 mov LX,BX ; move A to TOS nxt NROT: ; a b c -- c a b ; hold A in DX mov S0,MA ldr LX mov LX,AX ; AX= B mov S1,MA ldr LX mov LX,DX ; DX= A str BX ; store C at S1 mov S0,MA str DX ; store A at S0 mov AX,BX ; move B to TOS nxt Note that NIP and DROP set DX to the discarded value. Also, DUP OVER SWAP ROT and NROT all set DX to the value at S0. We may have primitives that can take advantage of having DX already loaded with this data, and we can save a memory access. Setting DX to this value doesn't cost anything because the MOV BX,DX gets parallelized (the exception is NIP though). Any primitive that sets DX for use in the next primitive, must end with NXT rather than POL so DX doesn't get clobbered. Note that SWAP is more efficient than OVER --- this is backwards from most Forth systems in which SWAP and ROT are slow. This is because the TOYF has plenty of registers, and instructions that use different registers can parallelize. : PLUS: ; a b -- a+b pls ldr LX add LX,BX nxt : MINUS: ; a b -- a-b neg BX pls ldr LX add LX,BX nxt : FAST_PLUS : a b -- a+b ; assumes A is in DX already pls ; nip the A value that we already have in DX mov BX,AX add AD,BX nxt : FAST_MINUS: ; a b -- a-b ; assumes A is in DX already pls ; nip the A value that we already have in DX neg BX mov BX,AX add DX,BX nxt Our PLUS and MINUS primitives are typically compiled for + and - in the source-code. Our FAST_PLUS and FAST_MINUS primitives are compiled when we know the A value is already in DX. Note that the PLS parallelizes, so it doesn't cost anything. PLUS and MINUS are each three clock cycles, whereas FAST_PLUS and FAST_MINUS are two and three clock cycles respectively. This isn't much of an improvement, so it is hardly worth the effort. An optimizing Forth compiler can generate slightly better code than a traditional non-optimizing Forth compiler. The TOYF is pretty efficient though, so the non-optimizing Forth compiler should generally be adequate. Even if a primitive is long, if it parallelizes well, then it will take few clock cycles to execute. In most primitives there is at least some parallelization. Opcodes packing two instructions will half the code size and double the speed --- this is pretty common. Opcodes packing three instructions are, of course, super-efficient. :-)