TOYF processor design --- by Hugh Aguilar --- copyright October 2017 contact: hughaguilar96@gmail.com Abstract: The TOYF (Toy Forth) processor is less complicated than my 65ISR processor design, and more Forth-centric. The intention is that the TOYF would be easier to implement in HDL than the 65ISR. The TOYF is somewhat similar to the MiniForth at Testra which is also 16-bit Harvard (built on the Lattice isp1048 PLD). The MiniForth was later upgraded to an FPGA and the name changed to RACE. I wrote the MFX, the development system for the RACE --- the TOYF development system is based on MFX. The TOYF has 125 instructions, which is quite a lot, but they are all simple single-cycle instructions that don't do much. The TOYF does out-of-order execution, so as many as three instructions can execute in one clock cycle. The 65ISR has instructions that work with 1-bit data, so it is a much better choice for machines with a lot of switches. The 65ISR-abu addresses 16MB of memory, so it is a much better choice for applications that need to buffer entire files. The TOYF is mostly a fun processor for Forth enthusiasts, but it could be used in applications that have minimal I/O. The TOYF doesn't actually have interrupts --- it polls the I/O after executing each primitive in the main program. The TOYF would be practical as a micro-controller if it had a 65ISR-chico as a coprocessor to handle the I/O interrupts. The TOYF provides better support for the main-program than the 65ISR does because it has Forth's traditional two stacks. Section 1.) the architecture We have the following registers (those that access data-memory have an Rx alternate name): CF 1-bit carry flag CF should be 0 prior to SBC of low bytes (like the MC6800 not the 6502) VF 1-bit overflow flag PC 16-bit program counter this is the machine-code (the only register that points to code-memory) AX R0 16-bit general purpose BX R1 16-bit general purpose this is the top value of the single-stack CX R2 16-bit counter for the LOOP primitive DX R3 16-bit general purpose this is sometimes used to pass data between primitives IP R4 16-bit instruction pointer this is the Forth threaded-code LF R5 5-bit local frame pointer high-byte is $00, low-byte is 111xxxxx --- this is the local-frame pointer RS R6 5-bit return stack pointer high-byte is $00, low-byte is 111xxxxx --- this is the return-stack pointer SS R7 5-bit single stack pointer high-byte is $00, low-byte is 110xxxxx --- this is the data-stack for single-cell data LX R8 16-bit primarily for use by LDR most memory reads use LX as the destination MA R9 16-bit memory access pointer all memory loads and stores use MA as the address The processor is Harvard Architecture. Code-memory and data-memory can be accessed concurrently in the same clock cycle. Both have a 16-bit data-bus and are word addressable. The two stacks are both in zero-page. The RS stack is in the upper 32 words and the SS stack is in the 32 words below the RS stack. I/O ports and important global variables can be in the lower 192 words of zero-page. Presumably the code-memory (32KW max) would be internal to the FPGA so that it is fast. If zero-page RAM is internal to the FPGA it should be fast too. When MA points outside of zero-page, some extra clock cycles are added on for the memory access to external memory. I would expect maybe 100 Mhz. normally, but dropping down to 25 Mhz. when accessing external data-memory. There are no instructions that access two adjacent memory locations, so the processor is neither little-endian or big-endian. As a convention though, little-endian should be used, except for the Forth data stack that is big-endian traditionally. Code-memory has a 15-bit address-bus (32KW). The cfa has the high-bit clear for primitives, so 32KW is all we can address. The data-bus is 16-bit. I was originally going to have 24-bit wide opcodes, but then I found 16-bit to be adequate. If the TOYF is to be upgraded however, the opcodes will have to be made wider to accommodate some more groups of instructions. Data-memory has a 16-bit data-bus (64KW). We address words not bytes! MOV 1,AX ADD AX,BX advances BX to the next 16-bit word. Forth code must be in the upper 32KW because the cfa has the high-bit set for colon words. Code-memory and data-memory are 32KW and 64KW respectively --- or 64KB and 128KB in terms of bytes. Each opcode has fields, with an instruction in each field. All of these instructions execute concurrently in one clock cycle. The assembler rearranges the instructions and packs them into the fields inserting NOP instructions as necessary. The goal is to minimize how many NOP instructions get inserted while guaranteeing that the program does the same thing as it would if we had one instruction per opcode in the same order that the instructions appeared in the source-code. The opcode has these fields: MMMMMAAAAAABBBBB FEDCBA9876543210 these are the bit positions of the fields There are three fields: MMMMM AAAAAA BBBBB We are executing up to three instructions per clock cycle. To help the assembler pack instructions into opcodes efficiently (minimum NOP instructions inserted), the programmer should hold data in a register for a while before using that register. Don't modify a register and then immediately use that register for something. Try to interleave instructions that modify different registers as these instructions can parallelize. This is very similar to Pentium programming in which UV-either and V-specific instructions were interleaved. The TOYF is more complicated though because there are three "pipes" (opcode fields) rather than just two. Not counting NOP, there are 31 MMMMM instructions, 63 AAAAAA instructions, and 31 BBBBB instructions. So, the TOYF has 125 instructions total, which is a lot, but this shouldn't make implementation unreasonably difficult. The TOYF has simple instructions that execute in one clock cycle, whereas the 65ISR has complicated multi-cycle instructions. These are completely different designs --- the TOYF is RISC and the 65ISR is CISC --- the 65ISR is a throwback to the 1980s. The TOYF is a throwback to the 1990s (it is similar to the MiniForth that Testra came out with in 1995). Upon start-up, RS and SS are set to zero. They can't be modified except incremented and decremented. Section 2.) the instruction set In the following instruction descriptions, the | separates events that are done concurrently. These are group-M instructions that modify LX PC MA IP LF RS SS: nop mov AX,LX ; move AX to LX ; this allows EXECUTE and QEXECUTE to work ldr LX ; load (MA) into LX ldr IP ; load (MA) into IP ldr LF ; load (MA) into LF str AX ; store AX to (MA) ; the CX DX and LF registers can't be stored, so move to AX then do: STR AX str BX ; store BX to (MA) str LX ; store LX to (MA) mov RS,LF ; move RS to LF mov LF,RS ; move LF to RS add AX,RS ; move RS + AX to RS ; this discards some items from the return-stack mov IP,MA ; move IP to MA mov AX,IP ; move AX to IP ; this initializes IP in the FULL code cad LX,IP ; if CF then move IP + LX to IP add 1,IP ; move IP + 1 to IP pol ; move POLL to PC nxt ; move IP + 1 to IP and to MA | move NEXT to PC isr ; if CF then move FULL to PC nxp ; if high-bit of LX then move LX to PC (note that PC is 15 bits, so high bit is discarded) nxf ; store IP to (MA) | move LX to IP and to MA | move NEXT to PC mov AX,MA ; move AX to MA mov BX,MA ; move BX to MA mov DX,MA ; move DX to MA mov LF,MA ; move LF to MA ; this is the address of the parent's LF under the local variables lcl AX,MA ; move LF + AX to MA ; if AX is the local offset, sets MA to local address mov S0,MA ; move SS to MA ; this is the address of the second-of-stack (BX is top-of-stack) mov S1,MA ; move SS + 1 to MA ; this is the address of the third-of-stack mov S2,MA ; move SS + 2 to MA ; this is the address of the fourth-of-stack phr ; move RS - 1 to RS and to MA plr ; move RS to MA | move RS + 1 to RS phs ; move SS - 1 to SS and to MA pls ; move SS to MA | move SS + 1 to SS These are group-A instructions that modify AX CX: nop mov LX,AX ; move LX to AX mov BX,AX ; move BX to AX mov DX,AX ; move DX to AX mov LF,AX ; move LF to AX ; this is needed because we don't have a STR LF instruction sum BX,AX ; if low-bit of DX then move AX + BX to AX add BX,AX ; move AX + BX to AX sub BX,AX ; move AX - BX to AX add DX,AX ; move AX + DX to AX and BX,AX ; move BX -and- AX to AX ior BX,AX ; move BX -ior- AX to AX xor BX,AX ; move BX -xor- AX to AX add 1,AX ; move AX + 1 to AX sub 1,AX ; move AX - 1 to AX not AX ; move -1 -xor- AX to AX neg AX ; move 0 - AX to AX shl AX ; shift AX left, move 0 into low-bit add LX,AX ; move AX + LX to AX sub LX,AX ; move AX - LX to AX and LX,AX ; move LX -and- AX to AX ior LX,AX ; move LX -ior- AX to AX xor LX,AX ; move LX -xor- AX to AX swp AX ; exchange low byte and high byte in AX ; useful for accessing bytes byt AX ; move AX -and- $FF to AX ; useful for masking a byte mov LX,CX ; move LX to CX xch CX,AX ; exchange CX and AX values ; also called: XCH AX,CX sub 1,CX ; move CX - 1 to CX mov -1,AX ; move -1 to AX mov 0,AX ; move 0 to AX ; 0, 1, etc. are small constants mov 1,AX ; move 1 to AX mov 2,AX ; move 2 to AX ; use NEG AX to get -2, etc. mov 3,AX ; move 3 to AX mov 4,AX ; move 4 to AX mov 5,AX ; move 5 to AX mov 6,AX ; move 6 to AX mov 7,AX ; move 7 to AX mov 8,AX ; move 8 to AX mov 9,AX ; move 9 to AX mov 10,AX ; move 10 to AX ; 10, 20, etc. are useful as the base to add a value, for a sum up to 99 mov 20,AX ; move 20 to AX mov 30,AX ; move 30 to AX mov 40,AX ; move 40 to AX mov 50,AX ; move 50 to AX mov 60,AX ; move 60 to AX mov 70,AX ; move 70 to AX mov 80,AX ; move 80 to AX mov 90,AX ; move 90 to AX mov 100,AX ; move 100 to AX ; 100, 200, etc. are useful as the base to add a value, for large sums mov 200,AX ; move 200 to AX ; SHL AX to get 400 SHL AX SHL AX to get 800 mov 300,AX ; move 300 to AX ; SHL AX to get 600 MOV AX,LX SHL AX ADD LX,AX to get 900 mov 500,AX ; move 500 to AX ; SHL AX to get 1000 etc. mov 700,AX ; move 700 to AX mov $10,AX ; move $10 to AX ; 1, 2, 4, 8, $10, 20, etc. are useful as bit masks for 1-bit I/O ports mov $20,AX ; move $20 to AX mov $40,AX ; move $40 to AX mov $80,AX ; move $80 to AX mov $100,AX ; move $100 to AX mov $200,AX ; move $200 to AX mov $400,AX ; move $400 to AX mov $800,AX ; move $800 to AX mov $1000,AX ; move $1000 to AX mov $2000,AX ; move $2000 to AX mov $4000,AX ; move $4000 to AX mov $8000,AX ; move $8000 to AX These are group-B instructions that modify BX DX VF CF: nop mov LX,DX ; move LX to DX mov AX,DX ; move AX to DX mov BX,DX ; move BX to DX add AX,DX ; move AX + DX to DX add AD,BX ; move AX + DX to BX mov LX,BX ; move LX to BX mov AX,BX ; move AX to BX mov DX,BX ; move DX to BX mov MA,BX ; move MA to BX xch DX,BX ; exchange DX and BX values ; also called: XCH BX,DX neg BX ; move 0 - BX to BX mov CF,BX ; move CF to BX ; if CF then sets BX to TRUE (TRUE is 1, not the traditional -1) flg BX ; if BX<>0 then move 1 to BX ; normalizes a nonzero into a TRUE flag sgn DX,BX ; if BX < 0 then move 0 - BX to BX | if DX < 0 then move 0 - DX to DX | set VF to -xor- original sign bits vng BX ; if VF then move 0 - BX to BX ; used by the SMULT primitive to fix the sign afterward adj DX,BX ; shift DX right, leave high-bit as is, move low-bit into CF | shift BX left, move 0 into low-bit ror BX ; shift BX right, move CF into high-bit, move low-bit into CF rol BX ; shift BX left, move CF into low-bit, move high-bit into CF add LX,BX ; move BX + LX to BX, move carry into CF, set VF if BX and LX had same high-bit and it changes adc AX,BX ; move AX + BX + CF to BX, move carry into CF, set VF if BX and AX had same high-bit and it changes sbc AX,BX ; move BX - AX - CF to BX, move borrow into CF, set VF if BX and AX had same high-bit and it changes abs BX ; if BX < 0 then move 0 - BX to BX | move original BX high-bit to CF mov 0,CF ; move 0 to CF mov VF,CF ; move VF to CF not CF ; move -not- CF to CF and DX,CF ; move DX<>0 -and- CF to CF ; DX<>0 is an -ior- of all the bits in DX ior DX,CF ; move DX<>0 -ior- CF to CF ; DX<>0 is an -ior- of all the bits in DX not BX,CF ; move ~(BX<>0) to CF ; set CF is BX is zero ; used in 0BRANCH flg CX,CF ; move (CX<>0) to CF ; set CF is CX is nonzero ; used in LOOP mov BN,CF ; move BX high-bit to CF ; set CF if BX is negative mov BL,CF ; move BX high-bit -xor- VF to CF ; set CF if SBC AX,BX indicates AXY mov 0,CF pls ldr LX mov LX,AX ; AX=X BX=Y sbc AX,BX ; BX= Y-X mov BL,CF ; CF set if YR mov 0,AX phr str AX endm macro TOR ; this is: >R phr str BX ; push the item to the return-stack drop ; remove the item from the single-stack endm RFETCH: ; -- val ; this is: R@ dup plr ldr LX mov LX,BX phr ; this just undoes the PLR earlier pol Note that we don't have interrupts, so there is no danger of an interrupt occurring between PLR and PHR to clobber the stack. ZERO8: ; -- ztor ZERO7: ; -- ztor ZERO6: ; -- ztor ZERO5: ; -- ztor ZERO4: ; -- ztor ZERO3: ; -- ztor ZERO2: ; -- ztor ZERO1: ; -- ztor pol ENTER8: ; a b c d e f g h -- tor ENTER7: ; a b c d e f g -- tor ENTER6: ; a b c d e f -- tor ENTER5: ; a b c d e -- tor ENTER4: ; a b c d -- tor ENTER3: ; a b c -- tor ENTER2: ; a b -- tor ENTER1: ; a -- tor ENTER0: ; -- mov LF,AX phr str AX ; push the old LF to the return-stack mov RS,LF ; set the LF to the new local-frame pol At the front of a colon word that has locals, we optionally have a ZEROx primitive for any zero-initialized locals. Then we have a mandatory ENTERx primitive for the stack-initialized locals (use ENTER0 if there aren't any). Any number of locals can be supported. A maximum of 10 seems reasonable --- it is bad Forth style to use a lot of locals. It is okay for each TOR to fall into the next one, because they aren't going to parallelize (they're group-M instructions). It is also okay for TOR to fall into ENTER0 because the MOV LF,AX will parallelize with the PHR anyway. All the other instructions are group-M and won't parallelize with each other. In code that primarily moves data around in memory, all the instructions are going to be group-M (the 'M' means memory). You get better parallelization when there are calculations being done with AX BX and DX --- these parallelize with group-M. Fortunately, there is usually a lot of calculation being done (that is where the term "computer" comes from). LEAVE1: ; -- ; assumes that there is one local on the stack mov 1,AX plr ldr LF ; restore the old LF add AX,RS ; discard the local variable EXIT: ; -- plr ldr IP ; restore the old IP pol Our EXIT primitive is for simple colon words. Our LEAVEx primitive is for colon words that started with ENTERx. We need LEAVE2 LEAVE3 ... LEAVE10 available, assuming that there is a maximum of 10 locals (ZEROx and ENTERx combined). QEXIT: ; -- ; assumes that this is a quotation called with QEXECUTE plr ldr IP ; pull old IP plr ldr LF ; pull HOF's LF pol Our QEXIT primitive is for quotations. This is like EXIT except that it restores the HOF's LF too. LOCAL1_ADR: ; -- adr mov 1,AX ; AX= offset to local dup lcl AX,MA mov MA,BX pol LOCAL1_FETCH: ; -- val mov 1,AX ; AX= offset to local dup lcl AX,MA ldr LX mov LX,BX pol LOCAL1_STORE: ; val -- mov 1,AX ; AX= offset to local lcl AX,MA str BX drop pol LOCAL1_PLUS_STORE: ; val -- mov 1,AX ; AX= offset to local lcl AX,MA ldr LX add LX,BX str BX drop pol L1..L10 can be done like this efficiently. Locals above this would need some arithmetic. Note that there is no L0 because at LF we have the old LF value, not a local variable. LOCAL11_ADR: ; -- adr mov 10,AX mov AX,DX dup mov 1,AX add DX,AX ; AX= offset to local lcl AX,MA mov MA,BX pol Our LOCAL11_ADR is an example of the arithmetic needed to get above L10. This isn't very efficient, but it should be rare that anybody has this many locals. Most likely, I will just set 10 as the limit of how many locals the compiler supports. LOCAL1_PRE_INC_FETCH: ; -- val mov 1,AX ; AX= offset to local dup lcl AX,MA ldr LX mov LX,BX mov 1,AX add AX,BX str BX pol LOCAL1_POST_INC_FETCH: ; -- val mov 1,AX ; AX= offset to local dup lcl AX,MA ldr LX mov LX,BX mov LX,AX add 1,AX str AX pol Our LOCAL1_PRE_INC_FETCH and LOCAL1_POST_INC_FETCH are examples of using local variables as pointers. Doing a pre-decrement or post-decrement would be similar. LOCAL1_DX_ADR: ; sets DX to the address xch DX,BX mov 1,AX ; AX= offset to local lcl AX,MA mov MA,BX xch DX,BX nxt LOCAL1_DX_FETCH: ; sets DX to the value mov 1,AX ; AX= offset to local lcl AX,MA ldr LX mov LX,DX nxt LOCAL1_DX_STORE: ; stores DX to the local mov 1,AX ; AX= offset to local lcl AX,MA str BX drop pol LOCAL1_DX_PLUS_STORE: ; adds DX to the local mov 1,AX ; AX= offset to local lcl AX,MA mov DX,AX ldr LX add LX,AX str AX pol Some primitives leave a value in DX rather than push it to the stack. Above we have LOCAL1_DX_ADR and LOCAL1_DX_FETCH that leave the value in DX and end in NXT. We also have LOCAL1_DX_STORE and LOCAL1_DX_PLUS_STORE that use DX. We need to do POL reasonably often because otherwise we risk losing I/O data. The danger is that we may have several short primitives that end in NXT all in a row, and the overall time will be too long. If this danger is problematic, the compiler can be made smart enough to deal with it. We can have two versions of most of the primitives, one that ends in NXT and the other that ends in POL. The compiler can choose which to use based on a count of how many opcodes we have had since the last POL. This is complicated --- maybe worry about that sometime in the future. In MFX (for the MiniForth) I had "ghost primitives." For these I saved the source-code, but didn't assemble them into machine-code. The first time they were used, they assembled into machine-code. Afterward, if there were used, I had the cfa already. All of these primitives that access local variables should be ghost primitives. Dozens or hundreds of them can be meta-compiled, but they won't use code-space unless they are actually used. Most programs won't need the primitives that are specific to more than four local variables. LIT0: ; -- 0 mov 0,AX ; AX= value phs str BX ; the PHS and STR BX are the DUP macro mov AX,BX nxt The constants 0 1 2 3 4 5 6 7 8 9 10 are easy and fast because the code can be written as above. There are three opcodes total, despite having five instructions. The MOV $0,AX and PHS parallelize into one opcode. The STR BX and MOV AX,BX parallelize into one opcode. The NXT doesn't parallelize with anything, so it is an opcode by itself. LIT29: ; -- 29 mov 20,AX ; AX= big part of value phs str BX ; the PHS and STR BX are the DUP macro mov AX,DX mov 9,AX ; AX= small part of value add AD,BX ; BX= value nxt LIT29 parallelizes pretty well. There are three opcodes total, despite having seven instructions. The MOV 20,AX and PHS parallelize into one opcode. The STR BX and MOV AX,DX and MOV 9,AX all parallelize into one opcode. The ADD AD,BX and NXT parallelize into one opcode. Parallelization is good! This is what makes a processor fast --- instructions executing in parallel. Both LIT0 and LIT29 got a NXT because they are so short we don't need to poll I/O afterward. LIT0_DX: ; sets DX= 0 mov 0,AX ; AX= value mov AX,DX nxt LIT29_DX: ; sets DX= 29 mov 20,AX ; AX= big part of value mov AX,DX mov 9,AX ; AX= small part of value add AX,DX nxt Our LIT0_DX and LIT29_DX leave the value in DX rather than push it to the stack. These get a NXT out of necessity, because POL would execute code that might clobber the DX register. LIT29 has three opcodes: MOV 20,AX doesn't parallelize with anything, so it is an opcode by itself. MOV AX,DX and MOV 9,AX parallelize into one opcode. ADD AX,DX and NXT parallelize into one opcode. This is no better than what LIT29 had, which was also three opcodes. Parallelization works better in long primitives because more is being done, and more registers are being used. In LIT29 there was a DUP which included two group-M instructions, both of which parallelized with group-A and group-B. In LIT29_DX we have only group-A and group-B instructions, plus NXT that is group-M, so fewer registers being used. Sometimes optimization, such as passing data in registers rather than in memory, doesn't help at all. LIT29_DX might still be useful though because the following primitive that uses DX may be faster than if it used the stack. LIT: ; -- val add 1,IP mov IP,MA ldr LX dup mov LX,BX nxt LIT_PLUS: ; n -- n+val add 1,IP mov IP,MA ldr LX add LX,BX nxt LIT and LIT_PLUS are generic for any constant. They each use two cells in the threaded code. LIT_DX: ; sets DX= val add 1,IP mov IP,MA ldr LX mov LX,DX nxt LIT_PLUS_DX: ; n -- ; sets DX= n+val mov BX,AX ; move BX to AX before we DROP pls ldr LX mov LX,BX ; this is the DROP macro add 1,IP mov IP,MA ldr LX add LX,AX mov AX,DX nxt Our LIT is six opcodes and our LIT_DX is four opcodes, so we have an improvement. Our LIT_PLUS is four opcodes and our LIT_PLUS_DX is seven opcodes, which is much worse. There is very little parallelization. Optimization is not always an improvement! PR0_FETCH: ; -- val ; this is pseudo-register 0 ; leaves the address in DX mov 0,AX ; AX= adr dup mov AX,DX mov AX,MA ldr LX mov LX,BX nxt PR0_STORE: ; val -- ; this is pseudo-register 0 ; leaves the address in DX mov 0,AX ; AX= adr mov AX,DX mov AX,MA str BX drop nxt Global variables are slow and bloaty because we need to put the address in the threaded code and load it with LIT. We can have pseudo-registers in low memory that are much faster. PR0_FETCH is seven opcodes and PR0_STORE is six opcodes. If we have primitives that are specific to these pseudo-registers, they should be quite fast. It is also possible to put important I/O ports in these low addresses so the I/O code is fast. The MOV AX,DX parallelizes, so it doesn't cost anything --- having the address in DX might be useful. PR0_FETCH_DX: ; sets DX= value ; this is pseudo-register 0 mov 0,AX ; AX= adr mov AX,MA ldr LX mov LX,DX nxt PR0_STORE_DX: ; stores DX to pseudo-register 0 mov 0,AX ; AX= adr mov AX,MA mov DX,AX str AX pol Our PR0_FETCH_DX and PR0_STORE_DX use DX rather than the stack. The compiler has to do some optimization. Consider this code sequence: 0 @ 1 ! This compiles into two primitives: PR0_FETCH_DX PR1_STORE_DX This is just peephole optimization --- it is easy to make a compiler generate code like this. Note that in PR0_STORE_DX the MOV DX,AX parallelizes, so our lack of a STR DX instruction doesn't hurt us. MOVE_POST_INC: ; src dst -- ; needs DX=src CX=count sub 1,CX ; decrement count flg CX,CF ; CF= nonzero? mov 1,AX ; AX= value to be added to SRC and DST pointers mov DX,MA ldr LX not CF ; CF= zero? mov BX,MA str LX mov -1,AX mov AX,LX ; LX= offset back to this primitive add AX,DX add AX,BX cad LX,IP ; if CF then move IP + LX to IP pol Our MOVE_POST_INC primitive moves a block of data using post-increment. This parallelizes into eight opcodes. This works for large blocks, as we do POL after every word moved. SUB 1,CX and FLG CX,CF parallelize. MOV 1,AX and MOV DX,MA parallelize. LDR LX and NOT CF parallelize. MOV BX,MA doesn't parallelize. STR LX and MOV -1,AX parallelize. MOV AX,LX and ADD AX,DX parallelize. ADD AX,BX and CAD LX,IP parallelize. POL doesn't parallelize. MOVE_PRE_DEC: ; src dst -- ; needs DX=src CX=count sub 1,CX ; decrement count flg CX,CF ; CF= nonzero? mov -1,AX ; AX= value to be added to SRC and DST pointers add AX,DX add AX,BX mov DX,MA ldr LX not CF ; CF= zero? mov BX,MA str LX mov AX,LX ; LX= offset back to this primitive cad LX,IP ; if CF then move IP + LX to IP pol Our MOVE_PRE_DEC primitive moves a block of data using pre-decrement. This parallelizes into nine opcodes. SUB 1,CX doesn't parallelize. FLG CX,CF and MOV -1,AX parallelize. ADD AX,DX doesn't parallelize. ADD AX,BX and MOV DX,MA parallelize. LDR LX and NOT CF parallelize. MOV BX,MA doesn't parallelize. STR LX and MOV AX,LX parallelize. CAD LX,IP doesn't parallelize. POL doesn't parallelize. Our MOVE_POST_INC and MOVE_PRE_DEC primitives are somewhat slow. They are heavy on group-M instructions because they are moving data between memory locations. Adding -1 to IP with CAD LX,IP is a trick for making a primitive iterate over itself. FAST_NIP: ; a b -- b ; only 2 opcodes (compared to 3 for NIP) but doesn't set DX pls nxt NIP: ; a b -- b ; hold A in DX pls ldr LX mov LX,DX nxt FAST_RIP: ; a b c -- b c ; 5 opcodes (compared to 6 for RIP) but doesn't set DX pls ldr LX mov S0,MA str LX nxt RIP: ; a b c -- b c ; hold A in DX pls ldr LX mov LX,AX ; AX= B mov S0,MA ldr LX mov LX,DX ; DX= A str AX nxt DROP: ; a -- ; hold A in DX mov BX,DX drop nxt DUP: ; a -- a a ; hold A in DX mov BX,DX dup nxt OVER: ; a b -- a b a ; hold B in DX mov BX,DX dup mov S1,MA ldr LX mov LX,BX nxt SWAP: ; a b -- b a ; hold B in DX mov BX,DX mov S0,MA ldr LX str BX mov LX,BX nxt ROT: ; a b c -- b c a ; hold C in DX mov S0,MA ldr LX mov LX,AX ; AX= B mov S1,MA ldr LX ; LX= A str AX ; store B at S1 mov S0,MA mov BX,DX str BX ; store C at S0 mov LX,BX ; move A to TOS nxt NROT: ; a b c -- c a b ; hold A in DX mov S0,MA ldr LX mov LX,AX ; AX= B mov S1,MA ldr LX mov LX,DX ; DX= A str BX ; store C at S1 mov S0,MA str DX ; store A at S0 mov AX,BX ; move B to TOS nxt Note that NIP RIP and DROP set DX to the discarded value. Also, DUP OVER SWAP ROT and NROT all set DX to the value at S0. We may have primitives that can take advantage of having DX already loaded with this data, and we can save a memory access. Setting DX to this value doesn't cost anything because the MOV BX,DX gets parallelized (the exceptions are NIP and RIP). Note that SWAP is more efficient than OVER --- this is backwards from most Forth systems in which SWAP and ROT are slow. This is because the TOYF has plenty of registers, and instructions that use different registers can parallelize. : PLUS: ; a b -- a+b pls ldr LX add LX,BX nxt : MINUS: ; a b -- a-b neg BX pls ldr LX add LX,BX nxt : FAST_PLUS : b -- a+b ; assumes A is in DX already mov BX,AX add AD,BX nxt : FAST_MINUS: ; a b -- a-b ; assumes A is in DX already neg BX mov BX,AX add AD,BX nxt Our PLUS and MINUS primitives are typically compiled for + and - in the source-code. Our FAST_PLUS and FAST_MINUS primitives are compiled when we know the A value is already in DX. Note that the PLS parallelizes, so it doesn't cost anything. PLUS and MINUS are each three clock cycles, whereas FAST_PLUS and FAST_MINUS are two and three clock cycles respectively. This isn't much of an improvement, so it is hardly worth the effort. An optimizing Forth compiler can generate slightly better code than a traditional non-optimizing Forth compiler. The TOYF is pretty efficient though, so the non-optimizing Forth compiler should generally be adequate. Even if a primitive is long, if it parallelizes well, then it will take few clock cycles to execute. In most primitives there is at least some parallelization. Opcodes packing two instructions will half the code size and double the speed --- this is pretty common. Opcodes packing three instructions are, of course, super-efficient. :-)