TOYF processor design --- by Hugh Aguilar --- October 2017 Abstract: The TOYF (Toy Forth) processor is less complicated than my 65ISR processor design, and more Forth-centric. The intention is that the TOYF would be easier to implement in HDL than the 65ISR. The TOYF is somewhat similar to the MiniForth at Testra which is also 16-bit Harvard (built on the Lattice isp1048 PLD). The MiniForth was later upgraded to an FPGA and the name changed to RACE. I wrote the MFX, the development system for the RACE --- the TOYF development system is based on MFX. The TOYF has 125 instructions, which is quite a lot, but they are all simple single-cycle instructions that don't do much. The TOYF does out-of-order execution, so as many as three instructions can execute in one clock cycle. The 65ISR has instructions that work with 1-bit data, so it is a much better choice for machines with a lot of switches. The 65ISR-abu addresses 16MB of memory, so it is a much better choice for applications that need to buffer entire files. The TOYF is mostly a fun processor for Forth enthusiasts, but it could be used in applications that have minimal I/O. The TOYF doesn't actually have interrupts --- it polls the I/O after executing each primitive in the main program. The TOYF would be practical as a micro-controller if it had a 65ISR-chico as a coprocessor to handle the I/O interrupts. The TOYF provides better support for the main-program than the 65ISR does because it has Forth's traditional two stacks. Section 1.) the architecture We have the following registers (those that access data-memory have an Rx alternate name): CF 1-bit carry flag CF should be 0 prior to SBC of low bytes (like the MC6800 not the 6502) VF 1-bit overflow flag PC 16-bit program counter this is the machine-code (the only register that points to code-memory) AX R0 16-bit general purpose BX R1 16-bit general purpose this is the top value of the single-stack DX R2 16-bit general purpose IP R3 16-bit instruction pointer this is the Forth threaded-code LF R4 5-bit local frame pointer high-byte is $00, low-byte is 111xxxxx --- this is the local-frame pointer RS R5 5-bit return stack pointer high-byte is $00, low-byte is 111xxxxx --- this is the return-stack pointer SS R6 5-bit single stack pointer high-byte is $00, low-byte is 110xxxxx --- this is the data-stack for single-cell data LX R7 16-bit primarily for use by LDR most memory reads use LX as the destination MA R8 16-bit memory access pointer all memory loads and stores use MA as the address The processor is Harvard Architecture. Code-memory and data-memory can be accessed concurrently in the same clock cycle. Both have a 16-bit data-bus and are word addressable. The two stacks are both in zero-page. The RS stack is in the upper 32 words and the SS stack is in the 32 words below the RS stack. I/O ports and important global variables can be in the lower 192 words of zero-page. Presumably the code-memory (32KW max) would be internal to the FPGA so that it is fast. If zero-page RAM is internal to the FPGA it should be fast too. When MA points outside of zero-page, some extra clock cycles are added on for the memory access to external memory. I would expect maybe 100 Mhz. normally, but dropping down to 25 Mhz. when accessing external data-memory. There are no instructions that access two adjacent memory locations, so the processor is neither little-endian or big-endian. As a convention though, little-endian should be used, except for the Forth data stack that is big-endian traditionally. Code-memory has a 15-bit address-bus (32KW). The cfa has the high-bit clear for primitives, so 32KW is all we can address. The data-bus is 16-bit. I was originally going to have 24-bit wide opcodes, but then I found 16-bit to be adequate. If the TOYF is to be upgraded however, the opcodes will have to be made wider to accommodate some more groups of instructions. Data-memory has a 16-bit data-bus (64KW). We address words not bytes! ADD 1,BX advances BX to the next 16-bit word. Forth code must be in the upper 32KW because the cfa has the high-bit set for colon words. Code-memory and data-memory are 32KW and 64KW respectively --- or 64KB and 128KB in terms of bytes. Each opcode has fields, with an instruction in each field. All of these instructions execute concurrently in one clock cycle. The assembler rearranges the instructions and packs them into the fields inserting NOP instructions as necessary. The goal is to minimize how many NOP instructions get inserted while guaranteeing that the program does the same thing as it would if we had one instruction per opcode in the same order that the instructions appeared in the source-code. The opcode has these fields: MMMMMAAAAAABBBBB FEDCBA9876543210 these are the bit positions of the fields There are three fields: MMMMM AAAAAA BBBBB We are executing up to three instructions per clock cycle. To help the assembler pack instructions into opcodes efficiently (minimum NOP instructions inserted), the programmer should hold data in a register for a while before using that register. Don't modify a register and then immediately use that register for something. Try to interleave instructions that modify different registers as these instructions can parallelize. This is very similar to Pentium programming in which UV-either and V-specific instructions were interleaved. The TOYF is more complicated though because there are three "pipes" (opcode fields) rather than just two. Not counting NOP, there are 31 MMMMM instructions, 63 AAAAAA instructions, and 31 BBBBB instructions. So, the TOYF has 125 instructions total, which is a lot, but this shouldn't make implementation unreasonably difficult. The TOYF has simple instructions that execute in one clock cycle, whereas the 65ISR has complicated multi-cycle instructions. These are completely different designs --- the TOYF is RISC and the 65ISR is CISC --- the 65ISR is a throwback to the 1980s. The TOYF is a throwback to the 1990s (it is similar to the MiniForth that Testra came out with in 1995). Upon start-up, RS and SS are set to zero. They can't be modified except incremented and decremented. Section 2.) the instruction set In the following instruction descriptions, the | separates events that are done concurrently. These are group-M instructions that modify LX PC MA IP LF RS SS: nop mov AX,LX ; move AX to LX ; this allows EXECUTE and QEXECUTE to work mov -1,LX ; move -1 to LX ; a CAD LX,IP after this will back up to redo the primitive ldr LX ; load (MA) into LX ldr IP ; load (MA) into IP ldr LF ; load (MA) into LF str AX ; store AX to (MA) ; there is no STR DX so use: MOV DX,AX STR AX str BX ; store BX to (MA) str LX ; store LX to (MA) mov RS,LF ; move RS to LF mov LF,RS ; move LF to RS add AX,RS ; move RS + AX to RS ; this discards some items from the return-stack mov IP,MA ; move IP to MA mov AX,IP ; move AX to IP ; this initializes IP in the FULL code cad LX,IP ; if CF then move IP + LX to IP add 1,IP ; move IP + 1 to IP pol ; move POLL to PC nxt ; move IP + 1 to IP and to MA | move NEXT to PC isr ; if CF then move FULL to PC nxp ; if high-bit of LX then move LX to PC (note that PC is 15 bits, so high bit is discarded) nxf ; store IP to (MA) | move LX to IP and to MA | move NEXT to PC mov AX,MA ; move AX to MA mov BX,MA ; move BX to MA mov DX,MA ; move DX to MA mov LF,MA ; move LF to MA ; this is the address of the parent's LF under the local variables lcl AX,MA ; move LF + AX to MA ; if AX is the local offset, sets MA to local address mov S0,MA ; move SS to MA ; this is the address of the second-of-stack (BX is top-of-stack) mov S1,MA ; move SS + 1 to MA ; this is the address of the third-of-stack phr ; move RS - 1 to RS and to MA plr ; move RS to MA | move RS + 1 to RS phs ; move SS - 1 to SS and to MA pls ; move SS to MA | move SS + 1 to SS These are group-A instructions that modify AX: nop mov LX,AX ; move LX to AX mov BX,AX ; move BX to AX mov DX,AX ; move DX to AX mov LF,AX ; move LF to AX ; this is needed because we don't have a STR LF instruction sum BX,AX ; if CF then move AX + BX to AX add BX,AX ; move AX + BX to AX sub BX,AX ; move AX - BX to AX add DX,AX ; move AX + DX to AX and BX,AX ; move BX -and- AX to AX ior BX,AX ; move BX -ior- AX to AX xor BX,AX ; move BX -xor- AX to AX add 1,AX ; move AX + 1 to AX sub 1,AX ; move AX - 1 to AX not AX ; move -1 -xor- AX to AX neg AX ; move 0 - AX to AX shl AX ; shift AX left, move 0 into low-bit add LX,AX ; move AX + LX to AX sub LX,AX ; move AX - LX to AX and LX,AX ; move LX -and- AX to AX ior LX,AX ; move LX -ior- AX to AX xor LX,AX ; move LX -xor- AX to AX swp AX ; exchange low byte and high byte in AX ; useful for accessing bytes byt AX ; move AX -and- $FF to AX ; useful for masking a byte mov 0,AX ; move 0 to AX ; 0, 1, etc. are small constants --- use NEG AX to negate them mov 1,AX ; move 1 to AX mov 2,AX ; move 2 to AX mov 3,AX ; move 3 to AX mov 4,AX ; move 4 to AX mov 5,AX ; move 5 to AX mov 6,AX ; move 6 to AX mov 7,AX ; move 7 to AX mov 8,AX ; move 8 to AX mov 9,AX ; move 9 to AX mov 10,AX ; move 10 to AX ; 10, 20, etc. are useful as the base to add a value, for a sum up to 99 mov 20,AX ; move 20 to AX mov 30,AX ; move 30 to AX mov 40,AX ; move 40 to AX mov 50,AX ; move 50 to AX mov 60,AX ; move 60 to AX mov 70,AX ; move 70 to AX mov 80,AX ; move 80 to AX mov 90,AX ; move 90 to AX mov 100,AX ; move 100 to AX ; 100, 200, etc. are useful as the base to add a value, for a sum up to 999 mov 200,AX ; move 200 to AX mov 300,AX ; move 300 to AX mov 400,AX ; move 400 to AX mov 500,AX ; move 500 to AX mov 600,AX ; move 600 to AX mov 700,AX ; move 700 to AX mov 800,AX ; move 800 to AX mov 900,AX ; move 900 to AX mov $10,AX ; move $10 to AX ; 1, 2, 4, 8, $10, 20, etc. are useful as bit masks for 1-bit I/O ports mov $20,AX ; move $20 to AX mov $40,AX ; move $40 to AX mov $80,AX ; move $80 to AX mov $100,AX ; move $100 to AX mov $200,AX ; move $200 to AX mov $400,AX ; move $400 to AX mov $800,AX ; move $800 to AX mov $1000,AX ; move $1000 to AX mov $2000,AX ; move $2000 to AX mov $4000,AX ; move $4000 to AX mov $8000,AX ; move $8000 to AX These are group-B instructions that modify BX DX VF CF: nop mov LX,DX ; move LX to DX mov AX,DX ; move AX to DX mov BX,DX ; move BX to DX add AX,DX ; move AX + DX to DX add AD,BX ; move AX + DX to BX mov LX,BX ; move LX to BX mov AX,BX ; move AX to BX mov DX,BX ; move DX to BX mov MA,BX ; move MA to BX xch DX,BX ; exchange DX and BX values neg BX ; move 0 - BX to BX mov CF,BX ; move CF to BX ; if CF then sets BX to TRUE (TRUE is 1, not the traditional -1) flg BX ; if BX<>0 then move 1 to BX ; normalizes a nonzero into a TRUE flag add 1,BX ; move BX + 1 to BX sub 1,BX ; move BX - 1 to BX sgn DX,BX ; if BX < 0 then move 0 - BX to BX | if DX < 0 then move 0 - DX to DX | set VF to -xor- original sign bits vng BX ; if VF then move 0 - BX to BX bit DX,BX ; shift DX right, move 0 into high-bit, move low-bit into CF | shift BX left, move 0 into low-bit ror 1,BX ; shift BX right, move CF into high-bit, move low-bit into CF rol 1,BX ; shift BX left, move CF into low-bit, move high-bit into CF add LX,BX ; move BX + LX to BX, move carry into CF, set VF if BX high-bit changes add AX,BX ; move AX + BX to BX, move carry into CF, set VF if BX high-bit changes adc AX,BX ; move AX + BX + CF to BX, move carry into CF, set VF if BX high-bit changes sbc AX,BX ; move BX - AX - CF to BX, move borrow into CF, set VF if BX high-bit changes abs BX ; if BX < 0 then move 0 - BX to BX | move original BX bit-15 to CF mov 0,CF ; move 0 to CF mov VF,CF ; move VF to CF not CF ; move -not- CF to CF mov BX,CF ; move -ior- of all BX bits to CF ; set CF is BX is nonzero mov BN,CF ; move BX bit-15 to CF ; set CF if BX is negative mov BL,CF ; move BX bit-15 -xor- VF to CF ; set CF if SBC AX,BX indicates AX I have pointed out quite some time > ago that keeping the DO..LOOP index in memory slows down tight loops > considerably (by up to a factor of 5-6); that would be easy to fix. TOYF keeps its index in memory! TOYF is going to get ripped apart on comp.lang.forth as a disgrace to the Forth community! This speed issue with loops is more difficult to fix than it may seem to the Forth-200x committee, however. If you have interrupts after arbitrary instructions, then you have to save/restore all of your registers in ISRs. Also, he mentiones "tight loops," presumably meaning loops that don't call functions that also have loops. If your loops are going to be nestable however, then you have to save/restore your index and limit registers. Also, if I had more registers, then I would need one or two new groups of instructions, and hence a wider opcode. I think that non-professional programmers are overly impressed by benchmarks. Somebody examines a small benchmark function and then says that using registers provides a six-fold speed-up (woo hoo!). All benchmark functions are small because the code has to fit on one page of a magazine article. In real programs however, reentrancy is important. This implies holding local data on a stack, like in TOYF. The 65ISR doesn't support reentrancy, and doesn't support a main-program very well. The 65ISR is for event-driven micro-controllers, and all the ISRs are small independent programs. The TOYF is exactly the opposite --- it supports a large main-program well, but has minimal support for I/O events. 0BRANCH: ; flag -- add 1,IP mov IP,MA ldr LX ; LX= offset mov BX,CF ; CF= nonzero? drop ntc ; CF= zero? cad LX,IP ; if CF then move IP + LX to IP pol BRANCH: ; -- mov 0,CF add 1,IP mov IP,MA ldr LX ; LX= offset not CF cad LX,IP ; if CF then move IP + LX to IP pol Both 0BRANCH and BRANCH use a relative offset as is traditional in Forth. Note that instructions modifying CF are group-B so they parallelize with the instructions nearby. These primitives aren't as slow as their length makes them appear to be. JUMP: ; -- add 1,IP mov IP,MA ldr IP nxt JUMP uses an absolute address and is faster than BRANCH. LESS_THAN: ; X Y -- XY mov 0,CF pls ldr LX mov LX,AX ; AX=X BX=Y sbc AX,BX ; BX= Y-X mov BL,CF ; CF set if YR mov 0,AX phr str AX endm macro TOR ; this is: >R phr str BX ; push the item to the return-stack drop ; remove the item from the single-stack endm RFETCH: ; -- val ; this is: R@ dup plr ldr LX mov LX,BX phr ; this just undoes the PLR earlier pol Note that we don't have interrupts, so there is no danger of an interrupt occurring between PLR and PHR to clobber the stack. ZERO8: ; -- ztor ZERO7: ; -- ztor ZERO6: ; -- ztor ZERO5: ; -- ztor ZERO4: ; -- ztor ZERO3: ; -- ztor ZERO2: ; -- ztor ZERO1: ; -- ztor pol ENTER8: ; a b c d e f g h -- tor ENTER7: ; a b c d e f g -- tor ENTER6: ; a b c d e f -- tor ENTER5: ; a b c d e -- tor ENTER4: ; a b c d -- tor ENTER3: ; a b c -- tor ENTER2: ; a b -- tor ENTER1: ; a -- tor ENTER0: ; -- mov LF,AX phr str AX ; push the old LF to the return-stack mov RS,LF ; set the LF to the new local-frame pol At the front of a colon word that has locals, we optionally have a ZEROx primitive for any zero-initialized locals. Then we have a mandatory ENTERx primitive for the stack-initialized locals (use ENTER0 if there aren't any). Any number of locals can be supported. A maximum of 10 seems reasonable --- it is bad Forth style to use a lot of locals. It is okay for each TOR to fall into the next one, because they aren't going to parallelize (they're group-M instructions). It is also okay for TOR to fall into ENTER0 because the MOV LF,AX will parallelize with the PHR anyway. All the other instructions are group-M and won't parallelize with each other. In code that primarily moves data around in memory, all the instructions are going to be group-M (the 'M' means memory). You get better parallelization when there are calculations being done with AX BX and DX --- these parallelize with group-M. Fortunately, there is usually a lot of calculation being done (that is where the term "computer" comes from). LEAVE1: ; -- ; assumes that there is one local on the stack mov 1,AX plr ldr LF ; restore the old LF add AX,RS ; discard the local variable EXIT: ; -- plr ldr IP ; restore the old IP pol Our EXIT primitive is for simple colon words. Our LEAVEx primitive is for colon words that started with ENTERx. We need LEAVE2 LEAVE3 ... LEAVE10 available, assuming that there is a maximum of 10 locals (ZEROx and ENTERx combined). QEXIT: ; -- ; assumes that this is a quotation called with QEXECUTE plr ldr IP ; pull old IP plr ldr LF ; pull HOF's LF pol Our QEXIT primitive is for quotations. This is like EXIT except that it restores the HOF's LF too. LOCAL1_ADR: ; -- adr mov 1,AX ; AX= offset to local dup lcl AX,MA mov MA,BX pol LOCAL1_FETCH: ; -- val mov 1,AX ; AX= offset to local dup lcl AX,MA ldr LX mov LX,BX pol LOCAL1_STORE: ; val -- mov 1,AX ; AX= offset to local lcl AX,MA str BX drop pol LOCAL1_PLUS_STORE: ; val -- mov 1,AX ; AX= offset to local lcl AX,MA ldr LX add LX,BX str BX drop pol L1..L10 can be done like this efficiently. Locals above this would need some arithmetic. Note that there is no L0 because at LF we have the old LF value, not a local variable. LOCAL11_ADR: ; -- adr mov 10,AX mov AX,DX dup mov 1,AX add DX,AX ; AX= offset to local lcl AX,MA mov MA,BX pol Our LOCAL11_ADR is an example of the arithmetic needed to get above L10. This isn't very efficient, but it should be rare that anybody has this many locals. Most likely, I will just set 10 as the limit of how many locals the compiler supports. LOCAL1_PRE_INC_FETCH: ; -- val mov 1,AX ; AX= offset to local dup lcl AX,MA ldr LX mov LX,BX add 1,BX str BX pol LOCAL1_POST_INC_FETCH: ; -- val mov 1,AX ; AX= offset to local dup lcl AX,MA ldr LX mov LX,BX mov LX,AX add 1,AX str AX pol Our LOCAL1_PRE_INC_FETCH and LOCAL1_POST_INC_FETCH are examples of using local variables as pointers. Doing a pre-decrement or post-decrement would be similar. LOCAL1_DX_ADR: ; sets DX to the address xch DX,BX mov 1,AX ; AX= offset to local lcl AX,MA mov MA,BX xch DX,BX nxt LOCAL1_DX_FETCH: ; sets DX to the value mov 1,AX ; AX= offset to local lcl AX,MA ldr LX mov LX,DX nxt LOCAL1_DX_STORE: ; stores DX to the local mov 1,AX ; AX= offset to local lcl AX,MA str BX drop pol LOCAL1_DX_PLUS_STORE: ; adds DX to the local mov 1,AX ; AX= offset to local lcl AX,MA mov DX,AX ldr LX add LX,AX str AX pol Some primitives leave a value in DX rather than push it to the stack. These must end in NXT rather than POL to preserve DX. Above we have LOCAL1_DX_ADR and LOCAL1_DX_FETCH that leave the value in DX and end in NXT. We also have LOCAL1_DX_STORE and LOCAL1_DX_PLUS_STORE that use DX. These end in POL because DX is no longer in use. Normally, if a primitive uses DX then it should end in POL --- the previous primitive had to end in NXT because it left DX. We need to do POL reasonably often because otherwise we risk losing I/O data. The danger is that we have several short primitives in a row, all of which end in NXT, and the overall time will be too long. If this danger is problematic, the compiler can be made smart enough to deal with it. We can have two versions of most of the primitives, one that ends in NXT and the other that ends in POL. The compiler can choose which to use based on a count of how many opcodes we have had since the last POL. This is complicated --- maybe worry about that sometime in the future. In MFX (for the MiniForth) I had "ghost primitives." For these I saved the source-code, but didn't assemble them into machine-code. The first time they were used, they assembled into machine-code. Afterward, if there were used, I had the cfa already. All of these primitives that access local variables should be ghost primitives. Dozens or hundreds of them can be meta-compiled, but they won't use code-space unless they are actually used. Most programs won't need the primitives that are specific to more than four local variables. LIT0: ; -- 0 mov 0,AX phs str BX ; the PHS and STR BX are the DUP macro mov AX,BX nxt The constants 0 1 2 3 4 5 6 7 8 9 10 are easy and fast because the code can be written as above. There are three opcodes total, despite having five instructions. The MOV $0,AX and PHS parallelize into one opcode. The STR BX and MOV AX,BX parallelize into one opcode. The NXT doesn't parallelize with anything, so it is an opcode by itself. LIT29: ; -- 29 mov 20,AX phs str BX ; the PHS and STR BX are the DUP macro mov AX,DX mov 9,AX add AD,BX nxt LIT29 parallelizes pretty well. There are three opcodes total, despite having seven instructions. The MOV 20,AX and PHS parallelize into one opcode. The STR BX and MOV AX,DX and MOV 9,AX all parallelize into one opcode. The ADD AD,BX and NXT parallelize into one opcode. Parallelization is good! This is what makes a processor fast --- instructions executing in parallel. Both LIT0 and LIT29 got a NXT because they are so short we don't need to poll I/O afterward. LIT0_DX: ; sets DX= 0 mov 0,AX mov AX,DX nxt LIT29_DX: ; sets DX= 29 mov 20,AX mov AX,DX mov 9,AX add AX,DX nxt Our LIT0_DX and LIT29_DX leave the value in DX rather than push it to the stack. These get a NXT out of necessity, because POL would execute code that might clobber the DX register. LIT29 has three opcodes: MOV 20,AX doesn't parallelize with anything, so it is an opcode by itself. MOV AX,DX and MOV 9,AX parallelize into one opcode. ADD AX,DX and NXT parallelize into one opcode. This is no better than what LIT29 had, which was also three opcodes. Parallelization works better in long primitives because more is being done, and more registers are being used. In LIT29 there was a DUP which included two group-M instructions, both of which parallelized with group-A and group-B. In LIT29_DX we have only group-A and group-B instructions, plus NXT that is group-M, so fewer registers being used. Sometimes optimization, such as passing data in registers rather than in memory, doesn't help at all. LIT29_DX might still be useful though because the following primitive that uses DX may be faster than if it used the stack. LIT: ; -- val add 1,IP mov IP,MA ldr LX dup mov LX,BX nxt LIT_PLUS: ; n -- n+val add 1,IP mov IP,MA ldr LX add LX,BX nxt LIT and LIT_PLUS are generic for any constant. They each use two cells in the threaded code. LIT_DX: ; sets DX= val add 1,IP mov IP,MA ldr LX mov LX,DX nxt LIT_PLUS_DX: ; n -- ; sets DX= n+val mov BX,AX ; move BX to AX before we DROP pls ldr LX mov LX,BX ; this is the DROP macro add 1,IP mov IP,MA ldr LX add LX,AX mov AX,DX nxt Our LIT is six opcodes and our LIT_DX is four opcodes, so we have an improvement. Our LIT_PLUS is four opcodes and our LIT_PLUS_DX is seven opcodes, which is much worse. There is very little parallelization. Optimization is not always an improvement! PR0_FETCH: ; -- val ; this is pseudo-register 0 ; leaves the address in DX mov 0,AX ; AX= adr dup mov AX,DX mov AX,MA ldr LX mov LX,BX nxt PR0_STORE: ; val -- ; this is pseudo-register 0 ; leaves the address in DX mov 0,AX ; AX= adr mov AX,DX mov AX,MA str BX drop nxt Global variables are slow and bloaty because we need to put the address in the threaded code and load it with LIT. We can have pseudo-registers in low memory that are much faster. PR0_FETCH is seven opcodes and PR0_STORE is six opcodes. If we have primitives that are specific to these pseudo-registers, they should be quite fast. It is also possible to put important I/O ports in these low addresses so the I/O code is fast. The MOV AX,DX parallelizes, so it doesn't cost anything --- having the address in DX might be useful. PR0_FETCH_DX: ; sets DX= value ; this is pseudo-register 0 mov 0,AX ; AX= adr mov AX,MA ldr LX mov LX,DX nxt PR0_STORE_DX: ; stores DX to pseudo-register 0 mov 0,AX ; AX= adr mov AX,MA mov DX,AX str AX pol Our PR0_FETCH_DX and PR0_STORE_DX use DX rather than the stack. The compiler has to do some optimization. Consider this code sequence: 0 @ 1 ! This compiles into two primitives: PR0_FETCH_DX PR1_STORE_DX This is just peephole optimization --- it is easy to make a compiler generate code like this. Note that in PR0_STORE_DX the MOV DX,AX parallelizes, so our lack of a STR DX instruction doesn't hurt us. PREP_REG_COPY: ; src dst cnt -- cnt ; sets AX=src DX=dst pls ldr DX pls ldr AX nxt REG_COPY: ; needs AX=src DX=dst BX=cnt moves a datum from (AX) to (DX) post-incrementing, and decrements the cnt mov AX,MA add 1,AX sub 1,BX mov BX,CF ; CF set if BX is non-zero ldr LX mov DX,MA xch DX,BX str LX add 1,BX xch DX,BX mov -1,LX ; this will cause NXT to back up and do the REG_COPY primitive again cad LX,IP ; if CF then move IP + LX to IP nxt Our REG_COPY takes 7 clock cycles per word copied, which is pretty fast considering how many group-M instructions we have. MOV AX,MA and ADD 1,AX and SUB 1,BX all parallelize into one opcode. MOV BX,CF and LDR LX parallelize into one opcode. MOV DX,MA and XCH DX,BX parallelize into one opcode. STR LX and ADD 1,BX parallelize into one opcode. XCH DX,BX and MOV -1,LX parallelize into one opcode. CAD LX,IP doesn't parallelize with anything, so it is an opcode by itself. NXT doesn't parallelize with anything, so it is an opcode by itself. REG_COPY can be used to do very fast block moves! The problem is that it requires that NXT rather than POL be used, so a large block move could delay the POLL code too long. Because of this, it is best to only copy small blocks of data. SLOW_COPY: ; src dst cnt -- ; copy SRC (post-increment) to DST (post-increment) and decrement CNT mov S1,MA ldr AX ; AX= src mov S0,MA ldr DX ; DX= dst mov AX,MA ldr LX mov DX,MA str LX add 1,AX mov S1,MA str AX mov DX,AX mov S0,MA add 1,AX str AX sub 1,BX pol The SLOW-COPY primitive keeps the src and dst in memory on the stack, rather than in registers. This takes 13 clock cycles, which is quite a lot. It ends in POL though, so the POLL code is getting executed. FAST_NIP: ; a b -- b ; only 2 opcodes (compared to 3 for NIP) but doesn't set DX pls nxt NIP: ; a b -- b ; hold A in DX pls ldr LX mov LX,DX nxt DROP: ; a -- ; hold A in DX mov BX,DX drop nxt DUP: ; a -- a a ; hold A in DX mov BX,DX dup nxt OVER: ; a b -- a b a ; hold B in DX mov BX,DX dup mov S1,MA ldr LX mov LX,BX nxt SWAP: ; a b -- b a ; hold B in DX mov BX,DX mov S0,MA ldr LX str BX mov LX,BX nxt ROT: ; a b c -- b c a ; hold C in DX mov S0,MA ldr LX mov LX,AX ; AX= B mov S1,MA ldr LX ; LX= A str AX ; store B at S1 mov S0,MA mov BX,DX str BX ; store C at S0 mov LX,BX ; move A to TOS nxt NROT: ; a b c -- c a b ; hold A in DX mov S0,MA ldr LX mov LX,AX ; AX= B mov S1,MA ldr LX mov LX,DX ; DX= A str BX ; store C at S1 mov S0,MA str DX ; store A at S0 mov AX,BX ; move B to TOS nxt Note that NIP and DROP set DX to the discarded value. Also, DUP OVER SWAP ROT and NROT all set DX to the value at S0. We may have primitives that can take advantage of having DX already loaded with this data, and we can save a memory access. Setting DX to this value doesn't cost anything because the MOV BX,DX gets parallelized (the exception is NIP though). Any primitive that sets DX for use in the next primitive, must end with NXT rather than POL so DX doesn't get clobbered. Note that SWAP is more efficient than OVER --- this is backwards from most Forth systems in which SWAP and ROT are slow. This is because the TOYF has plenty of registers, and instructions that use different registers can parallelize. : PLUS: ; a b -- a+b pls ldr LX add LX,BX nxt : MINUS: ; a b -- a-b neg BX pls ldr LX add LX,BX nxt : FAST_PLUS : a b -- a+b ; assumes A is in DX already pls ; nip the A value that we already have in DX mov BX,AX add AD,BX nxt : FAST_MINUS: ; a b -- a-b ; assumes A is in DX already pls ; nip the A value that we already have in DX neg BX mov BX,AX add DX,BX nxt Our PLUS and MINUS primitives are typically compiled for + and - in the source-code. Our FAST_PLUS and FAST_MINUS primitives are compiled when we know the A value is already in DX. Note that the PLS parallelizes, so it doesn't cost anything. PLUS and MINUS are each three clock cycles, whereas FAST_PLUS and FAST_MINUS are two and three clock cycles respectively. This isn't much of an improvement, so it is hardly worth the effort. An optimizing Forth compiler can generate slightly better code than a traditional non-optimizing Forth compiler. The TOYF is pretty efficient though, so the non-optimizing Forth compiler should generally be adequate. Even if a primitive is long, if it parallelizes well, then it will take few clock cycles to execute. In most primitives there is at least some parallelization. Opcodes packing two instructions will half the code size and double the speed --- this is pretty common. Opcodes packing three instructions are, of course, super-efficient. :-)