TOYF processor design --- by Hugh Aguilar --- copyright November 2017 contact: hughaguilar96@gmail.com Abstract: The TOYF (Toy Forth) processor is less complicated than my 65ISR processor design, and more Forth-centric. The intention is that the TOYF would be easier to implement in HDL than the 65ISR. The TOYF is somewhat similar to the MiniForth at Testra which is also 16-bit Harvard (built on the Lattice isp1048 PLD). The MiniForth was later upgraded to an FPGA and the name changed to RACE. I wrote the MFX, the development system for the RACE --- the TOYF development system is based on MFX. The TOYF has 125 instructions, which is quite a lot, but they are all simple single-cycle instructions that don't do much. The TOYF does out-of-order execution, so as many as three instructions can execute in one clock cycle. The 65ISR has instructions that work with 1-bit data, so it is a much better choice for machines with a lot of switches. The 65ISR-abu addresses 16MB of memory, so it is a much better choice for applications that need to buffer entire files. The TOYF is mostly a fun processor for Forth enthusiasts, but it could be used in applications that have minimal I/O. The TOYF doesn't actually have interrupts --- it polls the I/O after executing each primitive in the main program. The TOYF would be practical as a micro-controller if it had a 65ISR-chico as a coprocessor to handle the I/O interrupts. The TOYF provides better support for the main-program than the 65ISR does because it has Forth's traditional two stacks. A language standard limits us to a few primitives. TOYF has an assembly-language though, so there can be a lot of primitives. Section 1.) the architecture We have the following registers (those that access data-memory have an Rx alternate name): CF 1-bit carry flag CF should be 0 prior to SBC of low bytes (like the MC6800 not the 6502) VF 1-bit overflow flag PC 16-bit program counter this is the machine-code (the only register that points to code-memory) AX R0 16-bit general purpose BX R1 16-bit general purpose this is the top value of the single-stack CX R2 16-bit counter for the LOOP primitive DX R3 16-bit general purpose this is sometimes used to pass data between primitives IP R4 16-bit instruction pointer this is the Forth threaded-code LF R5 5-bit local frame pointer high-byte is $00, low-byte is 111xxxxx --- this is the local-frame pointer RS R6 5-bit return stack pointer high-byte is $00, low-byte is 111xxxxx --- this is the return-stack pointer SS R7 5-bit single stack pointer high-byte is $00, low-byte is 110xxxxx --- this is the data-stack for single-cell data LX R8 16-bit primarily for use by LDR most memory reads use LX as the destination MA R9 16-bit memory access pointer all memory loads and stores use MA as the address The processor is Harvard Architecture. Code-memory and data-memory can be accessed concurrently in the same clock cycle. Both have a 16-bit data-bus and are word addressable. The two stacks are both in zero-page. The RS stack is in the upper 32 words and the SS stack is in the 32 words below the RS stack. I/O ports and important global variables can be in the lower 192 words of zero-page. Presumably the code-memory (32KW max) would be internal to the FPGA so that it is fast. If zero-page RAM is internal to the FPGA it should be fast too. When MA points outside of zero-page, some extra clock cycles are added on for the memory access to external memory. I would expect maybe 40 Mhz. normally, but dropping down to 20 Mhz. when accessing external data-memory. The MSP430 runs at 25 Mhz. max --- it is my goal that TOYF Forth programs run faster than equivalent MSP430 C programs. There are no instructions that access two adjacent memory locations, so the processor is neither little-endian or big-endian. As a convention though, little-endian should be used, except for the Forth data stack that is big-endian traditionally. Code-memory has a 15-bit address-bus (32KW). The cfa has the high-bit clear for primitives, so 32KW is all we can address. The data-bus is 16-bit. I was originally going to have 24-bit wide opcodes, but then I found 16-bit to be adequate. If the TOYF is to be upgraded however, the opcodes will have to be made wider to accommodate some more groups of instructions. Data-memory has a 16-bit data-bus (64KW). We address words not bytes! MOV 1,AX ADD AX,BX advances BX to the next 16-bit word. Forth code must be in the upper 32KW because the cfa has the high-bit set for colon words. Code-memory and data-memory are 32KW and 64KW respectively --- or 64KB and 128KB in terms of bytes. Each opcode has fields, with an instruction in each field. All of these instructions execute concurrently in one clock cycle. The assembler rearranges the instructions and packs them into the fields inserting NOP instructions as necessary. The goal is to minimize how many NOP instructions get inserted while guaranteeing that the program does the same thing as it would if we had one instruction per opcode in the same order that the instructions appeared in the source-code. The opcode has these fields: MMMMMAAAAAABBBBB FEDCBA9876543210 these are the bit positions of the fields There are three fields: MMMMM AAAAAA BBBBB We are executing up to three instructions per clock cycle. To help the assembler pack instructions into opcodes efficiently (minimum NOP instructions inserted), the programmer should hold data in a register for a while before using that register. Don't modify a register and then immediately use that register for something. Try to interleave instructions that modify different registers as these instructions can parallelize. This is very similar to Pentium programming in which UV-either and V-specific instructions were interleaved. The TOYF is more complicated though because there are three "pipes" (opcode fields) rather than just two. Not counting NOP, there are 31 MMMMM instructions, 63 AAAAAA instructions, and 31 BBBBB instructions. So, the TOYF has 125 instructions total, which is a lot, but this shouldn't make implementation unreasonably difficult. The TOYF has simple instructions that execute in one clock cycle, whereas the 65ISR has complicated multi-cycle instructions. These are completely different designs --- the TOYF is RISC and the 65ISR is CISC --- the 65ISR is a throwback to the 1980s. The TOYF is a throwback to the 1990s (it is similar to the MiniForth that Testra came out with in 1995). Upon start-up, RS and SS are set to zero. They can't be modified except incremented and decremented. Section 2.) the instruction set In the following instruction descriptions, the | separates events that are done concurrently. These are group-M instructions that modify LX PC MA IP LF RS SS: nop mov AX,LX ; move AX to LX ; this allows EXECUTE and QEXECUTE to work ldr LX ; load (MA) into LX ldr IP ; load (MA) into IP ldr LF ; load (MA) into LF str AX ; store AX to (MA) ; the CX DX and LF registers can't be stored, so move to AX then do: STR AX str BX ; store BX to (MA) str LX ; store LX to (MA) mov RS,LF ; move RS to LF mov LF,RS ; move LF to RS add AX,RS ; move RS + AX to RS ; this discards some items from the return-stack mov IP,MA ; move IP to MA mov AX,IP ; move AX to IP ; this initializes IP in the FULL code cad LX,IP ; if CF then move IP + LX to IP lit IP,MA ; mov IP + 1 to IP and to MA pol ; move POLL to PC nxt ; move IP + 1 to IP and to MA | move NEXT to PC isr ; if CF then move FULL to PC nxp ; if high-bit of LX then move LX to PC (note that PC is 15 bits, so high bit is discarded) nxf ; store IP to (MA) | move LX to IP and to MA | move NEXT to PC mov AX,MA ; move AX to MA mov BX,MA ; move BX to MA mov DX,MA ; move DX to MA mov LF,MA ; move LF to MA ; this is the address of the parent's LF under the local variables lcl AX,MA ; move LF + AX to MA ; if AX is the local offset, sets MA to local address mov S0,MA ; move SS to MA ; this is the address of the second-of-stack (BX is top-of-stack) mov S1,MA ; move SS + 1 to MA ; this is the address of the third-of-stack mov RS,MA ; move RS to MA phr ; move RS - 1 to RS and to MA plr ; move RS to MA | move RS + 1 to RS phs ; move SS - 1 to SS and to MA pls ; move SS to MA | move SS + 1 to SS These are group-A instructions that modify AX CX: nop mov LX,AX ; move LX to AX mov BX,AX ; move BX to AX mov DX,AX ; move DX to AX mov LF,AX ; move LF to AX ; this is needed because we don't have a STR LF instruction sum BX,AX ; if low-bit of DX then move AX + BX to AX add BX,AX ; move AX + BX to AX sub BX,AX ; move AX - BX to AX add LX,AX ; move AX + LX to AX add DX,AX ; move AX + DX to AX and BX,AX ; move BX -and- AX to AX ior BX,AX ; move BX -ior- AX to AX xor BX,AX ; move BX -xor- AX to AX flg CF,AX ; move CF to AX ; sets AX= 1 if CF, else AX= 0 (TRUE is 1, not the traditional -1) not AX ; move -1 -xor- AX to AX neg AX ; move 0 - AX to AX shl AX ; shift AX left, move 0 into low-bit add LX,AX ; move AX + LX to AX sub LX,AX ; move AX - LX to AX and LX,AX ; move LX -and- AX to AX ior LX,AX ; move LX -ior- AX to AX xor LX,AX ; move LX -xor- AX to AX and CF,AX ; move CF -and- AX to AX ; sets AX= 1 if CF -and- low-bit of AX, else AX= 0 swp AX ; exchange low byte and high byte in AX ; useful for accessing bytes mov LX,CX ; move LX to CX ; used by LIST_START etc. xch CX,AX ; exchange CX and AX values ; also called: XCH AX,CX sub 1,CX ; move CX - 1 to CX ; used in countdown loops bit CX ; shift CX left, move -not- CF into low-bit ; used in division with CX being the quotient cad LX,AX ; if CF then move AX + LX to AX ; used in division for undoing trial subtraction mov -1,AX ; move -1 to AX mov 0,AX ; move 0 to AX ; 0, 1, etc. are small constants mov 1,AX ; move 1 to AX mov 2,AX ; move 2 to AX ; use NEG AX to get -2, etc. mov 3,AX ; move 3 to AX mov 4,AX ; move 4 to AX mov 5,AX ; move 5 to AX mov 6,AX ; move 6 to AX mov 7,AX ; move 7 to AX mov 8,AX ; move 8 to AX mov 9,AX ; move 9 to AX mov 10,AX ; move 10 to AX ; 10, 20, etc. are useful as the base to add a value, for a sum up to 99 mov 20,AX ; move 20 to AX mov 30,AX ; move 30 to AX mov 40,AX ; move 40 to AX mov 50,AX ; move 50 to AX mov 60,AX ; move 60 to AX mov 70,AX ; move 70 to AX mov 80,AX ; move 80 to AX mov 90,AX ; move 90 to AX mov 100,AX ; move 100 to AX ; for 200, use: SHL AX mov 300,AX ; move 300 to AX ; we can efficiently make constants up to 499 mov $FF,AX ; move 255 to AX ; useful for masking bytes mov $10,AX ; move $10 to AX ; 1, 2, 4, 8, $10, $20, etc. are useful as bit masks for 1-bit I/O ports mov $20,AX ; move $20 to AX ; these have to be fast because they are done in the POLL code mov $40,AX ; move $40 to AX mov $80,AX ; move $80 to AX mov $100,AX ; move $100 to AX mov $200,AX ; move $200 to AX mov $400,AX ; move $400 to AX mov $800,AX ; move $800 to AX mov $1000,AX ; move $1000 to AX mov $2000,AX ; move $2000 to AX mov $4000,AX ; move $4000 to AX mov $8000,AX ; move $8000 to AX These are group-B instructions that modify BX DX VF CF: nop mov LX,DX ; move LX to DX mov AX,DX ; move AX to DX mov BX,DX ; move BX to DX add AX,DX ; move AX + DX to DX add AD,BX ; move AX + DX to BX mov LX,BX ; move LX to BX mov AX,BX ; move AX to BX mov DX,BX ; move DX to BX mov MA,BX ; move MA to BX xch DX,BX ; exchange DX and BX values ; also called: XCH BX,DX neg BX ; move 0 - BX to BX mov CF,BX ; move CF to BX ; sets BX= 1 if CF, else BX= 0 and AX,CF ; mov AX<>0 -and- CF to CF sgn DX,BX ; if BX < 0 then move 0 - BX to BX | if DX < 0 then move 0 - DX to DX | set VF to -xor- original sign bits vng BX ; if VF then move 0 - BX to BX ; used by the SMULT primitive to fix the sign afterward adj DX,BX ; shift DX right, move 0 into high-bit | shift BX left, move 0 into low-bit, move high-bit into CF ror BX ; shift BX right, move CF into high-bit, move low-bit into CF rol BX ; shift BX left, move CF into low-bit, move high-bit into CF add LX,BX ; move BX + LX to BX, move carry into CF, set VF if BX and LX had same high-bit and it changes adc AX,BX ; move AX + BX + CF to BX, move carry into CF, set VF if BX and AX had same high-bit and it changes sbc AX,BX ; move BX - AX - CF to BX, move borrow into CF, set VF if BX and AX had same high-bit and it changes abs BX ; if BX < 0 then move 0 - BX to BX | move original BX high-bit to CF mov 0,CF ; move 0 to CF mov VF,CF ; move VF to CF not CF ; move -not- CF to CF cmv AX,DX ; if CF then move AX to DX flg DX,CF ; move DX<>0 to CF not BX,CF ; move BX=0 to CF ; set CF is BX is zero ; used in 0BRANCH flg CX,CF ; move CX<>0 to CF ; set CF is CX is nonzero ; used in LOOP mov BN,CF ; move BX high-bit to CF ; set CF if BX is negative mov BL,CF ; move BX high-bit -xor- VF to CF ; set CF if SBC AX,BX indicates AX0 ; effectively the same as: 0 <> flg BX,CF mov CF,BX nxt In TOYF a TRUE is 1 and a FALSE is 0. This is unlike ANS-Forth in which a TRUE is -1. NZ normalizes a non-zero value into a TRUE flag. NZ isn't needed by 0BRANCH --- NZ is mostly done in preparation for logic operations on flags. FLIP: ; n -- ~n ; effectively the same as: -1 XOR mov -1,AX xor BX,AX mov AX,BX nxt Note that NOT isn't the same as FLIP. This has been a point of confusion in Forth for decades. FLIP is rarely needed --- I can't actually think of any use for it. In general, you want to normalize flags and then do logic on them. macro DROP pls ldr LX mov LX,BX endm macro DUP phs str BX endm It is possible to implement DO LOOP in TOYF because we have an overflow flag (the MiniForth didn't have an overflow flag). I don't really like DO LOOP however, because it is very confusing for novices to learn, so I may not bother with this. Much more common will be a countdown loop as shown below: PRE_LOOP: ; count -- ; push old CX to return-stack, set DX= count, branch to POST_LOOP if zero mov BX,AX mov BX,DX ; DX= count xch CX,AX ; CX= count, AX= old CX phr mov 0,CF str AX ; push old CX to return-stack lit IP,MA ldr LX ; LX= offset to POST_LOOP not BX,CF ; CF= zero? drop ; -- cad LX,IP ; if CF then move IP + LX to IP nxt LOOP: ; branch if CX is non-zero, then decrement CX lit IP,MA ldr LX ; LX= offset to just past PRE_LOOP flg CX,CF ; CF= nonzero? cad LX,IP ; if CF then move IP + LX to IP sub 1,CX ; post-decrement CX pol POST_LOOP: ; -- ; pull CX from return-stack plr ldr LX mov LX,AX xch AX,CX ; CX= old CX pulled from return-stack nxt For a countdown loop, use: PRE_LOOP (offset to POST_LOOP) code... LOOP (offset to just past PRE_LOOP) POST_LOOP The code will execute as many times as count indicates. At POST_LOOP, our CX is -1 before it gets restored. FIND_MATCH: ; flag -- ; skip LIST_LOOP and go to POST_MATCH mov 1,AX not BX,CF ; CF= zero? drop not CF ; CF= non-zero? mov AX,LX cad LX,IP ; if CF then move IP + LX to IP nxt POST_MATCH: ; -- cx ; pull CX from return-stack xch CX,AX dup mov AX,BX ; -- cx ; CX is -1 if LOOP fell through, else it is how many iterations remain plr ldr LX mov LX,CX ; CX= old CX pulled from return-stack nxt To early-exit, use: PRE_LOOP (offset to POST_LOOP) code... FIND_MATCH LIST_LOOP (offset to just past PRE_LOOP) POST_MATCH The code matches the node, leaves the node on the stack, and pushes a flag indicating if the node matched. FIND_MATCH consumes the flag and skips over LIST_LOOP to POST_MATCH if the flag was non-zero. POST_MATCH is like POST_LOOP except that it pushes CX to the data-stack. If there was a match, this is how many iterations were not done, but if there was no match then this is -1. Afterward the number of successful iterations (not counting the one that early-exited) is: original-remaining-1 PRE_LOOP sets DX to the original --- this can be used after the loop if nothing in the loop modifies DX. If DX is modified in the loop though, then the original should be stored somewhere, such as on the return-stack. PRE_LIST_LOOP: ; node -- node ; sets CX to node, DX= 0 ; return-stack: -- old-CX prior-node mov 0,AX mov AX,DX ; DX= early-exit is false mov BX,AX xch CX,AX ; CX= node, AX= old CX phr str AX ; push old CX to return-stack mov 0,AX phr str AX ; push 0 to return-stack as prior-node lit IP,MA ldr LX ; LX= offset to POST_LIST_LOOP not BX,CF ; CF= nil? cad LX,IP ; if CF then move IP + LX to IP nxt LIST_START: ; node -- node ; needs CX=node (can't be nil), sets CX= next node xch CX,AX ; AX= node mov AX,MA ldr LX mov LX,CX ; CX= next node nxt FIND_LIST_MATCH: ; node flag -- node ; set DX= early-exit? mov 1,AX not BX,CF ; CF= zero? not CF ; CF= non-zero? mov AX,LX ; LX= offset over LIST_LOOP to POST_LIST mov 0,AX mov AX,DX ; DX= false (assumes no early-exit as default) mov 1,AX cad LX,IP ; if CF then move IP + LX to IP cmv AX,DX ; if an early-exit then DX= true nxt LIST_LOOP: ; node -- next-node ; needs CX=next-node, DX=false; moves CX to BX mov RS,MA str BX ; store node to prior-node slot on return-stack FAST_LIST_LOOP: ; like LIST_LOOP except doesn't update prior-node on the return-stack xch CX,AX mov AX,BX ; update BX with next-node xch CX,AX lit IP,MA ldr LX ; LX= offset to LIST_START flg CX,CF ; CF= new node is not nil cad LX,IP ; if CF then move IP + LX to IP pol POST_LIST: ; node -- prior-node node early-exit? ; return_stack: old-CX prior-node -- plr ldr LX phs str LX ; pop prior-node from return-stack and push it under node on the data-stack dup mov DX,BX ; -- prior-node node early-exit? plr ldr LX mov LX,CX ; CX= old CX pulled from return-stack nxt To traverse a list, use: PRE_LIST_LOOP (offset to POST_LIST) LIST_START code... LIST_LOOP (offset to LIST_START) POST_LIST The code inside of the loop will be given the current node on the data-stack. It does not consume this. To obtain a matching node in a list, use: PRE_LIST_LOOP (offset to POST_LIST) LIST_START code... FIND_LIST_MATCH LIST_LOOP (offset to LIST_START) POST_LIST The code inside of the loop will be given the current node on the data-stack. It does not consume this. This code must push a flag indicating if a match was found or not. FIND_LIST_MATCH consumes this flag. POST_LIST leaves this on the data-stack: -- prior-node node early-exit? If the loop completed, then node will be the tail node and early-exit? will be FALSE. If the loop early-exited, then node will be the matching node and early-exit? will be TRUE. Prior-node is the node prior to the node returned, or nil if the node returned is the first node. It is okay for the code to modify the link in the node (such as when moving the node to another list). This is because CX already holds the next-node so we aren't depending upon the old node's link anymore. If this is done, the prior-node returned by POST_LIST will not be meaningful, so discard it. If prior-node is not going to be used anyway, then use FAST_LIST_LOOP rather than LIST_LOOP. This will save two clock cycles per iteration because MOV RS,MA STR BX doesn't parallelize at all. Your prior-node after POST_LIST will still be nil, so just discard it. We could also have faster versions of PRE_LIST_LOOP and POST_LIST that don't mess with prior-node at all. In my experience in writing the novice-package, a merge-sort is not significantly faster than an insertion-sort. The insertion-sort has the advantage of not using any extra memory. My novice-package merge-sort used the return-stack. On the TOYF though, the return-stack is only 32 elements max. To write a merge-sort on the TOYF, you would need a stack in the heap for temporary storage. This is way more trouble than it is worth. Just use the insertion-sort --- it is reasonably fast. For insertion-sort to work, we need the prior-node to the matching node. POST_LIST above provides this. PRE_NTH: ; head index -- head ; sets DX= index mov 1,AX mov AX,LX ; LX= offset to skip over NTH if index is zero mov BX,DX not BX,CF ; CF= index is zero? drop cad LX,IP ; if CF then move IP + LX to IP nxt NTH: ; node -- next-node ; needs DX= index flg DX,CF ; CF= not zero? mov -1,AX mov BX,MA ldr LX mov LX,BX ; -- next-node add AX,DX ; post-decrement DX mov AX,LX ; LX= offset back to this primitive again cad LX,IP ; if CF then move IP + LX to IP pol To calculate the N'th node in a list, use: PRE_NTH NTH The index is 0 for the 1st node, 1 for the 2nd node, etc.. Use GOOD_LENGTH ahead of time to ensure that the list is long enough to have a valid node at the given index. PRE_GOOD_LENGTH: ; head count -- count ; sets DX= head, skips one thread if head is nil pls ldr DX ; -- count ; DX= head mov 1,AX mov AX,LX ; LX= offset skipping over next thread flg DX,CF not CF ; CF= nil? cad LX,IP ; if CF then move IP + LX to IP nxt GOOD_LENGTH: ; count -- remaining-count ; needs DX=node mov 1,AX sub AX,BX ; decrement BX mov 0,AX ; AX= nil mov DX,MA ldr LX mov LX,DX ; DX= new-node not BX,CF ; CF= count zero? cmv AX,DX ; change DX into nil if count is zero, to stop the iteration mov -1,AX mov AX,LX ; LX= offset back to this primitive again flg DX,CF ; CF= not nil? cad LX,IP ; if CF then move IP + LX to IP pol To ensure that a list is at least COUNT nodes long, use: PRE_GOOD_LENGTH GOOD_LENGTH QUERY_ZERO This leaves a TRUE if the list is long enough, and a FALSE if the list is too short. PRE_LENGTH: ; head -- 0 ; sets DX= head, skips one thread if head is nil mov BX,DX mov 1,AX mov AX,LX ; LX= offset skipping over next thread mov 0,AX not BX,CF ; CF= nil? mov AX,BX ; BX= 0 cad LX,IP ; if CF then move IP + LX to IP nxt LENGTH: ; 0 -- count ; needs DX=node, finishes with DX= nil, BX= count ; DX can't be nil initially mov 1,AX add AX,BX ; increment BX mov -1,AX mov DX,MA ldr LX mov LX,DX ; DX= new node mov AX,LX ; LX= offset back to this primitive again flg DX,CF ; CF= not nil? cad LX,IP ; if CF then move IP + LX to IP pol To obtain the length, use: PRE_LENGTH LENGTH PRE_LENGTH will skip over LENGTH if head is nil, leaving 0 on the stack. LENGTH will set BX= BX+1, set DX= new-node, then branch back to itself if DX is not nil. PRE_TAIL: ; head -- head ; sets DX= head, skips one thread if head is nil mov BX,DX mov 1,AX mov AX,LX ; LX= offset skipping over next thread not BX,CF ; CF= nil? cad LX,IP ; if CF then move IP + LX to IP nxt TAIL: ; head -- tail ; needs DX=node; finishes with DX= nil, BX= tail ; DX can't be nil initially mov -1,AX mov DX,BX ; BX= old node mov DX,MA ldr LX mov LX,DX ; DX= new node mov AX,LX ; LX= offset back to this primitive again flg DX,CF ; CF= not nil? cad LX,IP ; if CF then move IP + LX to IP pol To obtain the tail, use: PRE_TAIL TAIL PRE_TAIL will skip over TAIL if head is nil, leaving the nil on the stack. TAIL will set BX= old node, DX= new node, then branch back to itself if DX is not nil. The tail can also be obtained with LIST_LOOP, but that is much slower if all you need is the tail. PRE_LINK: ; 1st-head 2nd-head -- 1st-head 2nd-head 1st-head ; sets DX= 1st-head, skip two threads if 1st-head nil mov S0,MA mov 2,AX ldr LX dup mov 0,CF mov LX,BX ; -- 1st-head 2nd-head 1st-head mov LX,DX ; DX= 1st-head mov AX,LX ; LX= offset past two threads not BX,CF ; CF= 1st-head is nil? cad LX,IP ; if CF then move IP + LX to IP pol GOOD_LINK: ; 1st-head 2nd-head 1st-tail -- 1st-head ; TAIL fell through, so 1st-head is not nil pls ldr LX ; LX= 2nd-head mov BX,MA str LX ; store 2nd-head into 1st-tail link field lit IP,MA ; skip over BAD_LINK nxt BAD_LINK: ; 1st-head 2nd-head 1st-head -- 2nd-head ; assumes 1st-head is nil drop ; -- 1st-head 2nd-head pls ; -- 2nd-head nxt To link two lists, do this: PRE_LINK TAIL GOOD_LINK BAD_LINK PRE_LINK will skip to BAD_LINK if 1st-head is nil, otherwise fall into TAIL, which falls into GOOD_LINK. GOOD_LINK will link the lists, then skip over BAD_LINK. If BAD-LINK executes it returns 2nd-head (might be nil). The CX register is only used for iteration. Any other use would corrupt any loop that you are inside of. The POLL code has to save and restore CX and DX if it uses them. 0BRANCH: ; flag -- lit IP,MA ldr LX ; LX= offset not BX,CF ; CF= zero? drop cad LX,IP ; if CF then move IP + LX to IP pol BRANCH: ; -- mov 0,CF lit IP,MA ldr LX ; LX= offset not CF cad LX,IP ; if CF then move IP + LX to IP pol Both 0BRANCH and BRANCH use a relative offset as is traditional in Forth. Note that instructions modifying CF are group-B so they parallelize with the instructions nearby. These primitives aren't as slow as their length makes them appear to be. JUMP: ; -- lit IP,MA ldr IP nxt JUMP uses an absolute address and is faster than BRANCH. LESS_THAN: ; X Y -- XY mov 0,CF pls ldr LX mov LX,AX ; AX=X BX=Y sbc AX,BX ; BX= Y-X mov BL,CF ; CF set if YR mov 0,AX phr str AX endm macro TOR ; this is: >R phr str BX ; push the item to the return-stack drop ; remove the item from the single-stack endm RFETCH: ; -- val ; this is: R@ dup mov RS,MA ldr LX mov LX,BX pol ZERO8: ; -- ztor ZERO7: ; -- ztor ZERO6: ; -- ztor ZERO5: ; -- ztor ZERO4: ; -- ztor ZERO3: ; -- ztor ZERO2: ; -- ztor ZERO1: ; -- ztor pol ENTER8: ; a b c d e f g h -- tor ENTER7: ; a b c d e f g -- tor ENTER6: ; a b c d e f -- tor ENTER5: ; a b c d e -- tor ENTER4: ; a b c d -- tor ENTER3: ; a b c -- tor ENTER2: ; a b -- tor ENTER1: ; a -- tor ENTER0: ; -- mov LF,AX phr str AX ; push the old LF to the return-stack mov RS,LF ; set the LF to the new local-frame pol At the front of a colon word that has locals, we optionally have a ZEROx primitive for any zero-initialized locals. Then we have a mandatory ENTERx primitive for the stack-initialized locals (use ENTER0 if there aren't any). Any number of locals can be supported. A maximum of 10 seems reasonable --- it is bad Forth style to use a lot of locals. It is okay for each TOR to fall into the next one, because they aren't going to parallelize (they're group-M instructions). It is also okay for TOR to fall into ENTER0 because the MOV LF,AX will parallelize with the PHR anyway. All the other instructions are group-M and won't parallelize with each other. In code that primarily moves data around in memory, all the instructions are going to be group-M (the 'M' means memory). You get better parallelization when there are calculations being done with AX BX and DX --- these parallelize with group-M. Fortunately, there is usually a lot of calculation being done (that is where the term "computer" comes from). LEAVE1: ; -- ; assumes that there is one local on the stack mov 1,AX plr ldr LF ; restore the old LF add AX,RS ; discard the local variable EXIT: ; -- plr ldr IP ; restore the old IP pol Our EXIT primitive is for simple colon words. Our LEAVEx primitive is for colon words that started with ENTERx. We need LEAVE2 LEAVE3 ... LEAVE10 available, assuming that there is a maximum of 10 locals (ZEROx and ENTERx combined). QEXIT: ; -- ; assumes that this is a quotation called with QEXECUTE plr ldr IP ; pull old IP plr ldr LF ; pull HOF's LF pol Our QEXIT primitive is for quotations. This is like EXIT except that it restores the HOF's LF too. LOCAL1_ADR: ; -- adr mov 1,AX ; AX= offset to local dup lcl AX,MA mov MA,BX pol LOCAL1_FETCH: ; -- val mov 1,AX ; AX= offset to local dup lcl AX,MA ldr LX mov LX,BX pol LOCAL1_STORE: ; val -- mov 1,AX ; AX= offset to local lcl AX,MA str BX drop pol LOCAL1_PLUS_STORE: ; val -- mov 1,AX ; AX= offset to local lcl AX,MA ldr LX add LX,BX str BX drop pol L1..L10 can be done like this efficiently. Locals above this would need some arithmetic. Note that there is no L0 because at LF we have the old LF value, not a local variable. LOCAL11_ADR: ; -- adr mov 10,AX mov AX,DX dup mov 1,AX add DX,AX ; AX= offset to local lcl AX,MA mov MA,BX pol Our LOCAL11_ADR is an example of the arithmetic needed to get above L10. This isn't very efficient, but it should be rare that anybody has this many locals. Most likely, I will just set 10 as the limit of how many locals the compiler supports. LOCAL1_PRE_INC_FETCH: ; -- val ; sets DX= original value mov 1,AX ; AX= offset to local dup lcl AX,MA ldr LX mov LX,BX mov BX,DX mov 1,AX add AX,BX str BX pol LOCAL1_POST_INC_FETCH: ; -- val ; sets DX= incremented value mov 1,AX ; AX= offset to local dup lcl AX,MA ldr LX mov LX,BX mov 1,AX mov LX,DX add AX,DX str DX pol Our LOCAL1_PRE_INC_FETCH and LOCAL1_POST_INC_FETCH are examples of using local variables as pointers. Doing a pre-decrement or post-decrement would be similar. LOCAL1_DX_ADR: ; sets DX to the address xch DX,BX mov 1,AX ; AX= offset to local lcl AX,MA mov MA,BX xch DX,BX nxt LOCAL1_DX_FETCH: ; sets DX to the value mov 1,AX ; AX= offset to local lcl AX,MA ldr LX mov LX,DX nxt LOCAL1_DX_STORE: ; stores DX to the local mov 1,AX ; AX= offset to local lcl AX,MA str BX drop pol LOCAL1_DX_PLUS_STORE: ; adds DX to the local mov 1,AX ; AX= offset to local lcl AX,MA mov DX,AX ldr LX add LX,AX str AX pol Some primitives leave a value in DX rather than push it to the stack. Above we have LOCAL1_DX_ADR and LOCAL1_DX_FETCH that leave the value in DX and end in NXT. We also have LOCAL1_DX_STORE and LOCAL1_DX_PLUS_STORE that use DX. We need to do POL reasonably often because otherwise we risk losing I/O data. The danger is that we may have several short primitives that end in NXT all in a row, and the overall time will be too long. If this danger is problematic, the compiler can be made smart enough to deal with it. We can have two versions of most of the primitives, one that ends in NXT and the other that ends in POL. The compiler can choose which to use based on a count of how many opcodes we have had since the last POL. This is complicated --- maybe worry about that sometime in the future. In MFX (for the MiniForth) I had "ghost primitives." For these I saved the source-code, but didn't assemble them into machine-code. The first time they were used, they assembled into machine-code. Afterward, if they were used, I had the cfa already. All of these primitives that access local variables should be ghost primitives. Dozens or hundreds of them can be meta-compiled, but they won't use code-space unless they are actually used. Most programs won't need the primitives that are specific to more than four local variables. LIT0: ; -- 0 mov 0,AX ; AX= value phs str BX ; the PHS and STR BX are the DUP macro mov AX,BX nxt The constants 0 1 2 3 4 5 6 7 8 9 10 are easy and fast because the code can be written as above. There are three opcodes total, despite having five instructions. The MOV $0,AX and PHS parallelize into one opcode. The STR BX and MOV AX,BX parallelize into one opcode. The NXT doesn't parallelize with anything, so it is an opcode by itself. LIT29: ; -- 29 mov 20,AX ; AX= big part of value phs str BX ; the PHS and STR BX are the DUP macro mov AX,DX mov 9,AX ; AX= small part of value add AD,BX ; BX= value nxt LIT29 parallelizes pretty well. There are three opcodes total, despite having seven instructions. The MOV 20,AX and PHS parallelize into one opcode. The STR BX and MOV AX,DX and MOV 9,AX all parallelize into one opcode. The ADD AD,BX and NXT parallelize into one opcode. Parallelization is good! This is what makes a processor fast --- instructions executing in parallel. Both LIT0 and LIT29 got a NXT because they are so short we don't need to poll I/O afterward. LIT0_DX: ; sets DX= 0 mov 0,AX ; AX= value mov AX,DX nxt LIT29_DX: ; sets DX= 29 mov 20,AX ; AX= big part of value mov AX,DX mov 9,AX ; AX= small part of value add AX,DX nxt Our LIT0_DX and LIT29_DX leave the value in DX rather than push it to the stack. These get a NXT out of necessity, because POL would execute code that might clobber the DX register. LIT29 has three opcodes: MOV 20,AX doesn't parallelize with anything, so it is an opcode by itself. MOV AX,DX and MOV 9,AX parallelize into one opcode. ADD AX,DX and NXT parallelize into one opcode. This is no better than what LIT29 had, which was also three opcodes. Parallelization works better in long primitives because more is being done, and more registers are being used. In LIT29 there was a DUP which included two group-M instructions, both of which parallelized with group-A and group-B. In LIT29_DX we have only group-A and group-B instructions, plus NXT that is group-M, so fewer registers being used. Sometimes optimization, such as passing data in registers rather than in memory, doesn't help at all. LIT29_DX might still be useful though because the following primitive that uses DX may be faster than if it used the stack. LIT: ; -- val lit IP,MA ldr LX dup mov LX,BX nxt LIT_PLUS: ; n -- n+val lit IP,MA ldr LX add LX,BX nxt LIT and LIT_PLUS are generic for any constant. They each use two cells in the threaded code. LIT_DX: ; sets DX= val lit IP,MA ldr LX mov LX,DX nxt LIT_PLUS_DX: ; n -- ; sets DX= n+val mov BX,AX ; move BX to AX before we DROP pls ldr LX mov LX,BX ; this is the DROP macro lit IP,MA ldr LX add LX,AX mov AX,DX nxt Our LIT is six opcodes and our LIT_DX is four opcodes, so we have an improvement. Our LIT_PLUS is four opcodes and our LIT_PLUS_DX is seven opcodes, which is much worse. There is very little parallelization. Optimization is not always an improvement! PR0_FETCH: ; -- val ; this is pseudo-register 0 ; leaves the address in DX mov 0,AX ; AX= adr dup mov AX,DX mov AX,MA ldr LX mov LX,BX nxt PR0_STORE: ; val -- ; this is pseudo-register 0 ; leaves the address in DX mov 0,AX ; AX= adr mov AX,DX mov AX,MA str BX drop nxt Global variables are slow and bloaty because we need to put the address in the threaded code and load it with LIT. We can have pseudo-registers in low memory that are much faster. PR0_FETCH is seven opcodes and PR0_STORE is six opcodes. If we have primitives that are specific to these pseudo-registers, they should be quite fast. It is also possible to put important I/O ports in these low addresses so the I/O code is fast. The MOV AX,DX parallelizes, so it doesn't cost anything --- having the address in DX might be useful. PR0_FETCH_DX: ; sets DX= value ; this is pseudo-register 0 mov 0,AX ; AX= adr mov AX,MA ldr LX mov LX,DX nxt PR0_STORE_DX: ; stores DX to pseudo-register 0 mov 0,AX ; AX= adr mov AX,MA mov DX,AX str AX pol Our PR0_FETCH_DX and PR0_STORE_DX use DX rather than the stack. The compiler has to do some optimization. Consider this code sequence: 0 @ 1 ! This compiles into two primitives: PR0_FETCH_DX PR1_STORE_DX This is just peephole optimization --- it is easy to make a compiler generate code like this. Note that in PR0_STORE_DX the MOV DX,AX parallelizes, so our lack of a STR DX instruction doesn't hurt us. macro MOVE_WORD ; dst -- dst ; needs DX=src mov DX,MA ldr LX mov BX,MA str LX endm macro MOVE_WORD_POST_INC ; dst -- dst ; needs DX=src, AX=1 mov DX,MA add AX,DX ldr LX mov BX,MA add AX,BX str LX endm macro MOVE_WORD_PRE_DEC ; dst -- dst ; needs DX=src, AX=-1 add AX,DX mov DX,MA ldr LX add AX,BX mov BX,MA str LX endm If the size of the block is known at compile-time, the above macros can be used to make a primitive to move the block. If MOVE_WORD_PRE_DEC is used repeatedly, the STR LX at the end parallelizes with the ADD AX,DX at the front. This should only be done for small blocks. If the block is too big, then POL doesn't get executed often enough. MOVE_POST_INC: ; dst -- dst ; needs DX=src CX=count sub 1,CX ; decrement count flg CX,CF ; CF= nonzero? mov 1,AX ; AX= value to be added to SRC and DST pointers mov DX,MA ldr LX not CF ; CF= zero? mov BX,MA str LX mov -1,AX mov AX,LX ; LX= offset back to this primitive add AX,DX add AX,BX cad LX,IP ; if CF then move IP + LX to IP pol Our MOVE_POST_INC primitive moves a block of data using post-increment. This parallelizes into eight opcodes. This works for large blocks, as we do POL after every word moved. SUB 1,CX and FLG CX,CF parallelize. MOV 1,AX and MOV DX,MA parallelize. LDR LX and NOT CF parallelize. MOV BX,MA doesn't parallelize. STR LX and MOV -1,AX parallelize. MOV AX,LX and ADD AX,DX parallelize. ADD AX,BX and CAD LX,IP parallelize. POL doesn't parallelize. MOVE_PRE_DEC: ; dst -- dst ; needs DX=src CX=count sub 1,CX ; decrement count flg CX,CF ; CF= nonzero? mov -1,AX ; AX= value to be added to SRC and DST pointers add AX,DX add AX,BX mov DX,MA ldr LX not CF ; CF= zero? mov BX,MA str LX mov AX,LX ; LX= offset back to this primitive cad LX,IP ; if CF then move IP + LX to IP pol Our MOVE_PRE_DEC primitive moves a block of data using pre-decrement. This parallelizes into nine opcodes. SUB 1,CX doesn't parallelize. FLG CX,CF and MOV -1,AX parallelize. ADD AX,DX doesn't parallelize. ADD AX,BX and MOV DX,MA parallelize. LDR LX and NOT CF parallelize. MOV BX,MA doesn't parallelize. STR LX and MOV AX,LX parallelize. CAD LX,IP doesn't parallelize. POL doesn't parallelize. Our MOVE_POST_INC and MOVE_PRE_DEC primitives are somewhat slow. They are heavy on group-M instructions because they are moving data between memory locations. Adding -1 to IP with CAD LX,IP is a trick for making a primitive iterate over itself. FAST_NIP: ; a b -- b ; only 2 opcodes (compared to 3 for NIP) but doesn't set DX pls nxt NIP: ; a b -- b ; hold A in DX pls ldr LX mov LX,DX nxt FAST_RIP: ; a b c -- b c ; 5 opcodes (compared to 6 for RIP) but doesn't set DX pls ldr LX mov S0,MA str LX nxt RIP: ; a b c -- b c ; hold A in DX pls ldr LX mov LX,AX ; AX= B mov S0,MA ldr LX mov LX,DX ; DX= A str AX nxt DROP: ; a -- ; hold A in DX mov BX,DX drop nxt DUP: ; a -- a a ; hold A in DX mov BX,DX dup nxt OVER: ; a b -- a b a ; hold B in DX mov BX,DX dup mov S1,MA ldr LX mov LX,BX nxt SWAP: ; a b -- b a ; hold B in DX mov BX,DX mov S0,MA ldr LX str BX mov LX,BX nxt ROT: ; a b c -- b c a ; hold C in DX mov S0,MA ldr LX mov LX,AX ; AX= B mov S1,MA ldr LX ; LX= A str AX ; store B at S1 mov S0,MA mov BX,DX str BX ; store C at S0 mov LX,BX ; move A to TOS nxt NROT: ; a b c -- c a b ; hold A in DX mov S0,MA ldr LX mov LX,AX ; AX= B mov S1,MA ldr LX mov LX,DX ; DX= A str BX ; store C at S1 mov S0,MA str DX ; store A at S0 mov AX,BX ; move B to TOS nxt Note that NIP RIP and DROP set DX to the discarded value. Also, DUP OVER SWAP ROT and NROT all set DX to the value at S0. We may have primitives that can take advantage of having DX already loaded with this data, and we can save a memory access. Setting DX to this value doesn't cost anything because the MOV BX,DX gets parallelized (the exceptions are NIP and RIP). Note that SWAP is more efficient than OVER --- this is backwards from most Forth systems in which SWAP and ROT are slow. This is because the TOYF has plenty of registers, and instructions that use different registers can parallelize. : PLUS: ; a b -- a+b pls ldr LX add LX,BX nxt : MINUS: ; a b -- a-b neg BX pls ldr LX add LX,BX nxt : DX_PLUS : b -- a+b ; assumes A is in DX already mov BX,AX add AD,BX nxt : DX_MINUS: ; a b -- a-b ; assumes A is in DX already neg BX mov BX,AX add AD,BX nxt Our PLUS and MINUS primitives are typically compiled for + and - in the source-code. Our DX_PLUS and DX_MINUS primitives are compiled when we know the A value is already in DX. Note that the PLS parallelizes, so it doesn't cost anything. PLUS and MINUS are each three clock cycles, whereas DX_PLUS and DX_MINUS are two and three clock cycles respectively. This isn't much of an improvement, so it is hardly worth the effort. An optimizing Forth compiler can generate slightly better code than a traditional non-optimizing Forth compiler. The TOYF is pretty efficient though, so the non-optimizing Forth compiler should generally be adequate. Even if a primitive is long, if it parallelizes well, then it will take few clock cycles to execute. In most primitives there is at least some parallelization. Opcodes packing two instructions will half the code size and double the speed --- this is pretty common. Opcodes packing three instructions are, of course, super-efficient. :-)