TOYF processor design --- by Hugh Aguilar --- copyright March 2018 contact: hughaguilar96@gmail.com Abstract: The TOYF (Toy Forth) processor is less complicated than my 65ISR processor design, and more Forth-centric. The intention is that the TOYF would be easier to implement in HDL than the 65ISR. The TOYF is somewhat similar to the MiniForth at Testra that is also 16-bit Harvard (built on the Lattice isp1048 PLD). The MiniForth was later upgraded to an FPGA and the name changed to RACE. I wrote the MFX, the development system for the RACE --- the TOYF development system is based on MFX. The TOYF has 125 instructions, which is quite a lot, but they are all simple single-cycle instructions that don't do much. The TOYF does out-of-order execution, so as many as three instructions can execute in one clock cycle. The opcode fields are: 6-bit group-A (32-bit ALU), 5-bit group-M (16-bit ALU), 5-bit group-M (no ALU). The 65ISR supports 1-bit data and 256-byte buffers, so it is oriented toward I/O handling. The 65ISR-abu addresses 16MB of memory, so it is a better choice for applications that need to buffer entire files. The TOYF is designed to support the PID algorithm for motion-control. It is also fun because it is Forth rather than C. The TOYF doesn't have interrupts --- a paced loop would typically be used --- this is adequate for motion-control. The TOYF would be practical for I/O-heavy applications if it were a dual-core system. This could be expensive though. The TOYF provides better support for the main-program than the 65ISR does because it has Forth's traditional two stacks. A language standard limits us to a few primitives. TOYF has an assembly-language though, so there can be a lot of primitives. Classic Forth processors (B16, RTX2000, BUGS18, etc.) have a very few primitives. The TOYF can have hundreds of primitives! The TOYF also supports local variables and quotations --- the only other Forth processor with locals is the MiniForth/RACE. The TOYF quotations can access the parent function's locals, despite the fact that the HOF has locals of its own. Section 1.) the architecture We have the following registers (those that access data-memory have an Rx alternate name): CF 1-bit carry flag CF should be 0 prior to SBC of low bytes (like the MC6800 not the 6502) VF 1-bit overflow flag PC 16-bit program counter this is the machine-code (the only register that points to code-memory) AX R0 16-bit general purpose BX R1 16-bit general purpose this is the top value of the data-stack CX R2 16-bit used in countdown loops and list loops, and multiplication DX R3 16-bit general purpose this is sometimes used to hold the second value of the data-stack EX R4 16-bit used in list loops, and multiplication IP R5 16-bit instruction pointer this is the Forth threaded-code HF R6 5-bit local frame pointer of HOF high-byte is $00, low-byte is 010xxxxx --- this holds LF during a quotation LF R7 5-bit local frame pointer high-byte is $00, low-byte is 010xxxxx --- this is the local-frame pointer RS R8 5-bit return stack pointer high-byte is $00, low-byte is 010xxxxx --- this is the return-stack pointer SS R9 5-bit single stack pointer high-byte is $00, low-byte is 011xxxxx --- this is the data-stack for single-cell data LX RA 16-bit primarily for use by LDR most memory reads use LX as the destination MA RB 16-bit memory access pointer all memory loads and stores use MA as the address The processor is Harvard Architecture. Code-memory and data-memory can be accessed concurrently in the same clock cycle. Both have a 16-bit data-bus and are word addressable. Code-memory has a 15-bit address-bus and data-memory has a 16-bit address-bus. I/O ports and important global variables can be in the lower 64 words of zero-page. The two stacks are above that. An FPGA that has 128 words (256 bytes) of RAM internal will include all of the above in RAM. Presumably the code-memory (32KW max) would be internal to the FPGA so that it is fast. Forth processors that do Forth instructions such as DUP SWAP + etc. in one clock cycle, typically have the entire data-stack in registers. This allows them to push and pull data to the data-stack in parallel with the operation, such as addition for + etc.. This requires the data-stack to be small, such as 8 elements. This can result in stack-overflow causing hard-to-find bugs. Because they don't support local variables, they need all their data on the data-stack, or in global variables. This isn't a bad design for small programs, but it is error-prone in large programs. This is especially error-prone if code-libraries are used, so you don't know how deep into the data-stack your sub-functions go. I would not use such a processor unless the program was quite small and could be analyzed thoroughly for data-stack usage. The TOYF provides 32 elements for data-stack and return-stack each. This is more than enough for any practical program. It is possible that some recursive programs could overflow one or the other stack, but this is very unlikely. The TOYF uses BX as TOS. The TOYF uses DX as SOS sometimes (I would predict about 1/3 of the time). This helps the speed a lot. Registers use a lot of FPGA resources --- the TOYF keeps the data-stack in memory to allow a smaller FPGA to be used. There are no instructions that access two adjacent memory locations, so the processor is neither little-endian or big-endian. As a convention though, little-endian should be used, except for the Forth data stack that is big-endian traditionally. Code-memory has a 15-bit address-bus (32KW). The cfa has the high-bit clear for primitives, so 32KW is all we can address. Data-memory has a 16-bit address-bus (64KW) and a 16-bit data-bus, and the data registers are all 16-bit. The opcodes are 16-bit wide. We have group-A (64 instructions), group-B (32 instructions) and group-M (32-instructions). If the opcode were upgraded to 18-bit wide, group-B and group-M could each be upgraded to 64 instructions. This is a possibility for a super version of the TOYF --- this might be useful for an application-specific FPGA. Data-memory has a 16-bit data-bus (64KW). We address words not bytes! MOV 1,AX ADD AX,BX advances BX to the next 16-bit word. Forth threaded-code must be in the upper 32KW because the cfa has the high-bit set for colon words. Code-memory and data-memory are 32KW and 64KW respectively --- or 64KB and 128KB in terms of bytes. Each opcode has fields, with an instruction in each field. All of these instructions execute concurrently in one clock cycle. The assembler rearranges the instructions and packs them into the fields inserting NOP instructions as necessary. The goal is to minimize how many NOP instructions get inserted while guaranteeing that the program does the same thing as it would if we had one instruction per opcode in the same order that the instructions appeared in the source-code. The opcode has these fields: MMMMMAAAAAABBBBB FEDCBA9876543210 these are the bit positions of the fields There are three fields: MMMMM AAAAAA BBBBB We are executing up to three instructions per clock cycle. To help the assembler pack instructions into opcodes efficiently (minimum NOP instructions inserted), the programmer should hold data in a register for a while before using that register. Don't modify a register and then immediately use that register for something. Try to interleave instructions that modify different registers as these instructions can parallelize. This is very similar to Pentium programming in which UV-either and V-specific instructions were interleaved. The TOYF is more complicated though because there are three "pipes" (opcode fields) rather than just two. The TOYF arranges the out-of-order execution at compile-time though, whereas the Pentium does this at run-time. Not counting NOP, there are 31 MMMMM instructions, 63 AAAAAA instructions, and 31 BBBBB instructions. So, the TOYF has 125 instructions total, which is a lot, but this shouldn't make implementation unreasonably difficult. The TOYF has simple instructions that execute in one clock cycle, whereas the 65ISR has complicated multi-cycle instructions. These are completely different designs --- the TOYF is RISC and the 65ISR is CISC --- the 65ISR is a throwback to the 1980s. The TOYF is a throwback to the 1990s (it is loosely derived from the MiniForth that Testra came out with in 1995). Upon start-up, IP is set to zero. Use BCS to initialize it. Upon start-up, PC is set to 3. This is the start-up code just past the NEXT code. Upon start-up, RS and SS are set to zero. They can't be modified except incremented and decremented. Section 2.) the instruction set There are no addressing modes. All of the instructions are inherent. None of the opcodes have an operand. Most processors have an addressing-mode in which the operand is the address in memory to access. TOYF doesn't have that. The assembler packs the instructions together into the opcodes. TOYF has out-of-order execution. The assember is single-pass, which is typical of Forth assemblers. To assemble the current instruction, the following rules are followed: 1.) The current instruction has to be assembled after any label that appeared previously in the source-code. 2.) The current instruction has to be assembled after any instruction that modifies a register(s) that it uses. 3.) The current instruction has to be assembled after any instruction that modifies a register(s) that it modifies. 4.) The current instruction has to be concurrent with or after any instruction that uses a register(s) that it modifies. To assemble the current instruction, this is done: A.) Search back into the machine-code as far as possible to find an opcode that meets the four rules above. B.) Starting at that opcode, search forward to find an opcode with an empty group-field appropriate for our instruction. C.) If there is no empty slot available, then append a new opcode onto the end of the machine-code D.) Assemble our current instruction in the opcode we have obtained above. If we have ADD AX,BX already assembled, then a MOV 0,CF would have to assemble in an opcode after it. Rule #3 says an instruction that modifies a register (CF) has to be after a previous instruction that modifies that register. If we have MOV 0,AX already assembled, then a MOV 0,CF could assemble anywhere. The MOV 0,AX is irrelevant. It could assemble prior to the MOV 0,AX or in the same opcode or in an opcode afterward. So long as the four rules are followed, instructions may be assembled out of order to how they appeared in the source-code. If we have SUM BX,AX already assembled, then a MOV 0,CF could assemble in the same opcode or in an opcode afterward. The MOV 0,CF could not assemble in an opcode prior to the SUM BX,AX however, because that would violate rule #4. Rule #4 says that the current instruction (MOV 0,CF) has to assemble after or with any instruction that uses CF. If an instruction appears in the source-code after any instruction that could modify PC (BCS, NXT, NXP, etc.), then it will not pack with or before that PC-modifier. This means that code can go after the BCS and before the NXT. If the assembler were multi-pass, then rule #3 could be modified to allow the current instruction to be assembled prior to any instruction that modifies any register(s) that it modifies if no other instruction uses that register. This might allow for better packing in some circumstances. Writing a multi-pass assembler is above my pay grade however. The single-pass assembler provides pretty good packing. This is all that I did at Testra when I wrote MFX for the MiniForth. I was getting paid $10/hour at Testra. For that, you get a single-pass assembler that follows simple rules. The assembly-language programmer doesn't have to worry about any of this out-of-order complexity. You just assume that your program does the same thing as if the instructions were assembled one per opcode in the same order as they appeared in your source-code. Don't let the out-of-order execution freak you out! The complexity is under the hood. This is an example primitive: DX_MINMAX: ; a b -- max ; sets DX= min mov BX,AX mov S0,MA ldr LX mov LX,DX ; DX= a mov AX,LX ; LX= b mov 0,CF ; the TOYF's SBC needs CF set to 0 as done in the MC6800, not to 1 as done in the popular 6502 sbc AX,BX ; CF= a > b mov 1,AX ; AX= offset to branch to next primitive mov LX,BX ; BX= b add IP,AX not CF ; CF= a <= b bcs ; if DX is minimum and BX is maximum, then we are done mov DX,AX xch AX,BX ; they are backwards, so fix them before NXT ; we don't have an XCH DX,BX instruction anymore nxt group-A group-B group-M 1st opcode: MOV BX,AX MOV 0,CF MOV S0,MA ; MOV 0,CF moved all the way back here because it has no dependencies 2nd opcode: LDR LX ; LDR LX often doesn't pack well because everything after it depends on LX 3rd opcode: MOV 1,AX MOV LX,DX MOV AX,LX ; it is legal to use a register and store into that register concurrently 4th opcode: ADD IP,AX SBC AX,BX ; BX gets set, but all we needed was CF --- the BX result was unneeded 5th opcode: MOV LX,BX 6th opcode: NOT CF BCS ; BCS will terminate the primitive if a<=b 7th opcode: MOV DX,AX 8th opcode: MOV BX,AX MOV AX,BX NXT ; this will only execute if a>b Typical efficiency is to pack 2 instructions per opcode. Some primitives are heavy on group-M instructions and don't pack this efficiently. Sometimes you get better efficiency. In the following instruction descriptions, the | separates events that are done concurrently. These are group-A instructions that modify AX CX: nop mov LX,AX ; move LX to AX mov BX,AX ; move BX to AX mov DX,AX ; move DX to AX mov EX,AX ; move EX to AX sum EB,CA ; if low-bit of DX, then move CX:AX + EX:BX to CX:AX ; used in multiplication vng AX ; if VF then move 0 - AX to AX ; used by SMULT to fix the sign afterward vng CA ; if VF then move 0 - CX:AX to CX:AX ; used by DX_SMULT32 to fix the sign afterward add DX,AX ; move AX + DX to AX shl CA ; shift CX:AX left 1 bit ; used in division to shift numerator left sub DB,CA ; move CX:AX - DX:BX to CX:AX ; used in division to do trial subtraction cad DB,CA ; if CF then move CX:AX + DX:BX to CX:AX ; used in division to undo trial subtraction and CF,AX ; move AX<>0 -and- CF to AX ; bit-0 is the result, and the rest are zero'd ior CF,AX ; move AX<>0 -ior- CF to AX ; bit-0 is the result, and the rest are zero'd xor CF,AX ; move AX<>0 -xor- CF to AX ; bit-0 is the result, and the rest are zero'd set CF,LX,AX ; if CF then move LX -ior- AX to AX else move LX -and- ~AX to AX ; AX should be the bit-mask and LX the datum not AX ; move -1 -xor- AX to AX cng AX ; if CF then move 0 - AX to AX sh2 0,AX ; shift AX left 2 times, move %00 into low-bits sh2 1,AX ; shift AX left 2 times, move %01 into low-bits sh2 2,AX ; shift AX left 2 times, move %10 into low-bits sh2 3,AX ; shift AX left 2 times, move %11 into low-bits ; SH2 can be used to construct literal values shl AX ; shift AX left 1 time, move 0 into low-bit ; this multiplies AX by 2 shr 1,CA ; shift CX:AX right 1 time, move 0 into high-bits ; this divides CX:AX by 2 shr 4,CA ; shift CX:AX right 4 times, move 0 into high-bits ; SHR is used to divide the product by unity add IP,AX ; move AX + IP to AX add LX,IP,AX ; move LX + IP to AX add RS,AX ; move AX + RS to AX add LF,AX ; move AX + LF to AX add LX,AX ; move AX + LX to AX sub LX,AX ; move AX - LX to AX and LX,AX ; move LX -and- AX to AX ior LX,AX ; move LX -ior- AX to AX xor LX,AX ; move LX -xor- AX to AX byt AX ; move AX -and- $FF to AX swp AX ; exchange low byte and high byte in AX ; BYT and SWP are useful for accessing bytes mov MA,CX ; move MA to CX ; allows MA to hold CX during multiplication mov LX,CX ; move LX to CX ; used by LIST_START etc. xch CX,AX ; exchange CX and AX values ; also called: XCH AX,CX sub 1,CX ; move CX - 1 to CX ; used in countdown loops mov -1,AX ; move -1 to AX ; -1 is needed for primitives that conditionally iterate over themselves mov 0,AX ; move 0 to AX mov 1,AX ; move 1 to AX ; also called: MOV ^0,AX mov 2,AX ; move 2 to AX ; also called: MOV ^1,AX mov 3,AX ; move 3 to AX mov 4,AX ; move 4 to AX ; also called: MOV ^2,AX mov 5,AX ; move 5 to AX mov 6,AX ; move 6 to AX mov 7,AX ; move 7 to AX mov 8,AX ; move 8 to AX ; also called: MOV ^3,AX ; literals from 1 to 8 are useful for local-variable offsets mov $10,AX ; move $10 to AX ; also called: MOV ^4,AX ; these with 1 bit set are masks for accessing bits in I/O mov $20,AX ; move $20 to AX ; also called: MOV ^5,AX mov $40,AX ; move $40 to AX ; also called: MOV ^6,AX mov $80,AX ; move $80 to AX ; also called: MOV ^7,AX mov $100,AX ; move $100 to AX ; also called: MOV ^8,AX mov $200,AX ; move $200 to AX ; also called: MOV ^9,AX mov $400,AX ; move $400 to AX ; also called: MOV ^A,AX mov $800,AX ; move $800 to AX ; also called: MOV ^B,AX mov $1000,AX ; move $1000 to AX ; also called: MOV ^C,AX mov $2000,AX ; move $2000 to AX ; also called: MOV ^D,AX mov $4000,AX ; move $4000 to AX ; also called: MOV ^E,AX mov $8000,AX ; move $8000 to AX ; also called: MOV ^F,AX undefined undefined These are group-B instructions that modify BX DX EX VF CF: nop mov LX,DX ; move LX to DX mov AX,DX ; move AX to DX mov BX,DX ; move BX to DX add AX,DX ; move DX + AX to DX add AX,BX ; move BX + AX to BX add AD,BX ; move AX + DX to BX, move carry to CF mov LX,BX ; move LX to BX mov AX,BX ; move AX to BX mov DX,BX ; move DX to BX mov LX,EX ; move LX to EX bit EX ; shift EX left 1 bit, move ~CF into low-bit of EX ; used in division with EX = quotient mov CF,BX ; move CF to BX ; sets BX= 1 if CF, else BX= 0 sgn DX,BX ; if BX < 0 then move 0 - BX to BX | if DX < 0 then move 0 - DX to DX | set VF to -xor- original sign bits adj DX,EB ; shift DX right, move 0 into high-bit | shift EX:BX left, move 0 into low-bit, move high-bit into CF ror BX ; shift BX right, move CF into high-bit, move low-bit into CF rol BX ; shift BX left, move CF into low-bit, move high-bit into CF adc LX,BX ; move LX + BX + CF to BX, move carry into CF, set VF if LX and AX had same high-bit and it changes sbc AX,BX ; move BX - AX - CF to BX, move borrow into CF, set VF if BX and AX had same high-bit and it changes neg BX ; move 0 - BX to BX mov 0,CF ; move 0 to CF xor VF,CF ; move VF -xor- CF to CF ; set CF=0 first to just move VF to CF not CF ; move -not- CF to CF cmv AX,DX ; if CF then move AX to DX flg AX,CF ; move AX<>0 to CF ; set CF if AX is nonzero flg DX,CF ; move DX<>0 to CF ; set CF if DX is nonzero not BX,CF ; move BX=0 to CF ; set CF if BX is zero flg CX,CF ; move CX<>0 to CF ; set CF if CX is nonzero ; used in LOOP mov CN,CF ; move CX high-bit to CF ; set CF if CX is negative mov BN,CF ; move BX high-bit to CF ; set CF if BX is negative undefined undefined These are group-M instructions that modify LX PC MA IP LF RS SS: nop mov AX,LX ; move AX to LX ; this allows EXECUTE and QEXECUTE to work ldr LX,BX ; load (MA) into LX | move BX to MA ; this is for moving blocks of data, usually from DX to BX ldr LX ; load (MA) into LX ldr LF ; load (MA) into LF str AX ; store AX to (MA) str BX ; store BX to (MA) str LX ; store LX to (MA) mov LF,HM ; move LF to HF and to MA mov AX,MA ; move AX to MA mov BX,MA ; move BX to MA mov CX,MA ; move CX to MA mov DX,MA ; move DX to MA mov R0,MA ; move RS to MA | move CX to LX ; this is the address of the first item on the return-stack mov S0,MA ; move SS to MA ; this is the address of the second-of-stack (BX is top-of-stack) mov S1,MA ; move SS + 1 to MA ; this is the address of the third-of-stack phs ; move SS - 1 to SS and to MA pls ; move SS to MA | move SS + 1 to SS phr ; move RS - 1 to RS and to MA plr ; move RS to MA | move RS + 1 to RS phr LX ; move RS - 1 to RS and to MA | load (MA) into LX ; for ENTER1 etc. rst LF ; load (MA) into LF | move AX to RS ; for LEAVE1 etc. lit ; move IP + 1 to IP and to MA ; preps literal value fetch lit BX ; store BX to (MA) | move IP + 1 to IP and to MA ; for literal primitives bcs ; if CF then ( move AX to IP and to MA | move NEXT to PC ) ; branch on CF set nxt AX ; move IP + 1 to IP and to MA | move NEXT to PC | store AX to (MA) ; for storing to memory nxt LF ; move IP + 1 to IP and to MA | move NEXT to PC | store LF to (MA) | move RS to LF ; for ENTER0 etc. nxt IP ; move IP + 1 to IP and to MA | move NEXT to PC | load (MA) into IP ; for EXIT nxt IP,LF ; move IP + 1 to IP and to MA | move NEXT to PC | load (MA) into IP | move HF to LF ; for QEXIT nxt ; move IP + 1 to IP and to MA | move NEXT to PC ; for ending primitives nxp ; if high-bit of LX clear, then move LX to PC, else move RS - 1 to RS and to MA ; needs LX = cfa nxf ; store IP to (MA) | move LX to IP and to MA | move NEXT to PC ; for colon words Group-M does not need an ALU because the only arithmetic done is incrementing and decrementing of registers. Group-A needs a 32-bit ALU and Group-B needs a 16-bit ALU. We have some undefined instructions that can be used for application-specific purposes. Also, if more are needed, some of the existing instructions can be repurposed. For BCS the offset should be 0 to iterate over the same primitive, -1 to go back one primitive, etc.. An offset of 1 would be the same as NXT that adds 1 to IP internally. An offset of 2 would skip over 1 primitive, etc. (unintuitive!) MOV R0,MA also moves CX to LX for use by LIST_LOOP and SHORT_LIST_LOOP --- be careful of this if LX is already in use! macro EXECUTE_LX mov 1,AX ; every primitive starts with AX = 1 --- this speeds up some primitives by one clock cycle nxp ; either do primitive or prep pushing IP nxf ; push IP and do function endm Every primitive starts with AX = 1 because this is done in the EXECUTE_LX macro. We don't want to set CF inside of EXECUTE_LX because sometimes primitives pass data in CF to the next primitive. Examples of this are the NEXT_MATCH and LIST_LOOP primitives that pass CF to the POST_LIST primitive. Also, the last instruction of the primitive is usually group-B, so a group-B in EXECUTE_LX doesn't pack with it. Notice that there can't be any code between NXP and NXF because NXP may change the PC so nothing afterward will execute. If we get interrupts, they would only happen on the NXP instruction. They can't happen at arbitrary instructions because we sometimes have data under either the data-stack or return-stack pointer. NEXT: ; this is at address 0 of code-memory and is hard-coded into NXT and NXF ldr LX execute_LX Normally, every primitive ends with a NXT instruction. The NXT will hopefully pack with a group-A or group-B instruction nearby. Most primitives are dominated by group-M instructions, so packing more group-M instructions doesn't work very well. In some cases, a primitive is dominated by group-A and group-B instructions, so instead of NXT the following can be inlined: lit ldr LX execute_LX This is an example: ABS: ; n -- |n| ; absolute value mov BN,CF mov BX,AX lit cng AX ldr LX mov AX,BX execute_LX The LIT packs with surrounding code. The LDR LX packs with surrounding code. The MOV 1,AX and NXP in EXECUTE_LX pack with surrounding code (because MOV AX,BX is group-B and they are group-A and group-M). By comparison, if we used NXT it would pack with MOV AX,BX and it does LIT internally so that is parallelized as well, but the LDR LX in NEXT would cost an extra clock cycle because it doesn't pack with anything. Using EXECUTE_LX can save one clock cycle if both LIT and LDR LX pack, and the last instruction of the primitive is group-B. The downside is that the primitive is one more opcode in length as compared to NXT (because NXP and NXF each take an opcode). Note that we have multiple versions of NXT that internally do another group-M operation in parallel, to save one clock cycle. For example, instead of: STR AX NXT we have: NXT AX These are combination instructions (have to be special-cased by the assembler because otherwise they won't pack together): xch AX,LX ; MOV AX,LX and MOV LX,AX packed together ; also called XCH LX,AX xch AX,BX ; MOV AX,BX and MOV BX,AX packed together ; also called XCH BX,AX xch AX,DX ; MOV AX,DX and MOV DX,AX packed together ; also called XCH DX,AX xad AX,DX ; ADD AX,DX and MOV DX,AX packed together xad DX,AX ; ADD DX,AX and MOV AX,DX packed together xan LX,AX ; AND LX,AX and MOV AX,LX packed together xio LX,AX ; IOR LX,AX and MOV AX,LX packed together xxo LX,AX ; XOR LX,AX and MOV AX,LX packed together There are some others that can be done like this if needed. Section 3.) some basic example code LIT_23241 ; -- 23,241 ; binary: 0101,1010,1100,1001 ; inlining the NEXT code phs lit BX ldr LX mov 5,AX ; leftmost bits: 101 sh2 2,AX ; leftmost bits: 10 sh2 2,AX ; leftmost bits: 10 sh2 3,AX ; leftmost bits: 11 sh2 0,AX ; leftmost bits: 00 sh2 2,AX ; leftmost bits: 10 sh2 1,AX ; leftmost bits: 01 mov AX,BX execute_LX Literal value primitives follow the pattern of LIT_23241 shown above --- LIT BX was provided to boost their speed. Note that LIT BX and LDR LX each pack with one of the ,AX instructions, so they cost nothing. The MOV AX,BX is group-B so it packs with the MOV 1,AX and NXP in EXECUTE_LX for good speed. Small values are faster than large values because there are fewer ,AX instructions needed, and the ,AX instructions don't pack with each other at all. macro DUP ; n -- n n phs str BX end macro DROP ; n -- pls ldr LX mov LX,BX endm We have some producers that leave SOS in DX rather than in memory. Also, we have some consumers that expect SOS in DX rather than in memory. The compiler has to notice the possibility of using these as part of its optimization. QEXECUTE: ; cfa -- mov LF,HM ; move LF to HF and to MA ldr LF ; load LF with parent function's LF EXECUTE: ; cfa -- mov BX,AX drop mov AX,LX execute_LX LOCAL1_QEXECUTE: ; this executes a quotation cfa held in local1 ; mov n,AX ; AX= offset to local --- needed for any offset other than 1 mov LF,HM ; move LF to HF and to MA ldr LF ; load LF with parent function's LF add LF,AX mov AX,MA ldr LX execute_LX A function with locals starts with one of ZERO1 ZERO2 etc. if there are any zero-initialized locals. It then does ENTER2 ENTER1 or ENTER0 for how many stack-initialized locals there are. ENTER0 is for when there are some zero-initialized locals but no stack-initialized locals. ENTER0 is not needed when there are no locals at all. Nothing is needed in this case. ZERO2: ; -- mov 0,AX phr str AX ZERO1: ; -- mov 0,AX phr nxt AX ENTER2: ; a b -- phr str BX ; transfer B pls phr LX str LX ; transfer A pls phr LX mov LX,BX ; -- nxt LF ; push old LF to the return-stack, and move RS to LF as the new local-frame ENTER1: ; a -- phr str BX ; transfer A pls phr LX mov LX,BX ; -- nxt LF ; push old LF to the return-stack, and move RS to LF as the new local-frame ENTER0: ; -- phr nxt LF ; push old LF to the return-stack, and move RS to LF as the new local-frame Every colon word ends with LEAVEx corresponding to how many locals there were, or just EXIT if no locals. LEAVE1: ; -- ; assumes that there is one local on the stack ; mov n,AX ; AX= size of local frame --- needed for any size other than 1 plr add RS,AX ; prep RST LF rst LF ; pull old LF and discard local variables EXIT: ; -- plr nxt IP ; restore the old IP QEXIT: ; -- ; assumes that this is a quotation called with QEXECUTE plr nxt IP,LF ; restore the old IP and restore LF Originally, QEXECUTE would push the HOF's LF to the return-stack, and QEXIT would pull it off. Now QEXECUTE moves the HOF's LF to the HF register (HF means HOF's LF) and QEXIT moves it back. This is faster because we don't have memory access. This means that quotations can'be be nested though. A quotation can't call a HOF that executes another quotation. This limitation should never be a problem. The compiler can do error-checking to catch this at compile-time LESS_THAN: ; x y -- xx? xor VF,CF ; adjust for sign ; this would not be used in the unsigned version mov CF,BX lit ldr LX execute_LX GREATER_THAN: ; x y -- x>y ; a NOT of this is LESS_THAN_OR_EQUAL mov 0,CF pls ldr LX mov LX,AX ; AX=x BX=y sbc AX,BX ; BX= y-x mov BN,CF ; CF= x>y? xor VF,CF ; adjust for sign ; this would not be used in the unsigned version mov CF,BX lit ldr LX execute_LX All of the comparison and test primitives should be straight-forward. BRANCH-1: ; -- ; depends on AX=1 already flg AX,CF ; CF= 1 mov -1,AX ; AX= offset back to previous primitive add IP,AX bcs ; NXT not needed because CF certainly set 0BRANCH-1: ; flag -- not BX,CF ; CF= zero? drop mov -1,AX ; AX= offset back to previous primitive add IP,AX bcs nxt BRANCH: ; -- ; depends on AX=1 already flg AX,CF ; CF= 1 lit ldr LX ; LX= offset add LX,IP,AX bcs ; NXT not needed because CF certainly set 0BRANCH: ; flag -- lit ldr LX ; LX= offset not BX,CF ; CF= zero? drop add LX,IP,AX bcs nxt CF0BRANCH: ; -- ; branches if CF=0 lit ldr LX ; LX= offset not CF ; CF= zero? add LX,IP,AX bcs nxt Some of our logic primitives, such as with 1-bit I/O ports, returns the flag in CF --- use CF0BRANCH for this. Passing a flag in CF is faster than pushing it to the data-stack and then pulling it off the data-stack again. GREATER_THAN_BRANCH: ; x y -- mov 0,CF pls ldr LX mov LX,AX ; AX=x BX=y sbc AX,BX ; BX= y-x mov BN,CF ; CF= x>y? xor VF,CF ; adjust for sign ; this would not be used in the unsigned version lit ldr LX ; LX= offset to branch drop ; this is like GREATER_THAN except that we don't store CF into BX, but instead branch on it add LX,IP,AX bcs nxt TRUE: ; -- true ; depends on AX=1 already dup mov AX,BX nxt DX_TRUE: ; -- true ; sets DX= SOS ; depends on AX=1 already mov BX,DX mov AX,BX nxt Our TRUE is 1 --- unlike in ANS-Forth in which it is -1 Note that having AX=1 already done, does not help, because the MOV 1,AX would parallelize anyway. TO_R: ; val -- ; this is: >R mov BX,AX drop phr nxt AX R_FETCH: ; -- val ; this is: R@ mov R0,MA ; also moves CX to LX, but we don't care because LX is not in use yet ldr LX dup mov LX,BX nxt R_FROM: ; -- val ; this is: R> plr ldr LX dup mov LX,BX nxt For the most part, local variables should be used. >R R@ R> can also be used though. DX_DUP: ; n -- n ; sets DX= n mov BX,DX nxt DX_PREP: ; a b -- b ; sets DX= a ; preps a primitive that needs SOS in DX pls ldr LX mov LX,DX nxt NIP: ; a b -- b pls nxt RIP: ; a b c -- b c pls ldr LX mov LX,AX mov S0,MA nxt AX OVER: ; a b -- a b a mov S0,MA ldr LX dup mov LX,BX nxt DX_OVER: ; a b -- a a ; sets DX= b mov S0,MA ldr LX mov BX,DX mov LX,BX nxt TUCK: ; a b -- b a b mov S0,MA ldr LX mov LX,AX ; AX= a str BX phs nxt AX DX_TUCK: ; a b -- b b ; sets DX= a mov S0,MA ldr LX mov LX,DX str BX nxt SWAP: ; a b -- b a mov BX,AX mov S0,MA ldr LX mov LX,BX nxt AX SLOW_SWAP: ; a b -- b a mov S0,MA ldr LX str BX mov LX,BX nxt SLOW_SWAP was the original version that is obsolete. In SWAP we use NXT AX to save one clock cycle. Note that the MOV BX,AX parallelizes, so it doesn't cost anything. The idea is to have fewer group-M instructions that don't parallelize with each other. Use MOV BX,AX that is group-A. ROT and NROT also both now use NXT AX for greater efficiency from their previous versions. DX_SWAP: ; a b -- a ; sets DX= b mov BX,DX pls ldr LX mov LX,BX nxt ROT: ; a b c -- b c a mov BX,AX ; AX= c mov S0,MA ldr LX mov LX,BX ; BX= b mov S1,MA ldr LX ; LX= a str BX mov LX,BX mov S0,MA nxt AX DX_ROT: ; a b c -- b a ; sets DX= c mov BX,DX ; DX= c pls ldr LX mov LX,AX ; AX= b mov S0,MA ldr LX ; LX= a mov LX,BX nxt AX NROT: ; a b c -- c a b mov BX,AX ; AX= c mov S1,MA ldr LX mov LX,BX ; BX= a mov S0,MA ldr LX ; LX= b str BX mov S1,MA mov LX,BX nxt AX DX_NROT: ; a b c -- c b ; sets DX= a mov S1,MA ldr LX mov LX,DX ; DX= a mov BX,AX ; AX= c pls ldr LX mov LX,BX ; BX= b nxt AX NOT: ; n -- ~flag ; 0= in ANS-Forth not BX,CF ; CF= zero? mov CF,BX nxt DX_DUP_NOT: ; n -- ~flag ; sets DX= N ; 0= in ANS-Forth lit mov BX,DX ldr LX not BX,CF ; CF= zero? mov CF,BX execute_LX FLAG: ; n -- flag ; 0<> in ANS-Forth not BX,CF not CF ; CF= nonzero? mov CF,BX nxt DX_DUP_FLAG: ; n -- flag ; sets DX= N ; 0<> in ANS-Forth lit mov BX,DX ldr LX flg DX,CF ; CF= nonzero? mov CF,BX execute_LX FLIP: ; n -- ~n ; effectively the same as: -1 XOR mov BX,AX not AX mov AX,BX lit ldr LX execute_LX Note that NOT isn't the same as FLIP. This has been a point of confusion in Forth for decades. FLIP is rarely needed --- I can't actually think of any use for it. In general, you want to normalize flags and then do logic on them. MINUS: ; a b -- a-b neg BX pls ldr LX mov 0,CF adc LX,BX nxt PLUS: ; a b -- a+b pls ldr LX mov 0,CF adc LX,BX nxt DX_MINUS: ; b -- DX-b ; expects the SOS to be in DX rather than in memory ; depends on AX=1 already flg AX,CF ; CF= 1 mov BX,AX lit cng AX ; AX= -BX ldr LX add AD,BX ; BX= DX-BX execute_LX DX_PLUS: ; b -- DX+b ; expects the SOS to be in DX rather than in memory mov BX,AX add AD,BX nxt ABS: ; n -- |n| ; absolute value mov BN,CF mov BX,AX lit cng AX ldr LX mov AX,BX execute_LX MIN: ; a b -- a|b ; unsigned minimum pls mov BX,DX ; DX= BX ldr LX mov LX,AX sbc AX,BX ; CF= AX > BX not CF ; CF = AX <= BX ; MAX is just like MIN except without this line lit cmv AX,DX ; sets DX= minimum ldr LX mov DX,BX execute_LX DX_MINMAX: ; a b -- max ; sets DX= min mov BX,AX mov S0,MA ldr LX mov LX,DX ; DX= a mov AX,LX ; LX= b mov 0,CF ; the TOYF's SBC needs CF set to 0 as done in the MC6800, not to 1 as done in the popular 6502 sbc AX,BX ; CF= a > b mov 1,AX ; AX= offset to branch to next primitive mov LX,BX ; BX= b add IP,AX not CF ; CF= a <= b bcs ; if DX is minimum and BX is maximum, then we are done mov DX,AX xch AX,BX ; they are backwards, so fix them before NXT ; we don't have an XCH DX,BX instruction anymore nxt Section 4.) some example code with global variales FETCH: ; adr -- val mov BX,MA ldr LX mov LX,BX nxt FETCH1: ; -- val ; global variable at address 1 ; mov n,AX ; AX= adr --- needed for any adr other than 1 dup mov AX,MA ldr LX mov LX,BX nxt STORE: ; val adr -- pls ldr LX ; LX= val mov BX,MA str LX drop nxt DX_STORE: ; adr -- ; expects DX= SOS mov DX,AX ; AX= val mov BX,DX ; DX= adr drop mov DX,MA nxt AX SLOW_STORE1: ; val -- ; global variable at address 1 ; mov n,AX ; AX= adr --- needed for any adr other than 1 mov AX,MA str BX drop nxt STORE1: ; val -- ; global variable at address 1 ; sets DX= adr ; mov n,AX ; AX= adr --- needed for any adr other than 1 mov AX,DX mov BX,AX drop mov DX,MA nxt AX SLOW_STORE1 refrains from using DX, but this slows it down to 5 clock cycles. STORE1 is 4 clock cycles. The global variables [0,8] are fastest, so they should generally be the I/O ports. They are still pretty slow --- accessing I/O is going to be a drag. Section 5.) some example code with moving blocks of data macro MOVE_WORD ; dst -- dst ; needs DX=src mov DX,MA ldr LX,BX str LX endm macro MOVE_WORD_POST_INC ; dst -- dst ; needs DX=src, AX=1 mov DX,MA ldr LX,BX add AX,DX add AX,BX str LX endm macro MOVE_WORD_PRE_DEC ; dst -- dst ; needs DX=src, AX=-1 add AX,DX add AX,BX mov DX,MA ldr LX,BX str LX endm If the size of the block is known at compile-time, the above macros can be used to make a primitive to move the block. If MOVE_WORD_PRE_DEC is used repeatedly, the STR LX at the end parallelizes with the ADD AX,DX at the front. MOVE_POST_INC: ; dst -- dst ; needs DX=src CX=count AX=1 ; CX can't be 0 sub 1,CX ; decrement count flg CX,CF ; CF= nonzero? mov DX,MA not CF ; CF= zero? ldr LX,BX add AX,DX add AX,BX mov 0,AX ; AX= offset back to ourself str LX add IP,AX bcs nxt MOVE_PRE_DEC: ; dst -- dst ; needs DX=src CX=count ; CX can't be 0 sub 1,CX ; decrement count flg CX,CF ; CF= nonzero? mov -1,AX ; AX= value to be added to SRC and DST pointers add AX,DX add AX,BX mov DX,MA not CF ; CF= zero? ldr LX,BX mov 0,AX ; AX= offset back to ourself str LX add IP,AX bcs nxt Our MOVE_POST_INC and MOVE_PRE_DEC primitives are somewhat slow. They are heavy on group-M instructions because they are moving data between memory locations. Section 6.) some example code with local variables LOCAL1_ADR: ; -- adr ; mov n,AX ; AX= offset to local --- needed for any offset other than 1 dup add LF,AX mov AX,BX nxt LOCAL1_FETCH: ; -- val ; mov n,AX ; AX= offset to local --- needed for any offset other than 1 dup add LF,AX mov AX,MA ldr LX mov LX,BX nxt LOCAL1_STORE: ; val -- ; mov n,AX ; AX= offset to local --- needed for any offset other than 1 add LF,AX mov AX,MA str BX drop nxt DX_DUP_LOCAL1_FETCH ; -- val ; sets DX= SOS ; mov n,AX ; AX= offset to local --- needed for any offset other than 1 add LF,AX mov AX,MA mov BX,DX ldr LX mov LX,BX nxt DX_LOCAL1_STORE: ; val -- ; expects SOS in DX rather than in memory LOCAL1_STORE: ; val -- ; mov n,AX ; AX= offset to local --- needed for any offset other than 1 add LF,AX mov AX,MA str BX mov DX,BX nxt LOCAL1_PLUS_STORE: ; val -- ; mov x,AX ; AX= offset to local --- needed for any offset other than 1 add LF,AX mov AX,MA ldr LX ; LX= current value mov 0,CF adc LX,BX str BX drop nxt DX_LOCAL1_PLUS_STORE: ; val -- ; expects SOS in DX rather than in memory ; mov x,AX ; AX= offset to local --- needed for any offset other than 1 add LF,AX mov AX,MA mov BX,AX ldr LX ; LX= current value add LX,AX mov DX,BX nxt AX LOCAL1_PLUS_STORE is 7 clock cycles, because nothing parallelizes. This is very inefficient! DX_LOCAL1_PLUS_STORE is 3 clock cycles --- this shows the advantage of getting SOS into DX as much as possible. LOCAL1_PRE_INC_FETCH: ; -- val ; depends on AX = 1 dup mov AX,BX ; BX= 1 ; mov x,AX ; AX= offset to local --- needed for any offset other than 1 add LF,AX mov AX,MA ldr LX mov 0,CF adc LX,BX ; BX= 1+val str BX nxt LOCAL1_POST_INC_FETCH: ; -- val ; depends on AX = 1 dup ; mov x,AX ; AX= offset to local --- needed for any offset other than 1 add LF,AX mov AX,MA ldr LX mov LX,BX ; mov 1,AX ; needed if the offset was anything other than 1 add LX,AX nxt AX In MFX (for the MiniForth) I had "ghost primitives." For these I saved the source-code, but didn't assemble them into machine-code. The first time they were used, they assembled into machine-code. Afterward, if they were used, I had the cfa already. All of these primitives that access local variables should be ghost primitives. Dozens or hundreds of them can be meta-compiled, but they won't use code-space unless they are actually used. Most programs won't need the primitives that are specific to more than four local variables. Section 7.) some example code with multiplication macro MULTIPLY ; multiply DX * BX setting CX:AX= product, LX= old EX, MA= old CX mov EX,AX mov AX,LX ; LX= old EX mov 0,AX mov AX,EX ; EX= 0 xch AX,CX ; CX= 0 mov AX,MA ; MA= old CX mov 0,AX ; AX= 0 rept 16 ; the instructions in here parallelize, so we have 1 clock cycle per iteration, for 16 total sum EB,CA ; if low-bit of DX, then move CX:AX + EX:BX to CX:AX adj DX,EB ; shift DX right | shift EX:BX left endr ; CX:AX= product endm DX_SMULT: ; y -- x*y ; needs DX = x ; signed sgn DX,BX ; this would not be used in DX_UMULT multiply ; needs DX= x, BX = y, sets CX:AX= product mov LX,EX ; EX= old EX mov MA,CX ; CX= old CX vng AX ; this would not be used in DX_UMULT mov AX,BX lit ldr LX execute_LX Use DX_SMULT for * in Forth. DX_SMULT32: ; y -- product_high ; needs DX = x, sets DX= product_low ; signed sgn DX,BX ; this would not be used in DX_UMULT32 multiply ; needs DX= x, BX = y, sets CX:AX= product vng CA ; this would not be used in DX_UMULT32 mov AX,DX ; DX= product_low xch CX,AX mov AX,BX ; BX= product_high mov LX,EX ; EX= old EX mov MA,CX ; CX= old CX nxt Use DX_SMULT32 for mixed-precision arithmetic, such as */ the scalar. DX_ UMULT64: ; y -- product ; needs DX = x ; assumes unity = 64 ; unsigned multiply ; needs DX= x, BX = y, sets CX:AX= product mov LX,EX ; EX= old EX mov MA,CX ; CX= old CX shr 4,CA shr 1,CA shr 1,CA ; all of these SHR effectively divide CX:AX by 64 mov AX,BX lit ldr LX execute_LX Use DX_SMULT64 when unity is 64 --- any power of 2 for unity is fast. DX_SUM32: ; y -- ; needs DX = x ; return: low high -- low high ; signed sgn DX,BX multiply ; needs DX= x, BX = y, sets CX:AX= product vng CA mov AX,BX ; BX= low xch CX,AX ; AX= high mov MA,CX ; CX= old CX mov LX,EX ; EX= old EX plr ; now R0 is the low-word mov R0,MA ldr LX mov 0,CF adc LX,BX str BX phr ; now R0 is the high-word again --- and MA is set to R0 mov AX,BX ; BX= high ldr LX mov 0,CF adc LX,BX str BX drop nxt Use DX_SMULT32 to approximate polynomials, summing up the 32-bit products in an accumulator on the return-stack. Afterward, divide the sum by unity for the result. A unity that is a power of 2 is fast. Note that we don't have MOV R1,MA so we need to use PLR and PHR to adjust RS. Adjusting RS like this is one reason we can't have interrupts after arbitrary instructions. It would be possible to have interrupts only after NXP though. Section 8.) some example code with division macro DIV_BIT ; needs CX:AX = numerator, DX:BX = denominator, sets EX= quotient, CX:AX= remainder shl CA ; CX:AX= 2*numerator (remainder) sub DB,CA ; CX:AX= remainder - denominator mov CN,CF ; CF= CX:AX negative? bit EX ; if negative, set quotient bit to 0, else set quotient bit to 1 cad DB,CA ; if negative, then undo trial subtraction endm DX_SDIV: ; denominator -- quotient ; needs DX = numerator, sets DX= remainder ; signed sgn dx,bx ; this would not be used in DX_UDIV mov EX,AX phr str AX ; push old EX to return-stack mov DX,AX xch CX,AX ; CX= numerator, AX= old CX phr str AX ; push old CX to return-stack mov 0,AX ; CX:AX= numerator shifted left 16 bits mov AX,DX ; DX:BX= denominator (high word set to zero) rept 16 div_bit ; sets EX= quotient, CX= remainder endr xch CX,AX ; AX= low-word of remainder, CX= high-word of remainder (should be 0 by this time) mov AX,DX ; DX= remainder mov EX,AX vng AX ; this would not be used in DX_UDIV mov AX,BX ; BX= quotient plr ldr LX mov LX,CX ; pull old CX from return-stack plr ldr LX mov LX,AX mov AX,EX ; pull old EX from return-stack nxt Our DX_SDIV takes a 16-bit numerator and converts it into a 32-bit numerator by shifting left by 16 bits. It is easy to write a function that takes a 32-bit numerator, such as provided by DX_SMULT32 above. Section 9.) some example code with countdown loops It is possible to implement DO LOOP in TOYF because we have an overflow flag (the MiniForth didn't have an overflow flag). I don't really like DO LOOP however, because it is very confusing for novices to learn, so I may not bother with this. Much more common will be a countdown loop as shown below: PRE_LOOP: ; count -- ; push old CX to return-stack, set CX= count, branch to POST_LOOP if zero mov BX,AX xch CX,AX ; CX= count, AX= old CX phr str AX ; push old CX to return-stack lit ldr LX ; LX= offset to POST_LOOP not BX,CF ; CF= zero? drop ; -- add LX,IP,AX bcs nxt LOOP: ; branch if CX is non-zero, then decrement CX lit ldr LX ; LX= offset to just past PRE_LOOP flg CX,CF ; CF= nonzero? sub 1,CX ; post-decrement CX add LX,IP,AX bcs nxt INC_LOOP: ; n -- n+1 ; depends upon AX = 1 add AX,BX lit ldr LX ; LX= offset to just past PRE_LOOP flg CX,CF ; CF= nonzero? sub 1,CX ; post-decrement CX add LX,IP,AX bcs nxt LOCAL1_INC_LOOP: ; -- ; mov x,AX ; AX= offset to local --- needed for any offset other than 1 add LF,AX mov AX,MA ; mov 1,AX ; AX= increment --- needed if a MOV x,AX was used above ldr LX ; LX= current value add LX,AX lit ldr LX ; LX= offset to just past PRE_LOOP flg CX,CF ; CF= nonzero? sub 1,CX ; post-decrement CX add LX,IP,AX bcs nxt AX POST_LOOP: ; -- ; pull CX from return-stack plr ldr LX mov LX,AX xch AX,CX ; CX= old CX pulled from return-stack nxt For a countdown loop, use: PRE_LOOP (offset to POST_LOOP) code... LOOP (offset to just past PRE_LOOP) POST_LOOP The code will execute as many times as count indicates. At POST_LOOP, our CX is -1 before it gets restored. Use INC_LOOP for the common case of incrementing an index or pointer on the data-stack. Use LOCALn_INC_LOOP for the even more common case of incrementing an index or pointer in a local variable. Section 10.) some example code with list loops PRE_LIST_LOOP: ; node -- ; sets EX= node, CX= node ; return-stack: -- old-EX old-CX prior-node mov EX,AX phr str AX ; push old EX to return-stack mov 0,AX mov BX,AX xch CX,AX ; CX= node, AX= old CX phr str AX ; push old CX to return-stack mov 0,AX phr str AX ; push 0 to return-stack as prior-node not BX,CF ; CF= nil? mov BX,AX drop mov AX,LX mov LX,EX ; EX= node lit ldr LX ; LX= offset to POST_LIST_LOOP add LX,IP,AX bcs nxt LIST_START: ; -- ; needs CX = node (can't be nil), sets CX= next node ; EX is still the node mov CX,MA ldr LX mov LX,CX ; CX= next node nxt LIST_START_LOCAL1_QEXECUTE: ; depends on AX = 1 mov CX,MA ldr LX ; mov n,AX ; AX= offset to local --- needed for any offset other than 1 mov LF,HM ; move LF to HF and to MA mov LX,CX ; CX= next node ldr LF ; load LF with parent function's LF add LF,AX mov AX,MA ldr LX execute_LX In many cases, executing a quotation is the first thing that is done inside of a list loop. That is what LIST_START_LOCALn_QEXECUTE is for. FIND_LIST_MATCH: ; flag -- ; set CF= early-exit? ; depends upon AX = 1 not BX,CF not CF ; CF= true? mov 2,AX ; AX= offset over LIST_LOOP to POST_LIST add IP,AX bcs ; if CF then jump to POST_LIST nxt ; else fall into LIST_LOOP LIST_LOOP: ; -- ; needs CX = next-node ; sets EX= CX, CF= false mov R0,MA ; also moves CX to LX mov EX,AX str AX ; store node to prior-node slot on return-stack mov LX,EX ; update EX with next-node xch AX,CX lit ldr LX ; LX= offset to LIST_START flg CX,CF ; CF= new node is not nil add LX,IP,AX bcs ; if CF then jump to LIST_START nxt ; else fall into POST_LIST SHORT_LIST_LOOP: ; -- ; needs CX = next-node ; sets EX= CX, CF= false mov R0,MA ; also moves CX to LX mov EX,AX str AX ; store node to prior-node slot on return-stack xch AX,CX mov -1,AX ; AX= offset to LIST_START_LOCALn_QEXECUTE that is just prior to SHORT_LIST_LOOP mov LX,EX ; update EX with next-node flg CX,CF ; CF= new node is not nil add IP,AX bcs ; if CF then jump to LIST_START nxt ; else fall into POST_LIST In many cases, LIST_START_LOCALn_QEXECUTE is the only thing that is done inside of a list loop. That is what SHORT_LIST_LOOP is for. It is possible to have something other than LIST_START_LOCALn_QEXECUTE though. POST_LIST: ; -- prior-node node early-exit? ; needs CF= early-exit? ; return_stack: old-EX old-CX prior-node -- dup plr ldr LX mov LX,BX ; -- prior-node mov EX,AX dup mov AX,BX ; -- prior-node node dup mov CF,BX ; -- prior-node node early-exit? plr ldr LX mov LX,CX ; CX= old CX pulled from return-stack plr ldr LX mov LX,EX ; EX= old EX pulled from return-stack nxt To traverse a list, use: PRE_LIST_LOOP (offset to POST_LIST) LIST_START code... LIST_LOOP (offset to LIST_START) POST_LIST The code inside of the loop will be given the current node in EX --- it leaves EX unmodified. To obtain a matching node in a list, use: PRE_LIST_LOOP (offset to POST_LIST) LIST_START code... FIND_LIST_MATCH LIST_LOOP (offset to LIST_START) POST_LIST The code inside of the loop will be given the current node in EX --- it leaves EX unmodified. This code must push a flag indicating if a match was found or not. FIND_LIST_MATCH consumes this flag. POST_LIST leaves this on the data-stack: -- prior-node node early-exit? If the loop completed, then node will be the tail node and early-exit? will be FALSE. If the loop early-exited, then node will be the matching node and early-exit? will be TRUE. Prior-node is the node prior to the node returned, or nil if the node returned is the first node. It is okay for the code to modify the link in the node (such as when moving the node to another list). This is because CX already holds the next-node so we aren't depending upon the old node's link anymore. If this is done, the prior-node returned by POST_LIST will not be meaningful, so discard it. The CX register is only used for iteration. Any other use would corrupt any loop that you are inside of. NODE: ; -- node ; use inside of a list loop to obtain the current node dup mov EX,AX mov AX,BX nxt NODE_PLUS: ; field-offset -- field-adr ; use inside of a list loop to obtain a field in the current node mov EX,AX add AX,BX nxt NODE and NODE_PLUS are two easy examples of accessing the node inside of a list loop. Note that modifying EX is not allowed. Also, if list loops are nested, NODE etc. refer to the inner loop's current node. If you want to use the outer loop's node, it must be copied somewhere, such as the data-stack or a local variable. A local variable would work well, because a quotation can be passed in that can access it. NODE1: ; -- field-adr ; use inside of a list loop to obtain field1 in the current node mov n,AX ; needed for any field-offset other than 1 dup mov AX,BX mov EX,AX add AX,BX nxt The field-offset is generally known at compile-time, so custom primitives can be generated. Our NODE1 provides the field-address assuming a field-offset of 1. In my experience in writing the novice-package, a merge-sort is not significantly faster than an insertion-sort. The insertion-sort has the advantage of not using any extra memory. My novice-package merge-sort used the return-stack. On the TOYF though, the return-stack is only 32 elements max. To write a merge-sort on the TOYF, you would need a stack in the heap for temporary storage. This is way more trouble than it is worth. Just use the insertion-sort --- it is reasonably fast. For insertion-sort to work, we need the prior-node to the matching node. POST_LIST above provides this. Section 11.) some example code for lists PRE_NTH: ; head index -- head ; sets DX= index mov 2,AX ; AX= offset to skip over NTH if index is zero mov BX,DX not BX,CF ; CF= index is zero? drop add IP,AX bcs nxt NTH: ; node -- next-node ; needs DX= index ; depends on AX = 1 flg DX,CF ; CF= not zero? mov BX,MA add AX,DX ; post-decrement DX mov 0,AX ; AX= offset back to this primitive again ldr LX mov LX,BX ; -- next-node add IP,AX bcs nxt To calculate the N'th node in a list, use: PRE_NTH NTH The index is 0 for the 1st node, 1 for the 2nd node, etc.. Use GOOD_LENGTH ahead of time to ensure that the list is long enough to have a valid node at the given index. PRE_GOOD_LENGTH: ; head count -- count ; sets DX= head, skips one thread if head is nil pls ldr LX mov LX,DX ; -- count ; DX= head mov 2,AX ; AX= offset skipping over next thread flg DX,CF not CF ; CF= nil? add IP,AX bcs nxt GOOD_LENGTH: ; count -- remaining-count ; needs DX=node mov -1,AX ; AX= -1 mov DX,MA ldr LX mov LX,DX ; DX= new-node not BX,CF ; CF= count zero? add AX,BX ; post-decrement BX mov 0,AX ; AX= nil, and offset back to this primitive again cmv AX,DX ; change DX into nil if count is zero, to stop the iteration flg DX,CF ; CF= not nil? add IP,AX bcs nxt To ensure that a list is at least COUNT nodes long, use: PRE_GOOD_LENGTH GOOD_LENGTH NOT This leaves a TRUE if the list is long enough, and a FALSE if the list is too short. PRE_LENGTH: ; head -- 0 ; sets DX= head, skips one thread if head is nil ; depends on AX = 1 mov BX,DX mov 0,AX not BX,CF ; CF= nil? mov AX,BX ; -- 0 ; initial count is zero mov 2,AX ; AX= offset skipping over next thread add IP,AX bcs nxt LENGTH: ; 0 -- count ; needs DX=node, finishes with DX= nil, BX= count ; needs AX = 1, DX <> NIL add AX,BX ; increment BX mov 0,AX ; AX= offset back to this primitive again mov DX,MA ldr LX mov LX,DX ; DX= new node flg DX,CF ; CF= not nil? add IP,AX bcs nxt To obtain the length, use: PRE_LENGTH LENGTH PRE_LENGTH will skip over LENGTH if head is nil, leaving 0 on the stack. LENGTH will set BX= BX+1, set DX= new-node, then branch back to itself if DX is not nil. PRE_TAIL: ; head -- head ; sets DX= head, skips one thread if head is nil ; depends on AX = 1 mov BX,DX mov 2,AX ; AX= offset skipping over next thread not BX,CF ; CF= nil? add IP,AX bcs nxt TAIL: ; head -- tail ; needs DX=node; finishes with DX= nil, BX= tail ; DX can't be nil initially mov DX,BX ; BX= old node mov DX,MA ldr LX mov LX,DX ; DX= new node mov 0,AX ; AX= offset back to this primitive again flg DX,CF ; CF= not nil? add IP,AX bcs nxt To obtain the tail, use: PRE_TAIL TAIL PRE_TAIL will skip over TAIL if head is nil, leaving the nil on the stack. TAIL will set BX= old node, DX= new node, then branch back to itself if DX is not nil. The tail can also be obtained with LIST_LOOP, but that is much slower if all you need is the tail. PRE_LINK: ; 1st-head 2nd-head -- 1st-head 2nd-head 1st-head ; depends on AX = 1 ; branch past LINK if 1st-head is nil mov S0,MA ldr LX ; LX= 1st-head dup mov LX,BX ; -- 1st-head 2nd-head 1st-head not BX,CF not CF ; CF= 1st-head is not nill, so we can do LINK add IP,AX ; AX was 1, so BCS will do the same as NXT --- it will do the next primitive which is TAIL bcs mov 3,AX ; AX= offset past LINK flg AX,CF ; CF= 1 drop ; -- 1st-head 2nd-head pls ; -- 2nd-head add IP,AX bcs ; branch past LINK --- no need for NXT as CF is certainly set LINK: ; 1st-head 2nd-head 1st-tail -- 1st-head ; TAIL fell through, so 1st-head is not nil pls ldr LX,BX ; -- 1st-head 1st-tail ; LX= 2nd-head str LX ; store 2nd-head into 1st-tail link field drop ; -- 1st-head nxt To link two lists, do this: PRE_LINK TAIL LINK The following is a previous version of the code for linking two lists: PRE_LINK: ; 1st-head 2nd-head -- 1st-head 2nd-head 1st-head ; branch to BAD_LINK if 1st-head nil mov 3,AX ; AX= offset past TAIL and GOOD_LINK to BAD_LINK mov S0,MA ldr LX ; LX= 1st-head dup mov LX,BX ; -- 1st-head 2nd-head 1st-head not BX,CF ; CF= 1st-head is nil? add IP,AX bcs nxt GOOD_LINK: ; 1st-head 2nd-head 1st-tail -- 1st-head ; TAIL fell through, so 1st-head is not nil mov 2,AX ; AX= offset past BAD_LINK pls ldr LX,BX ; LX= 2nd-head flg AX,CF ; CF= 1 str LX ; store 2nd-head into 1st-tail link field add IP,AX drop ; -- 1st-head bcs ; no need for NXT because CF is certainly set BAD_LINK: ; 1st-head 2nd-head 1st-head -- 2nd-head ; assumes 1st-head is nil drop ; -- 1st-head 2nd-head pls ; -- 2nd-head nxt To link two lists, do this: PRE_LINK TAIL GOOD_LINK BAD_LINK PRE_LINK will skip to BAD_LINK if 1st-head is nil, otherwise fall into TAIL, which falls into GOOD_LINK. GOOD_LINK will link the lists, then skip over BAD_LINK. If BAD-LINK executes it returns 2nd-head (might be nil). This is somewhat more complicated, and it uses 3 cells in the threaded code compared to 2 for the first version. section 11.) some example code with 1-bit I/O Normally we would have our I/O ports in low memory, because it is slightly faster to set AX to low literal values. If several 1-bit I/O ports will be logic'd together, it is best if all of them are in the same cell. This is so MOV AX,MA LDR LX need only be done once to get all of them into LX where they can be extracted one at a time. CF_STORE7_5; ; -- ; needs CF=flag, stores CF into address 7, bit-5 mov 7,AX mov AX,MA ldr LX mov ^5,AX set CF,LX,AX ; if CF then move LX -ior- AX to AX else move LX -and- ~AX to AX nxt AX FETCH7_5: ; -- flag ; flag at address 7, bit-5 dup mov 7,AX mov AX,MA ldr LX mov ^5,AX and LX,AX flg AX,CF ; note that we can't use MOV AX,BX because AX is not normalized mov CF,BX ; BX is now 1 or 0 --- it is normalized lit ldr LX execute_LX CF_FETCH7_5xor9_ior_7: ; -- ; CF= flag at address 7, bit 5 -xor- bit 9 mov 7,AX mov AX,MA ldr LX mov ^5,AX and LX,AX flg AX,CF ; CF= bit-5 mov ^9,AX and LX,AX ; AX= bit-9 xor CF,AX flg AX,CF ; CF= bit-5 -xor- bit-9 mov ^7,AX and LX,AX ior CF,AX flg AX,CF ; CF= (bit-5 -xor- bit-9) -ior- bit-7 lit ldr LX execute_LX CF_FETCH7_5xor9_and_3iorB: ; -- flag ; flag at address 7, (bit 5 -xor- bit 9) -and- (bit 3 -ior- bit B) ; uses DX mov 7,AX mov AX,MA ldr LX mov ^5,AX and LX,AX flg AX,CF ; CF= bit-5 mov ^9,AX and LX,AX ; AX= bit-9 xor CF,AX mov AX,DX ; DX= bit-5 -xor- bit-9 mov ^3,AX and LX,AX flg AX,CF ; CF= bit-3 mov ^B,AX and LX,AX ; AX= bit-B ior CF,AX mov AX,LX ; LX= bit-3 -ior- bit-B mov DX,AX ; AX= bit-5 -xor- bit-9 and LX,AX flg AX,CF ; CF= (bit-5 -xor- bit-9) -and- (bit-3 -ior- bit-B) lit ldr LX execute_LX