TOYF processor design --- by Hugh Aguilar --- copyright November 2018 contact: hughaguilar96@gmail.com Abstract: The TOYF (Toy Forth) processor is somewhat similar to the MiniForth at Testra that is also 16-bit Harvard (built on the Lattice isp1048 PLD). The MiniForth was later upgraded to an FPGA and the name changed to RACE. I wrote the MFX, the development system for the RACE --- the TOYF development system is based on MFX. The TOYF has a maximum of 125 instructions. At this time however, 9 are undefined. These are available for application-specific uses. The TOYF does out-of-order execution, so as many as three instructions can execute in one clock cycle. The opcode fields are: 6-bit group-A (32-bit ALU), 5-bit group-M (16-bit ALU), 5-bit group-M (no ALU). The TOYF has only a crude IRQ facility that is mostly provided to support a UART. In general, you would program the TOYF with a paced-loop. The TOYF has support for multiplication and division. It could be used for motion-control. The TOYF has support for 1-bit data, so it could be used as a PLC (Programmable Logic Controller). The TOYF has a RND instruction. It could be used for a video-game machine. It would need a coprocessor to do the audio though. The TOYF might need to be ponied up, so it would have a coprocessor (such as a C8051) that could also do sound. A language standard limits us to a few primitives. TOYF has an assembly-language though, so there can be a lot of primitives. Classic Forth processors (B16, RTX2000, BUGS18, etc.) have a very few primitives. The TOYF can have hundreds of primitives! The TOYF also supports local variables and quotations --- the only other Forth processor with locals is the MiniForth/RACE. The TOYF quotations can access the parent function's locals, despite the fact that the HOF has locals of its own. Section 1.) the architecture We have the following registers (those that access data-memory have an Rx alternate name): CF 1-bit carry flag CF should be 0 prior to SBC of low bytes (like the MC6800 not the 6502) VF 1-bit overflow flag IF 1-bit IRQ flag this masks IRQs --- a 1 blocks IRQs --- it is 1 on start-up IV 2-bit zero or ISR vector high-byte is $FF, low-byte is xx000000 --- this is the ISR vector PC 16-bit program counter this is the machine-code (the only register that points to code-memory) AX R0 16-bit general purpose BX R1 16-bit general purpose this is the top value of the data-stack CX R2 16-bit used in countdown loops and list loops; also multiplication and division DX R3 16-bit general purpose mostly used for multiplication and division, and sometimes as the second value of the data-stack EX R4 16-bit used in list loops; also multiplication and division IP R5 16-bit instruction pointer this is the Forth threaded-code HF R6 5-bit local frame pointer of HOF high-byte is $00, low-byte is 010xxxxx --- this holds LF during a quotation LF R7 5-bit local frame pointer high-byte is $00, low-byte is 010xxxxx --- this is the local-frame pointer RS R8 5-bit return stack pointer high-byte is $00, low-byte is 010xxxxx --- this is the return-stack pointer SS R9 5-bit single stack pointer high-byte is $00, low-byte is 011xxxxx --- this is the data-stack for single-cell data LX RA 16-bit primarily for use by LDR most memory reads use LX as the destination MA RB 16-bit memory access pointer all memory loads and stores use MA as the address The IV register is normally zero, but if there is an IRQ (Interrupt ReQuest), it is the address of the ISR (Interrupt Service Routine) of the highest-priority IRQ that is pending. The ISR is a colon word that ends in ISR_EXIT (like EXIT but also does XOR 1,IF to unmask IRQs). The IF mask is set on start-up and XOR 1,IF is used after initialization to enable interrupts. We don't have CLI and SEI like in most processors. The programmer knows at compile-time whether XOR 1,IF is masking or unmasking. The processor is Harvard Architecture. Code-memory and data-memory can be accessed concurrently in the same clock cycle. Both have a 16-bit data-bus and are word addressable. Code-memory has a 15-bit address-bus and data-memory has a 16-bit address-bus. I/O ports and important global variables can be in the lower 64 words of zero-page. The two stacks are above that. An FPGA that has 128 words (256 bytes) of RAM internal will include all of the above in RAM. Presumably the code-memory (32KW max) would be internal to the FPGA so that it is fast. Forth processors that do Forth instructions such as DUP SWAP + etc. in one clock cycle, typically have the entire data-stack in registers. This allows them to push and pull data to the data-stack in parallel with the operation, such as addition for + etc.. This requires the data-stack to be small, such as 8 elements. This can result in stack-overflow causing hard-to-find bugs. Because they don't support local variables, they need all their data on the data-stack, or in global variables. This isn't a bad design for small programs, but it is error-prone in large programs. This is especially error-prone if code-libraries are used, so you don't know how deep into the data-stack your sub-functions go. I would not use such a processor unless the program was quite small and could be analyzed thoroughly for data-stack usage. The TOYF provides 32 elements for data-stack and return-stack each. This is more than enough for any practical program. It is possible that some recursive programs could overflow one or the other stack, but this is very unlikely. The TOYF uses BX as TOS. The TOYF uses DX as SOS sometimes (I would predict about 1/3 of the time). This helps the speed a lot. Registers use a lot of FPGA resources --- the TOYF keeps the data-stack in memory to allow a smaller FPGA to be used. There are no instructions that access two adjacent memory locations, so the processor is neither little-endian or big-endian. As a convention though, little-endian should be used, except for the Forth data stack that is big-endian traditionally. Code-memory has a 15-bit address-bus (32KW). The cfa has the high-bit clear for primitives, so 32KW is all we can address. Data-memory has a 16-bit address-bus (64KW) and a 16-bit data-bus, and the data registers are all 16-bit. The opcodes are 16-bit wide. We have group-A (64 instructions), group-B (32 instructions) and group-M (32-instructions). If the opcode were upgraded to 18-bit wide, group-B and group-M could each be upgraded to 64 instructions. This is a possibility for a super version of the TOYF --- this might be useful for an application-specific FPGA. Data-memory has a 16-bit data-bus (64KW). We address words not bytes! ADD 1,BD advances BX and DX to the next 16-bit words. Forth threaded-code must be in the upper 32KW because the cfa has the high-bit set for colon words. Code-memory and data-memory are 32KW and 64KW respectively --- or 64KB and 128KB in terms of bytes. Each opcode has fields, with an instruction in each field. All of these instructions execute concurrently in one clock cycle. The assembler rearranges the instructions and packs them into the fields inserting NOP instructions as necessary. The goal is to minimize how many NOP instructions get inserted while guaranteeing that the program does the same thing as it would if we had one instruction per opcode in the same order that the instructions appeared in the source-code. The opcode has these fields: MMMMMAAAAAABBBBB FEDCBA9876543210 these are the bit positions of the fields There are three fields: MMMMM AAAAAA BBBBB We are executing up to three instructions per clock cycle. To help the assembler pack instructions into opcodes efficiently (minimum NOP instructions inserted), the programmer should hold data in a register for a while before using that register. Don't modify a register and then immediately use that register for something. Try to interleave instructions that modify different registers as these instructions can parallelize. This is very similar to Pentium programming in which UV-either and V-specific instructions were interleaved. The TOYF is more complicated though because there are three "pipes" (opcode fields) rather than just two. The TOYF arranges the out-of-order execution at compile-time though, whereas the Pentium does this at run-time. Not counting NOP, there are 31 MMMMM instructions, 63 AAAAAA instructions, and 31 BBBBB instructions. So, the TOYF has 125 instructions total, which is a lot, but this shouldn't make implementation unreasonably difficult. The TOYF has simple instructions that execute in one clock cycle, rather than complicated multi-cycle instructions. An FPGA can support an 18-bit opcode. The TOYF is currently only using a 16-bit opcode. It might be possible to upgrade to an 18-bit opcode and get a 2-bit field in addition to our group-A, group-B and group-M. I can't think of any general-purpose use for this, but there might be an application-specific use. We also have a lot of undefined instructions in group-A and group-B that could have some application-specific use. Upon start-up, IP is set to zero. Use BCS to initialize it. Upon start-up, PC is set to 3. This is the start-up code just past the NEXT code. Upon start-up, RS and SS are set to zero. They can't be modified except incremented and decremented. Section 2.) the instruction set There are no addressing modes. All of the instructions are inherent. None of the opcodes have an operand. Most processors have an addressing-mode in which the operand is the address in memory to access. TOYF doesn't have that. The assembler packs the instructions together into the opcodes. TOYF has out-of-order execution. The assember is single-pass, which is typical of Forth assemblers. To assemble the current instruction, the following rules are followed: 1.) The current instruction has to be assembled after any label that appeared previously in the source-code. 2.) The current instruction has to be assembled after any instruction that modifies a register(s) that it uses. 3.) The current instruction has to be assembled after any instruction that modifies a register(s) that it modifies. 4.) The current instruction has to be concurrent with or after any instruction that uses a register(s) that it modifies. To assemble the current instruction, this is done: A.) Search back into the machine-code as far as possible to find an opcode that meets the four rules above. B.) Starting at that opcode, search forward to find an opcode with an empty group-field appropriate for our instruction. C.) If there is no empty slot available, then append a new opcode onto the end of the machine-code D.) Assemble our current instruction in the opcode we have obtained above. If we have ADC LX,BX already assembled, then a MOV 0,CF would have to assemble in an opcode after it. Rule #3 says an instruction that modifies a register (CF) has to be after a previous instruction that modifies that register. If we have MOV 0,AX already assembled, then a MOV 0,CF could assemble anywhere. The MOV 0,AX is irrelevant. It could assemble prior to the MOV 0,AX or in the same opcode or in an opcode afterward. So long as the four rules are followed, instructions may be assembled out of order to how they appeared in the source-code. If we have SUM BX,AX already assembled, then a MOV 0,CF could assemble in the same opcode or in an opcode afterward. The MOV 0,CF could not assemble in an opcode prior to the SUM BX,AX however, because that would violate rule #4. Rule #4 says that the current instruction (MOV 0,CF) has to assemble after or with any instruction that uses CF. If an instruction appears in the source-code after any instruction that could modify PC (BCS, NXT, NXP, etc.), then it will not pack with or before that PC-modifier. This means that code can go after the BCS and before the NXT. If the assembler were multi-pass, then rule #3 could be modified to allow the current instruction to be assembled prior to any instruction that modifies any register(s) that it modifies if no other instruction uses that register. This might allow for better packing in some circumstances. Writing a multi-pass assembler is above my pay grade however. The single-pass assembler provides pretty good packing. This is all that I did at Testra when I wrote MFX for the MiniForth. I was getting paid $10/hour at Testra. For that, you get a single-pass assembler that follows simple rules. The assembly-language programmer doesn't have to worry about any of this out-of-order complexity. You just assume that your program does the same thing as if the instructions were assembled one per opcode in the same order as they appeared in your source-code. Don't let the out-of-order execution freak you out! The complexity is under the hood. This is an example primitive: DX_MINMAX: ; a b -- max ; sets DX= min xch BX,DX ; DX= b pls ldr LX ; LX= a mov LX,BX ; BX= a mov DX,AX ; AX= b mov 0,CF ; the TOYF's SBC needs CF set to 0 as done in the MC6800, not to 1 as done in the popular 6502 sbc AX,BX ; CF= b(AX,DX) > a(BX) ; CF indicates that b(DX) > a(BX) not CF ; CF= b(AX,DX) <= a(BX) ; CF indicates that b(DX) is the minimum and a(BX) is the maximum mov LX,BX ; BX= a ; the SBC AX,BX screwed up BX, so BX has to be restored mov 1,AX ; AX= offset to branch to next primitive add IP,AX bcs ; if BX is maximum and DX is minimum, then we are done SWAP_DX: ; a -- b ; needs DX=b, sets DX= a xch BX,DX nxt group-A group-B group-M 1st opcode: XCH BX,DX PLS 2nd opcode: MOV DX,AX MOV 0,CF LDR LX ; MOV 0,CF moved all the way back here becaused it has no dependencies 3rd opcode: MOV LX,BX 4th opcode: MOV 1,AX SBC AX,BX ; it is legal to use a register and store into that register concurrently 5th opcode: ADD IP,AX NOT CF 6th opcode: MOV LX,BX BCS ; BCS will terminate the primitive if BX>=DX 7th opcode: XCH BX,DX NXT ; swap BX and DX so BX>=DX then terminate Our DX_MINMAX above had 14 instructions that packed into 7 opcodes, so it came out to an average of 2 instructions per opcode. It is typical to get about 2 instructions per opcode in primitives. Some primitives are heavy on one particular group of instructions and don't pack this efficiently, but others get better efficiency. In the following instruction descriptions, the | separates events that are done concurrently. These are group-A instructions that modify AX CX IF: nop xor 1,IF ; move 1 -xor- IF to IF ; toggle the IF mask mov LX,AX ; move LX to AX mov BX,AX ; move BX to AX mov DX,AX ; move DX to AX mov EX,AX ; move EX to AX sum EB,CA ; if low-bit of DX, then move CX:AX + EX:BX to CX:AX ; used in multiplication vng AX ; if VF then move 0 - AX to AX ; used by SMULT to fix the sign afterward vng CA ; if VF then move 0 - CX:AX to CX:AX ; used by DX_SMULT32 to fix the sign afterward shl CA ; shift CX:AX left 1 bit ; used in division to shift numerator left sub DB,CA ; move CX:AX - DX:BX to CX:AX ; used in division to do trial subtraction cad DB,CA ; if CF then move CX:AX + DX:BX to CX:AX ; used in division to undo trial subtraction and CF,AX ; move AX<>0 -and- CF to AX ; bit-0 is the result, and the rest are zero'd ior CF,AX ; move AX<>0 -ior- CF to AX ; bit-0 is the result, and the rest are zero'd xor CF,AX ; move AX<>0 -xor- CF to AX ; bit-0 is the result, and the rest are zero'd set CF,LX,AX ; if CF then move LX -ior- AX to AX else move LX -and- ~AX to AX ; AX should be the bit-mask and LX the datum cng AX ; if CF then move 0 - AX to AX sh2 0,AX ; shift AX left 2 times, move %00 into low-bits ; also called: SHL 2,AX ; this multiplies AX by 4 sh2 1,AX ; shift AX left 2 times, move %01 into low-bits sh2 2,AX ; shift AX left 2 times, move %10 into low-bits sh2 3,AX ; shift AX left 2 times, move %11 into low-bits shl 1,AX ; shift AX left 1 time, move 0 into low-bit ; this multiplies AX by 2 shl 4,AX ; shift AX left 4 times, move 0 into low-bits ; this multiplies AX by 16 shr 1,CA ; shift CX:AX right 1 time, move 0 into high-bits ; this divides CX:AX by 2 shr 4,CA ; shift CX:AX right 4 times, move 0 into high-bits ; SHR is used to divide the product by unity add IP,AX ; move AX + IP to AX add RS,AX ; move AX + RS to AX add LF,AX ; move AX + LF to AX add LX,AX ; move AX + LX to AX sub LX,AX ; move AX - LX to AX and LX,AX ; move LX -and- AX to AX ior LX,AX ; move LX -ior- AX to AX xor LX,AX ; move LX -xor- AX to AX not AX ; move -1 -xor- AX to AX swp AX ; exchange low byte and high byte in AX ; BYT AX,BX and SWP are useful for accessing bytes mov MA,CX ; move MA to CX ; allows MA to hold CX during multiplication mov LX,CX ; move LX to CX ; used by LIST_START etc. xch CX,AX ; exchange CX and AX values ; also called: XCH AX,CX sub 1,CX ; move CX - 1 to CX ; used in countdown loops rnd AX ; move bit-1 -xor- bit-2 -xor- bit-4 -xor- bit-15 to bit-0 | shift left bits 0..14 to bits 1..15 (discarding bit-15) mov -1,AX ; move -1 to AX ; -1 is needed for primitives that conditionally iterate over themselves mov 0,AX ; move 0 to AX mov 1,AX ; move 1 to AX mov 2,AX ; move 2 to AX mov 3,AX ; move 3 to AX mov 4,AX ; move 4 to AX mov 5,AX ; move 5 to AX mov 6,AX ; move 6 to AX mov 7,AX ; move 7 to AX mov 8,AX ; move 8 to AX mov 9,AX ; move 9 to AX mov 10,AX ; move 10 to AX ; also called: MOV $A,AX mov 11,AX ; move 11 to AX ; also called: MOV $B,AX mov 12,AX ; move 12 to AX ; also called: MOV $C,AX mov 13,AX ; move 13 to AX ; also called: MOV $D,AX mov 14,AX ; move 14 to AX ; also called: MOV $E,AX mov 15,AX ; move 15 to AX ; also called: MOV $F,AX undefined undefined undefined undefined undefined undefined undefined Literal values from -1 to 15 can be constructed in AX with one instruction. For larger values, use SH2 to append 2 bits at a time. Also, SHL 4,AX and SWP AX are useful for setting the high bits when the low bits are zero. In general, small values are faster than big values, so 1-bit I/O ports should be in the lower bits if possible. These are group-B instructions that modify BX DX EX VF CF: nop mov LX,DX ; move LX to DX mov AX,DX ; move AX to DX add 1,BD ; move BX + 1 BX | move DX + 1 to DX ; used when moving blocks of data with post-increment sub 1,BD ; move BX - 1 BX | move DX - 1 to DX ; used when moving blocks of data with pre-decrement mov LX,BX ; move LX to BX mov AX,BX ; move AX to BX xch BX,DX ; exchange BX and DX ; also called: XCH DX,BX byt AX,BX ; move AX -and- $FF to BX mov AX,EX ; move AX to EX bit EX ; shift EX left 1 bit, move ~CF into low-bit of EX ; used in division with EX = quotient mov CF,BX ; move CF to BX ; sets BX= 1 if CF, else BX= 0 sgn DX,BX ; if BX < 0 then move 0 - BX to BX | if DX < 0 then move 0 - DX to DX | set VF to -xor- original sign bits adj DX,EB ; shift DX right, move 0 into high-bit | shift EX:BX left, move 0 into low-bit, move high-bit into CF ror BX ; shift BX right, move CF into high-bit, move low-bit into CF rol BX ; shift BX left, move CF into low-bit, move high-bit into CF adc LX,BX ; move BX + LX + CF to BX, move carry into CF, set VF if LX and AX had same high-bit and it changes sbc AX,BX ; move BX - AX - CF to BX, move borrow into CF, set VF if BX and AX had same high-bit and it changes neg BX ; move 0 - BX to BX mov 0,CF ; move 0 to CF xor VF,CF ; move VF -xor- CF to CF ; set CF=0 first to just move VF to CF not CF ; move -not- CF to CF cmv AX,DX ; if CF then move AX to DX flg AX,CF ; move AX<>0 to CF ; set CF if AX is nonzero flg BX,CF ; move BX<>0 to CF ; set CF if BX is nonzero ; this one is only marginally useful flg CX,CF ; move CX<>0 to CF ; set CF if CX is nonzero ; used in LOOP flg DX,CF ; move DX<>0 to CF ; set CF if DX is nonzero not BX,CF ; move BX=0 to CF ; set CF if BX is zero mov CN,CF ; move CX high-bit to CF ; set CF if CX is negative mov BN,CF ; move BX high-bit to CF ; set CF if BX is negative undefined undefined If it is necessary to make room for another instruction in group-B, FLG BX,CF could be discarded. Use NOT BX,CF NOT CF instead. These are group-M instructions that modify LX PC MA IP LF HF RS SS: nop mov AX,LX ; move AX to LX ; this allows EXECUTE and QEXECUTE to work ldr LX,BX ; load (MA) into LX | move BX to MA ; this is for moving blocks of data, usually from DX to BX ldr LX ; load (MA) into LX ldr LF ; load (MA) into LF ldr IP ; load (MA) into IP str AX ; store AX to (MA) str BX ; store BX to (MA) str LX ; store LX to (MA) mov LF,HM ; move LF to HF and to MA mov AX,MA ; move AX to MA ; there is no MOV BX,MA however NXP moves BX to MA for FETCH etc.. mov CX,MA ; move CX to MA mov DX,MA ; move DX to MA mov R0,MA ; move RS to MA ; this is the address of the first item on the return-stack mov S0,MA ; move SS to MA ; this is the address of the second-of-stack (BX is top-of-stack) mov S1,MA ; move SS + 1 to MA ; this is the address of the third-of-stack phs ; move SS - 1 to SS and to MA pls ; move SS to MA | move SS + 1 to SS phr ; move RS - 1 to RS and to MA plr ; move RS to MA | move RS + 1 to RS phr LX ; move RS - 1 to RS and to MA | load (MA) into LX ; for ENTER1 etc. plr IV ; if IV -and- ~IF then ( move 1 to IF | move 0 to IV | move IV to IP and to MA | move NEXT to PC ) else ( move RS to MA | move RS + 1 to RS ) rst LF ; load (MA) into LF | move AX to RS ; for LEAVE1 etc. lit ; move IP + 1 to IP and to MA ; preps literal value fetch lit BX ; store BX to (MA) | move IP + 1 to IP and to MA ; for literal primitives bcs ; if CF then ( move AX to IP and to MA | move NEXT to PC ) ; branch on CF set nxt AX ; move IP + 1 to IP and to MA | move NEXT to PC | store AX to (MA) ; for storing to memory nxt LF ; move IP + 1 to IP and to MA | move NEXT to PC | store LF to (MA) | move RS to LF ; for ENTER0 etc. nxt QX ; move IP + 1 to IP and to MA | move NEXT to PC | move HF to LF ; for QEXIT nxt ; move IP + 1 to IP and to MA | move NEXT to PC ; for ending primitives nxp ; if high-bit of LX clear, then ( move LX to PC | move BX to MA | move DX to LX ) else ( move RS - 1 to RS and to MA ) ; needs LX = cfa nxf ; store IP to (MA) | move LX to IP and to MA | move NEXT to PC ; for colon words NXP expects LX to contain the cfa. If the cfa is of a primitive, NXP executes it, else NXP does PHR in preparation for NXF. NXF stores the IP to the return-stack, moves the CFA to IP and to MA, and jumps to NEXT. If NXP executes a primitive, it does some prep for the primitive: It moves BX to MA to boost the speed of FETCH etc.. It moves DX to LX to boost the speed of DX_AND DX_OR DX_XOR etc.. Group-M does not need an ALU because the only arithmetic done is incrementing and decrementing of registers. Group-A needs a 32-bit ALU and Group-B needs a 16-bit ALU. For BCS the offset should be 0 to iterate over the same primitive, -1 to go back one primitive, etc.. An offset of 1 would be the same as NXT that adds 1 to IP internally. An offset of 2 would skip over 1 primitive, etc. (unintuitive!) macro EXECUTE_LX mov 1,AX ; every primitive starts with AX=1 --- this speeds up various primitives nxp ; either do primitive or prep pushing IP --- every primitive starts with MA=BX to speed up FETCH etc. nxf ; push IP and do function endm Every primitive starts with AX = 1 because this is done in the EXECUTE_LX macro. We don't want to set CF inside of EXECUTE_LX because sometimes primitives pass data in CF to the next primitive. Examples of this are the NEXT_MATCH and LIST_LOOP primitives that pass CF to the POST_LIST primitive. Also, the last instruction of the primitive is usually group-B, so a group-B in EXECUTE_LX doesn't pack with it. Notice that there can't be any code between NXP and NXF because NXP may change the PC so nothing afterward will execute. NEXT: ; this is at address 0 of code-memory and is hard-coded into NXT and NXF ldr LX execute_LX Normally, every primitive ends with a NXT instruction. The NXT will hopefully pack with a group-A or group-B instruction nearby. Most primitives are dominated by group-M instructions, so packing more group-M instructions doesn't work very well. In some cases, a primitive is dominated by group-A and group-B instructions, so instead of NXT the following can be inlined: lit ldr LX execute_LX This is an example: ABS: ; n -- |n| ; absolute value mov BN,CF mov BX,AX lit cng AX ldr LX mov AX,BX execute_LX The LIT packs with surrounding code. The LDR LX packs with surrounding code. The MOV 1,AX and NXP in EXECUTE_LX pack with surrounding code (because MOV AX,BX is group-B and they are group-A and group-M). By comparison, if we used NXT it would pack with MOV AX,BX and it does LIT internally so that is parallelized as well, but the LDR LX in NEXT would cost an extra clock cycle because it doesn't pack with anything. Using EXECUTE_LX can save one clock cycle if both LIT and LDR LX pack, and the last instruction of the primitive is group-B. The downside is that the primitive is one more opcode in length as compared to NXT (because NXP and NXF each take an opcode). Note that we have multiple versions of NXT that internally do another group-M operation in parallel, to save one clock cycle. For example, instead of: STR AX NXT we have: NXT AX These are combination instructions (have to be special-cased by the assembler because otherwise they won't pack together): xch AX,LX ; MOV AX,LX and MOV LX,AX packed together ; also called XCH LX,AX xch AX,BX ; MOV AX,BX and MOV BX,AX packed together ; also called XCH BX,AX xch AX,DX ; MOV AX,DX and MOV DX,AX packed together ; also called XCH DX,AX xch AX,EX ; MOV AX,EX and MOV EX,AX packed togheter ; also called XCH EX,AX xan LX,AX ; AND LX,AX and MOV AX,LX packed together xio LX,AX ; IOR LX,AX and MOV AX,LX packed together xxo LX,AX ; XOR LX,AX and MOV AX,LX packed together There are some others that can be done like this if needed. Section 3.) some basic example code LIT_23241: ; -- 23,241 ; binary: 0101,1010,1100,1001 ; inlining the NEXT code phs lit BX ldr LX mov 5,AX ; leftmost bits: 0101 sh2 2,AX ; leftmost bits: 10 sh2 2,AX ; leftmost bits: 10 sh2 3,AX ; leftmost bits: 11 sh2 0,AX ; leftmost bits: 00 sh2 2,AX ; leftmost bits: 10 sh2 1,AX ; leftmost bits: 01 mov AX,BX execute_LX Literal value primitives follow the pattern of LIT_23241 shown above --- LIT BX was provided to boost their speed. Note that LIT BX and LDR LX each pack with one of the ,AX instructions, so they cost nothing. The MOV AX,BX is group-B so it packs with the MOV 1,AX and NXP in EXECUTE_LX for good speed. Small values are faster than large values because there are fewer ,AX instructions needed, and the ,AX instructions don't pack with each other at all. Sometimes SHL 4,AX and SWP AX can be used when you have blocks of zeros in the low bits. LIT_N250: ; -- -250 ; binary: 1111,1111,0000,0110 ; this is the obvious way to do it phs lit BX ldr LX mov $F,AX ; leftmost bits: 1111 sh2 3,AX ; leftmost bits: 11 sh2 3,AX ; leftmost bits: 11 shl 4,AX ; leftmost bits: 0000 sh2 1,AX ; leftmost bits: 01 sh2 2,AX ; leftmost bits: 10 mov AX,BX execute_LX Here it is again using NOT AX to save 2 clock cycles: LIT_N250: ; -- -250 ; binary: 1111,1111,0000,0110 ~0000,0000,1111,1001 phs lit BX ldr LX mov $F,AX ; leftmost bits: 1111 sh2 2,AX ; leftmost bits: 10 sh2 1,AX ; leftmost bits: 01 not AX mov AX,BX execute_LX The NOT AX instruction was mostly provided for generating small negative literals. LIT_4096: ; -- 4095 ; binary: 0001,0000,0000,0000 phs lit BX ldr LX mov 1,AX ; leftmost bits: 0001 shl 4,AX ; leftmost bits: 0000 swp AX ; leftmost bits: 0000,0000 mov AX,BX execute_LX The SWP AX instruction is useful for 1-bit masks in the high byte. macro DUP ; n -- n n phs str BX end macro DROP ; n -- pls ldr LX mov LX,BX endm We have some producers that leave SOS in DX rather than in memory. Also, we have some consumers that expect SOS in DX rather than in memory. The compiler has to notice the possibility of using these as part of its optimization. QEXECUTE: ; cfa -- mov LF,HM ; move LF to HF and to MA ldr LF ; load LF with parent function's LF EXECUTE: ; cfa -- mov BX,AX drop mov AX,LX execute_LX LOCAL1_QEXECUTE: ; this executes a quotation cfa held in local1 ; depends on AX=1 already ; mov n,AX ; AX= offset to local ; this instruction needed for any local offset other than 1 mov LF,HM ; move LF to HF and to MA ldr LF ; load LF with parent function's LF add LF,AX mov AX,MA ldr LX execute_LX A function with locals starts with one of ZERO1 ZERO2 etc. if there are any zero-initialized locals. It then does ENTER2 ENTER1 or ENTER0 for how many stack-initialized locals there are. ENTER0 is for when there are some zero-initialized locals but no stack-initialized locals. ENTER0 is not needed when there are no locals at all. Nothing is needed in this case. ZERO2: ; -- mov 0,AX phr str AX ZERO1: ; -- mov 0,AX phr nxt AX ENTER2: ; a b -- phr str BX ; transfer B pls phr LX str LX ; transfer A pls phr LX mov LX,BX ; -- nxt LF ; push old LF to the return-stack, and move RS to LF as the new local-frame ENTER1: ; a -- phr str BX ; transfer A pls phr LX mov LX,BX ; -- nxt LF ; push old LF to the return-stack, and move RS to LF as the new local-frame ENTER0: ; -- phr nxt LF ; push old LF to the return-stack, and move RS to LF as the new local-frame Every colon word ends with LEAVEx corresponding to how many locals there were, or just EXIT if no locals. macro _EXIT ; -- plr IV ; if an IRQ is pending, will goto the ISR for that IRQ ldr IP ; restore the old IP nxt endm LEAVE1: ; -- ; assumes that there is one local on the stack ; depends on AX=1 already ; mov n,AX ; AX= size of local frame ; this instruction needed for any number of locals other than 1 plr add RS,AX ; prep RST LF rst LF ; pull old LF and discard local variables EXIT: ; -- _exit QEXIT: ; -- ; assumes that this is a quotation called with QEXECUTE plr ldr IP ; restore the old IP nxt QX ; restore the old LF Originally, QEXECUTE would push the HOF's LF to the return-stack, and QEXIT would pull it off. Now QEXECUTE moves the HOF's LF to the HF register (HF means HOF's LF) and QEXIT moves it back. This is faster because we don't have memory access. This means that quotations can't be be nested though. A quotation can't call a HOF that executes another quotation. This limitation should never be a problem. The compiler can do error-checking to catch this at compile-time Note that EXIT uses PLR IV rather than PLR. This works the same as PLR if there is no IRQ pending. If there is an IRQ pending, and IF isn't on, then it goes to the ISR (a colon word) for that IRQ rather than do an exit. If more than one IRQ is pending, the highest-priority IRQ will be chosen by the hardware to get its address put in IV. The ISRs are at $FFC0 $FF80 $FF40 $FF00 (highest-priority to lower priority). Every ISR must end in ISR_EXIT, which will exit back to the caller of the interrupted function that never did its EXIT. PLR IV should only be used in EXIT --- PLR should be used any other time data is pulled from the return-stack PLR IV can also be used in LEAVE2 etc. that do an exit as part of their operation. PLR IV can't be used in QEXIT because the ISR ends in ISR_EXIT --- this won't undo the quotation properly. ISR_EXIT: ; -- ; like EXIT except changes IF back to 0 again xor 1,IF _exit We don't normally want ISRs to get interrupted, so PLR IV sets the IF mask to 1 when it executes an ISR. Interrupts are blocked during execution of the ISR, but ISR-EXIT sets IF back to 0 so the main-program can be interrupted. The XOR 1,IF won't parallelize with the PLR IV in _EXIT so this PLR IV can be interrupted if another IRQ is pending. The downside of XOR 1,IF not parallelizing however, is that ISR_EXIT takes one clock cycle more than EXIT does. This isn't too bad --- most processors burn up a lot of clock cycles saving and restoring registers. The ISR can have local variables. Locals are actually faster than globals (because small literals are fast). We would need ISR_LEAVEx primitives, because we can't use the LEAVEx primitives that don't clear IF. Initializing the local-frame with ZEROx ENTER0, and getting rid of it with ISR_LEAVEx, adds a lot of overhead however. For the most part, if your ISR needs locals, it is too big and complicated --- it is doing too much. No registers need to be saved and restored for the ISR. The ISR must ISR_EXIT with the return-stack and data-stack as they were, so RS SS and BX are unchanged. IP can be corrupted because ISR_EXIT will restore the calling function's IP from the return-stack (this is how we get back). CX and EX will be unchanged so long as loops are concluded before EXIT is done. AX DX VF CF can be corrupted by the ISR without harm, because they are always invalid when EXIT is done in the main-program. DX is only used to pass data between primitives, but never between colon words, so corrupting it causes no harm. Upon start-up, IF is set to 1 to block interrupts. During initialization, XOR 1,IF should be done to enable interrupts. After than, IF is normally 0 while the main-program is executing, and 1 while ISRs are executing. We effectively only service IRQs inside of EXIT (or the exit code in LEAVE2 etc.). This should be adequate because EXIT generally executes pretty often (Forth code tends to be heavily factored into colon words). Some words however may be iterative, and the loop contains only primitives, so EXIT isn't being done for a long time. This may cause interrupt latency to be excessively long. In this case, the loop should contain a call to the NOOP colon word: : NOOP ( -- ) ; \ does nothing except EXIT --- allows any pending IRQ to be serviced by the PLR IV inside of EXIT. Even given judicious use of NOOP, there may still be too much interrupt latency. Some primitives are pretty lengthy (multiplication and division, for example), so even bracketed by NOOP too much time passes. The TOYF clock-speed should be pretty fast (maybe 100 Mhz.), so hopefully EXIT will be done often enough for the I/O. The TOYF only has 64KW of data-memory, so you can't really buffer a lot of I/O data anyway. If you have a high-speed UART you don't have to keep up with it. For such small files the upload speed doesn't matter much anyway. A timer for an ADC or DAC might be problematic though, because you do have to keep up with it --- it has a specified speed. For the most part, we are still relying on a paced loop, as with the original design. The interrupts are for non-critically timed I/O, especially a UART --- we aren't going to expect ISRs to execute right away. We want the ISRs to be very short and quick, so our paced-loop doesn't run past its time allotment. My original thought was to have interrupts on the NXP instruction. They can't happen at arbitrary instructions because we sometimes have data under either the data-stack or return-stack pointer. If NXP is interrupted, the ISR would be written in machine-code as a primitive. This is a bad idea however. For one thing, there are no conditional branches in machine-code. Conditional execution is done by changing IP and then doing a NXT, so it is only done in colon words. The ISR has to be a colon word so it can do conditional execution, which is almost always needed. If NXP is interrupted, then it has to save a lot of state information, start up a colon word, then afterward restore the state. It is much simpler to have PLR IV that is interrupted, only in EXIT, because there are no registers to save and restore. RND_BIT: ; seed-adr -- [0,1] ; depends upon MA=BX ldr LX mov LX,AX rnd AX flg AX,CF mov CF,BX ; BX= low bit of AX nxt AX ; update 16-bit seed RND_BYTE: ; seed-adr -- [0,255] ; depends upon MA=BX ldr LX mov LX,AX rnd AX rnd AX rnd AX rnd AX rnd AX rnd AX rnd AX rnd AX byt AX,BX ; BX= low 8 bits of AX nxt AX ; update 16-bit seed RND_BIT and RND_BYTE aren't adequate for encryption. They are provided for use in games. The TOYF uses a paced-loop which should work for games --- 100 Hz. (10ms) would update the display fast enough to look like motion. The major problem is that there is no low-latency IRQ that can be used to generate sound. A separate sound chip would be needed for music. It may be necessary to use a coprocessor to pony up the TOYF (the TOYF may not be able to start itself on power-up without help). The coprocessor (a C8051, for example) could also act as the sound chip. The TOYF would squirt a "sound file" over by way of the UART. The coprocessor would interpret this sound-file to play a tune by toggling a speaker. The sound quality should be comparable to the video games of the late 1980s (probably not at the level of the C64's 6581 SID though). The 1 Mhz. Apple-II could do speach synthesis (somewhat robotic, but understandable). A 72 Mhz. C8051 should be significantly better. The Super Nintendo (SNES) used a 3.58 Mhz. Ricoh 5A22 (a 65c816 derivative). It came out in 1990 and went kaput about 10 years later. The SNES was programmed in assembly-language (using self-modifying code for speed), and supported some pretty cool games. The TOYF should be an order of magnitude faster, and you get to program in Forth. High-school students might like it. :-) LESS_THAN: ; x y -- xx? xor VF,CF ; adjust for sign ; this would not be used in the unsigned version mov CF,BX lit ldr LX execute_LX GREATER_THAN: ; x y -- x>y ; a NOT of this is LESS_THAN_OR_EQUAL mov 0,CF pls ldr LX mov LX,AX ; AX=x BX=y sbc AX,BX ; BX= y-x mov BN,CF ; CF= x>y? xor VF,CF ; adjust for sign ; this would not be used in the unsigned version mov CF,BX lit ldr LX execute_LX All of the comparison and test primitives should be straight-forward. BRANCH: ; -- ; depends on AX=1 already flg AX,CF ; CF= 1 lit ldr LX ; LX= offset mov LX,AX add IP,AX bcs ; NXT not needed because CF certainly set 0BRANCH: ; flag -- not BX,CF ; CF= zero? drop lit ldr LX ; LX= offset mov LX,AX add IP,AX bcs nxt CF0BRANCH: ; -- ; branches if CF=0 lit ldr LX ; LX= offset not CF ; CF= zero? mov LX,AX add IP,AX bcs nxt Some of our logic primitives, such as with 1-bit I/O ports, return the flag in CF --- use CF0BRANCH for this. Passing a flag in CF is faster than pushing it to the data-stack and then pulling it off the data-stack again. BRANCH-1: ; -- ; depends on AX=1 already flg AX,CF ; CF= 1 mov -1,AX ; AX= offset back to previous primitive add IP,AX bcs ; NXT not needed because CF certainly set 0BRANCH-1: ; flag -- not BX,CF ; CF= zero? drop mov -1,AX ; AX= offset back to previous primitive add IP,AX bcs nxt CF0BRANCH-1: ; -- ; branches if CF=0 not CF ; CF= zero? mov -1,AX ; AX= offset back to previous primitive add IP,AX bcs nxt It is faster, and results in smaller threaded-code, if BRANCH-1 0BRANCH-1 CF0BRANCH-1 are used instead of BRANCH 0BRANCH CF0BRANCH. Primitives like BRANCH-1 o0BRANCH-1 CF0BRANCH-1 should be provided for all the common small sizes of code blocks. Also, if the code block consists of a call to a colon word, then EXIT is done which allows IRQs to be serviced. GREATER_THAN_BRANCH: ; x y -- mov 0,CF pls ldr LX mov LX,AX ; AX=x BX=y sbc AX,BX ; BX= y-x mov BN,CF ; CF= x>y? xor VF,CF ; adjust for sign ; this would not be used in the unsigned version drop ; this is like GREATER_THAN except that we don't store CF into BX, but instead branch on it lit ldr LX ; LX= offset to branch mov LX,AX add IP,AX bcs nxt TRUE: ; -- true ; depends on AX=1 already dup mov AX,BX nxt DX_TRUE: ; -- true ; sets DX= SOS ; depends on AX=1 already xch BX,DX mov AX,BX nxt Our TRUE is 1 --- unlike in ANS-Forth in which it is -1 TO_R: ; val -- ; this is: >R mov BX,AX drop phr nxt AX R_FETCH: ; -- val ; this is: R@ mov R0,MA ldr LX dup mov LX,BX nxt R_FROM: ; -- val ; this is: R> plr ldr LX dup mov LX,BX nxt For the most part, local variables should be used. >R R@ R> can also be used though. DX_DUP: ; n -- n ; sets DX= n mov BX,AX mov AX,DX nxt DX_PREP: ; a b -- b ; sets DX= a ; preps a primitive that needs SOS in DX pls ldr LX mov LX,DX nxt DX_POST: ; b -- a b ; needs DX=a ; puts the SOS onto the memory stack phs str DX nxt NIP: ; a b -- b pls nxt RIP: ; a b c -- b c pls ldr LX mov LX,AX mov S0,MA nxt AX OVER: ; a b -- a b a mov S0,MA ldr LX dup mov LX,BX nxt DX_OVER: ; a b -- a a ; sets DX= b mov S0,MA ldr LX xch BX,DX mov LX,BX nxt TUCK: ; a b -- b a b mov S0,MA ldr LX mov LX,AX ; AX= a str BX phs nxt AX DX_TUCK: ; a b -- b b ; sets DX= a mov S0,MA ldr LX mov LX,DX str BX nxt SWAP: ; a b -- b a mov BX,AX mov S0,MA ldr LX mov LX,BX nxt AX SLOW_SWAP: ; a b -- b a mov S0,MA ldr LX str BX mov LX,BX nxt SLOW_SWAP was the original version that is obsolete. In SWAP we use NXT AX to save one clock cycle. Note that the MOV BX,AX parallelizes, so it doesn't cost anything. The idea is to have fewer group-M instructions that don't parallelize with each other. Use MOV BX,AX that is group-A. ROT and NROT also both now use NXT AX for greater efficiency from their previous versions. DX_SWAP: ; a b -- a ; sets DX= b xch BX,DX pls ldr LX mov LX,BX nxt ROT: ; a b c -- b c a mov BX,AX ; AX= c mov S0,MA ldr LX mov LX,BX ; BX= b mov S1,MA ldr LX ; LX= a str BX mov LX,BX mov S0,MA nxt AX DX_ROT: ; a b c -- b a ; sets DX= c xch BX,DX ; DX= c pls ldr LX mov LX,AX ; AX= b mov S0,MA ldr LX ; LX= a mov LX,BX nxt AX NROT: ; a b c -- c a b mov BX,AX ; AX= c mov S1,MA ldr LX mov LX,BX ; BX= a mov S0,MA ldr LX ; LX= b str BX mov S1,MA mov LX,BX nxt AX DX_NROT: ; a b c -- c b ; sets DX= a mov S1,MA ldr LX mov LX,DX ; DX= a mov BX,AX ; AX= c pls ldr LX mov LX,BX ; BX= b nxt AX In ANS-Forth, NOT AND OR XOR all do a logical operation on all the bits. The programmer almost never wants this though. In our Forth, all of these logical operators first normalize the parameters. That is, if it is non-zero, it is changed to 1. Then the logical operation is done, so the result is normalized. We also have MASK_NOT MASK-AND MASK-OR MASK-XOR that operate on all the bits, for the rare cases in which this is needed. NOT: ; n -- ~flag ; 0= in ANS-Forth ; normalizes the flags not BX,CF ; CF= zero? mov CF,BX nxt FLAG: ; n -- flag ; 0<> in ANS-Forth ; normalizes the flags flg BX,CF ; CF= nonzero? mov CF,BX nxt DX_DUP_FLAG: ; n -- flag ; sets DX= N ; 0<> in ANS-Forth ; normalizes the flags lit xch BX,DX ldr LX flg DX,CF ; CF= nonzero? mov CF,BX execute_LX DX_XOR: ; b -- DX -xor- b ; expects the SOS to be in DX rather than in memory ; normalizes the flags flg BX,CF mov DX,AX xor CF,AX mov AX,BX lit ldr LX execute_LX XOR: ; a b -- a -xor- b ; expects the SOS to be in DX rather than in memory ; normalizes the flags pls ldr LX mov LX,AX xor CF,AX mov AX,BX nxt DX_MASK_XOR: ; b -- DX -xor- b ; expects the SOS to be in DX rather than in memory ; depends on LX=DX mov BX,AX xor LX,AX mov AX,BX nxt MASK_XOR: ; a b -- a -xor- b mov BX,AX pls ldr LX xor LX,AX mov AX,BX nxt SLOW_MASK_NOT: ; n -- ~n ; effectively the same as: -1 XOR mov BX,AX not AX mov AX,BX nxt SLOW_MASK_NOT is the obvious way to do it, but it is inefficient. MASK_NOT: ; n -- ~n ; effectively the same as: -1 XOR neg BX sub 1,BD ; this screws up DX, but we aren't using DX anyway so it doesn't matter nxt EQUAL: ; x y -- x=y mov BX,AX ; AX= y pls ldr LX ; LX= x xor LX,AX flg AX,CF ; CF= x<>y not CF mov CF,BX lit ldr LX execute_LX DX_EQUAL: ; y -- x=y ; DX = x ; thanks to NXP, LX contains DX when DX_EQUAL is executed mov BX,AX ; AX= y xor LX,AX flg AX,CF ; CF= x<>y not CF mov CF,BX lit ldr LX execute_LX MINUS: ; a b -- a-b neg BX pls ldr LX mov 0,CF adc LX,BX nxt PLUS: ; a b -- a+b pls ldr LX mov 0,CF adc LX,BX nxt DX_MINUS: ; b -- DX-b ; expects the SOS to be in DX rather than in memory ; depends on LX=DX AX=1 already flg AX,CF ; CF= 1 mov BX,AX cng AX ; AX= -BX add LX,AX mov AX,BX ; BX= DX-BX lit ldr LX execute_LX DX_PLUS: ; b -- DX+b ; expects the SOS to be in DX rather than in memory ; depends on LX=DX already mov BX,AX add LX,AX mov AX,BX lit ldr LX execute_LX ABS: ; n -- |n| ; absolute value mov BN,CF mov BX,AX lit cng AX ldr LX mov AX,BX execute_LX DX_MINMAX: ; a b -- max ; sets DX= min xch BX,DX ; DX= b pls ldr LX ; LX= a mov LX,BX ; BX= a mov DX,AX ; AX= b mov 0,CF ; the TOYF's SBC needs CF set to 0 as done in the MC6800, not to 1 as done in the popular 6502 sbc AX,BX ; CF= b(AX,DX) > a(BX) ; CF indicates that b(DX) > a(BX) not CF ; CF= b(AX,DX) <= a(BX) ; CF indicates that b(DX) is the minimum and a(BX) is the maximum mov LX,BX ; BX= a ; the SBC AX,BX screwed up BX, so BX has to be restored mov 1,AX ; AX= offset to branch to next primitive add IP,AX bcs ; if BX is maximum and DX is minimum, then we are done SWAP_DX: ; a -- b ; needs DX=b, sets DX= a xch BX,DX nxt DX_MAXMIN: ; a b -- min ; sets DX= max This is just like DX_MINMAX above except without the NOT CF in there. MIN and MAX are just DX_MAXMIN and DX_MINMAX except that you ignore the DX value. MINMAX and MAXMIN can be accomplished with DX_MINMAX and DX_MAXMIN followed by POST_DX. SWAP_DX_POST: ; b -- b a ; needs DX=a ; puts the SOS onto the memory stack (DX is now the new SOS b) phs str BX xch DX,BX nxt Section 4.) some example code with global variales FETCH: ; adr -- val ; expects MA=BX ldr LX mov LX,BX nxt FETCH depends on MA getting set by NXP to the needed value. STORE: ; val adr -- mov BX,AX pls ldr LX ; LX= val mov AX,MA str LX drop nxt DX_STORE: ; adr -- ; expects DX=SOS mov DX,AX ; AX= val xch BX,DX ; DX= adr drop mov DX,MA nxt AX DX_STORE assembles as: group-A group-B group-M 1st opcode: MOV DX,AX XCH BX,DX PHS ; the PHS comes from the DROP macro 2nd opcode: STR BX ; the STR BX comes from the DROP macro 3rd opcode: MOV DX,MA 4th opcode: NXT AX This is an alternate version of DX_STORE that takes advantage of MA=BX that we now have: DX_STORE: ; adr -- ; expects DX=SOS, MA=BX (MA is set to BX by NXP) str DX drop nxt This assembles as: group-A group-B group-M 1st opcode: STR DX 2nd opcode: PHS ; the PHS comes from the DROP macro 3rd opcode: STR BX ; the STR BX comes from the DROP macro 4th opcode: NXT It seems as if DX_STORE could benefit from MA=BX that we now have, but it actually comes out to 4 clock-cycles either way. The global variables [0,8] are fastest, so they should generally be the I/O ports. They are still pretty slow --- accessing I/O is going to be a drag. Section 5.) some example code with moving blocks of data macro MOVE_WORD ; dst -- dst ; needs DX=src mov DX,MA ldr LX,BX str LX endm macro MOVE_WORD_POST_INC ; dst -- dst ; needs DX=src mov DX,MA ldr LX,BX add 1,BD str LX endm macro MOVE_WORD_PRE_DEC ; dst -- dst ; needs DX=src add 1,BD mov DX,MA ldr LX,BX str LX endm If the size of the block is known at compile-time, the above macros can be used to make a primitive to move the block. If MOVE_WORD_PRE_DEC is used repeatedly, the STR LX at the end parallelizes with the ADD 1,BD at the front. MOVE_POST_INC: ; dst -- dst ; needs DX=src CX=count ; CX can't be 0 sub 1,CX ; decrement count flg CX,CF ; CF= nonzero? mov DX,MA not CF ; CF= zero? ldr LX,BX add 1,BD mov 0,AX ; AX= offset back to ourself str LX add IP,AX bcs nxt MOVE_PRE_DEC: ; dst -- dst ; needs DX=src CX=count ; CX can't be 0 sub 1,CX ; decrement count flg CX,CF ; CF= nonzero? sub 1,BD mov DX,MA not CF ; CF= zero? ldr LX,BX mov 0,AX ; AX= offset back to ourself str LX add IP,AX bcs nxt Our MOVE_POST_INC and MOVE_PRE_DEC primitives are somewhat slow. They are heavy on group-M instructions because they are moving data between memory locations. They also cause a lot of interrupt latency when moving a large block of data due to EXIT not being done. PRE_MOVE: ; src dst count -- ; used prior to MOVE_POST_INC and MOVE_PRE_DEC mov BX,AX flg AX,CF ; CF= non-zero? xch CX,AX ; CX= count, AX= old CX phr str AX ; push old CX to return-stack pls ldr LX mov LX,DX ; DX= src, BX=dst not CF ; CF= zero? mov 2,AX ; AX= offset to POST_MOVE add IP,AX bcs nxt POST_MOVE: ; -- ; used after MOVE_POST_INC and MOVE_PRE_DEC plr ldr LX mov LX,CX ; restore old CX nxt Section 6.) some example code with local variables LOCAL1_ADR: ; -- adr ; depends on AX=1 ; mov n,AX ; AX= offset to local ; needed for any local other than the first dup add LF,AX mov AX,BX nxt LOCAL1_FETCH: ; -- val ; depends on AX=1 ; mov n,AX ; AX= offset to local ; needed for any local other than the first dup add LF,AX mov AX,MA ldr LX mov LX,BX nxt LOCAL1_STORE: ; val -- ; depends on AX=1 ; mov n,AX ; AX= offset to local ; needed for any local other than the first add LF,AX mov AX,MA str BX drop nxt DX_DUP_LOCAL1_FETCH ; -- val ; depends on AX=1 ; sets DX= SOS ; mov n,AX ; AX= offset to local ; needed for any local other than the first add LF,AX mov AX,MA xch BX,DX ldr LX mov LX,BX nxt DX_LOCAL1_STORE: ; val -- ; depends on AX=1 ; expects SOS in DX rather than in memory ; mov n,AX ; AX= offset to local ; needed for any local other than the first add LF,AX mov AX,MA str BX xch DX,BX nxt LOCAL1_PLUS_STORE: ; val -- ; depends on AX=1 ; mov n,AX ; AX= offset to local ; needed for any local other than the first add LF,AX mov AX,MA ldr LX ; LX= current value mov 0,CF adc LX,BX str BX drop nxt DX_LOCAL1_PLUS_STORE: ; val -- ; depends on AX=1 ; expects SOS in DX rather than in memory ; mov n,AX ; AX= offset to local ; needed for any local other than the first add LF,AX mov AX,MA mov BX,AX ldr LX ; LX= current value add LX,AX xch DX,BX nxt AX LOCAL1_PLUS_STORE is 7 clock cycles, because nothing parallelizes. This is very inefficient! DX_LOCAL1_PLUS_STORE is 3 clock cycles --- this shows the advantage of getting SOS into DX as much as possible. LOCAL1_PRE_INC_FETCH: ; -- val ; depends on AX=1 dup mov AX,BX ; BX= 1 ; mov n,AX ; AX= offset to local ; needed for any local other than the first add LF,AX mov AX,MA ldr LX mov 0,CF adc LX,BX ; BX= 1+val str BX nxt LOCAL1_POST_INC_FETCH: ; -- val ; depends on AX=1 dup ; mov n,AX ; AX= offset to local ; needed for any local other than the first add LF,AX mov AX,MA ldr LX mov LX,BX mov 1,AX ; AX= offset to local add LX,AX nxt AX In MFX (for the MiniForth) I had "ghost primitives." For these I saved the source-code, but didn't assemble them into machine-code. The first time they were used, they assembled into machine-code. Afterward, if they were used, I had the cfa already. All of these primitives that access local variables should be ghost primitives. Dozens or hundreds of them can be meta-compiled, but they won't use code-space unless they are actually used. Most programs won't need the primitives that are specific to more than four local variables. Section 7.) some example code with multiplication macro MOV_LXEX xch AX,LX mov AX,EX xch AX,LX endm macro MULTIPLY ; multiply DX * BX setting CX:AX= product, LX= old EX, MA= old CX mov 0,AX xch AX,EX ; EX= 0, AX= old EX mov AX,LX ; LX= old EX mov 0,AX xch AX,CX ; CX= 0, AX= old CX mov AX,MA ; MA= old CX mov 0,AX ; AX= 0 rept 16 ; the instructions in here parallelize, so we have 1 clock cycle per iteration, for 16 total sum EB,CA ; if low-bit of DX, then move CX:AX + EX:BX to CX:AX adj DX,EB ; shift DX right | shift EX:BX left endr ; CX:AX= product endm DX_SMULT: ; y -- x*y ; needs DX = x ; signed sgn DX,BX ; this would not be used in DX_UMULT multiply ; needs DX= x, BX = y, sets CX:AX= product mov_LXEX ; EX= old EX mov MA,CX ; CX= old CX vng AX ; this would not be used in DX_UMULT mov AX,BX lit ldr LX execute_LX Use DX_SMULT for * in Forth. DX_SMULT32: ; y -- product_high ; needs DX = x, sets DX= product_low ; signed sgn DX,BX ; this would not be used in DX_UMULT32 multiply ; needs DX= x, BX = y, sets CX:AX= product vng CA ; this would not be used in DX_UMULT32 mov AX,DX ; DX= product_low xch CX,AX mov AX,BX ; BX= product_high xch LX,AX mov AX,EX ; EX= old EX mov MA,CX ; CX= old CX nxt Use DX_SMULT32 for mixed-precision arithmetic, such as */ the scalar. DX_ UMULT64: ; y -- product ; needs DX = x ; assumes unity = 64 ; unsigned multiply ; needs DX= x, BX = y, sets CX:AX= product mov_LXEX ; EX= old EX mov MA,CX ; CX= old CX shr 4,CA shr 1,CA shr 1,CA ; all of these SHR effectively divide CX:AX by 64 mov AX,BX lit ldr LX execute_LX Use DX_SMULT64 when unity is 64 --- any power of 2 for unity is fast. DX_SUM32: ; y -- ; needs DX = x ; return: low high -- low high ; signed sgn DX,BX multiply ; needs DX= x, BX = y, sets CX:AX= product vng CA mov AX,BX ; BX= low xch CX,AX ; AX= high mov MA,CX ; CX= old CX mov_LXEX ; EX= old EX plr ; now R0 is the low-word mov R0,MA ldr LX mov 0,CF adc LX,BX str BX phr ; now R0 is the high-word again --- and MA is set to R0 mov AX,BX ; BX= high ldr LX mov 0,CF adc LX,BX str BX drop nxt Use DX_SMULT32 to approximate polynomials, summing up the 32-bit products in an accumulator on the return-stack. Afterward, divide the sum by unity for the result. A unity that is a power of 2 is fast. Note that we don't have MOV R1,MA so we need to use PLR and PHR to adjust RS. Adjusting RS like this is one reason we can't have interrupts after arbitrary instructions. It would be possible to have interrupts only after NXP though. Section 8.) some example code with division macro DIV_BIT ; needs CX:AX = numerator, DX:BX = denominator, sets EX= quotient, CX:AX= remainder shl CA ; CX:AX= 2*numerator (remainder) sub DB,CA ; CX:AX= remainder - denominator mov CN,CF ; CF= CX:AX negative? bit EX ; if negative, set quotient bit to 0, else set quotient bit to 1 cad DB,CA ; if negative, then undo trial subtraction endm DX_SDIV: ; denominator -- quotient ; needs DX = numerator, sets DX= remainder ; signed sgn dx,bx ; this would not be used in DX_UDIV mov EX,AX phr str AX ; push old EX to return-stack mov DX,AX xch CX,AX ; CX= numerator, AX= old CX phr str AX ; push old CX to return-stack mov 0,AX ; CX:AX= numerator shifted left 16 bits mov AX,DX ; DX:BX= denominator (high word set to zero) rept 16 div_bit ; sets EX= quotient, CX= remainder endr xch CX,AX ; AX= low-word of remainder, CX= high-word of remainder (should be 0 by this time) mov AX,DX ; DX= remainder mov EX,AX vng AX ; this would not be used in DX_UDIV mov AX,BX ; BX= quotient plr ldr LX mov LX,CX ; pull old CX from return-stack plr ldr LX mov LX,AX mov AX,EX ; pull old EX from return-stack nxt Our DX_SDIV takes a 16-bit numerator and converts it into a 32-bit numerator by shifting left by 16 bits. It is easy to write a function that takes a 32-bit numerator, such as provided by DX_SMULT32 above. Section 9.) some example code with countdown loops It is possible to implement DO LOOP in TOYF because we have an overflow flag (the MiniForth didn't have an overflow flag). I don't really like DO LOOP however, because it is very confusing for novices to learn, so I may not bother with this. Much more common will be a countdown loop as shown below: PRE_LOOP: ; count -- ; push old CX to return-stack, set CX= count, branch to POST_LOOP if zero mov BX,AX xch CX,AX ; CX= count, AX= old CX phr str AX ; push old CX to return-stack not BX,CF ; CF= zero? drop ; -- lit ldr LX ; LX= offset to POST_LOOP mov LX,AX add IP,AX bcs nxt LOOP: ; branch if CX is non-zero, then decrement CX lit ldr LX ; LX= offset to just past PRE_LOOP flg CX,CF ; CF= nonzero? sub 1,CX ; post-decrement CX mov LX,AX add IP,AX bcs nxt INC_LOOP: ; n -- n+1 ; depends upon AX=1 add BX,AX mov AX,BX ; BX incremented lit ldr LX ; LX= offset to just past PRE_LOOP flg CX,CF ; CF= nonzero? sub 1,CX ; post-decrement CX mov LX,AX add IP,AX bcs nxt LOCAL1_INC_LOOP: ; -- mov 1,AX ; AX= offset to local add LF,AX mov AX,MA mov 1,AX ; AX= increment ldr LX ; LX= current value add LX,AX lit ldr LX ; LX= offset to just past PRE_LOOP flg CX,CF ; CF= nonzero? sub 1,CX ; post-decrement CX mov LX,AX add IP,AX bcs nxt AX POST_LOOP: ; -- ; pull CX from return-stack plr ldr LX mov LX,AX xch AX,CX ; CX= old CX pulled from return-stack nxt For a countdown loop, use: PRE_LOOP (offset to POST_LOOP) code... LOOP (offset to just past PRE_LOOP) POST_LOOP The code will execute as many times as count indicates. At POST_LOOP, our CX is -1 before it gets restored. Use INC_LOOP for the common case of incrementing an index or pointer on the data-stack. Use LOCALn_INC_LOOP for the even more common case of incrementing an index or pointer in a local variable. Section 10.) some example code with list loops PRE_LIST_LOOP: ; node -- ; sets EX= node, CX= node ; return-stack: -- old-EX old-CX prior-node mov EX,AX phr str AX ; push old EX to return-stack mov 0,AX mov BX,AX xch CX,AX ; CX= node, AX= old CX phr str AX ; push old CX to return-stack mov 0,AX phr str AX ; push 0 to return-stack as prior-node not BX,CF ; CF= nil? mov BX,AX drop mov AX,EX ; EX= node lit ldr LX ; LX= offset to POST_LIST_LOOP mov LX,AX add IP,AX bcs nxt LIST_START: ; -- ; needs CX=node (can't be nil), sets CX= next node ; EX is still the node mov CX,MA ldr LX mov LX,CX ; CX= next node nxt LIST_START_LOCAL1_QEXECUTE: ; needs CX=node (can't be nil), sets CX= next node ; EX is still the node mov CX,MA ldr LX mov 1,AX ; AX= offset to local mov LF,HM ; move LF to HF and to MA mov LX,CX ; CX= next node ldr LF ; load LF with parent function's LF add LF,AX mov AX,MA ldr LX execute_LX In many cases, executing a quotation is the first thing that is done inside of a list loop. That is what LIST_START_LOCALn_QEXECUTE is for. FIND_LIST_MATCH: ; flag -- ; set CF= early-exit? ; depends upon AX=1 flg BX,CF ; CF= true? mov 2,AX ; AX= offset over LIST_LOOP to POST_LIST add IP,AX bcs ; if CF then jump to POST_LIST nxt ; else fall into LIST_LOOP LIST_LOOP: ; -- ; needs CX = next-node ; sets EX= CX, CF= false mov EX,AX mov R0,MA str AX ; store node to prior-node slot on return-stack xch CX,AX ; CX= EX, AX= CX mov AX,EX ; update EX with next-node lit ldr LX ; LX= offset to LIST_START flg CX,CF ; CF= new node is not nil mov LX,AX add IP,AX bcs ; if CF then jump to LIST_START nxt ; else fall into POST_LIST SHORT_LIST_LOOP: ; -- ; needs CX = next-node ; sets EX= CX, CF= false mov EX,AX mov R0,MA str AX ; store node to prior-node slot on return-stack xch CX,AX ; CX= EX, AX= CX mov AX,EX ; update EX with next-node mov -1,AX ; AX= offset to LIST_START_LOCALn_QEXECUTE that is just prior to SHORT_LIST_LOOP flg CX,CF ; CF= new node is not nil add IP,AX bcs ; if CF then jump to LIST_START nxt ; else fall into POST_LIST In many cases, LIST_START_LOCALn_QEXECUTE is the only thing that is done inside of a list loop. That is what SHORT_LIST_LOOP is for. It is possible to have something other than LIST_START_LOCALn_QEXECUTE though. POST_LIST: ; -- prior-node node early-exit? ; needs CF= early-exit? ; return_stack: old-EX old-CX prior-node -- dup plr ldr LX mov LX,BX ; -- prior-node mov EX,AX dup mov AX,BX ; -- prior-node node dup mov CF,BX ; -- prior-node node early-exit? plr ldr LX mov LX,CX ; CX= old CX pulled from return-stack plr ldr LX mov LX,AX mov AX,EX ; EX= old EX pulled from return-stack nxt To traverse a list, use: PRE_LIST_LOOP (offset to POST_LIST) LIST_START code... LIST_LOOP (offset to LIST_START) POST_LIST The code inside of the loop will be given the current node in EX --- it leaves EX unmodified. To obtain a matching node in a list, use: PRE_LIST_LOOP (offset to POST_LIST) LIST_START code... FIND_LIST_MATCH LIST_LOOP (offset to LIST_START) POST_LIST The code inside of the loop will be given the current node in EX --- it leaves EX unmodified. This code must push a flag indicating if a match was found or not. FIND_LIST_MATCH consumes this flag. POST_LIST leaves this on the data-stack: -- prior-node node early-exit? If the loop completed, then node will be the tail node and early-exit? will be FALSE. If the loop early-exited, then node will be the matching node and early-exit? will be TRUE. Prior-node is the node prior to the node returned, or nil if the node returned is the first node. It is okay for the code to modify the link in the node (such as when moving the node to another list). This is because CX already holds the next-node so we aren't depending upon the old node's link anymore. If this is done, the prior-node returned by POST_LIST will not be meaningful, so discard it. The CX register is only used for iteration. Any other use would corrupt any loop that you are inside of. NODE: ; -- node ; use inside of a list loop to obtain the current node dup mov EX,AX mov AX,BX nxt NODE_PLUS: ; field-offset -- field-adr ; use inside of a list loop to obtain a field in the current node mov EX,AX add BX,AX mov AX,BX nxt NODE and NODE_PLUS are two easy examples of accessing the node inside of a list loop. Note that modifying EX is not allowed. Also, if list loops are nested, NODE etc. refer to the inner loop's current node. If you want to use the outer loop's node, it must be copied somewhere, such as the data-stack or a local variable. A local variable would work well, because a quotation can be passed in that can access it. NODE1: ; -- field-adr ; use inside of a list loop to obtain field1 in the current node \ mov n,AX ; this instruction needed for any field other than 1 dup mov AX,BX mov EX,AX add BX,AX mov AX,BX nxt The field-offset is generally known at compile-time, so custom primitives can be generated. Our NODE1 provides the field-address assuming a field-offset of 1. In my experience in writing the novice-package, a merge-sort is not significantly faster than an insertion-sort. The insertion-sort has the advantage of not using any extra memory. My novice-package merge-sort used the return-stack. On the TOYF though, the return-stack is only 32 elements max. To write a merge-sort on the TOYF, you would need a stack in the heap for temporary storage. This is way more trouble than it is worth. Just use the insertion-sort --- it is reasonably fast. For insertion-sort to work, we need the prior-node to the matching node. POST_LIST above provides this. Section 11.) some example code for lists PRE_NTH: ; head index -- head ; sets DX= index mov 2,AX ; AX= offset to skip over NTH if index is zero not BX,CF ; CF= index is zero? xch BX,DX ; DX= index drop ; BX= node add IP,AX bcs mov BX,LX ; LX= node if we are doing NTH (during NTH the BX register is invalid) nxt NTH: ; invalid -- next-node ; needs LX=node DX=index flg DX,CF ; CF= not zero? sub 1,BD ; post-decrement DX (also decrements BX, but that doesn't matter because BX is invalid) mov LX,MA ldr LX ; LX= next node mov 0,AX ; AX= offset back to this primitive again add IP,AX bcs mov LX,BX ; -- next-node if we aren't doing NTH anymore nxt To calculate the N'th node in a list, use: PRE_NTH NTH The index is 0 for the 1st node, 1 for the 2nd node, etc.. Use GOOD_LENGTH ahead of time to ensure that the list is long enough to have a valid node at the given index. PRE_GOOD_LENGTH: ; head count -- count ; sets DX= head, skips one thread if head is nil pls ldr LX mov LX,DX ; -- count ; DX= head mov 2,AX ; AX= offset skipping over next thread flg DX,CF not CF ; CF= nil? add IP,AX bcs nxt GOOD_LENGTH: ; count -- remaining-count ; needs DX=node mov -1,AX ; AX= -1 mov DX,MA ldr LX mov LX,DX ; DX= new-node not BX,CF ; CF= count zero? add BX,AX mov AX,BX ; post-decrement BX mov 0,AX ; AX= nil, and offset back to this primitive again cmv AX,DX ; change DX into nil if count is zero, to stop the iteration flg DX,CF ; CF= not nil? add IP,AX bcs nxt To ensure that a list is at least COUNT nodes long, use: PRE_GOOD_LENGTH GOOD_LENGTH NOT This leaves a TRUE if the list is long enough, and a FALSE if the list is too short. PRE_LENGTH: ; head -- 0 ; sets DX= head, skips one thread if head is nil ; depends on AX=1 not BX,CF ; CF= nil? xch BX,DX mov 0,AX mov AX,BX ; -- 0 ; initial count is zero mov 2,AX ; AX= offset skipping over next thread add IP,AX bcs nxt LENGTH: ; 0 -- count ; needs DX=node, finishes with DX= nil, BX= count ; needs AX=1, DX<>NIL add BX,AX mov AX,BX ; increment BX mov 0,AX ; AX= offset back to this primitive again mov DX,MA ldr LX mov LX,DX ; DX= new node flg DX,CF ; CF= not nil? add IP,AX bcs nxt To obtain the length, use: PRE_LENGTH LENGTH PRE_LENGTH will skip over LENGTH if head is nil, leaving 0 on the stack. LENGTH will set BX= BX+1, set DX= new-node, then branch back to itself if DX is not nil. PRE_TAIL: ; head -- head ; skips over TAIL if head is nil mov 2,AX ; AX= offset skipping over next thread not BX,CF ; CF= nil? add IP,AX bcs nxt TAIL depends on DX not being nil initially, which PRE_TAIL guarantees. TAIL: ; node -- tail ; needs BX=node; finishes with DX= nil, BX= tail ; depends on MA=BX ldr LX mov LX,DX ; DX= new node mov 0,AX ; AX= offset back to this primitive again flg DX,CF not CF ; CF= not nil? add IP,AX xch DX,BX ; update BX assuming DX is not nil, hold old BX in DX in case we are done bcs ; if BCS ends TAIL BX will still be the old node (DX that is nil) xch DX,BX ; restore BX to tail (it got set to nil in the earlier XCH DX,BX) and we are done nxt To obtain the tail, use: PRE_TAIL TAIL PRE_TAIL will skip over TAIL if head is nil, leaving the nil on the stack. TAIL will set BX= old node, DX= new node, then branch back to itself if DX is not nil. The tail can also be obtained with LIST_LOOP, but that is much slower if all you need is the tail. PRE_LINK: ; 1st-head 2nd-head -- 1st-head 2nd-head 1st-head ; depends on AX=1 ; branch past LINK if 1st-head is nil mov S0,MA ldr LX ; LX= 1st-head dup mov LX,BX ; -- 1st-head 2nd-head 1st-head flg BX,CF ; CF= 1st-head is not nill, so we can do LINK add IP,AX ; AX was 1, so BCS will do the same as NXT --- it will do the next primitive which is TAIL bcs mov 3,AX ; AX= offset past LINK flg AX,CF ; CF= 1 drop ; -- 1st-head 2nd-head pls ; -- 2nd-head add IP,AX bcs ; branch past LINK --- no need for NXT as CF is certainly set LINK: ; 1st-head 2nd-head 1st-tail -- 1st-head ; TAIL fell through, so 1st-head is not nil pls ldr LX,BX ; -- 1st-head 1st-tail ; LX= 2nd-head str LX ; store 2nd-head into 1st-tail link field drop ; -- 1st-head nxt To link two lists, do this: PRE_LINK TAIL LINK The following is a previous version of the code for linking two lists: PRE_LINK: ; 1st-head 2nd-head -- 1st-head 2nd-head 1st-head ; branch to BAD_LINK if 1st-head nil mov 3,AX ; AX= offset past TAIL and GOOD_LINK to BAD_LINK mov S0,MA ldr LX ; LX= 1st-head dup mov LX,BX ; -- 1st-head 2nd-head 1st-head not BX,CF ; CF= 1st-head is nil? add IP,AX bcs nxt GOOD_LINK: ; 1st-head 2nd-head 1st-tail -- 1st-head ; TAIL fell through, so 1st-head is not nil mov 2,AX ; AX= offset past BAD_LINK pls ldr LX,BX ; LX= 2nd-head flg AX,CF ; CF= 1 str LX ; store 2nd-head into 1st-tail link field add IP,AX drop ; -- 1st-head bcs ; no need for NXT because CF is certainly set BAD_LINK: ; 1st-head 2nd-head 1st-head -- 2nd-head ; assumes 1st-head is nil drop ; -- 1st-head 2nd-head pls ; -- 2nd-head nxt To link two lists, do this: PRE_LINK TAIL GOOD_LINK BAD_LINK PRE_LINK will skip to BAD_LINK if 1st-head is nil, otherwise fall into TAIL, which falls into GOOD_LINK. GOOD_LINK will link the lists, then skip over BAD_LINK. If BAD-LINK executes it returns 2nd-head (might be nil). This is somewhat more complicated, and it uses 3 cells in the threaded code compared to 2 for the first version. section 11.) some example code with 1-bit I/O Normally we would have our I/O ports in low memory, because it is slightly faster to set AX to low literal values. If several 1-bit I/O ports will be logic'd together, it is best if all of them are in the same cell. This is so MOV AX,MA LDR LX need only be done once to get all of them into LX where they can be extracted one at a time. CF_STORE7_5; ; -- ; needs CF=flag, stores CF into address 7, bit-5 mov 7,AX mov AX,MA ldr LX mov ^5,AX set CF,LX,AX ; if CF then move LX -ior- AX to AX else move LX -and- ~AX to AX nxt AX FETCH7_5: ; -- flag ; flag at address 7, bit-5 dup mov 7,AX mov AX,MA ldr LX mov ^5,AX and LX,AX flg AX,CF ; note that we can't use MOV AX,BX because AX is not normalized mov CF,BX ; BX is now 1 or 0 --- it is normalized lit ldr LX execute_LX CF_FETCH7_5xor9_ior_7: ; -- ; CF= flag at address 7, bit 5 -xor- bit 9 mov 7,AX mov AX,MA ldr LX mov ^5,AX and LX,AX flg AX,CF ; CF= bit-5 mov ^9,AX and LX,AX ; AX= bit-9 xor CF,AX flg AX,CF ; CF= bit-5 -xor- bit-9 mov ^7,AX and LX,AX ior CF,AX flg AX,CF ; CF= (bit-5 -xor- bit-9) -ior- bit-7 lit ldr LX execute_LX CF_FETCH7_5xor9_and_3iorB: ; -- flag ; flag at address 7, (bit 5 -xor- bit 9) -and- (bit 3 -ior- bit B) ; uses DX mov 7,AX mov AX,MA ldr LX mov ^5,AX and LX,AX flg AX,CF ; CF= bit-5 mov ^9,AX and LX,AX ; AX= bit-9 xor CF,AX mov AX,DX ; DX= bit-5 -xor- bit-9 mov ^3,AX and LX,AX flg AX,CF ; CF= bit-3 mov ^B,AX and LX,AX ; AX= bit-B ior CF,AX mov AX,LX ; LX= bit-3 -ior- bit-B mov DX,AX ; AX= bit-5 -xor- bit-9 and LX,AX flg AX,CF ; CF= (bit-5 -xor- bit-9) -and- (bit-3 -ior- bit-B) lit ldr LX execute_LX