65ISR processor design --- by Hugh Aguilar --- September 2017 Abstract: The 65ISR is derived from the 6502, but it only supports ISRs. It is an 8-bit processor. The full version, called the 65ISR-abu, has a W register and can access 16MB of memory. The small version, called the 65ISR-chico, lacks the W register. This would most likely be used as a coprocessor. The MIRQ interrupt is the innovative part of the 65ISR design --- nothing like this is found in any other processor. All variables are in zero-page. There is only indirect access to other memory. We have a page,Y addressing-mode that is useful for circular buffers and small arrays. We have a bank,W addressing-mode that is useful for accessing alternate 64KB banks, such as in a RAM-disk. The 65ISR has 1-bit variables similar to the i8032. These are useful for state-machines, such as in a PLC. Section 1.) the registers The 65ISR is a little-endian 8-bit processor. RAM is in the lower part of memory and non-volatile in the upper part. We have the following registers: A 8-bit accumulator Y 8-bit index register W 16-bit word register unsupported in the 65ISR-chico PC 16-bit program counter 12-bit in the 65ISR-chico: 1111,xxxx,xxxx,xxxx P 6-bit processor status flags The P register contains these flags: bit 0 C-flag this indicates a carry bit 1 Z-flag this indicates a zero bit 2 N-flag this indicates a negative bit 3 V-flag this indicates an overflow bit 4 F-flag this is used as a flag bit 5 M-flag this masks the MIRQ; every IRQ automatically sets this to 0 on start All interrupts are masked while code is executing. Interrupts can only occur after the RTI instruction. If no interrupts are pending, then MIRQ executes unless M-flag is masking it. If no interrupts are pending and M-flag is set, the 65ISR goes into low-power mode until an IRQx interrupt trips. The 65ISR executes interrupts quickly because no registers need to be saved and restored. The BRK punctuating the main-program is typically done when no registers are valid, so only the PC has to be saved. The processor should be easier to implement in HDL because interrupts only occur after RTI. The MiniForth from Testra only allowed interrupts after the NXT instruction --- this is where the idea came from. The C-flag is the same as on the 6502. ADC adds C and SBC subtracts ~C. You can use CLC before ADC to have no carry. We have an ADD instruction though, that does this automatically. You use SEC before SBC to have no borrow. We don't have a SUB instruction, so this has to be done manually. Some processors, such as the MC6805, subtract C rather than ~C so you use CLC before SBC to have no borrow. The Z-flag, N-flag and V-flag are all the same as on the 6502 --- likely the same as all other processors. section 2.) conditional execution All of the instructions have two versions, depending upon the high bit of the opcode. If the high-bit is set, then the instruction is conditional on the F-flag. It does nothing if the F-flag is clear, but does its usual thing if the F-flag is set. If the high-bit is clear, then the instruction does its usual thing unconditionally. The idea of conditional execution of instructions is borrowed from the ARM Cortex. The 65ISR system is simpler because we only have "execute if F-flag set" and "execute unconditionally." By comparison, the ARM Cortex has a whole slew of ways to conditionally execute an instruction. The 65ISR system is more useful though, because we are using the F-flag as an accumulator with our 1-bit flags. Be careful that the F-flag doesn't get modified in the conditional code --- it needs to hold the flag all the way through. In the assembler code, a '?' on the front of the instruction name indicates that the conditional version is to be used. If there is no '?' on the front of the instruction name then the unconditional version is compiled. Micro-controllers do a lot of conditional execution, so this should make a difference in the overall speed. We are avoiding a BR0 instruction, which isn't much, but if the conditional code is short then the savings is significant. A good example of this would be debouncing switches. Factory machines usually have a lot of switches. The 65ISR is designed especially to support 1-bit variables and 1-bit ports --- our "killer app." ;-) PLCs are programmed with ladder diagrams, but 65ISR assembly-language could be used just as easily. We avoid the need for bit masks, which are inefficient and also make for complicated error-prone assembly-language code. In many cases, the 65ISR will not be run at a very high clock speed. An inexpensive slow version will be used. Because the 65ISR has an efficient design though, it will be faster than a Dallas 80c320 or whatever. Section 3.) the interrupts IRQ0 execution begins at $FC00. If more than one interrupt is pending, IRQ0 is the highest priority, etc.. IRQ1 execution begins at $FC40. IRQ2 execution begins at $FC80. IRQ3 execution begins at $FCC0. IRQ4 execution begins at $FD00. IRQ5 execution begins at $FD40. IRQ6 execution begins at $FD80. IRQ7 execution begins at $FDC0. IRQ8 execution begins at $FE00. IRQ9 execution begins at $FE40. IRQA execution begins at $FE80. IRQB execution begins at $FEC0. IRQC execution begins at $FF00. If more than one interrupt is pending, IRQC is the lowest priority. BIRQ execution begins at $FF40. This is done when the BRK instruction is executed. MIRQ execution begins at $FF80. This is done when no IRQx interrupt is pending and the M-flag is set to zero. start-up execution begins at $FFC0. This is done on power-up. When any of the above begin, only the PC is initialized. The registers and flags are all set to zero. The MIRQ only executes if the M-flag is set to zero. It is normally set to 0, but the BLK instruction sets it to 1. The BLK instruction (pronounced: "block") blocks the main-program from executing until at least one IRQx has executed. The IRQx interrupts end in RTI --- this unmasks the interrupts so another interrupt can execute. Normally the M-flag is left set to 0 so the main-program can execute when no IRQx interrupts are pending. In programs that are entirely event-driven, and don't have a main-program, every IRQx should do BLK before RTI. The MIRQ ISR is effectively the main-program. When no interrupts are pending and the M-flag is set to 0, MIRQ executes. The MIRQ ISR starts out with a JMP through a zero-page vector (the vector should be initialized during start-up). The main-program is broken up into chunks that are punctuated with BRK instructions. The BRK instruction stores the address after it to the zero-page vector, then executes code at $FF40. Typically this is RTI. If no IRQx interrupts need servicing, the MIRQ ISR executes again, jumping through that vector. If any interrupts are pending, they get serviced, and the MIRQ ISR executes when no more IRQx interrupts are pending. From the perspective of the programmer, BRK has no effect because execution continues as if it were a NOP instruction. The programmer should be aware however, that no registers are saved through a BRK (assuming that BIRQ does an RTI). Registers (the flags, A, Y, P and W) need to be saved manually, or better yet, BRK should be done when they are invalid. The BRK instruction can be thought of as polling the I/O. BLK BRK can also be used. This blocks the main-program from executing again until at least one IRQx has been done. This is similar to a WAI instruction on a traditional processor, as we wait for some input from the outside world. The main-program could set the vector manually with STW and then do RTI, rather than use BRK. This could be useful when the destination address is calculated somehow, and is already in W. A typical example would be the fast-main-subroutines discussed later in this document. There can be any number of IRQx lines up to 13 maximum --- implement as many as needed for the application. For code, 1KB is the minimum because we need memory at $FC00 for IRQ0. Section 4.) the addressing-modes We have the following addressing-modes: inherent --- no operand is provided #byte 8-bit immediate value #word 16-bit immediate value A 8-bit A register value zadr 8-bit address in zero-page page,Y 8-bit page value is the high-byte and Y is the low-byte, to form a 16-bit address 1,Y 1 is the high-byte and Y is the low-byte, to form a 16-bit address (like page,Y except explicit for page-one) bank,W 8-bit bank value is the high-byte and W is the low-word, to form a 24-bit address flag 8-bit index to one of 256 1-bit variables located at $00..$1F The A addressing-modes is essentially just the inherent addressing-mode because there is no operand after the opcode. Typically we have I/O memory-mapped in zero-page, as well as all of our data. The Y register and the page,Y addressing-mode are mostly used for 256-byte circular buffers located in RAM above zero-page. Page-one is especially efficient. This can be used for a high-priority buffer. On the 65ISR-chico, use separate pages for low and high bytes when buffering 16-bit data. Because page-one is efficient, an alternate use would be to put a stack here --- useful if reentrancy is needed. A file that fits in one 64KB bank can be addressed with the bank,W addressing-mode. If a file is too big, then put the even bytes in one 64KB bank and the odd bytes in another 64KB bank. The file will have to contain an even number of bytes. It can be padded with a zero if necessary. This technique can be extended to use any number of 64KB banks, for very large files. The (zadr),Y addressing-mode was the hallmark of the 6502. The 65ISR doesn't have that though. A pointer can be loaded into W though, which would then be used like the (zadr) addressing-mode of the 65c02. By holding a pointer in W in a loop, there are fewer memory accesses than with the (zadr),Y addressing-mode. Switching between two pointers inside the loop, such as in a block move, is less efficient though. At one time I wanted to have an X register that would be used as a data-stack as traditionally done in Forth. I now think it is possible to write a Forth compiler that simulates a stack, but uses direct addressing internally. There are already many examples of C and Pascal compilers (from ByteCraft) that use only direct addressing internally. With this technique you don't get reentrancy, but our ISRs can't be interrupted anyway, so reentrancy is less important. The compiler has to be smart enough however, to not reuse zero-page memory in subroutines that are in the same call-chain. The 65ISR-chico lacks subroutines and indirect memory-access through pointers, which are needed in any high-level language. The 65ISR-chico would generally be programmed in assembly-language, although a BASIC-like language is possible. The 65ISR-abu should generally be programmed in a high-level language. Reusing zero-page memory is complicated, but a compiler can do it --- in assembly-language, this can be error-prone. Section 5.) the instructions The flags are not affected unless specifically stated (unlike the 6502, our LDA doesn't affect the flags). BLK set M-flag to 1 (this blocks the MIRQ from executing) FBK load PC with $FF40 (pronounced "fast break") BRK zadr store PC+1 to memory >> do FBK (pronounced "break") RTI unmask the interrupts >> go into a low-power wait mode until the next interrupt The M-flag is automatically set to 0 when an ISR begins. BLK can be used to set it to 1 prior to the RTI or BRK. If M-flag is set to 1, the MIRQ won't execute again until at least one IRQx interrupt has executed (and doesn't do a BLK). Typically the BIRQ ISR ($FF40) does RTI only. It can do other things though --- this could be useful in a debugger. FBK is useful because it can replace RTI in a fast-main-subroutine, and it is the same size (one byte). This is useful because the BIRQ ISR may contain something other than RTI --- it may contain a debugger. JMP zadr load PC with value JMP #word load PC with value NOP do nothing BRA #byte add signed value to PC BEQ #byte if Z then add signed value to PC BNE #byte if ~Z then add signed value to PC BCS #byte if C then add signed value to PC BCC #byte if ~C then add signed value to PC BMI #byte if N then add signed value to PC BPL #byte if ~N then add signed value to PC BVS #byte if V then add signed value to PC BLT #byte if N<>V then add signed value to PC BGT #byte if N=V and ~Z then add signed value to PC BR1 #byte if F then add signed value to PC BR0 #byte if ~F then add signed value to PC The NOP instruction is primarily provided so machine-code can be patched, such as with an old-school monitor. Note that BR1 and ?BRA are the same. LD0 load F-flag with 0 LD1 load F-flag with 1 LDF flag load F-flag from 1-bit variable STF flag store F-flag to 1-bit variable RSF flag store F-flag to 1-bit variable >> do RTI EOF flag logical exclusive-or F-flag with 1-bit variable IOF flag logical inclusive-or F-flag with 1-bit variable ANF flag logical and F-flag with 1-bit variable NTF logical not F-flag TFC transfer F-flag to C-flag TCF transfer C-flag to F-flag The instructions that use the F-flag and 1-bit variables are primarily for state-machines, such as used in PLCs. This area can also be used for I/O, such as control and status ports, in which you need to access only one bit. CLV load V-flag with 0 SEV load V-flag with 1 CLC load C-flag with 0 SEC load C-flag with 1 EOR zadr logical exclusive-or value with A, setting N= high bit, V= 2nd high bit, C= low-bit IOR zadr logical inclusive-or value with A, setting N= high bit, V= 2nd high bit, C= low-bit AND zadr logical and value with A, setting N= high bit, V= 2nd high bit, C= low-bit NOT A logical not A, setting N= high bit, V= 2nd high bit, C= low-bit Some 65c02 programs used C-flag as a flag. This can be done on the 65ISR too, although F-flag is provided for that purpose. Using C-flag doesn't work very well because a lot more instructions modify C-flag on the 65ISR than did on the 65c02. V-flag can also be used as a flag. This doesn't work very well either, because all the arithmetic instructions modify it. Using C-flag and V-flag as flags was pretty hokey --- the 65ISR has F-flag and all the 1-bit variables that work a lot better. LDY #byte load Y with value LDY zadr load Y with value STY zadr store Y to memory TAY transfer A to Y TYA transfer Y to A CPY #byte subtract value from Y without modifying Y, but setting C Z N flags CPY zadr subtract value from Y without modifying Y, but setting C Z N flags INY move Y plus 1 to Y setting C Z N flags DEY move Y minus 1 to Y setting C Z N flags ILY zadr load Y with value >> move Y + 1 to Y >> move Y to memory pre-increment of memory value LIY zadr load Y with value >> move Y + 1 to memory post-increment of memory value L2Y zadr load Y with value >> move Y + 2 to memory post-increment of memory value Note that INY DEY set the C-flag. On the 6502 the INY DEY instructions don't set the C flag. This is mostly useful for DEY to indicate when we have crossed the boundary. LDW #word load W with 16-bit value unsupported in the 65ISR-chico LDW zadr load W with 16-bit value, latching the word if it is an I/O port unsupported in the 65ISR-chico LDW page,Y load W with 16-bit value unsupported in the 65ISR-chico LDW 1,Y load W with 16-bit value unsupported in the 65ISR-chico STW zadr store W to memory, latching the word if it is an I/O port unsupported in the 65ISR-chico STW page,Y store W to memory unsupported in the 65ISR-chico STW 1,Y store W to memory unsupported in the 65ISR-chico ADW #byte add signed value to W setting C Z N V flags unsupported in the 65ISR-chico ASW zadr move W plus 16-bit value to W setting C Z N V flags >> do STW unsupported in the 65ISR-chico MUL unsigned multiply A times Y, set W= product unsupported in the 65ISR-chico The ASW instruction, and the INC instruction (shown later), are intended for adding the partial products in a 16x16 multiply. SLW shift W left, moving 0 into low bit, moving high-bit into C-flag, setting N Z flags ADW zadr move W plus 16-bit value to W setting C Z N V flags The SLW and ADW instructions are intended for use in a 16/8 division. Both are unsupported in the 65ISR-chico. See the DIV macro for an example of how 16/8 division is done. JSR #word move PC+2 to W >> do JMP unsupported in the 65ISR-chico RTS move W to PC unsupported in the 65ISR-chico RT0 move W to PC >> load F-flag with 0 unsupported in the 65ISR-chico RT1 move W to PC >> load F-flag with 1 unsupported in the 65ISR-chico LDA #byte load A with value LDA zadr load A with value LDA page,Y load A with value LDA 1,Y load A with value LDA bank,W load A with value unsupported in the 65ISR-chico STA zadr store A to memory STA page,Y store A to memory STA 1,Y store A to memory STA bank,W store A to memory unsupported in the 65ISR-chico LIA bank,W load A with value >> move W plus 1 to W setting C Z N V flags by A unsupported in the 65ISR-chico SIA bank,W store A to memory >> move W plus 1 to W setting C Z N V flags by W unsupported in the 65ISR-chico The SIA instruction sets the flags according to W. The N-flag is bit-15, the V-flag is bit-14, the C-flag is bit-13 and the Z-flag is set if W=0. These flags tell you when W has passed the 16KB, 32KB or 64KB boundary. This is useful for telling you if your file has overflowed (assuming it is limited to 16KB, 32KB or 64KB in size). The ISR that loads a file can be quite fast, and yet will still be able to catch file overflow errors. These errors would indicate a bug in the desktop-computer program that is uploading the file into the 65ISR. The LIA instruction sets the flags according to the value of A. The N-flag is bit-7, the Z-flag is set if A=0, he V-flag is set if A=32, and the C-flag is set to 1. It is common to have space-delimited ascii text, which is what V-flag looks for. The C-flag is 1 in preparation for CPA. The N-flag indicates the high-bit is set for the datum. Given 7-bit ascii, this could imply some kind of formatting info. If you assume that files have a zero sentinel for EOF, then Z-flag tells you when you are at the end of your file. In general, the 65ISR load instructions (LDA LDY etc.) don't set flags --- LIA is an exception to this rule though. EXA zadr exchange A with memory EXA page,Y exchange A with memory EXA 1,Y exchange A with memory EXA bank,W exchange A with memory unsupported in the 65ISR-chico ADD #byte move A plus value to A setting C Z N V flags ADD zadr move A plus value to A setting C Z N V flags ASA zadr do ADD >> do STA ADC #byte move A plus value plus C to A setting C Z N V flags ADC zadr move A plus value plus C to A setting C Z N V flags The ADD instruction is the same as CLC ADC but slightly faster. SBC #byte move A minus value minus ~C to A setting C Z N V flags SBC zadr move A minus value minus ~C to A setting C Z N V flags Note that SBC requires C-flag to be set to 1 prior to subtracting the low bytes in a 16-bit subtraction. We used to have a SUB instruction that was the same as SEC SBC but slightly faster, but it got discarded. CPA #byte subtract value from A and subtract ~C without modifying A, but setting C Z N flags CPA zadr subtract value from A and subtract ~C without modifying A, but setting C Z N flags Note that CPA requires C-flag to be set to 1 prior to comparing the low bytes in a 16-bit compare. The CPA of the high bytes will use the C-flag from the CPA of the low bytes. This is different from CPY that doesn't use C-flag, but does set the flags just like CPA does. NEG A negate A setting Z N flags, and setting C= ~Z NEG zadr negate memory value setting Z N flags, and setting C= ~Z Note that C-flag is set to not-Z-flag in order to allow the DNEGATE macro to work. This idea came from the i8086. INC zadr add C-flag to 8-bit memory value setting C Z N flags DEC zadr subtract ~C-flag from 8-bit memory value setting C Z N flags Note that INC and DEC only increment or decrement depending upon what the C-flag is, unlike the 6502 that always does it. RNC clock LFSR in W, setting C-flag to a pseudo-random value unsupported in the 65ISR-chico See the RND_F macro later in this document for an equivalent in software for the algorithm used in the RNC instruction. TCL transfer C-flag to low bit of A TCH transfer C-flag to high bit of A TLC transfer low bit of A to C-flag THC transfer high bit of A to C-flag ROR A shift A right, moving F-flag into high-bit, moving low-bit into C-flag, setting N Z flags ROR zadr shift memory value right, moving F-flag into high-bit, moving low-bit into C-flag, setting N Z flags ROL A shift A left, moving C-flag into low-bit, moving high-bit into C-flag, setting N Z flags ROL zadr shift memory value left, moving C-flag into low-bit, moving high-bit into C-flag, setting N Z flags Section 6.) ISRs and subroutines This is how the MIRQ ISR would typically be written: MIRQ_VECTOR: dw 1 ; vector to next code chunk of MIRQ ISR org $FF80 ; this is the start of the MIRQ ISR MIRQ: JMP mirq_vector The MIRQ ISR is the main-program. At start, it has to jump through a vector to where it left off previously. This vector is typically set by the BRK instruction. Note that registers aren't saved and restored automatically. The code chunks punctuated by BRK should be pretty short so we don't block IRQx interrupts for very long. The 65ISR-chico doesn't support subroutines at all. On the 65ISR-abu subroutines are called with JSR or JSx, so they start with the return-address in W. None of our subroutines support recursion --- presumably recursion won't be needed in a micro-controller anyway. We have four kinds of subroutines: fast-ISR-subroutines These hold the return-address in W and they end with RTS. These can't use W for data, and they can't call other subroutines. slow-ISR-subroutines These hold the return-address in a zero-page vector that they own (not used by other subroutines) and end in JMP indirect. These can use W for data and they can call fast-ISR-subroutines or slow-ISR-subroutines. fast-main-subroutines These hold the return-address in MIRQ-VECTOR and end with RTI (or FBK). These can use W for data and they can't call other subroutines. They can't use BRK internally. slow-main-subroutines These hold the return-address in a zero-page vector that they own and end in JMP indirect. These can use W for data and they can call fast-main and slow-main subroutines. They can use BRK internally. Subroutines are either for use in ISRs or the main-program, but not both. It may be necessary to have duplicates of certain subroutines so both ISRs can the main-program can be supported. This will result in some redundancy, but FLASH memory is cheap these days, so don't worry about it. Fast-ISR-subroutines are hamstrung by not being able to use W for data, but they are fast in and out. Slow-ISR-subroutines can use W for data, but they are somewhat slower because the return-address is saved in memory. Fast-main-subroutines can't call other subroutines, and they can't use BRK internally. They should be pretty short. The advantage of these is that calling one does RTI interally; it is like BRK in allow pending IRQx interrupts to execute. Slow-main-subroutines can call fast-main or slow-main subroutines. They can use BRK internally, so they can be pretty long. Some 65ISR systems won't have a main-program at all, but will be entirely event-driven. If there is a main-program, it should be punctuated with BRK (or calls to fast-main-subroutines that end in RTI internally). We need to do BRK or RTI pretty frequently so the IRQx interrupts aren't blocked for too long. How often should BRK or RTI be done in a main-program? This depends upon the application. If the clock is fast and the I/O is slow, some interrupt latency is acceptable. For the most part though, interrupt latency should be minimized, so BRK or RTI should be done frequently. The assumption is that the main-program is not doing anything time-critical so punctuating it with BRK or RTI is okay. The programmer has to calculate how often BRK or RTI are done in the main-program for each application. How much interrupt latency is acceptable? How fast is the clock? How many clock cycles are there per instruction on average? As a rule-of-thumb, 10 to 20 instructions between BRK or RTI should be acceptable for most systems. Later on we have a DIV macro that calculates one bit in the quotient, so sixteen are needed to do a 16/8 division. The entire 16/8 division is somewhat lengthy. Hopefully this doesn't need to be done in an IRQx ISR. Calculations involving division belong in the main-program and are presumably not time-critical. An example would be updating the display, which involves converting a 16-bit number into an ascii string of digits. The 16/8 division could be a slow-main-subroutine, and it could do BRK after every two bits of the quotient. Similarly, the 16x16 multiply subroutine will likely have to do BRK a few times internally because it is lengthy too. The 65ISR mostly shines when there is no main-program, or at least the main-program is not time-critical. If a BLK is done before BRK or RTI in the main-program, the processor will wait for an IRQx to happen. When the IRQx is triggered, there is very little interrupt latency because no registers have to be saved/restored. In an application with a main-program that is time-critical, a more traditional processor may be a better choice. The STM8, for example, supports a main-program. The problem is that interrupts save/restore 9 bytes, which is quite a lot. You get a main-program, but you have a lot of interrupt latency. The IRET instruction takes a whopping 11 clock cycles! Entering an ISR takes 10 clock cycles, although with WFI or TRAP you get to save the registers ahead of time. The advantage of the 65ISR is somewhat subtle. The 65ISR does not allow ISRs to be interrupted. This means that the code doesn't have to be reentrant. This means that we can use direct addressing of zero-page variables rather than hold local data on a stack. Holding local data on a stack is slow because the indexed addressing mode is needed. On most processors, the indexed addressing-mode is slower than the direct zero-page addressing-mode. It may seem that having to punctuate the main-program with BRK is a hassle, but there is a hidden benefit. The benefit is that all the code (both ISRs and the main-program) get to use direct addressing of zero-page variables. Dodging the requirement for code to be reentrant simplifies the HDL for the hardware and speeds up the software. Section 7.) some sample code ; For comparing flags, use these macros for improved readability: macro FNE flag ; flag not equal EOF flag endm macro FEQ flag ; flag equal EOF flag NTF end ; In the GET and PUT macros, PAGE is the buffer, SRC is the available data and DST is the free area. macro GET page src dst err ; load A from buffer --- jump to ERR if there is no data LDY src CPY dst BEQ err LDA page,Y INY STY src endm macro PUT page src dst err ; store A into buffer --- jump to ERR if there is no room LDY dst CPY src BEQ err STA page,Y INY STY dst endm ; This version of GET and PUT has the disadvantage of being lengthy. ; The following version of GET and PUT are shorter and faster: macro GET page src dst err ; load A from buffer --- jump to ERR if there is no data LIY src ; Y= value prior to increment in memory CPY dst BEQ err ; if this branch is taken, then the increment should not have been done LDA page,Y endm macro PUT page src dst err ; store A into buffer --- jump to ERR if there is no room LIY dst ; Y= value prior to increment in memory CPY src BEQ err ; if this branch is taken, then the increment should not have been done STA page,Y endm ; In this version, if we BEQ to the ERR code, the increment needs to be undone because it stepped over the other index. ; Branching to the ERR code should (hopefully) be pretty rare, so undoing the bad increment doesn't hurt efficiency at all. ; GET and PUT need to be fast. A lot of ISRs only do some I/O and access a buffer, and that is it. ; The following version is for 16-bit data: macro GET page src dst err ; load W from buffer --- jump to ERR if there is no data L2Y src ; Y= value prior to increment in memory CPY dst BEQ err ; if this branch is taken, then the increment should not have been done LDW page,Y endm macro PUT page src dst err ; store W into buffer --- jump to ERR if there is no room L2Y dst ; Y= value prior to increment in memory CPY src BEQ err ; if this branch is taken, then the increment should not have been done STW page,Y endm ; This is essentially the same thing except that L2Y is used rather than LIY, and the data is in W rather than A. ; Note that BRK and RTI do not save the registers. ; If GET or PUT can't be done and we need to wait, when our ISR restarts it will be at the beginning again. ; If an ISR is going to recover from a BRK or RTI left off, it must manually store the context in variables. macro ASHR ; arithmetic shift right A THC ROR A endm macro SHR ; logical shift right A CLC ROR A endm macro SHL ; logical shift left A CLC ROR A endm macro DNEGATE adr ; ADR is of a 16-bit value in zero-page NEG adr+1 NEG adr DEC adr+1 endm ; ASHR SHR and SHL could each be made into an instruction to be slightly faster, but there wouldn't be much of a speed boost. ; DNEGATE could be an instruction too, but this is unlikely to get enough use to justify it. ; There are lots of code segments that could be made into instructions. With an FPGA, that option is always available. :-) macro DIV D T B ; W=numerator, D=denominator, T=-D, B=bit this is done 16 times for each B in quotient SLW ADW T ; W= W*2-D (partial remainder) BPL L1 ; if W>=0 then leave the quotient bit set to one (all quotient bits set to one prior to starting) LD0 STF B ; set the quotient bit to zero ADW D ; restore W L1: endm ; The DIV macro calculates one bit of the quotient. The quotient should be preset with all 1 bits. ; D is a 16-bit denominator; the 8-bit denominator shifted left by 8 bits. T is D negated (see the DNEGATE macro above). ; The bits should be calculated from most-significant to least-significant (right to left because we are little-endian). ; Execute DIV sixteen times. Whatever is left in W is the remainder (should be 8-bit because the denominator was 8-bit). ; It is possible to write a version of DIV that early-exits when the remainder is zero. ; Each iteration will be slower because there is a BEQ in there that does the early exit. ; The problem with this is that you have to assume the worst-case scenario when there is no early exit. ; Being faster most of the time isn't helpful because sooner or later you will get the worst-case scenario. ; MUL is an instruction because this has to be fast for PID control. DIV isn't used much and doesn't have to be fast. ; The most common use for DIV is dividing a 16-bit number by ten to convert it into decimal digits. ; Our DIV macro above is pretty fast though (compared to the 80c320, for example) --- it should be adequate for most uses. macro RND_C seed ; SEED is the leftmost flag of the 16-bit seed works on the 65ISR-abu but not the 65ISR-chico LDW seed/8 RNC STW seed/8 endm macro RND_F seed ; SEED is the leftmost flag of the 16-bit seed works on the 65ISR-abu and the 65ISR-chico LDF seed+15 EOF seed+4 EOF seed+2 EOF seed+1 ; F-flag is the random bit TFC ROR seed/8 ; shift left byte ROR seed/8+1 ; shift right byte endm ; RND_F uses the same algorithm as the RNC instruction. It is just shown here to illustrate how RNC works internally. ; RND_F does the same thing as RND_C except that it sets the F-flag rather than the C-flag. seed: dw 1 ; RND_A: address of a 16-bit LFSR seed ra dw 1 ; RND_A: return-address RND_A: ; needs Y = how many bits (1 minimum and 8 maximum); sets A to a random value STW ra LDA #0 ; this is the initial value of the byte we are generating LDW seed L1: RNC ROL A DEY BNE L1 STW seed JMP ra ; this could also be done with: LDW ra RTS ; RND_A should be random enough for games. There is no memory access in the loop, so it is fast. ; Note how RND_A can't hold the return-address in W because W is used internally. I: db 1 ; RC4: index initialized to $00 J: db 1 ; RC4: index initialized to $00 S equ $1 ; RC4: S array page initialized to contain the numbers [0,255] jumbled K equ $2 ; INIT_RC4 local: key string KL: db 1 ; INIT_RC4 local: key length macro EXIJ ; exchange bytes at S(I) and S(J); needs Y=I; leaves Y=I and A=S(I) LDA S,Y LDY J EXA S,Y LDY I STA S,Y endm INIT_RC4: ; needs the K array and KL; initializes the S array; initializes I and J to zero LDY #0 L1: TYA STA S,Y ; S(Y)= Y INY BNE L1 ; fill the S array, leave Y = 0 STY I ; I= 0 index into S STY J ; J= 0 index into K L2: ; begin LDY J LDA K,Y ; A= K(J) INY CPY KL BNE L3 LDY #0 L3: STY J ; J= (J + 1) mod KL LDY I ADD S,Y ; A= K(J) + S(I) ASA J ; J= K(J) + S(I) + J A= J EXIJ ; swap S(I) and S(J) Y= I A= S(I) INY STY I ; I= I + 1 BNE L2 ; loop until I=0 STY ; J= 0 RTS RC4: ; set A to output byte, using the S array and the I J indices provided by INIT_RC4 ILY I ; I= I+1 Y= I LDA S,Y ASA J ; J= J + S(I) A= J EXIJ ; swap S(I) and S(J) Y= I A= S(I) LDY J ADD S,Y ; A= S(I) + S(J) TAY LDA S,Y ; A= S(A) RTS ; RC4 is very fast, but requires a page of RAM for the S array. You get speed at the cost of increased memory usage. ; RC4 also needs a page to hold the key, although this can be used for something else after INIT-RC4 completes. ; Memory conservation is only an issue in the 65ISR-chico that may have only 1KB or 2KB of RAM. ; The 65ISR-chico could be used in a smart-card, in which case AES would likely be required. macro randomize_high_bit ; do this before encrypting 7-bit ascii in A RND_C seed TCH endm macro crypt var ; set A= encrytion or decryption of VAR byte JSR RC4 eor var endm macro zero_high_bit ; do this after decrypting 7-bit ascii in A CLC TCH endm ; In 7-bit ascii, the high-bit of each char is always zero. This is known-plaintext that the attacker can use. ; An easy solution is to randomize the high-bit before encrypting, then zero the high-bit after decrypting. ; If speed is very important though, then don't bother. Even given known-plaintext, RC4 is pretty secure. ; Extended ascii may be needed anyway, to provide Spanish chars, so the known-plaintext problem is a non-issue. Section 8.) some example applications for the 65ISR-abu and 65ISR-chico The 1-bit instructions make the 65ISR a good choice for factory machines that have a lot of switches. The 65ISR-chico could be used as a coprocessor to do I/O in the background for a larger processor that has a main-program. This speeds up the main-program because it gets interrupted less often. This isn't necessarily important though. If all I/O is on the coprocessor then it is only slightly faster (because the 65ISR-chico doesn't save/restore context). If there are two ISRs that have to be fast, they should be on separate processors so they don't delay each other. The 65ISR-chico has to be programmed in assembly-language, and it doesn't have subroutines. The 65ISR-abu provides subroutines and indirect memory-access through pointers, which are needed in high-level language. The 65ISR-abu also can access 16MB of memory, which is useful for buffering entire files. Arcade games would benefit from having a lot of memory. They may have multiple large graphics files to work with. Some machines have multiple games. They could benefit from using binary overlays. The 65ISR-abu does have the RNC instruction for games, which could boost the speed somewhat. The Super Nintendo used the 65c816 processor, but the 65ISR-abu is a better design in many ways. The Super Nintendo is obsolete anyway --- afaik, game machines use 32-bit processors now. CNC machines would benefit from having a lot of memory. A large file of proprocessed data can be uploaded to the 65ISR-abu. The idea with preprocessing is to avoid doing calculations while objects are moving --- because the objects have momentum! The major bottleneck in many programs is the 16x16 multiplication. The 65ISR-abu only has an 8x8 multiply. The 65ISR-abu does have the ASW and INC instructions that speed up the summation of the partial products. This is pretty fast. The 65ISR-abu has a 16/8 division that can be useful --- dividing by 10 is needed for converting numbers into decimal strings. TCP/IP needs a lot of memory. Even a small website with minimal graphics can use a lot of memory. It is a good idea to have factory machines connected to the internet so the factory owner can monitor them from home. Also, there may be micro-controllers in remote locations, such as used for gate-access, that need to be monitored. Smart-phones are insecure because the NSA has backdoors. Fax machines are easier to build and can be made secure. I can foresee copy centers throughout America offering secure fax-machine usage with a public-key cryptography system. This should make the NSA snoops miserable --- although not as miserable as they deserve to be. The eZ80 processor was designed specifically to support TCP/IP --- or any other application that needs a lot of memory. The eZ80 was described as: "a poor man's ARM" --- the eZ80 became obsolete though --- the ARM went down in price. At this time, the STM8 and MSP430 are in the market niche that the eZ80 used to occupy. The 65ISR is arguably a better design than any of these processors --- certainly a different design, anyway. For the most part, the 65ISR is a hobby project for me. I'm not expecting to take over the micro-controller world. I think the 65ISR has a higher fun-quotient than the MSP430, which is a warmed-over PDP11 (16 registers instead of 8). The 65ISR might appeal to programmers who have nostalgia for the 65c02 --- the 65c02 was just a lot cooler than the Z80!