65ISR processor design --- by Hugh Aguilar --- September 2017 Abstract: The 65ISR is derived from the 6502, but it only supports ISRs. It is an 8-bit processor. The full version, called the 65ISR-abu, has a W register and can access 16MB of memory. The small version, called the 65ISR-chico, lacks the W register. This would most likely be used as a coprocessor. The MIRQ interrupt is the innovative part of the 65ISR design --- nothing like this is found in any other processor. All variables are in zero-page. There is only indirect access to other memory. We have a page,Y addressing-mode that is useful for circular buffers and small arrays. We have a bank,W addressing-mode that is useful for accessing alternate 64KB banks, such as in a RAM-disk. The 65ISR has 1-bit variables similar to the i8032. These are useful for state-machines, such as in a PLC. Section 1.) the registers The 65ISR is a little-endian 8-bit processor. RAM is in the lower part of memory and non-volatile in the upper part. We have the following registers: A 8-bit accumulator Y 8-bit index register W 16-bit word register unsupported in the 65ISR-chico PC 16-bit program counter 12-bit in the 65ISR-chico: 1111,xxxx,xxxx,xxxx P 5-bit processor status flags The P register contains these flags: bit 0 C-flag this indicates a carry bit 1 Z-flag this indicates a zero bit 2 N-flag this indicates a negative bit 3 V-flag this indicates an overflow bit 4 M-flag this masks the MIRQ; every IRQ automatically sets this to 0 on start All interrupts are masked while code is executing. Interrupts can only occur after the RTI or POL instructions. If no interrupts are pending, then MIRQ executes unless M-flag is masking it. If no interrupts are pending and M-flag is set, the 65ISR goes into low-power mode until an IRQx interrupt trips. The 65ISR executes interrupts quickly because no registers need to be saved and restored. The POL punctuating the main-program is typically done when no registers are valid, so only the PC has to be saved. The processor should be easier to implement in HDL because interrupts only occur after RTI and POL. The MiniForth from Testra only allowed interrupts after the NXT instruction --- this is where the idea came from. The C-flag is the same as on the 6502. ADC adds C and SBC subtracts ~C. You can use CLC before ADC to have no carry. We have an ADD instruction though, that does this automatically. You use SEC before SBC to have no borrow. We don't have a SUB instruction, so this has to be done manually. Some processors, such as the MC6805, subtract C rather than ~C so you use CLC before SBC to have no borrow. The Z-flag, N-flag and V-flag are all the same as on the 6502 --- likely the same as all other processors. Section 2.) the interrupts IRQ0 execution begins at $FC00. If more than one interrupt is pending, IRQ0 is the highest priority, etc.. IRQ1 execution begins at $FC40. IRQ2 execution begins at $FC80. IRQ3 execution begins at $FCC0. IRQ4 execution begins at $FD00. ... IRQ14 execution begins at $FF40. If more than one interrupt is pending, IRQ14 is the lowest priority. MIRQ execution begins at $FF80 This is done when no IRQx interrupt is pending and the M-flag is set to zero. start-up execution begins at $FFC0. This is done on power-up. When any of the above begin, only the PC is initialized. The registers and flags are all set to zero. The MIRQ only executes if the M-flag is set to zero. It is normally set to 0, but the BLK instruction sets it to 1. The BLK instruction (pronounced: "block") blocks the main-program from executing until at least one IRQx has executed. The IRQx interrupts end in RTI --- this unmasks the interrupts so another interrupt can execute. Normally the M-flag is left set to 0 so the main-program can execute when no IRQx interrupts are pending. In programs that are entirely event-driven, and don't have a main-program, every IRQx should do BLK before RTI. The MIRQ ISR is effectively the main-program. When no interrupts are pending and the M-flag is set to 0, MIRQ executes. The MIRQ ISR starts out with a JMP through a zero-page vector (the vector should be initialized during start-up). The main-program is broken up into chunks that are punctuated with POL instructions. The POL instruction stores the address after it to the zero-page vector, then does RTI. If no IRQx interrupts need servicing, the MIRQ ISR executes again, jumping through that vector. If any interrupts are pending, they get serviced, and the MIRQ ISR executes when no more IRQx interrupts are pending. From the perspective of the programmer, POL has no effect because execution continues as if it were a NOP instruction. The programmer should be aware however, that no registers are saved through a POL. Registers (the flags, A, Y and W) need to be saved manually, or better yet, POL should be done when they are invalid. The POL instruction can be thought of as polling the I/O, which is where it gets its name. BLK POL can also be used. This blocks the main-program from executing again until at least one IRQx has been done. This is similar to a WAI instruction on a traditional processor, as we wait for some input from the outside world. The main-program could set the vector manually with STW and then do RTI, rather than use POL. This could be useful when the destination address is calculated somehow, and is in W. There can be any number of IRQx lines up to 14 maximum --- implement as many as needed for the application. For code, 1KB is the minimum because we need memory at $FC00 for IRQ0. Section 3.) the addressing-modes We have the following addressing-modes: inherent --- no operand is provided #byte 8-bit immediate value #word 16-bit immediate value A 8-bit A register value zadr 8-bit address in zero-page page,Y 8-bit page value is the high-byte and Y is the low-byte, to form a 16-bit address bank,W 8-bit bank value is the high-byte and W is the low-word, to form a 24-bit address flag 8-bit index to one of 256 1-bit variables located at $00..$1F The A addressing-modes is essentially just the inherent addressing-mode because there is no operand after the opcode. Typically we have I/O memory-mapped in zero-page, as well as all of our data. The Y register and the page,Y addressing-mode are mostly used for 256-byte circular buffers located in RAM above zero-page. On the 65ISR-chico, use separate pages for low and high bytes when buffering 16-bit data. A file that fits in one 64KB bank can be addressed with the bank,W addressing-mode. If a file is too big, then put the even bytes in one 64KB bank and the odd bytes in another 64KB bank. The file will have to contain an even number of bytes. It can be padded with a zero if necessary. This technique can be extended to use any number of 64KB banks, for very large files. The (zadr),Y addressing-mode was the hallmark of the 6502. The 65ISR doesn't have that though. A pointer can be loaded into W though, which would then be used like the (zadr) addressing-mode of the 65c02. By holding a pointer in W in a loop, there are fewer memory accesses than with the (zadr),Y addressing-mode. Switching between two pointers inside the loop, such as in a block move, is less efficient though. At one time I wanted to have an X register that would be used as a data-stack as traditionally done in Forth. I now think it is possible to write a Forth compiler that simulates a stack, but uses direct addressing internally. There are already many examples of C and Pascal compilers (from ByteCraft) that use only direct addressing internally. With this technique you don't get reentrancy, but our ISRs can't be interrupted anyway, so reentrancy is less important. The compiler has to be smart enough however, to not reuse zero-page memory in subroutines that are in the same call-chain. The 65ISR-chico lacks subroutines and indirect memory-access through pointers, which are needed in any high-level language. The 65ISR-chico would generally be programmed in assembly-language, although a BASIC-like language is possible. The 65ISR-abu should generally be programmed in a high-level language. Reusing zero-page memory is complicated, but a compiler can do it --- in assembly-language, this can be error-prone. Section 4.) the instructions The flags are not affected unless specifically stated (unlike the 6502, our LDA doesn't affect the flags). BLK set M-flag to 1 (this blocks the MIRQ from executing) RTI unmask the interrupts >> go into a low-power wait mode until the next interrupt POL zadr store PC+1 to memory >> do RTI The M-flag is automatically set to 0 when an ISR begins. BLK can be used to set it to 1 prior to the RTI or POL at the end. If M-flag is set to 1, the MIRQ won't execute again until at least one IRQx interrupt has executed (and doesn't do a BLK). JMP zadr load PC with value JMP #word load PC with value NOP do nothing BRA #byte add signed value to PC BEQ #byte if Z then add signed value to PC BNE #byte if ~Z then add signed value to PC BCS #byte if C then add signed value to PC BCC #byte if ~C then add signed value to PC BMI #byte if N then add signed value to PC BPL #byte if ~N then add signed value to PC BVS #byte if V then add signed value to PC BLT #byte if N<>V then add signed value to PC BGT #byte if N=V and ~Z then add signed value to PC The NOP instruction is primarily provided so machine-code can be patched, such as with an old-school monitor. CLC set C-flag to 0 SEC set C-flag to 1 LDC flag load C-flag from 1-bit variable STC flag store C-flag to 1-bit variable EOC flag logical exclusive-or C-flag with 1-bit variable IOC flag logical inclusive-or C-flag with 1-bit variable ANC flag logical and C-flag with 1-bit variable NTC logical not C-flag RNC clock an LFSR in W, setting C-flag to a pseudo-random value unsupported in the 65ISR-chico TCL transfer C-flag to low bit of A TCH transfer C-flag to high bit of A TLC transfer low bit of A to C-flag THC transfer high bit of A to C-flag The instructions that use the C-flag and 1-bit variables are primarily for state-machines, such as used in PLCs. This area can also be used for I/O, such as control and status ports, in which you need to access only one bit. See the RND_C macro later in this document for an equivalent in software for the algorithm used in the RNC instruction. EOR zadr logical exclusive-or value with A, setting N= high bit, V= 2nd high bit, C= low-bit IOR zadr logical inclusive-or value with A, setting N= high bit, V= 2nd high bit, C= low-bit AND zadr logical and value with A, setting N= high bit, V= 2nd high bit, C= low-bit NOT A logical not A, setting N= high bit, V= 2nd high bit, C= low-bit LDY #byte load Y with value LDY zadr load Y with value STY zadr store Y to memory TAY transfer A to Y TYA transfer Y to A CPY #byte subtract value from Y without modifying Y, but setting Z N flags, and setting C= ~Z CPY zadr subtract value from Y without modifying Y, but setting Z N flags, and setting C= ~Z INY move Y plus 1 to Y setting C Z N flags DEY move Y minus 1 to Y setting C Z N flags ILY zadr load Y with value >> move Y + 1 to Y >> move Y to memory pre-increment of memory value LIY zadr load Y with value >> move Y + 1 to memory post-increment of memory value L2Y zadr load Y with value >> move Y + 2 to memory post-increment of memory value Note that INY DEY set the C-flag. On the 6502 the INY DEY instructions don't set the C flag. This is mostly useful for DEY to indicate when we have crossed the boundary. LDW #word load W with 16-bit value unsupported in the 65ISR-chico LDW zadr load W with 16-bit value, latching the word if it is an I/O port unsupported in the 65ISR-chico LDW page,Y load W with 16-bit value unsupported in the 65ISR-chico STW zadr store W to memory, latching the word if it is an I/O port unsupported in the 65ISR-chico STW page,Y store W to memory unsupported in the 65ISR-chico ADW #byte add signed value to W setting C Z N V flags unsupported in the 65ISR-chico ASW zadr move W plus 16-bit value to W setting C Z N V flags >> do STW unsupported in the 65ISR-chico MUL unsigned multiply A times Y, set W= product unsupported in the 65ISR-chico The ASW instruction, and the INC instruction (shown later), are intended for adding the partial products in a 16x16 multiply. SLW shift W left, moving 0 into low bit, moving high-bit into C-flag, setting N Z flags ADW zadr move W plus 16-bit value to W setting C Z N V flags The SLW and ADW instructions are intended for use in a 16/8 division. Both are unsupported in the 65ISR-chico. See the DIV macro for an example of how 16/8 division is done. JSR #word move PC+2 to W >> do JMP unsupported in the 65ISR-chico RTS move W to PC unsupported in the 65ISR-chico JS0 move PC to W >> move $100 to PC unsupported in the 65ISR-chico JS2 move PC to W >> move $120 to PC unsupported in the 65ISR-chico JS4 move PC to W >> move $140 to PC unsupported in the 65ISR-chico JS6 move PC to W >> move $160 to PC unsupported in the 65ISR-chico JS8 move PC to W >> move $180 to PC unsupported in the 65ISR-chico JSA move PC to W >> move $1A0 to PC unsupported in the 65ISR-chico JSC move PC to W >> move $1C0 to PC unsupported in the 65ISR-chico JSE move PC to W >> move $1E0 to PC unsupported in the 65ISR-chico Eight common subroutine calls can be made fast and small. Each subroutine can be up to 32 bytes long, which is quite a lot. These are in RAM. They should generally be initialized by the start-up code. They could be used for jitting though. LDA #byte load A with value LDA zadr load A with value LDA page,Y load A with value LDA bank,W load A with value unsupported in the 65ISR-chico STA zadr store A to memory STA page,Y store A to memory STA bank,W store A to memory unsupported in the 65ISR-chico EXA zadr exchange A with memory EXA page,Y exchange A with memory EXA bank,W exchange A with memory unsupported in the 65ISR-chico ADD #byte move A plus value to A setting C Z N V flags ADD zadr move A plus value to A setting C Z N V flags ASA zadr do ADD >> do STA ADC #byte move A plus value plus C to A setting C Z N V flags ADC zadr move A plus value plus C to A setting C Z N V flags SUB #byte move A minus value to A setting C Z N V flags SUB zadr move A minus value to A setting C Z N V flags SBC #byte move A minus value minus ~C to A setting C Z N V flags SBC zadr move A minus value minus ~C to A setting C Z N V flags NEG A negate A setting Z N flags, and setting C= ~Z NEG zadr negate memory value setting Z N flags, and setting C= ~Z INC zadr add C-flag to 8-bit memory value setting C Z N flags DEC zadr subtract ~C-flag from 8-bit memory value setting C Z N flags Note that INC and DEC only increment or decrement depending upon what the C-flag is, unlike the 6502 that always does it. ROR A shift A right, moving C-flag into high-bit, moving low-bit into C-flag, setting N Z flags ROR zadr shift memory value right, moving C-flag into high-bit, moving low-bit into C-flag, setting N Z flags ROL A shift A left, moving C-flag into low-bit, moving high-bit into C-flag, setting N Z flags ROL zadr shift memory value left, moving C-flag into low-bit, moving high-bit into C-flag, setting N Z flags Section 5.) ISRs and subroutines This is how the MIRQ ISR would typically be written: MIRQ_VECTOR: dw 1 ; vector to next code chunk of MIRQ ISR org $FF80 ; this is the start of the MIRQ ISR MIRQ: JMP mirq_vector The MIRQ ISR is the main-program. At start, it has to jump through a vector to where it left off previously. This vector is typically set by the POL instruction. Note that registers aren't saved and restored automatically. The code chunks punctuated by POL should be pretty short so we don't block IRQx interrupts for very long. The 65ISR-chico doesn't support subroutines at all. On the 65ISR-abu subroutines are called with JSR or JSx, so they start with the return-address in W. None of our subroutines support recursion --- presumably recursion won't be needed in a micro-controller anyway. We have four kinds of subroutines: fast-ISR-subroutines These hold the return-address in W and they end with RTS. These can't use W for data, and they can't call other subroutines. slow-ISR-subroutines These hold the return-address in a zero-page vector that they own (not used by other subroutines) and end in JMP indirect. These can use W for data and they can call fast-ISR-subroutines or slow-ISR-subroutines. fast-main-subroutines These hold the the return-address in MIRQ-VECTOR and end with RTI. These can use W for data and they can't call other subroutines. They can't use POL internally. slow-main-subroutines These hold the return-address in a zero-page vector that they own and end in JMP indirect. These can use W for data and they can call fast-main and slow-main subroutines. They can use POL internally. Subroutines are either for use in ISRs or the main-program, but not both. It may be necessary to have duplicates of certain subroutines so both ISRs can the main-program can be supported. This will result in some redundancy, but FLASH memory is cheap these days, so don't worry about it. Fast-ISR-subroutines are hamstrung by not being able to use W for data, but they are fast in and out. Slow-ISR-subroutines can use W for data, but they are somewhat slower because the return-address is saved in memory. Fast-main-subroutines can't call other subroutines, and they can't use POL internally. They should be pretty short. The advantage of these is that calling one does RTI interally; it is like POL in allow pending IRQx interrupts to execute. Slow-main-subroutines can call fast-main or slow-main subroutines. They can use POL internally, so they can be pretty long. Some 65ISR systems won't have a main-program at all, but will be entirely event-driven. If there is a main-program, it should be punctuated with POL (or calls to fast-main-subroutines that end in RTI internally). We need to do POL or RTI pretty frequently so the IRQx interrupts aren't blocked for too long. How often should POL or RTI be done in a main-program? This depends upon the application. If the clock is fast and the I/O is slow, some interrupt latency is acceptable. For the most part though, interrupt latency should be minimized, so POL or RTI should be done frequently. The assumption is that the main-program is not doing anything time-critical so punctuating it with POL or RTI is okay. The programmer has to calculate how often POL or RTI are done in the main-program for each application. How much interrupt latency is acceptable? How fast is the clock? How many clock cycles are there per instruction on average? As a rule-of-thumb, 10 to 20 instructions between POL or RTI should be acceptable for most systems. Later on we have a DIV macro that calculates one bit in the quotient, so sixteen are needed to do a 16/8 division. The entire 16/8 division is somewhat lengthy. Hopefully this doesn't need to be done in an IRQx ISR. Calculations involving division belong in the main-program and are presumably not time-critical. An example would be updating the display, which involves converting a 16-bit number into an ascii string of digits. The 16/8 division could be a slow-main-subroutine, and it could do POL after every two bits of the quotient. Similarly, the 16x16 multiply subroutine will likely have to do POL a few times internally because it is lengthy too. The 65ISR mostly shines when there is no main-program, or at least the main-program is not time-critical. If a BLK is done before POL or RTI in the main-program, the processor will wait for an IRQx to happen. When the IRQx is triggered, there is very little interrupt latency because no registers have to be saved/restored. In an application with a main-program that is time-critical, a more traditional processor may be a better choice. The STM8, for example, supports a main-program. The problem is that interrupts' save/restore 9 bytes, which is quite a lot. You get a main-program, but you have a lot of interrupt latency. The IRET instruction takes a whopping 11 clock cycles! Entering an ISR takes 10 clock cycles, although with WFI or TRAP you get to save the registers ahead of time. The advantage of the 65ISR is somewhat subtle. The 65ISR does not allow ISRs to be interrupted. This means that the code doesn't have to be reentrant. This means that we can use direct addressing of zero-page variables rather than hold local data on a stack. Holding local data on a stack is slow because the indexed addressing mode is needed. On most processors, the indexed addressing-mode is slower than the direct zero-page addressing-mode. It may seem that having to punctuate the main-program with POL is a hassle, but there is a hidden benefit. The benefit is that all the code (both ISRs and the main-program) get to use direct addressing of zero-page variables. Dodging the requirement for code to be reentrant simplifies the HDL for the hardware and speeds up the software. Section 6.) some sample code ; In the GET and PUT macros, PAGE is the buffer, SRC is the available data and DST is the free area. macro GET page src dst err ; load A from buffer --- jump to ERR if there is no data LDY src CPY dst BEQ err LDA page,Y INY STY src endm macro PUT page src dst err ; store A into buffer --- jump to ERR if there is no room LDY dst CPY src BEQ err STA page,Y INY STY dst endm ; This version of GET and PUT has the disadvantage of being lengthy. ; The following version of GET and PUT are each one instruction shorter: macro GET page src dst err ; load A from buffer --- jump to ERR if there is no data LDY src CPY dst BEQ err LDA page,Y INC src ; C-flag was set to 1 by CPY endm macro PUT page src dst err ; store A into buffer --- jump to ERR if there is no room LDY dst CPY src BEQ err STA page,Y INC dst ; C-flag was set to 1 by CPY endm ; This version of GET and PUT also has the disadvantage of being lengthy, and INC does an extra memory access. ; The following version of GET and PUT are each one instruction shorter, and we have one less memory access: macro GET page src dst err ; load A from buffer --- jump to ERR if there is no data LIY src ; Y= value prior to increment in memory CPY dst BEQ err ; if this branch is taken, then the increment should not have been done LDA page,Y endm macro PUT page src dst err ; store A into buffer --- jump to ERR if there is no room LIY dst ; Y= value prior to increment in memory CPY src BEQ err ; if this branch is taken, then the increment should not have been done STA page,Y endm ; In this version, if we BEQ to the ERR code, the increment needs to be undone because it stepped over the other index. ; Branching to the ERR code should (hopefully) be pretty rare, so undoing the bad increment doesn't hurt efficiency at all. ; GET and PUT need to be fast. A lot of ISRs only do some I/O and access a buffer, and that is it. ; The following version is for 16-bit data: macro GET page src dst err ; load W from buffer --- jump to ERR if there is no data L2Y src ; Y= value prior to increment in memory CPY dst BEQ err ; if this branch is taken, then the increment should not have been done LDW page,Y endm macro PUT page src dst err ; store W into buffer --- jump to ERR if there is no room L2Y dst ; Y= value prior to increment in memory CPY src BEQ err ; if this branch is taken, then the increment should not have been done STW page,Y endm ; This is essentially the same thing except that L2Y is used rather than LIY, and the data is in W rather than A. ; Note that POL and RTI do not save the registers. ; If GET or PUT can't be done and we need to wait, when our ISR restarts it will be at the beginning again. ; If an ISR is going to recover from a POL or RTI left off, it must manually store the context in variables. macro ASHR ; arithmetic shift right A THC ROR A endm macro SHR ; logical shift right A CLC ROR A endm macro SHL ; logical shift left A CLC ROR A endm macro DNEGATE adr ; ADR is of a 16-bit value in zero-page NEG adr+1 NEG adr DEC adr+1 endm ; ASHR SHR and SHL could each be made into an instruction to be slightly faster, but there wouldn't be much of a speed boost. ; DNEGATE could be an instruction too, but this is unlikely to get enough use to justify it. ; There are lots of code segments that could be made into instructions. With an FPGA, that option is always available. :-) macro DIV D T B ; W=numerator, D=denominator, T=-D, B=bit this is done 16 times for each B in quotient SLW ADW T ; W= W*2-D (partial remainder) BPL L1 ; if W>=0 then leave the quotient bit set to one (all quotient bits set to one prior to starting) CLC STC B ; set the quotient bit to zero ADW D ; restore W L1: endm ; The DIV macro calculates one bit of the quotient. The quotient should be preset with all 1 bits. ; D is a 16-bit denominator; the 8-bit denominator shifted left by 8 bits. T is D negated (see the DNEGATE macro above). ; The bits should be calculated from most-significant to least-significant (right to left because we are little-endian). ; Execute DIV sixteen times. Whatever is left in W is the remainder (should be 8-bit because the denominator was 8-bit). ; MUL is an instruction because this has to be fast for PID control. DIV isn't used much and doesn't have to be fast. ; The most common use for DIV is dividing a 16-bit number by ten to convert it into decimal digits. ; Our DIV macro above is pretty fast though (compared to the 65c02, for example) --- it should be adequate for most uses. macro RND_C seed ; SEED is the leftmost flag of the 16-bit seed works on the 65ISR-abu but not the 65ISR-chico LDW seed/8 RNC STW seed/8 endm macro RND_C seed ; SEED is the leftmost flag of the 16-bit seed works on the 65ISR-abu and the 65ISR-chico LDC seed+15 EOC seed+4 EOC seed+2 EOC seed+1 ; C-flag is the random bit ROR seed/8 ; shift left byte ROR seed/8+1 ; shift right byte LDC seed+0 ; this was C-flag before it got shifted into the leftmost bit endm ; The two equivalent versions of RND_C are shown here to illustrate how the RNC instruction works. seed: dw 1 ; RND_A: address of a 16-bit LFSR seed ra dw 1 ; RND_A: return-address RND_A: ; needs Y = how many bits (1 minimum and 8 maximum); sets A to a random value STW ra LDA #0 ; this is the initial value of the byte we are generating LDW seed L1: RNC ROL A DEY BNE L1 STW seed LDW ra RTS ; RND_A should be random enough for games. There is no memory access in the loop, so it is fast. ; Note how RND_A can't hold the return-address in W because W is used internally. I: db 1 ; RC4: index initialized to $00 J: db 1 ; RC4: index initialized to $00 S equ $1 ; RC4: S array page initialized to contain the numbers [0,255] jumbled K equ $2 ; INIT_RC4 local: key string KL: db 1 ; INIT_RC4 local: key length macro EXIJ ; exchange bytes at S(I) and S(J); needs Y=I; leaves Y=I and A=S(I) LDA S,Y LDY J EXA S,Y LDY I STA S,Y endm INIT_RC4: ; needs the K array and KL; initializes the S array; initializes I and J to zero LDY #0 L1: TYA STA S,Y ; S(Y)= Y INY BNE L1 ; fill the S array, leave Y = 0 STY I ; I= 0 index into S STY J ; J= 0 index into K L2: ; begin LDY J LDA K,Y ; A= K(J) INY CPY KL BNE L3 LDY #0 L3: STY J ; J= (J + 1) mod KL LDY I ADD S,Y ; A= K(J) + S(I) ASA J ; J= K(J) + S(I) + J A= J EXIJ ; swap S(I) and S(J) Y= I A= S(I) INY STY I ; I= I + 1 BNE L2 ; loop until I=0 STY ; J= 0 RTS RC4: ; set A to output byte, using the S array and the I J indices provided by INIT_RC4 ILY I ; I= I+1 Y= I LDA S,Y ASA J ; J= J + S(I) A= J EXIJ ; swap S(I) and S(J) Y= I A= S(I) LDY J ADD S,Y ; A= S(I) + S(J) TAY LDA S,Y ; A= S(A) RTS ; RC4 is very fast, but requires a page of RAM for the S array. You get speed at the cost of increased memory usage. ; RC4 also needs a page to hold the key, although this can be used for something else after INIT-RC4 completes. ; Memory conservation is only an issue in the 65ISR-chico that may have only 1KB or 2KB of RAM. ; The 65ISR-chico could be used in a smart-card, in which case AES would likely be required. macro randomize_high_bit ; do this before encrypting 7-bit ascii in A RND_C seed TCH endm macro crypt var ; set A= encrytion or decryption of VAR byte JSR RC4 eor var endm macro zero_high_bit ; do this after decrypting 7-bit ascii in A CLC TCH endm ; In 7-bit ascii, the high-bit of each char is always zero. This is known-plaintext that the attacker can use. ; An easy solution is to randomize the high-bit before encrypting, then zero the high-bit after decrypting. ; If speed is very important though, then don't bother. Even given known-plaintext, RC4 is pretty secure. ; Extended ascii may be needed anyway, to provide Spanish chars, so the known-plaintext problem is a non-issue. Section 7.) some example applications for the 65ISR-abu and 65ISR-chico The 65ISR-chico would be useful in micro-controllers that involve a lot of I/O, because it has low interrupt latency. The advantage over the venerable i8032 is that an FPGA can be modified to support custom I/O. Also, if there is a speed bottleneck, the FPGA processor can be upgraded with new instructions to fix the problem. This works especially well when the code can be parallelized internally, rather than done sequentially. The 65ISR-chico could be used as a coprocessor to do I/O in the background for a larger processor that has a main-program. This speeds up the main-program because it gets interrupted less often. This isn't necessarily important though. If all I/O is on the coprocessor then it is only slightly faster (because the 65ISR-chico doesn't save/restore context). If there are two ISRs that have to be fast, they should be on separate processors so they don't delay each other. The 65ISR-chico has to be programmed in assembly-language, and it doesn't have subroutines. The 65ISR-abu provides subroutines and indirect memory-access through pointers, which are needed in high-level language. The 65ISR-abu also can access 16MB of memory, which is useful for buffering entire files. Arcade games would benefit from having a lot of memory. They may have multiple large graphics files to work with. Some machines have multiple games. They could benefit from using binary overlays. The 65ISR-abu does have the RNC instruction for games, which could boost the speed somewhat. The Super Nintendo used the 65c816 processor, but the 65ISR-abu is a better design in many ways. The Super Nintendo is obsolete anyway --- afaik, game machines use 32-bit processors now. CNC machines would benefit from having a lot of memory. A large file of proprocessed data can be uploaded to the 65ISR-abu. The idea with preprocessing is to avoid doing calculations while objects are moving --- because the objects have momentum! The major bottleneck in many programs is the 16x16 multiplication. The 65ISR-abu only has an 8x8 multiply. The 65ISR-abu does have the ASW and INC instructions that speed up the summation of the partial products. This is pretty fast. The 65ISR-abu has a 16/8 division that can be useful --- dividing by 10 is needed for converting numbers into decimal strings. TCP/IP needs a lot of memory. Even a small website with minimal graphics can use a lot of memory. It is a good idea to have factory machines connected to the internet so the factory owner can monitor them from home. Also, there may be micro-controllers in remote locations, such as used for gate-access, that need to be monitored. Smart-phones are insecure because the NSA has backdoors. Fax machines are easier to build and can be made secure. I can foresee copy centers throughout America offering secure fax-machine usage with a public-key cryptography system. This should make the NSA snoops miserable --- although not as miserable as they deserve to be. The eZ80 processor was designed specifically to support TCP/IP --- or any other application that needs a lot of memory. The eZ80 was described as: "a poor man's ARM" --- the eZ80 became obsolete though --- the ARM went down in price. At this time, the STM8 and MSP430 are in the market niche that the eZ80 used to occupy. The 65ISR is arguably a better design than any of these processors --- certainly a different design, anyway. For the most part, the 65ISR is a hobby project for me. I'm not expecting to take over the micro-controller world. I think the 65ISR has a higher fun-quotient than the MSP430, which is a warmed-over PDP11 (16 registers instead of 8). The 65ISR might appeal to programmers who have nostalgia for the 65c02 --- the 65c02 was just a lot cooler than the Z80!