65ISR processor design --- by Hugh Aguilar --- September 2017 Abstract: The 65ISR is derived from the 6502, but it only supports ISRs. It is an 8-bit processor. The full version, called the 65ISR-abu, has a W register and can access 16MB of memory. The small version, called the 65ISR-chico, lacks the W register. This would most likely be used as a coprocessor. The VIRQ interrupt is the innovative part of the 65ISR design --- nothing like this is found in any other processor. All variables are in zero-page. There is only indirect access to other memory. We have a page,Y addressing-mode that is useful for circular buffers and small arrays. We have a bank,W addressing-mode that is useful for accessing alternate 64KB banks, such as in a RAM-disk. The 65ISR has 1-bit variables similar to the i8032. These are useful for state-machines, such as in a PLC. Section 1.) the registers The 65ISR is a little-endian 8-bit processor. RAM is in the lower part of memory and non-volatile in the upper part. We have the following registers: A 8-bit accumulator Y 8-bit index register W 16-bit word register unsupported in the 65ISR-chico PC 16-bit program counter 12-bit in the 65ISR-chico: 1111,xxxx,xxxx,xxxx P 4-bit processor status flags The P register contains these flags: bit 0 C-flag this indicates a carry bit 1 Z-flag this indicates a zero bit 2 N-flag this indicates a negative bit 3 V-flag this indicates an overflow The C-flag is the same as on the 6502. ADC adds C and SBC subtracts ~C. You can use CLC before ADC to have no carry. We have an ADD instruction though, that does this automatically. You use SEC before SBC to have no borrow. We don't have a SUB instruction, so this has to be done manually. Some processors, such as the MC6805, subtract C rather than ~C so you use CLC before SBC to have no borrow. The Z-flag, N-flag and V-flag are all the same as on the 6502 --- likely the same as all other processors. Section 2.) the interrupts IRQ0 execution begins at $FC00. If more than one interrupt is pending, IRQ0 is the highest priority, etc.. IRQ1 execution begins at $FC40. IRQ2 execution begins at $FC80. IRQ3 execution begins at $FCC0. IRQ4 execution begins at $FD00. ... IRQ14 execution begins at $FF40. If more than one interrupt is pending, IRQ14 is the lowest priority. VIRQ execution begins at $FF80 This is done when no interrupt is pending and the V-flag is set to 1. start-up execution begins at $FFC0. This is done on power-up. When any of the above begin, only the PC is initialized. The registers and flags are all set to zero. The VIRQ only executes if the V-flag was previously set to 1. The IRQx interrupts typically end in RTI --- this unmasks the interrupts and sets V-flag to 1 (allowing VIRQ). The IRQx interrupts can end in WAI --- this unmasks the interrupts and sets V-flag to 0 (disallowing VIRQ). WAI would typically only be used in programs that don't have a main-program, but are entirely event-driven. The VIRQ ISR is effectively the main-program. When no interrupts are pending and the V-flag is set to 1, VIRQ executes. The VIRQ ISR starts out with a JMP through a zero-page vector (the vector should be initialized during start-up). The main-program is broken up into chunks that are punctuated with PSV instructions. The PSV instruction stores the address after it to the zero-page vector, then does RTI. If no IRQx interrupts need servicing, the VIRQ ISR executes again, jumping through that vector. If any interrupts are pending, they get serviced, and the VIRQ ISR executes when no more IRQx interrupts are pending. From the perspective of the programmer, PSV has no effect because execution continues as if it were a NOP instruction. The programmer should be aware however, that no registers are saved through a PSV. Registers (the flags, A, Y and W) need to be saved manually, or better yet, PSV should be done when they are invalid. We also have PCV that is like PSV except uses WAI rather than RTI internally. It is for when there is no main-program. The chico version doesn't have PSV or PCV so saving the vector has to be done manually using the A register. The 65ISR-chico will be less efficient than the 65ISR-abu because it lacks the W register, but it can do most everything. Saving the vector can also be done manually on the 65ISR-abu. For example, AGAIN could set the vector to the BEGIN address. The 'P' in the PSV and PCV instructins stands for "poll" --- it is effectively an instruction that polls the I/O. There can be any number of IRQx lines up to 14 maximum --- implement as many as needed for the application. For code, 1KB is the minimum because we need memory at $FC00 for IRQ0. Section 3.) the addressing-modes We have the following addressing-modes: #byte 8-bit immediate value #word 16-bit immediate value Z zero value A 8-bit A register value zadr 8-bit address in zero-page page,Y 8-bit page value is the high-byte and Y is the low-byte, to form a 16-bit address bank,W 8-bit bank value is the high-byte and W is the low-word, to form a 24-bit address flag 8-bit index to one of 256 1-bit variables located at $00..$1F The Z addressing-mode is equivalent to #0 as an operand, but there is no operand so the instruction is smaller and faster. Z can be used explicitly --- in general though, just use #byte and let the assembler go with Z when possible. Typically we have I/O memory-mapped in zero-page, as well as all of our data. The Y register and the page,Y addressing-mode are mostly used for 256-byte circular buffers located in RAM above zero-page. In order to buffer 16-bit words rather than 8-bit bytes, use separate pages for low and high bytes. A file can fit in one 64KB bank and be addressed with the bank,W addressing-mode. If a file is too big, then put the even bytes in one 64KB bank and the odd bytes in another 64KB bank. The file will have to contain an even number of bytes. It can be padded with a zero if necessary. This technique can be extended to use any number of 64KB banks, for very large files. The (zadr),Y addressing-mode was the hallmark of the 6502. The 65ISR doesn't have that though. A pointer can be loaded into W though, which would then be used like the (zadr) addressing-mode of the 65c02. By holding a pointer in W in a loop, there are fewer memory accesses than with the (zadr),Y addressing-mode. Switching between two pointers inside the loop, such as in a block move, is less efficient though. At one time I wanted to have an X register that would be used as a data-stack as traditionally done in Forth. I now think it is possible to write a Forth compiler that simulates a stack, but uses direct addressing internally. There are already many examples of C and Pascal compilers (from ByteCraft) that use only direct addressing internally. With this technique you don't get reentrancy, but our ISRs can't be interrupted anyway, so reentrancy is less important. The compiler has to be smart enough however, to not reuse zero-page memory in subroutines that are in the same call-chain. The 65ISR-chico lacks subroutines and indirect memory-access through pointers, which are needed in any high-level language. The 65ISR-chico would generally be programmed in assembly-language, although a BASIC-like language is possible. The 65ISR-abu should generally be programmed in a high-level language. Reusing zero-page memory is complicated, but a compiler can do it --- in assembly-language, this can be error-prone. Section 4.) the instructions The flags are not affected unless specifically stated (unlike the 6502, our LDA doesn't affect the flags). RTI set V-flag to 1 >> unmask the interrupts WAI set V-flag to 0 >> unmask the interrupts >> go into a low-power wait mode PSV zadr load W with PC+1 >> store W to memory >> do RTI unsupported in the 65ISR-chico PCV zadr load W with PC+1 >> store W to memory >> do WAI unsupported in the 65ISR-chico JMP zadr load PC with value JMP #word load PC with value BRA #byte add signed value to PC BEQ #byte if Z then add signed value to PC BNE #byte if ~Z then add signed value to PC BCS #byte if C then add signed value to PC BCC #byte if ~C then add signed value to PC BMI #byte if N then add signed value to PC BPL #byte if ~N then add signed value to PC BVS #byte if V then add signed value to PC BVC #byte if ~V then add signed value to PC BLT #byte if N<>V then add signed value to PC BGT #byte if N=V and ~Z then add signed value to PC CLC set C-flag to 0 SEC set C-flag to 1 LDC flag load C-flag from 1-bit variable STC flag store C-flag to 1-bit variable EOC flag logical exclusive-or C-flag with 1-bit variable IOC flag logical inclusive-or C-flag with 1-bit variable ANC flag logical and C-flag with 1-bit variable NTC logical not C-flag RNC clock an LFSR in W, setting C-flag to a pseudo-random value unsupported in the 65ISR-chico TCL transfer C-flag to low bit of A TCH transfer C-flag to high bit of A TLC transfer low bit of A to C-flag THC transfer high bit of A to C-flag TCV transfer C-flag to V-flag TVC transfer V-flag to C-flag The instructions that use the C-flag and 1-bit variables are primarily for state-machines, such as used in PLCs. This area can also be used for I/O, such as control and status ports, in which you need to access only one bit. See the RND_C macro later in this document for an equivalent in software for the algorithm used in the RNC instruction. EOR zadr logical exclusive-or value with A, setting N= high bit, V= 2nd high bit, C= low-bit IOR zadr logical inclusive-or value with A, setting N= high bit, V= 2nd high bit, C= low-bit AND zadr logical and value with A, setting N= high bit, V= 2nd high bit, C= low-bit NOT A logical not A, setting N= high bit, V= 2nd high bit, C= low-bit LDY Z load Y with zero LDY #byte load Y with value LDY zadr load Y with value STY zadr store Y to memory TAY transfer A to Y TYA transfer Y to A CPY #byte subtract value from Y without modifying Y, but setting Z N flags, and setting C= ~Z CPY zadr subtract value from Y without modifying Y, but setting Z N flags, and setting C= ~Z INY move Y plus 1 to Y setting C Z N flags DEY move Y minus 1 to Y setting C Z N flags ILY zadr load Y with value >> move Y + 1 to Y >> move Y to memory pre-increment of memory value LIY zadr load Y with value >> move Y + 1 to memory post-increment of memory value L2Y zadr load Y with value >> move Y + 2 to memory post-increment of memory value LUP #byte do DEY >> do BNE (pronounced: "loop") Note that INY DEY LUP all set the C-flag. On the 6502 the INY DEY instructions don't set the C flag. This is mostly useful for DEY to indicate when we crossed the boundary. LDW #word load W with 16-bit value unsupported in the 65ISR-chico LDW zadr load W with 16-bit value, latching the word if it is an I/O port unsupported in the 65ISR-chico LDW page,Y load W with 16-bit value unsupported in the 65ISR-chico STW zadr store W to memory, latching the word if it is an I/O port unsupported in the 65ISR-chico STW page,Y store W to memory unsupported in the 65ISR-chico ASW zadr move W plus 16-bit value to W setting C Z N V flags >> do STW unsupported in the 65ISR-chico ADW #byte add signed value to W setting C Z N V flags unsupported in the 65ISR-chico MUL unsigned multiply A times Y, set W= product unsupported in the 65ISR-chico DIV unsigned divide W by Y, set W= quotient, A= remainder unsupported in the 65ISR-chico JSR #word move PC+2 to W >> do JMP unsupported in the 65ISR-chico RTS move W to PC unsupported in the 65ISR-chico JS0 move PC to W >> move $100 to PC unsupported in the 65ISR-chico JS2 move PC to W >> move $120 to PC unsupported in the 65ISR-chico JS4 move PC to W >> move $140 to PC unsupported in the 65ISR-chico JS6 move PC to W >> move $160 to PC unsupported in the 65ISR-chico JS8 move PC to W >> move $180 to PC unsupported in the 65ISR-chico JSA move PC to W >> move $1A0 to PC unsupported in the 65ISR-chico JSC move PC to W >> move $1C0 to PC unsupported in the 65ISR-chico JSE move PC to W >> move $1E0 to PC unsupported in the 65ISR-chico Eight common subroutine calls can be made fast and small. Each subroutine can be up to 32 bytes long, which is quite a lot. These are in RAM. They should generally be initialized by the start-up code. They could be used for jitting though. LDA Z load A with zero LDA #byte load A with value LDA zadr load A with value LDA page,Y load A with value LDA bank,W load A with value unsupported in the 65ISR-chico STA zadr store A to memory STA page,Y store A to memory STA bank,W store A to memory unsupported in the 65ISR-chico EXA zadr exchange A with memory EXA page,Y exchange A with memory EXA bank,W exchange A with memory unsupported in the 65ISR-chico ADD #byte move A plus value to A setting C Z N V flags ADD zadr move A plus value to A setting C Z N V flags ASA zadr do ADD >> do STA ADC #byte move A plus value plus C to A setting C Z N V flags ADC zadr move A plus value plus C to A setting C Z N V flags SUB #byte move A minus value to A setting C Z N V flags SUB zadr move A minus value to A setting C Z N V flags SBC #byte move A minus value minus ~C to A setting C Z N V flags SBC zadr move A minus value minus ~C to A setting C Z N V flags NEG A negate A setting Z N flags, and setting C= ~Z NEG zadr negate memory value setting Z N flags, and setting C= ~Z INC zadr add C-flag to 8-bit memory value setting C Z N flags DEC zadr subtract ~C-flag from 8-bit memory value setting C Z N flags Note that INC and DEC only increment or decrement depending upon what the C-flag is, unlike the 6502 that always does it. The ASW and INC instructions are intended to boost the speed of adding partial products in a 16x16 multiply. ROR A shift A right, moving C-flag into high-bit, moving low-bit into C-flag, setting N Z flags ROR zadr shift memory value right, moving C-flag into high-bit, moving low-bit into C-flag, setting N Z flags ROL A shift A left, moving C-flag into low-bit, moving high-bit into C-flag, setting N Z flags ROL zadr shift memory value left, moving C-flag into low-bit, moving high-bit into C-flag, setting N Z flags Section 5.) some sample code ; In the GET and PUT macros, PAGE is the buffer, SRC is the available data and DST is the free area. macro GET page src dst err ; load A from buffer --- jump to ERR if there is no data LDY src CPY dst BEQ err LDA page,Y INY STY src endm macro PUT page src dst err ; store A into buffer --- jump to ERR if there is no room LDY dst CPY src BEQ err STA page,Y INY STY dst endm ; This version of GET and PUT has the disadvantage of being lengthy. ; The following version of GET and PUT are each one instruction shorter: macro GET page src dst err ; load A from buffer --- jump to ERR if there is no data LDY src CPY dst BEQ err LDA page,Y INC src ; C-flag was set to 1 by CPY endm macro PUT page src dst err ; store A into buffer --- jump to ERR if there is no room LDY dst CPY src BEQ err STA page,Y INC dst ; C-flag was set to 1 by CPY endm ; This version of GET and PUT also has the disadvantage of being lengthy, and INC does an extra memory access. ; The following version of GET and PUT are each one instruction shorter, and we have one less memory access: macro GET page src dst err ; load A from buffer --- jump to ERR if there is no data LIY src ; Y= value prior to increment in memory CPY dst BEQ err ; if this branch is taken, then the increment should not have been done LDA page,Y endm macro PUT page src dst err ; store A into buffer --- jump to ERR if there is no room LIY dst ; Y= value prior to increment in memory CPY src BEQ err ; if this branch is taken, then the increment should not have been done STA page,Y endm ; In this version, if we BEQ to the ERR code, the increment needs to be undone because it stepped over the other index. ; Branching to the ERR code should (hopefully) be pretty rare, so undoing the bad increment doesn't hurt efficiency at all. ; GET and PUT need to be fast. A lot of ISRs only do some I/O and access a buffer, and that is it. ; The following version is for 16-bit data: macro GET page src dst err ; load W from buffer --- jump to ERR if there is no data L2Y src ; Y= value prior to increment in memory CPY dst BEQ err ; if this branch is taken, then the increment should not have been done LDW page,Y endm macro PUT page src dst err ; store W into buffer --- jump to ERR if there is no room L2Y dst ; Y= value prior to increment in memory CPY src BEQ err ; if this branch is taken, then the increment should not have been done STW page,Y endm ; This is essentially the same thing except that L2Y is used rather than LIY, and the data is in W rather than A. ; Note that WAI does not save the registers. ; If GET or PUT can't be done and we need to wait, when our ISR restarts it will be at the beginning again. ; If an ISR is going to recover from a WAI where it left off, it must manually store the context in variables. macro DNEGATE adr ; ADR is of a 16-bit value in zero-page NEG adr+1 NEG adr DEC adr+1 endm macro ASHR ; arithmetic shift right A THC ROR A endm macro SHR ; logical shift right A CLC ROR A endm macro SHL ; logical shift left A CLC ROR A endm ; ASHR SHR and SHL could each be made into an instruction to be slightly faster, but there wouldn't be much of a speed boost. ; DNEGATE could be an instruction too, but this is unlikely to get enough use to justify it. ; There are lots of code segments that could be made into instructions. With an FPGA, that option is always available. :-) VIRQ_VECTOR: dw 1 ; vector to next code chunk of VIRQ ISR org $FF80 ; this is the start of the VIRQ ISR VIRQ: LDW virq_vector ; W= code chunk RTS ; the RTS isn't returning from a subroutine --- it is jumping through a vector in W ; The VIRQ ISR is the main-program. At start, it has to jump through a vector to where it left off previously. ; This vector is typically set by the PSV instruction. Note that registers aren't saved and restored automatically. ; The code chunks punctuated by PSV should be pretty short so we don't block IRQx interrupts for very long. macro RND_C seed ; SEED is the leftmost flag of the 16-bit seed this works on the 65ISR but not on the 65ISR-chico LDW seed/8 RNC STW seed/8 endm macro RND_C seed ; SEED is the leftmost flag of the 16-bit seed this works on the 65ISR and also on the 65ISR-chico LDC seed+15 EOC seed+4 EOC seed+2 EOC seed+1 ; C-flag is the random bit ROR seed/8 ; shift left byte ROR seed/8+1 ; shift right byte LDC seed+0 ; this was C-flag before it got shifted into the leftmost bit endm ; The two equivalent versions of RND_C are shown here to illustrate how the RNC instruction works. seed: dw 1 ; RND_A: address of a 16-bit LFSR seed ra dw 1 ; RND_A: return-address RND_A: ; needs Y = how many bits (1 minimum and 8 maximum); sets A to a random value STW ra LDA #0 ; this is the initial value of the byte we are generating LDW seed L1: RNC ROL A LUP L1 STW seed LDW ra RTS ; RND_A should be random enough for games. There is no memory access in the loop, so it is fast. ; Note how RND_A can't hold the return-address in W because W is used internally. I: db 1 ; RC4: index initialized to $00 J: db 1 ; RC4: index initialized to $00 S equ $1 ; RC4: S array page initialized to contain the numbers [0,255] jumbled K equ $2 ; INIT_RC4 local: key string KL: db 1 ; INIT_RC4 local: key length macro EXIJ ; exchange bytes at S(I) and S(J); needs Y=I; leaves Y=I and A=S(I) LDA S,Y LDY J EXA S,Y LDY I STA S,Y endm INIT_RC4: ; needs the K array and KL; initializes the S array; initializes I and J to zero LDY #0 L1: TYA STA S,Y ; S(Y)= Y INY BNE L1 ; fill the S array, leave Y = 0 STY I ; I= 0 index into S STY J ; J= 0 index into K L2: ; begin LDY J LDA K,Y ; A= K(J) INY CPY KL BNE L3 LDY #0 L3: STY J ; J= (J + 1) mod KL LDY I ADD S,Y ; A= K(J) + S(I) ASA J ; J= K(J) + S(I) + J A= J EXIJ ; swap S(I) and S(J) Y= I A= S(I) INY STY I ; I= I + 1 BNE L2 ; loop until I=0 STY ; J= 0 RTS RC4: ; set A to output byte, using the S array and the I J indices provided by INIT_RC4 ILY I ; I= I+1 Y= I LDA S,Y ASA J ; J= J + S(I) A= J EXIJ ; swap S(I) and S(J) Y= I A= S(I) LDY J ADD S,Y ; A= S(I) + S(J) TAY LDA S,Y ; A= S(A) RTS ; RC4 is very fast, but requires a page of RAM for the S array. You get speed at the cost of increased memory usage. ; RC4 also needs a page to hold the key, although this can be used for something else after INIT-RC4 completes. ; Memory conservation is only an issue in the 65ISR-chico that may have only 1KB or 2KB of RAM. ; The 65ISR-chico could be used in a smart-card, in which case AES would likely be required. macro randomize_high_bit ; do this before encrypting 7-bit ascii in A RND_C seed TCH endm macro crypt var ; set A= encrytion or decryption of VAR byte JSR RC4 eor var endm macro zero_high_bit ; do this after decrypting 7-bit ascii in A CLC TCH endm ; In 7-bit ascii, the high-bit of each char is always zero. This is known-plaintext that the attacker can use. ; An easy solution is to randomize the high-bit before encrypting, then zero the high-bit after decrypting. ; If speed is very important though, then don't bother. Even given known-plaintext, RC4 is pretty secure. ; Extended ascii may be needed anyway, to provide Spanish chars, so the known-plaintext problem is a non-issue. Section 6.) some example applications for the 65ISR-abu and 65ISR-chico The 65ISR-chico would be useful in micro-controllers that involve a lot of I/O, because it has low interrupt latency. The advantage over the venerable i8032 is that an FPGA can be modified to support custom I/O. Also, if there is a speed bottleneck, the FPGA processor can be upgraded with new instructions to fix the problem. This works especially well when the code can be parallelized internally, rather than done sequentially. The 65ISR-chico could be used as a coprocessor to do I/O in the background for a larger processor that has a main-program. This speeds up the main-program because it gets interrupted less often. This isn't necessarily important though. If all I/O is on the coprocessor then it is only slightly faster (because the 65ISR-chico doesn't save/restore context). If there are two ISRs that have to be fast, they should be on separate processors so they don't delay each other. The 65ISR-chico has to be programmed in assembly-language, and it doesn't have subroutines. The 65ISR-abu provides subroutines and indirect memory-access through pointers, which are needed in high-level language. The 65ISR-abu also can access 16MB of memory, which is useful for buffering entire files. Arcade games would benefit from having a lot of memory. They may have multiple large graphics files to work with. Some machines have multiple games. They could benefit from using binary overlays. The 65ISR-abu does have the RNC instruction for games, which could boost the speed somewhat. The Super Nintendo used the 65c816 processor, but the 65ISR-abu is a better design in many ways. The Super Nintendo is obsolete anyway --- afaik, game machines use 32-bit processors now. CNC machines would benefit from having a lot of memory. A large file of proprocessed data can be uploaded to the 65ISR-abu. The idea with preprocessing is to avoid doing calculations while objects are moving --- because the objects have momentum! The major bottleneck in many programs is the 16x16 multiplication. The 65ISR-abu only has an 8x8 multiply. The 65ISR-abu does have the ASW and INC instructions that speed up the summation of the partial products. This is pretty fast. The 65ISR-abu has a 16/8 division that can be useful --- dividing by 10 is needed for converting numbers into decimal strings. TCP/IP needs a lot of memory. Even a small website with minimal graphics can use a lot of memory. It is a good idea to have factory machines connected to the internet so the factory owner can monitor them from home. Also, there may be micro-controllers in remote locations, such as used for gate-access, that need to be monitored. Smart-phones are insecure because the NSA has backdoors. Fax machines are easier to build and can be made secure. I can foresee copy centers throughout America offering secure fax-machine usage with a public-key cryptography system. This should make the NSA snoops miserable --- although not as miserable as they deserve to be. The eZ80 processor was designed specifically to support TCP/IP --- or any other application that needs a lot of memory. The eZ80 was described as: "a poor man's ARM" --- the eZ80 became obsolete though --- the ARM went down in price. At this time, the STM8 and MSP430 are in the market niche that the eZ80 used to occupy. The 65ISR is arguably a better design than any of these processors --- certainly a different design, anyway. For the most part, the 65ISR is a hobby project for me. I'm not expecting to take over the micro-controller world. I think the 65ISR has a higher fun-quotient than the MSP430, which is a warmed-over PDP11 (16 registers instead of 8). The 65ISR might appeal to programmers who have nostalgia for the 65c02 --- the 65c02 was just a lot cooler than the Z80!