View unanswered posts | View active topics It is currently Mon Feb 27, 2017 11:14 pm

Reply to topic  [ 33 posts ]  Go to page 1, 2, 3  Next
 Introducing the 65m32 

65m32: Stupid or neat?
Stupid 0%  0%  [ 0 ]
Neat 67%  67%  [ 2 ]
Undecided 0%  0%  [ 0 ]
65m32? 33%  33%  [ 1 ]
Total votes : 3

 Introducing the 65m32 
Author Message

Joined: Tue Dec 31, 2013 2:01 am
Posts: 76
Hello all. Can I ask for a bit of your time? Via Garth's encouragement, I would like to share a project of mine that is still incomplete, but has been in the making for decades (mostly in my head). By sharing my journey with you all and asking for your input, I am hopeful that I will be able to find the inspiration to allow me to get this thing fully specified, simulated, and implemented on an FPGA. An assembler, BIOS, and eForth-based operating system are to be developed as soon as the simulator compiles and runs successfully on a PC host. Garth W., Jeff L., Ed S. and (especially) Dieter M. have been invaluable in coaxing me toward these goals, and I couldn't have come this far without their generosity. However, I am stretching my spare time and expertise to their limits, and have decided that the best chance to reach the finish line will be to start a thread here and seek more help. I am a bit stubborn, and I don't want to lose control of the project, but I need to light a fire under my butt or it will continue to languish in spec doc purgatory.

Here are a few search links to provide some background information; the oldest posts from 2013 are a bit out-dated, but the general idea of what I'm trying to do should still be apparent: ... mit=Search!searc ... Csort:date


In a nutshell, the 65m32 is a 32-bit, word-addressed microprocessor, inspired by the 6502, 6809, pdp-11, and Nova. Its assembly language looks mostly like a cross between 6502 and 6809 assembly, but with a few twists, described below.


The 65m32 is a 32-bit microprocessor. The address bus, data bus, internal registers, ALU, and (with few exceptions) the instructions themselves are all 32-bits wide. I/O is memory-mapped. Each 32-bit instruction contains several bit fields which, altogether, define a flexible range of possibilities. Using a subset of these possibilities, the 65m32 can easily approximate the behaviors of the long-established but less powerful 8-bit machines from the 65xx and 68xx families. Although this property can facilitate comprehension and source translation from one to the other, compatibility is not 100%, as will be explained later. The focus of this document is on the new 32-bit machine. Points relating to the 65xx and 68xx families are subordinate, and will be mentioned as they arise.

The first implementation of the 65m32 has a simplified programming model, in which a user program has full access rights to all 16 of the 32-bit registers, and full access to all 4 GW of physical memory. Future upgrades may include a supervisor/user mode mechanism, and allow a more complex framework for multiprocessing, virtual memory, and inter-process messaging and protection, through additional processor status bits, additional processor registers, and memory management hardware. These upgrades will be designed to provide this extra functionality without requiring any major modifications to well-behaved software written for the original model.

There are ten basic 65m32 registers, and six other advanced registers, to be described later.

Register a is the system main accumulator, and is the 32-bit equivalent to the 6502's 8-bit counterpart of the same name. If any non-trivial arithmetic and/or logical operations are to be performed, register a is the appropriate conduit for these operations. Logical instructions that modify a affect ^NZ , and arithmetic instructions that modify a affect the c register and ^NVZC , in a manner quite similar to the 65xx family. The BCD arithmetic mode is selected by setting ^D , and affects operations involving a , but in a more complete manner than the 65xx family.

Registers x and y are functionally equivalent, and are for general purpose storage and pointer arithmetic. Loads to x and y affect ^NZ .

Register z is reset to zero at the beginning of every instruction fetch, and provides the useful constant 0 (and, in certain cases, +/- 1 ). A load to z affects ^NZ , making it the equivalent of the 68xx's tst instruction.

Registers b and u are functionally equivalent, and are for general purpose storage and pointer arithmetic, just like x and y . The only difference is that loads or adds to b and u do not affect any condition flags, making them more useful as effective-address and stack-frame pointers than x or y .

Register s is the system stack pointer, and its use is implied in any operation that includes a push or a pull, unless specifically stated otherwise. A push operation pre-decrements s before storing, and a pull operation post-increments s after loading. Loads or adds to register s do not affect any condition flags.

Register n is the system instruction pointer, and it is automatically incremented after every instruction fetch. Loads or adds to register n do not affect any condition flags.

Register p is special-purpose, and contains processor status information, including, but not limited to, the information contained in its 6502 counterpart of the same name. All load, logic and arithmetic instructions that alter the contents of registers a , x , y and z cause two or more of the condition flag bits to be updated automatically, in a manner quite similar to the 65xx family. In order to prevent the condition flags from being changed by effective-address calculations, stack operations and branches, all load and add instructions that alter the contents of registers b , u , s , and n have no condition flag effects. Bits 4 and 5 are currently unused by the ALU, but are modifiable by the programmer, who may use them however he or she wishes. Bit 2 and bits 8 to 31 are reserved for future expansion and capabilities … although they may be viewed and modified in the initial version of the hardware, there is a high likelihood that such behavior will cause problems in future versions.

Register c is a full-width 'carry' register, and is used to hold the high-order bits of an arithmetic result which cannot fit in register a . It is potentially a source and a target for the rol , ror , rot , inc , dec , adc , sbc , cdc , mul , and div instructions, and a target for the ash , lsh , asl , lsr , cmp , add , sub and cdd instructions, but only when register a is either specified or implied. The ^C flag in p is similar to the register c , but is only a single bit, and is used when memory cells or registers other than a are modified by certain instructions. ^C may be updated without affecting register c , but every load or update to register c will cause ^C to be updated to reflect the zero or nonzero result in c .

The 32-bit operation word is divided into several fields:

The 7-bit op field specifies the type of operation that will be performed, like load, add, store, shift, compare, etc. Here is the 65m32 op-code matrix. Many of the mnemonics will be instantly recognizable to 65xx and 68xx programmers. Several will look a bit foreign, but a complete understanding can be delayed until after the operand structure has been introduced.

0x: ill ??? stq str stk stm stt stw
1x: ??? ??? ??? ??? ??? ??? wai rti
2x: ??? ??? ldq ldr ldk ldm ldt ldw
3x: ??? ??? ??? ??? ??? ??? ??? ???
4x: bic brk rot dbn ??? stp ??? stc
5x: ora adc lsh sbc orp fad orc fsb
6x: and cdc cdd sub anp ldp anc ldc
7x: eor mul ash div eop fml eoc fdv
8x: sta stx sty stz stb stu sts stn
9x: sla slx sly trb slb slu sls sln
Ax: lda ldx ldy tst ldb ldu lds ldn
Bx: pda pdx pdy pul pdb pdu psh pdn
Cx: cmp cpx cpy bit cpb cpu cps cpn
Dx: exa exx exy tsb exb exu exs exn
Ex: add adx ady byt adb adu ads adn
Fx: asl asr lsl lsr rol ror dec inc

Operations shown in the above matrix are preliminary, and may be moved, modified, or deleted without prior notice. The operation fields marked ??? are reserved for future expansion. The top four rows of the matrix may become privileged-mode operations in future revisions of the 65m32, but are executable by any program for now.

The 3-bit operand register field specifies which register will participate in the operand calculation. Any one of the basic registers, including a and n , can be chosen (except c and p ).

The 16-bit embedded numeric field specifies the numeric portion of the operand. In most (but not all) of the basic instructions, the numeric portion is added to the value contained in the operand register to form the actual operand (the operand register is not modified by this particular action). The embedded numeric field is only 16-bits wide, but is promoted to 32-bits (by duplicating bit 15 into bits 16 .. 31) before being fed to the operand adder.

The 2-bit operand mode field specifies the mode in which the operand will be used by the operation. There is a literal mode, in which the operand is used at 'face value' (immediate mode in 6502-speak), and there are three direct modes, in which the value of the operand is used as an effective address into memory (direct-page and absolute modes in 6502-speak). All basic 65m32 operations use an operand, so the rrr , mm and iiiiiiiiiiiiiiii fields in the machine instruction relate to the operand field in the assembly language as shown:

0: #i,a #i,x #i,y #i,z #i,b #i,u #i,s #i,n
1:  i,a  i,x  i,y  i,z  i,b  i,u  i,s  i,n
2:  i,a+ i,x+ i,y+ i,z+ i,b+ i,u+ i,s+ i,n+
3:  i,-a i,-x i,-y i,-z i,-b i,-u i,-s i,-n

i indicates the position of the signed 16-bit integer embedded numeric, which may be a constant or a label. In the assembly language, a missing embedded numeric defaults to the value 0 , and a missing operand register defaults to the register ,z . Operand registers are always directly preceded by a comma, to distinguish them from labels, so lda #x and lda #,x are not equivalent and will assemble as different machine instructions ... the first would be assembled as lda #x,z (with x assumed to be a label) and the second would be assembled as lda #0,x .

# indicates the literal mode, and is conceptually a superset of the 65xx's immediate mode. This mode has some unique properties, discussed in the next section. Please note the difference in nomenclature here. Literal mode on the 65m32 is equivalent to immediate mode on the 65xx and 68xx devices.

+ indicates the post-increment direct mode. The value in the operand register is fed to the operand adder, then the register is incremented by 1 before the execution of the machine instruction.

– indicates the pre-decrement direct mode. The value in the operand register is decremented by 1 before being fed to the operand adder, and before the execution of the machine instruction.

If # , + , and – are missing, the simple direct mode is the default. In this mode, the operand register is not modified as part of the operand calculation.

On the 65m32, the three direct modes cause the machine operation to use the result from the operand adder as an effective address to memory, making them conceptually a superset of the 65xx's abs abs,x abs,y dp dp,x and dp,y addressing modes. The 65m32 has no built-in facility for the 65xx's (d) (abs) (d,x) (abs,x) (d),y and (d,s),y addressing modes (which specify an address of an address), but we will see later how these modes can be efficiently synthesized when needed, with far more flexibility and generality than their 65xx counterparts.

Please note the difference in nomenclature here as well. Direct mode on the 65m32 is equivalent to absolute or direct-page modes (indexed or non-indexed) on the 65xx and 68xx devices.

The 4-bit conditional field specifies the conditions by which the instruction will be executed or skipped, based on various different combinations of the status bits, like overflow, negative result, greater than or equal, etc. In the assembly language, a missing condition field is assumed to be always.

0: [ra]      (always)
1: [rn]      (never)
2: [hi]      Z ∨ /C == 0
3: [ls]      Z ∨ /C == 1
4: [cc] [lo] C == 0
5: [cs] [hs] C == 1
6: [ne]      Z == 0
7: [eq]      Z == 1
8: [vc]      V == 0
9: [vs]      V == 1
A: [pl]      N == 0
B: [mi]      N == 1
C: [ge]      N ⊕ V == 0
D: [lt]      N ⊕ V == 1
E: [gt]      Z ∨ (N ⊕ V) == 0
F: [le]      Z ∨ (N ⊕ V) == 1

If the condition is true, the instruction is executed in its entirety. If the condition is false, the instruction is skipped in its entirety. In all skipped instructions, the operation and any applicable memory accesses and index register side-effects are all canceled. Notice that the 65m32 allows the conditional execution of any basic instruction, not just those related to branching, which allows the elimination of many “hop” branches around short sections of conditionally-executed code, as long as any condition flag side-effects of the conditionally-executed instructions are consistent with that intent.

So, how does all of this fit together to make the 65m32 a joy to program in assembly (at least for me)? I don't want to tease, but I also don't want to make this post too long, so I'll save some assembly code snippets (including a full explanation of the unique properties of literal mode) for my next post. In the meantime, I am open to questions, suggestions, and critiques.

Thank you all for your time.

Mike B.

Thu Aug 04, 2016 6:24 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 565
Thanks for posting your write-up Mike - looking forward to the next instalment.
I'm interested to see how you deal with indirect addressing, as that was historically a very important innovation.
Also interested to see how Z not being zero can help!

Thu Aug 04, 2016 9:43 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 76
Location: Huntsville, AL

Thanks for posting your write up. I've been following your processor concept from the snippets of its details that you've revealed in some of your posts on I have a better appreciation of its architecture from your post above.

Since you've specifically asked for comments, I'll provide one or two to get the conversation started. It appears to me that you have two competing concepts rolled into your processor: (1) speed; and (2) ease of programming. For speed, you've adopted a word addressable architecture. It appears that you've relegated processing of data less than 32 bits in width to the programmer. This decision certainly simplifies the instruction fetch, and memory interface. In my opinion, this is a false economy. I think that you would be better off allowing data to be accessible in 8, 16, and 32 chunks, and providing instructions for manipulating memory in those standard data widths. You can keep the instruction at 32 bits, and enforce instruction alignment on 32-bit word boundaries for reasons of speed, ease of decoding, etc. On the data handling side, I think that providing a way to handle 8, 16, and 32 data chunks will simplify the programming in a manner that will be much appreciated.

I can see ways to handle 8-bit and 16-bit I/O devices on your processors memory bus. However, I can also see the need to manipulate data smaller than 32 bits in memory. The programming problems associated with processing packed 8-bit or 16-bit character strings will lead, I suspect, to storing these type of data structures in an unpacked manner. I think that will prove to be very wasteful of memory even though you've provided as much as 16 GB in your architecture.

Although I like the architecture of the word addressable Nova computers, I think the byte addressable PDP-11 provided a better approach to the handling of strings. You've incorporated concepts from both of these architectures in your design. For improved ease of programming, I think that you should consider allowing your design to directly address and manipulate 8-bit and 16-bit data. The added complexity needed in the memory interface and in your ALU is minor compared to the simplification this capability will provide when dealing directly with packed 8-bit and 16-bit quantities.

Once again thanks for sharing the details of your 65m32 concept.

Michael A.

Thu Aug 04, 2016 11:01 am

Joined: Tue Dec 31, 2013 2:01 am
Posts: 76
Thank you Ed and Michael. Regarding Michael's comments (your participation is greatly appreciated BTW):
Speed would be nice, but is not an immediate goal ... simplicity is. If I could get 20 MIPS using current hobbyist-grade FPGA technology, I would be giddy with excitement. Ease of programming comes naturally to me, because I am the architect, but I most certainly don't want to be the only 65m32 assembly language programmer, so I have become mindful of quirks that may turn potential programmers off.

For example:
My original design didn't even allow direct random access to the full 4 GW in a single instruction; a long literal load to an index register (via 0,n+ addressing, a la pdp-11) had to precede the access, because a full-width address couldn't embed within the instruction word. Garth was silently vexed when I told him this, but Dieter came to my rescue with his clever (maybe even original) "magic value" in the embedded constant, which would trigger the hardware to discard the narrow constant and use the next word in the instruction stream for the full-width operand constant. This works for literal and direct modes, but adds a "wart" to the architecture (instructions are no longer always one word long). I reluctantly agreed that it was necessary.

The 65m32's inability to address individual bytes is definitely eccentric, but (as you said) not without precedent. I have struggled to come up with a clean way to do this, but the complexities which I unearthed in my attempts are more than I am capable of tackling. I have much respect for your opinion, but I confess that I will probably be using UTF-32 strings and pack/unpack subroutines for the narrower chars ... that is, unless those "inspiration pills" I ordered on-line finally get here and perform as advertised ;-)

I am exhausted from work today, but I will post part 2 of my overview here soon. This will explain some of the mysterious mnemonics like "cdc" and "sla", why there is no "jsr" in the matrix, the "magic value" implementation, and the unique properties of literal mode.

Keep the suggestions coming. I promise to sincerely consider all of them before stubbornly doing it my way! ;-)

Mike B.

Fri Aug 05, 2016 3:09 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 76
Location: Huntsville, AL
Great. looking forward to your next installment. I too am a fan of the solution that you and Dieter have apparently included in your architecture to deal with long immediate and direct address values. I suspect that feature is partly driven by some of the features of the ARM architecture.

Fixed length instructions do have some benefits, particularly in deeply pipelined implementations. But your assertion that speed is not the only driver indicates to me that your implementation should consider variable length instructions. In this case then, some judicious judgement will be necessary so that the instruction complexity doesn't get out of hand as it did with the VAX-11 instruction set.

One additional 32-bit word in the instruction should probably be the limit. The impact on a pipelined implementation would be minimal, and I suspect very beneficial in the overall CPI. The inability to load long constants will be reflected in the need to implement instruction sequences to build those constants in registers. Disturbing the pipeline slightly to avoid the need to fetch and execute the several instructions needed to build long constants will improve performance far more than the disturbance it might introduce into a pipelined implementation of your architecture. Furthermore, I can't see that feature inhibiting the implementation of a pipelined architecture to support future speed improvements of your architecture.

An architecture that you may want to read about, and possibly borrow some ideas from, is the architecture of the PDP-6 and/or PDP-10. I would like very much to find the time to implement a processor based on this architecture.

Another architecture that I think you may like to exploring is the architecture of the CDC 6600. It has a fairly unique way of addressing memory that you may find fits some of your ideas regarding your register usage.

Again, looking forward to your next installment.

Michael A.

Fri Aug 05, 2016 7:24 am

Joined: Tue Dec 31, 2013 2:01 am
Posts: 76
@Michael: Ah yes, the big iron! I time-shared on a CDC Cyber and a VAX 11/780 in my college years, but never got to program them in their native assembly languages. I ran a one-pass Pascal compiler and FSE (full-screen-editor) on the Cyber, and the Ultrix cc compiler and cc68 cross-compiler (along with vi) on the VAX. A curious circumstance: my assembly language course was IBM 360, but there was no 360 on campus, so we ran a 360 assembler/simulator on the Cyber! The VAX, pdp-10 and CDC 6600 are very impressive and personality-rich machines, but I can't get the hang of the assembly languages of any of them ... they are just too foreign for me to put forth the effort to learn in my middle-age. I learned 6502 assembly when I was a teenager, at the peak of my IQ ... it hadn't deteriorated from college partying yet.

Moving on to part 2:

“What about all the old familiar 65xx mnemonics not included in the matrix above, like jmp , bne , pla , tax , iny , clc , jsr , rts , …?” The answer lies in the 65m32's simple yet flexible operand modes, which allow these instructions (and many more) to be synthesized in a single basic 32-bit instruction.

Because the 65m32's literal mode has the useful property of a built-in operand register, it is able to synthesize the 65xx's txa , tax , tya , tay , tsx , txs , inx , iny , dex , dey instructions (and many more) with a simple load or store instruction. The closest analogs to this would be the lea instructions of the 68xx and x86 families. tax translates to ldx #0,a and dey translates to ldy #-1,y … variations on this theme are too numerous to list.

Here are a few translation examples using the 65m32's “indexed” literal operand mode, in which the operand is used at “face value”:
            65816                           65m32
:a9 ff 7f    lda  #32767        :a0307fff    lda  #32767
:a2 00 f0    ldx  #-4096        :a230f000    ldx  #-4096
:e8          inx                :a2100001    ldx  #1,x
:88          dey                :a420ffff    ldy  #-1,y
:8a          txa                :a0100000    lda  #,x
:f4 0b 70    pea  $700b         :b630700b    psh  #$700b
:d0 0d       bne  .+15          :ae76000e    ldn  [ne]#14,n
:10 ce       bpl  .-48          :ae7affcf    ldn  [pl]#-49,n
:4c 54 76    jmp  $7654         :ae307654    ldn  #$7654
:20 56 34    jsr  $3456         :be303456    pdn  #$3456

Note that all of the 65m32 examples above have literal mode operands, even though several of their 65816 counterparts use implied, relative, or absolute modes. Also, note that by treating the 65m32 instruction pointer n in the same manner as the other registers, a layer of indirection is apparently removed in the translation of the last four examples. This is due to the 65m32’s implementation of the instruction pointer … it employs n as just another register with regard to loading and storing. In other words, loading a literal value into n has the effect of jumping to the associated direct address.

Here are some translation examples using the 65m32's “indexed” direct operand modes, in which the operand is used to form an effective address to memory, for read, write, or read-modify-write:
            65816                           65m32
:ac 9a 78    ldy  $789a         :a4b0789a    ldy  $789a
:99 54 76    sta  $7654,y       :80a07654    sta  $7654,y
:48          pha                :81e00000    sta  ,-s
:68          pla                :a1600000    lda  ,s+
:60          rts                :af600000    ldn  ,s+
:d4 43       pei  ($43)         :b6b00043    psh  $43
:6c 57 13    jmp  ($1357)       :aeb01357    ldn  $1357
:fc 89 67    jsr  ($6789,x)     :be906789    pdn  $6789,x

Note that by treating the instruction pointer n in the same manner as the other registers, a layer of indirection is once again apparently removed in the assembly language translation of the last three examples from above. In other words, loading a direct value from a memory cell into n has the effect of jumping to the associated indirect address. It is my intention to provide many of the more familiar 65xx/68xx-style mnemonics as aliases in the 65m32 assembler, to improve readability and ease-of-transition for those users:
Assembler alias         Native 65m32 instruction
    tax                     ldx  #,a
    rts                     ldn  ,s+
    pha                     sta  ,-s
    jsr  label              pdn  #label
    bpl  label              ldn  [pl]#label+~.,n
    tst  label,x            ldz  label,x
    jmp  label,y            ldn  #label,y
    jmp  (label)            ldn  label
    asl  {no operand}       asl  #1,a
    dey                     ldy  #-1,y
    pea  label              psh  #label
    pei  (label)            psh  label

Since the 65m32's direct modes are comparable to the 6502/65816's direct-page and/or absolute addressing modes, an auxiliary pointer register technique must be used to synthesize the (d,x) (d) (d),y and (d,s),y addressing modes of those processors. There is no built-in equivalent for these modes on the 65m32. Fortunately, this pointer address set-up usually can be done once, outside of any tight loops, making the performance hit negligible:
            65816                           65m32
:41 8c       eor  ($8c,x)       :a890008c    ldb  $8c,x
                                :70c00000    eor  ,b

:11 fe       ora  ($fe),y       :a8b000fe    ldb  $fe
                                :e8200000    adb  #,y
                                :60c00000    ora  ,b

:73 03       adc  (3,s),y       :a8e00002    ldb  2,s
                                :e8200000    adb  #,y
                                :52c00000    adc  ,b

:a9 00 00    lda  #0            :a0300000    lda  #0
:a0 63 00    ldy  #99           :a4300064    ldy  #100
          loop:                 :a8e00002    ldb  2,s
:18          clc                :e8200000    adb  #,y
:73 03       adc  (3,s),y                 loop:
:88          dey                :e1c00000    add  ,-b
:10 fa       bpl  loop          :4620fffe    dbn  loop,y

Note that although one or two additional instructions are required to set up register b as the pointer, the checksum loop in the last example above is actually two instructions shorter (thanks to the add and dbn instructions).

Full-width numeric operands

The 65m32 is 32-bits all-the-way, but most instructions require an opcode and operand data to specify a literal value or an address, and it is generally impossible to fit a 32-bit operand and an op-code into 32-bits.

One way that the 65m32 gets around the problem is to automatically promote the 16-bit embedded numeric value contained in the instruction to 32-bits, by sign-extending it (duplicating bit 15 into bits 16 to 31) before adding it to the contents of the index register. But that only works most of the time, depending on what is being done with the operand. [-32767 ... 32767] is a respectable range that can be used for small increments, constants and offsets, but doesn't enable the 65m32's full potential.

Another way that the 65m32 avoids this problem, at least for large literals , is to allow the use of the direct mode operand 0,n+ to allow a literal value to be placed in-line, directly after the instruction. This method has been used by the pdp-11 with great success, but can only be used to provide the equivalent of extended-literal mode on the 65m32, because of its limited indirection capability. In the case of a conditionally-executed instruction of this type, it is important for the 65m32 programmer to be aware that activation of the conditional skip mechanism will cause the trailing literal to be treated as an instruction, because the auto-increment of n would be skipped as well. Although it is conceivable that this type of behavior could have an obscure use, it is more likely to be a programming error. If this type of behavior is not desirable, the following method should be more appropriate.

The third (and most flexible) way that the 65m32 gets around the problem is by treating the embedded numeric value of -32768 in a special manner. When an instruction containing this value is fetched and decoded, it automatically triggers a second fetch, and allows in-lining a full 32-bit numeric literal or address immediately after the instruction. The -32768 “magic” value is then discarded, and the full-width constant is used in its place. Unlike the 0,n+ method, which can not implement extended-direct, this method works equally well in literal and direct operand modes. When composing small (<64kW) programs, this extended-operand mode is typically only needed for large constants, like bit-masks and such, since the embedded 16-bit value provides plenty of reach for most relative branch targets, increments, initializations, etc. The assembler automatically recognizes the need to 'promote' a one-word instruction to a two-word instruction, and the 65m32's conditional execution mechanism 'knows' to skip the second word as well, if applicable. Some examples:
:a0a08000    lda  $23456789,y
:70308000    eor  #$ff000000
:ae908000    ldn  $edcba987,x   \ jmp  ($edcba987,x)

While translating a DTC Forth variant from 6xxx assembly to 65m32 assembly, I have so far only found a few occasions in hundreds of instructions where this extended-operand technique is necessary, and they were only needed because of the four-char-per-word dictionary name storage convention that I have implemented. For accessing any memory location in the 65m32's 4GW address space, this technique is only necessary if the target address falls outside of the range [-32767 .. 32767] surrounding any index register, including z and n . Otherwise, the basic single-word instruction suffices, making the range [-32767,z .. 32767,z] the 65m32's equivalent to the 6502's zero-page addressing mode, and the range [-32767,n .. 32767,n] the 65m32's equivalent of the 6xxx family’s PC-relative addressing mode.

Part 3 ... coming soon!

Mike B.

Sat Aug 06, 2016 6:38 am
User avatar

Joined: Tue Jan 15, 2013 5:43 am
Posts: 163
Really good to see to this thread appear, Mike! :) What you've posted is well-written, too -- something that doesn't happen by accident. And I don't need to tell you I'm already a big fan of this project.

Regarding addressing: if addresses are expressed as 0 to 4 GByte (not GWord), I readily agree the HDL code will be daunting. That's if the thing is done "properly." By that I mean being able to freely access words, half-words and bytes from any address. But there's a half-way solution that's worth considering, and that is to express addresses in bytes... but with the restriction that you can't access any word or half-word that spans a word boundary. In other words, no mis-aligned accesses. This'll make the HDL quite a lot easier. (It also makes 65m32 programming slightly harder, but only compared to having full support for byte addressing. Compared with no support you're much better off.)

Full support is something that could be added to a later revision, perhaps. But you won't have that option if version 1 has committed to word addressing. To be fair, word addressing does have the advantage of pushing the memory ceiling from 4 GB to 4 GW, but how much is that really worth compared to the other side of the tradeoff? Hard to say.

As for instruction encoding, you'll need to allow space in your opcode matrix for the byte- and half-word operations. Simple loads and stores are all you really need. But, if you have space, something like the 65xx's TSB and TRB would be nice, too.

barrym95838 wrote:
On the 65m32, the three direct modes cause the machine operation to use the result from the operand adder as an effective address to memory, making them conceptually a superset of the 65xx's abs abs,x abs,y dp dp,x and dp,y addressing modes.

I'm completely sold on the memory addressing modes -- very flexible, elegant & cool! But IMO there's something that's been neglected, and that is the ability to NOT address memory! :shock: :ugeek: :D

I'm saying I think you need more registers. Ideally your code would execute as quickly as it can be fetched. But, since there's no cache for code or for data, the two are forced to take turns accessing memory. More registers can alleviate that -- more so in some situations than others, of course, but we can agree there's a clear advantage.

Everything's a tradeoff. To define more registers we need a wider [rrr] field. That means using fewer bits somewhere else -- such as in the [cccc] field or the [iiiiiiiiiiiiiiii] field. Either of these could go on a diet and still retain most of its utility. But I vote for the [iiiiiiiiiiiiiiii] field. IMO the penalty of shrinking [iiiiiiiiiiiiiiii] to 15 bits is worth mentioning, but hardly major. What it gains us -- 8 scratchpad registers -- is far more significant, IMO.

(When AMD created x86-64 they boosted the register complement -- they saw benefit despite the fact caching is already a given. Without caching the more-registers argument is even more persuasive.)

best regards,


Sat Aug 06, 2016 5:44 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 309
Location: Canada
Thanks for the post Mike. I've been looking forward to learning more about 65m32.

As Dr. Jeff and Mike already mentioned I have to third the motion for 8/16 bit data support. Maybe leave some room open for 64 bit operands as well. In my one core I had just 32 bit word support like the 65m32, but when I went to write some software found it was better to have at least byte loads and stores available. I then added an "orb" instruction standing for "or byte" and "stb" for store byte instruction that worked only in the lowest 4GB instead of 4GW. In short the core and software turned into an ugly monster with the oddball support for bytes and I decided to re-write it without the "orb" and "stb" instructions, but instead using prefix bytes to indicate byte or half word addressing.

The magic number doesn't need to decode the full 16 bits, if desiring to save some LUTs you could just check the top nibble for $8. But it would mean the magic number would be used more often.

Could the assembler / compiler support alternate mnemonics for some of the instructions ? I'm really fond on the JSR / JMP instructions as opposed to program counter loads.

More registers might help. But one thing I've noticed is that the block memory in an FPGA is just as fast as a processor core. So if running just in block memory it may not be as critical to have so many registers. It depends on target applications I guess.

Robert Finch

Sun Aug 07, 2016 4:30 am
Profile WWW
User avatar

Joined: Tue Jan 15, 2013 5:43 am
Posts: 163
re the "more registers?" question... and I'll try to be brief, 'cause there's other stuff to talk about. Also, I'm looking forward to Part 3 of Mike's doc! :)

TBH, the existing register complement would suit me alright, because my #1 project would be to write a Forth kernal. And, Mike, I know you've coded Forth primitives as a means to road-test the 65m32 instruction set. I'm sure those test cases worked out fine. But other folks have different goals, and that's where the extra registers become important.

Although I don't know much about compilers (other than Forth's), I've often read that the mainstream compilers like a CPU with lots of registers, for performance reasons (and maybe for code density too). Another example -- one where I do have first-hand knowledge -- is Interrupt Service Routines. Due to our 65xx/68xx backgrounds, it's easy for folks like us to take it for granted that every ISR begins by saving some registers and ends by restoring them. But I've done some MSP430 coding, and that really opened my eyes. No save/restore in the ISR -- very snappy! That's because I could afford to let the ISR have some reg's for its own exclusive use. And there are other examples, harder to describe, in which extra reg's result in a positive outcome.

robfinch wrote:
one thing I've noticed is that the block memory in an FPGA is just as fast as a processor core. So if running just in block memory it may not be as critical to have so many registers.
Rob, you probably do this differently but AFAIK Mike's design will handle code and data accesses sequentially. If that's the case then, even if block ram were used, data accesses delay code accesses and thus increase execution time. A larger register pool reduces the number of memory accesses required for data.

robfinch wrote:
It depends on target applications I guess.
Agreed. And I'm admitting Forth is an application that doesn't get much benefit from a larger register pool. But in that regard it's somewhat of an anomaly.

Evaluating tradeoffs means looking at BOTH sides of the equation. And we can resume this discussion at a later time. When we do, maybe someone can make the opposing argument -- namely, that it's really important the [iiiiiiiiiiiiiiii] field be 16-bit, not 15.

J :)


Sun Aug 07, 2016 4:27 pm
Profile WWW

Joined: Tue Dec 31, 2013 2:01 am
Posts: 76
Wow! You guys really like your 8-bit bytes, don't you? :) If I had decided on a 36-bit or 40-bit design, would it have made any difference in your advice? Maybe I'm just hopelessly eccentric, but I never liked bytes that much ... they seemed too limiting. Computer programming and design has always been a hobby of mine, not a profession, so I confess that I just don't see the universal appeal of them, at least not enough to complicate the auto-increment and auto-decrement modes, lose 75% of my address space, use more op-code real estate, deal with alignment [and endianness] issues ... Garth hasn't chimed in yet, but I would venture a cautious guess that he wouldn't mind working almost exclusively with words in the embedded projects he is so good at designing. Of course, there are the issues of data density, translation, and compatibility with some existing paradigms, but I'm prepared to make an (almost) clean break ... I hope that I don't regret it, because you guys have certainly given me fair warming.

@Jeff: Yeah, we have been through this discussion before, and I value your point of view (as well as your superior experience). However, if I chose to wander from the 65xx "look and feel", I still would want to maintain a certain level of orthogonality, so I would need to expand my op-field as well as my reg-field, thereby stealing two bits from my embedded constant ... hey, 8-bit op, 4-bit reg, 4-bit condition, 2-bit address mode, 14-bit constant ... it sure does look tidy in hexadecimal! I'll have to meditate a bit further on that ... 16 registers are too much for my brain to juggle, but in the long run I don't want to be the only software composer, so concessions may be necessary.

@Rob: Yes, I really like JSR, JMP, RTS, etc. I would make every effort to include them at the earliest possible stages of assembler development.

Well, soldiering on to part 3:

"dbn", "cdd", "cdc", "sla", "pda" ... I'm using a lot of three-letter combinations, but haven't stumbled into any rude ones yet, at least not in standard English. As I told Garth, I'm quite fond of uniform-length mnemonics, and the 65m32 is no exception. This fondness is not without its awkward moments, though. "cdc" is "complement and add with carry", "cdd" is "clear carry, complement and add", "dbn" is "decrement and branch if not zero", "sla" is "store and pull accumulator", and "pda" is "push and load accumulator".

"pda" and friends are a direct (and useful, IMO) outgrowth of the 65m32's treatment of n (the instruction pointer) as "just another register". "pdn" is a direct translation of "jsr", so it's a must-have, but how many times have you guys needed to borrow register a for a while in a section of code? If you want to live dangerously, you just use it and hope that the side-effects on the rest of the system are harmless. If you want to live carefully, you examine all possible entries and exits from your code section, and ensure that the side-effects are harmless. Or, you just push the register, load something into it, process that something, store that something somewhere, and pull the original value. "pda" and "sla" streamline the beginning and ending of that sequence for register a, and registers x, y, b, u, and n have their own machine instructions for the same type of benefit.

A 65m32 instruction execution consists of a read of the operation word, an optional read of the extended operand, an optional read of the direct operand (or a pull), and an optional store [er, write] of the direct operand (or a push). Many useful literal-mode instructions can do their work in a single memory cycle, but it is certainly likely that a complex arithmetic operation would require enough additional time to stall this "one memory access per machine cycle" pattern. My initial design will treat any operations complex enough to foul up this critical path with instruction emulation traps; "mul" and "div" are obvious candidates, but there certainly should be others ... I won't really know for sure until I choose a host FPGA.

I wanted to get into the unique properties of literal mode, but I'm feeling a bit loopy from a hard day of working my tired old body on my run-down old house. I will have to put it off until part 4, and apologize ahead of time if any of the above is utter nonsense. It's definitely bed-time for Mr. Barry ...

Mike B.

Mon Aug 08, 2016 6:31 am

Joined: Tue Dec 31, 2013 2:01 am
Posts: 76
Okay, let's dig in to part 4. But first, an answer to Ed's question about non-zero values of register z.
    stz  label
    stz  #,b

... good ways to store zero or "false" into a memory location or register without affecting any condition codes. But, what if we want to store a 1 or a -1 into a memory location without dirtying a register?
    stz  label,z+
    stz  label+1,-z

... since the index register is updated before the store, and z is allowed to change during the execution of an instruction, the first will store a 1 in (label), and the second will store a -1 in (label). Register z is reset to zero during the next instruction fetch. And we can store any value into a register with something like
    stz  #12345,x

... this is an unusual form of literal addressing, in which the numeric part of the operand is added to the source register instead of the operand register. The result is "stored" into the operand register (x in this example) without affecting any condition codes. Here's a more complete explanation of "split literal" mode.

Irregular Forms of Some Basic Instructions

The operand mode field (especially the literal mode) may be treated in an irregular manner by certain instructions. Let's start with some examples of 'normal' behavior. In the following examples, the condition flag side-effects are determined by the instruction, not the operand:
:a8400003       ldb  #3,b       \ b = b + 3;
:a0000003       lda  #3,a       \ updateNZ(a = a + 3);
:e8400003       adb  #3,b       \ b = b + b + 3;
:52000003       adc  #3,a       \ updateNVZC(c:a = a + a + 3 + c);
:aa600003       ldu  #3,s       \ u = s + 3;
:a2700003       ldx  #3,n       \ updateNZ(x = n + 3);
:d4200003       exa  #3,y       \ temp = y + 3; y = a; updateNZ(a = temp);
:72000003       mul  #3,a       \ updateNZC(c:a = a * (a + 3) + c);
                                \   ( mul and div are unsigned operations )
:7650fff7       div  #-9,u      \ updateNVZC(c:a / (u - 9));
                                \   ( c = remainder , a = quotient )

The above examples illustrate the 'normal' behavior of the literal operand mode. The operand adder adds the contents of the operand register to the embedded constant, and the instruction uses this sum to modify the destination register. But, what should a write or a read-modify-write instruction do with a literal value such as this? In most microprocessors, this mode of addressing is undocumented or forbidden, but the 65m32 makes use of it, by simply modifying the behavior of such instructions:
:a0100003       sta  #3,x       \ This is equivalent to ldx  #3,a but
                                \   with no flag side-effects

Above is the first example of irregularity in the operand calculation, and is called 'split-literal’. The 'normal' behavior for the operand adder would be to calculate (3 + x) , but (3 + a) is calculated instead in this instance. This behavior is utilized by the literal modes of the sta , stx , sty , stz , stb , stu , sts, stn, sla, slx, sly, slb, slu, sls and sln instructions.

The read-modify-write group: inc , dec , tsb , trb , rol , ror , asl , asr , lsl , lsr ... these should all be reasonably familiar to 65xx and 68xx programmers, and they act exactly as expected in the direct modes, where the operand specifies a memory address:
:fd100bb8       dec  3000,x+    \ updateNZ(--*(3000 + x++));
:d74001c3       tsb  451,b+     \ updateNZ(*(451 + b)); *(451 + b++) |= a;
:f0b000c0       asl  192        \ updateNZC(*(192) <<= 1);

BUT ... what about when the operand is literal mode? Since both the embedded constant and the operand register are present, even for the literal mode, it is natural to do a read-modify-write on the register rather than the memory location, and to use the embedded constant to alter the behavior of the operation. This mode is called 'split-literal', since the two operand components rrr and iiiiiiiiiiiiiiii are used independently, rather than being added together in the operand adder:
:fe100bb8       inc  #3000,x    \ updateNZ(x += 3000);
:fe10f448       inc  #-3000,x   \ updateNZC(x += 3000);
:fc400000       dec  #0,b       \ same as tst  #,b
:fc40fffd       dec  #-3,b      \ updateNZC(b -= 3);
:fc400003       dec  #3,b       \ updateNZ(b -= 3);

So, in the cases of inc and dec , a positive numeric constant excludes ^C , and a negative one includes ^C , but does NOT change the direction of the operation. Why not just get rid of dec and allow an inc with a negative constant to replace it? If inc was allowed to decrement as well, and rol to rotate right as well, then there would be no way to dec or ror a memory cell using the direct modes, because the numeric constant would already be used up as an offset in the effective address calculation! It is important to note that the use of ^C makes it easier to detect unsigned "wrap-around" when the increment or decrement constant is greater than one.

A numeric constant of 0 for all of these instructions results in the equivalent of a tst # instruction, which affects ^NZ , but nothing else:
:f8400001       rol  #1,b       \ 32-bit rotate left, b by 1 bit
:f8100000       rol  #0,x       \ same effect as tst  #,x
:f820fffe       rol  #-2,y      \ 33-bit rotate left, ^C:y by 2 bits
:f6000002       lsr  #2,a       \ updateNZ(a >>= 2);
:f600fffe       lsr  #-2,a      \ updateNZC(c:a >>= 2);

In the cases of numeric constants of 0 , and in other cases as well, there is significant redundancy in the instruction set, where dozens (or in some cases, many hundreds) of different instruction bit-patterns do exactly the same thing. This situation is not very desirable when there are 8-bit instructions, but is terribly difficult to avoid for 32-bit instructions without making the instruction decode process much more complex. With billions of possible individual instruction encodings, there is plenty of room for this redundancy, and it even provides a method by which a compiler or assembler (or human coder) can produce machine code with an "individual" quality or "signature" without affecting performance or code-density.

Are things starting to get strange yet? More strangeness coming up soon, in part 5 ... I'll be sharing some longer code snippets,

Mike B.

Thu Aug 11, 2016 7:02 am

Joined: Tue Dec 31, 2013 2:01 am
Posts: 76
You guys ready for part 5? I hope so ...

Let's try coding some useful subroutines, like a memory move (actually a copy). If the ranges overlap, we need to have two versions; one that copies "up" and one that copies "down", to prevent interference during the copy. Here's a 6502 version, courtesy of Bruce Clark:
;------------ NMOS 6502: 75 bytes ------------
; Move memory down
; FROM = source start address
; TO = destination start address
; SIZE = number of bytes to move
        LDX SIZEH
        BEQ MD2
MD1:    LDA (FROM),Y  ; move a page at a time
        STA (TO),Y
        BNE MD1
        INC FROM+1
        INC TO+1
        BNE MD1
        BEQ MD4
MD3:    LDA (FROM),Y  ; move remaining bytes
        STA (TO),Y
        BNE MD3
MD4:    RTS
; Move memory up
MOVEUP: LDX SIZEH    ; the last byte must be
        CLC          ;   moved first
        TXA          ; start at the final pages
        ADC FROM+1   ;   of FROM and TO
        STA FROM+1
        ADC TO+1
        STA TO+1
        INX          ; allows the use of BNE
        LDY SIZEL    ;   after the DEX below
        BEQ MU3
        DEY          ; move bytes on the last
        BEQ MU2      ;   page first
MU1:    LDA (FROM),Y
        STA (TO),Y
        BNE MU1
MU2:    LDA (FROM),Y ; handle Y = 0 separately
        STA (TO),Y
MU3:    DEY
        DEC FROM+1   ; move next page (if any)
        DEC TO+1
        BNE MU1

... and here's the same thing (actually better, because it chooses the correct direction by itself) ...

\----------- 65m32: 11 words -----------
\ Move memory up/down (1 W to 4 GW)
\ x = source start address
\ y = destination start address
\ b = number of words to move
\ (4 GW are moved if b == 0 !!)
:c4100000 move:   cpy  #,x
:a1140000 mvdown: lda  [cc],x+
:81240000         sta  [cc],y+
:4644fffd         dbn  [cc]mvdown,b
:afe40000         rts  [cc]
:e2400000         adx  #,b
:e4400000         ady  #,b
:a1900000 moveup: lda  ,-x
:81a00000         sta  ,-y
:4640fffd         dbn  moveup,b
:afe00000         rts

The compare instruction at the top is used to decide the direction, and the [cc] additions to mvdown save a "bcs" instruction by sliding straight down to moveup if the carry is set, at a net cost of a couple of cycles.

Here's a typical Sieve of Eratosthenes, first in C:

main () {
 int limit = 8192,

 iter = 100;
 do {
  count = 0;
  y = limit;
  do {
   flags[--y] = -1;
  } while (y != 0);
  do {
   if (flags[y]) {
    prime = y + y + 3;
    k = y + prime;
    while (k <= limit) {
     flags[k] = FALSE;
     k += prime;
  } while
      (y++ != limit);
 } while
      (--iter != 0);

... then in 65c816 assembly, courtesy of Eyes and Lichty:
0000          ERATOS START
0000          LIMIT  GEQU 8192
0000          ITER   GEQU $80
0000          COUNT  GEQU $82
0000          .K     GEQU $84
0000          PRIME  GEQU $86
0000          FLAGS  GEQU $4000
0000 18              CLC
0001 FB              XCE
0002 C2 30           REP #$30
0004          LONGI  ON
0004 A9 64 00        LDA #100
0007 85 80           STA ITER
0009 64 82    AGAIN  STZ COUNT
000B A0 00 20        LDY #LIMIT
000E A9 FF FF        LDA #$FFFF
0011 88       INIT   DEY
0012 88              DEY
0013 99 00 40        STA FLAGS,Y
0016 D0 F9           BNE INIT
0018 B9 FF 3F MAIN   LDA FLAGS-1,Y
001B 10 1E           BPL SKIP
001D E6 82           INC COUNT
001F 98              TYA
0020 0A              ASL A
0021 1A              INC A
0022 1A              INC A
0023 1A              INC A
0024 85 86           STA PRIME
0026 98              TYA
0027 18              CLC
0028 65 86           ADC PRIME
002A C9 01 20 TOP    CMP #LIMIT+1
002D B0 0C           BCS SKIP
002F AA              TAX
0030 E2 20           SEP #$20
0032 9E 00 40        STZ FLAGS,X
0035 C2 21           REP #$21
0037 65 86           ADC PRIME
0039 80 EF           BRA TOP
003B C8       SKIP   INY
003C C0 01 20        CPY #LIMIT+1
003F D0 D7           BNE MAIN
0041 C6 80           DEC ITER
0043 D0 C4           BNE AGAIN
0045 38              SEC
0046 FB              XCE
0047 60              RTS
0048                 END

... then in 65m32 assembly, courtesy of yours truly:

00000000          eratos:
                  limit   .eq 8192
                  ;iter is register u
                  ;count is register b
                  ;k is register a
                  ;prime is register x
                  flags   .eq $4000

00000000:54060064         ldu #100
00000001:50060000 again:  ldb #0
00000002:48062000         ldy #limit
00000003:4007ffff         lda #-1
00000004:a1043fff init:   sta flags-1,y
00000005:8c05fffe         dbn init,y
00000006:4d044000 main:   tst flags,y
00000007:5cae000a         bpl skip
00000008:50080001         inb
00000009:40040000         tya
0000000a:10000003         add #3,a
0000000b:44000000         tax
0000000c:10040000         add #,y
0000000d:80062001 top:    cmp #limit+1
0000000e:5c5e0003         bcs skip
0000000f:ad004000         stz flags,a
00000010:10020000         add #,x
00000011:5c0ffffb         bra top
00000012:48040001 skip:   iny
00000013:48062001         cpy #limit+1
00000014:5c6ffff1         bne main
00000015:8c000000         dbn again,u
00000016:5e0c0000         rts
00000017                  .en

Note: the op-code map changed since I wrote this, so the hex is wrong, but the word count is correct.

Part of my 65m32 journey has included a renewed interest in Forth and its fascinating properties. The 65m32 architecture lends itself well to Forth, and it's no coincidence ... I tailored the machine instruction set to favor a DTC setup, with TOS in register a, although STC and ITC can be economically implemented as well:

Here are some eForth primitives. a is TOS, s is PSP, u is RSP, and y is IP:
    jmp  (,y+)          \    LODSW              \ load next word into WP (AX)
                        \    JMP AX             \ jump to the word thru WP
ENDM                                            \ IP (SI) points to next word

doLIST ( a -- )         \ Run address list in a colon word.
    sly  ,-u            \    XCHG BP,SP         \ exchange pointers
    $NEXT               \    PUSH SI            \ push return stack
                        \    XCHG BP,SP         \ restore the pointers
                        \    POP SI             \ new list address
                        \    $NEXT

CODE EXIT               \ Terminate a colon definition.
    ldy  ,u+            \    XCHG BP,SP         \ exchange pointers
    $NEXT               \    POP SI             \ pop return stack
                        \    XCHG BP,SP         \ restore the pointers
                        \    $NEXT

CODE EXECUTE ( ca -- )  \ Execute the word at ca.
    sla  #,n            \    POP BX
                        \    JMP BX             \ jump to the code address

CODE doLIT ( -- w )     \ Push inline literal on data stack.
    pda  ,y+            \    LODSW              \ get the literal compiled in-line
    $NEXT               \    PUSH AX            \ push literal on the stack
                        \    $NEXT              \ execute next word after literal

CODE next ( -- )        \ Decrement index and exit loop if < 0
    ldb  ,u+            \    SUB WORD PTR [BP],1 \ decrement the index
    dec  #-1,b          \    JC  NEXT1          \ ?decrement below 0
    stb [cs],-u         \    MOV SI,0[SI]       \ no, continue loop
    ldy [cs],y          \    $NEXT
    iny [cc]            \NEXT1:ADD BP,2         \ yes, pop the index
    $NEXT               \    ADD SI,2           \ exit loop
                        \    $NEXT

CODE ?branch ( f -- )   \ Branch if flag is zero.
    sla  #,b            \    POP BX             \ pop flag
    tst  #,b            \    OR  BX,BX          \ ?flag=0
    iny [ne]            \    JZ  BRAN1          \ yes, so branch
    ldy [eq],y          \    ADD SI,2           \ point IP to next cell
    $NEXT               \    $NEXT
                        \BRAN1:MOV SI,0[SI]     \ IP:=(IP), jump to new address
                        \    $NEXT

CODE branch ( -- )      \ Branch to an inline address.
    ldy  ,y             \    MOV SI,0[SI]       \ jump to new address
    $NEXT               \    $NEXT              \ unconditionally

CODE ! ( w a -- )       \ Pop the data stack to memory.
    sla  #,b            \    POP BX             \ get address from tos
    sla  ,b             \    POP 0[BX]          \ store data to that adddress
    $NEXT               \    $NEXT

CODE @ ( a -- w )       \ Push memory location to data stack.
    lda  ,a             \    POP BX             \ get address
    $NEXT               \    PUSH 0[BX]         \ fetch data
                        \    $NEXT

CODE C! ( c b -- )      \ Pop data stack to byte memory.
    sla  #,b            \    POP BX             \ get address
    sla  ,b             \    POP AX             \ get data in a cell
    $NEXT               \    MOV 0[BX],AL       \ store one byte
                        \    $NEXT

CODE C@ ( b -- c )      \ Push byte memory content on data stack.
    lda  ,a             \    POP BX             \ get address
    $NEXT               \    XOR AX,AX          \ AX=0 zero the hi byte
                        \    MOV AL,0[BX]       \ get low byte
                        \    PUSH AX            \ push on stack
                        \    $NEXT

CODE RP@ ( -- a )       \ Push current RP to data stack.
    pda  #,u            \    PUSH BP            \ copy address to return stack
    $NEXT               \    $NEXT              \ pointer register BP

CODE RP! ( a -- )       \ Set the return stack pointer.
    sla  #,u            \    POP BP             \ copy (BP) to tos
    $NEXT               \    $NEXT

CODE R> ( -- w )        \ Pop return stack to data stack.
    pda  ,u+            \    PUSH 0[BP]         \ copy w to data stack
    $NEXT               \    ADD BP,2           \ adjust RP for popping
                        \    $NEXT

CODE R@ ( -- w )        \ Copy top of return stack to data stack.
    pda  ,u             \    PUSH 0[BP]         \ copy w to data stack
    $NEXT               \    $NEXT

CODE >R ( w -- )        \ Push data stack to return stack.
    sla  ,-u            \    SUB BP,2           \ adjust RP for pushing
    $NEXT               \    POP 0[BP]          \ push w to return stack
                        \    $NEXT

CODE DROP ( w -- )      \ Discard top stack item.
    pla                 \    ADD SP,2           \ adjust SP to pop
    $NEXT               \    $NEXT

CODE DUP ( w -- w w )   \ Duplicate the top stack item.
    pha                 \    MOV BX,SP          \ use BX to index the stack
    $NEXT               \    PUSH 0[BX]
                        \    $NEXT

CODE SWAP ( w1 w2 -- w2 w1 ) \ Exchange top two stack items.
    exa  ,s             \    POP BX             \ get w2
    $NEXT               \    POP AX             \ get w1
                        \    PUSH BX            \ push w2
                        \    PUSH AX            \ push w1
                        \    $NEXT

CODE OVER ( w1 w2 -- w1 w2 w1 ) \ Copy second stack item to top.
    pda  1,s            \    MOV BX,SP          \ use BX to index the stack
    $NEXT               \    PUSH 2[BX]         \ get w1 and push on stack
                        \    $NEXT

CODE SP@ ( -- a )       \ Push the current data stack pointer.
    pda  #,s            \    MOV BX,SP          \ use BX to index the stack
    $NEXT               \    PUSH BX            \ push SP back
                        \    $NEXT

CODE AND ( w w -- w )   \ Bitwise AND.
    and  ,s+            \    POP BX
    $NEXT               \    POP AX
                        \    AND BX,AX
                        \    PUSH BX
                        \    $NEXT

CODE OR ( w w -- w )    \ Bitwise inclusive OR.
    ora  ,s+            \    POP BX
    $NEXT               \    POP AX
                        \    OR  BX,AX
                        \    PUSH BX
                        \    $NEXT

CODE XOR ( w w -- w )   \ Bitwise exclusive OR.
    eor  ,s+            \    POP BX
    $NEXT               \    POP AX
                        \    XOR BX,AX
                        \    PUSH BX
                        \    $NEXT

CODE UM* ( u1 u2 -- ud ) unsigned 32x32->64 multiply
    ldc  #0
    mul  ,s
    sta  ,s             \ low result in NOS
    stc  #,a            \ high result in TOS

CODE UM/MOD ( ud u1 -- u2 u3 ) unsigned 64/32->32
    ldc  ,s+            \ c is upper half of ud
    sla  #,x            \ a is lower half of ud, divisor u1 to x
    div  #,x            \ quotient u3 goes to TOS (register a)
    stc  ,-s            \ remainder u2 goes to NOS

A lot of DTC primitives translate to a single 65m32 machine instruction plus a single machine instruction NEXT. This implies that a competent STC version should be about twice as fast, but I am not ready to expend another time-slice of my limited attention span to pursue this yet. When it comes to Forth, it seems that the 65m32 closely resembles the 6809 in instruction count. I have a lot of respect for the 6809, and the 65m32's register names and operand structure reflect this, but my first love is the 6502, and the mnemonics reflect that.

Is anyone thoroughly confused yet? I'm doing my best, but my thought processes can get a bit disorganized, and I can't always tell whether or not I'm making any sense until someone breaks me out of my trance.

My next installment will be a call for discussion of some architectural details that I haven't finalized. I am just a hobbyist, so advanced programming techniques aren't among my stronger qualities. I don't want to cripple the 65m32 with some foolish assumptions, so I'll be counting on you guys to advise me.

Mike B.

Sat Aug 13, 2016 2:19 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 76
Location: Huntsville, AL

Great set of demo programs. Clearly shows the expressive power of the architecture.

I'd like to suggest that you provide a comment regarding the example source code you included with your assembler code for common FORTH words; it appears to me to be 80x86 assembler, but I don't see where you've explicitly defined the processor.

Michael A.

Sat Aug 13, 2016 7:58 pm

Joined: Tue Dec 31, 2013 2:01 am
Posts: 76
Thanks, Michael. The example (commented out) eForth source is x86, and came from this excellent .pdf by Dr. TIng:

How about another code sample brain teaser?
CODE DIGIT ( char base -- digit true | false )
    ldb  ,s+        \ nip potential digit into b from NOS
    dec  #-'0',b    \ clear ^C and set ^N if b < '0'
    tst  [pl]#-10,b \ clear ^N if b > '9'
    adb  [pl]#-7    \ UTF-32 correction for b > '9'
    cpb  [pl]#10    \ clear ^C if '9' < b < 'A'
    cmp  [cs]#1,b   \ check for base > digit
    stb  [cs],-s    \ if valid digit, tuck it back into NOS
    stz  #,a        \ init TOS to false
    dec  [cs]#1,a   \ change to true if valid digit

The task is to convert a potential digit from UTF-32 to binary if and only if it lies in the interval ['0'..'9', 'A'..'Z'] and is less than base, for values of base in [2..36].
The ldb ,s+ nips the UTF-32 char from NOS into b.
The dec #-'0',b makes the UTF-32 to binary adjustment for [0..9] and causes any char below '0' to fall all the way through to the stz #,a by setting ^N and clearing ^C.
The tst [pl]#-10,b causes any digit in [0..9] to fall through to the cmp [cs]#1,b by setting ^N.
The adb [pl]#-7 adjusts any digit above 9 down by seven without affecting any flags.
The cpb [pl]#10 invalidates the range above 9 but below A by clearing ^C and falling through to the stz #,a.
The cmp [cs]#1,b invalidates any digit >= base by clearing ^C and falling through to the stz #,a.
The stb [cs],-s tucks the translated digit back into NOS, but only if it has passed all tests for validity.
The stz #,a and dec [cs]#1,a place the appropriate flag in TOS.

This makes very careful use of the side-effects of conditionally executed instructions, and could be considered needlessly tricky, but I believe that it's the shortest and most efficient way to do the task (for the 65m32, not necessarily for the human composing, debugging, and/or maintaining this little gem). :)

Mike B.

Sun Aug 14, 2016 5:44 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 565
Thanks for the code examples - it's a crucial part of a new architecture, to see what the software would look like, and not always made explicit.

As I've said elsewhere, I'm fully in favour of a word-based architecture. Bear in mind that the people who will speak up (on anything) are the people who disagree!

However, it's worth perhaps spelling out what the shifting and masking would look like if anyone felt compelled to deal with packed data - such as you might get from network or storage, depending on what your I/O looks like. It might be that shift-by-8 starts to look like a very useful operation.

Sun Aug 14, 2016 7:40 am
Display posts from previous:  Sort by  
Reply to topic   [ 33 posts ]  Go to page 1, 2, 3  Next

Who is online

Users browsing this forum: No registered users and 0 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software