AnyCPU - View topic - Introducing the 65m32

Page 2 of 4

[ 49 posts ]

Go to page Previous 1, 2, 3, 4 Next

Introducing the 65m32

Author	Message
BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1877	Re: Introducing the 65m32 With permission, I'm pasting here comments from ttlworks (a member over on 6502.org but not here) - he won't necessarily have time to follow the discussion here, but this gives us all a chance to see and respond to his thoughts: Quote: Nice to see the 65m32 project back to life again. Meanwhile, I took a look at your 65m32 text at anycpu.org. Literals now seem to be 16 Bit instead of 17. I remember, that trying to "steal" one of those 17 Bits from you was pretty hard some years ago. Concept looks nice so far. Of course, people would like to have the ability to use Byte or 16 Bit word data... but it looks like you won't have any more free Bits in the instruction longword to decode this. (Where Byte would be more important than 16 Bit word.) A possible workaround would be making the memory 4GBytes instead of 4GLongwords, then to have "special cases" of LDA and STA. Since register 'A' has 4 Bytes, (or two 16 Bit words), this would require: 4LDA Byte, 4STA Byte, 2LDA word, 2STA word. // note, that you would have to increment X by 1,2, or 4 for X+ etc. That's 12 instruction in total, leaving you seven '???' in the Opcode map. Of course, some guys would have the Byte or word to be loaded into 'A' sign extended, too. 32 Bit longwords would have to be alligned in memory of course, same thing for 16 Bit words. Another idea to encode 8 Bit or 16 Bit transfers would be taking away Bits from the literal, but I'm not sure if that's a good idea. ;--- When remembering the last revision of the Opcode map, 'mod' seems to be missing. I can't remember exactly... are those 'mul'\'div' instruction signed, unsigned, or both ? My last suggestion on this was: Since you happen to have a decimal flag in the status register, why not having a signed\unsigned flag for 'mul'\'div'\'mod', too ?
Mon Aug 15, 2016 6:13 am

barrym95838 Joined: Tue Dec 31, 2013 2:01 am Posts: 116 Location: Sacramento, CA, United States	Re: Introducing the 65m32 BigEd wrote: ... it's worth perhaps spelling out what the shifting and masking would look like if anyone felt compelled to deal with packed data - such as you might get from network or storage, depending on what your I/O looks like. It might be that shift-by-8 starts to look like a very useful operation. I knew that I could whip it out when needed, but I hadn't tried until now (untested, naturally): Code: \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \ pack a counted string of 32-bit chars @ x into a counted \ string of 8-bit chars @ y; any 32-bit chars > $FF are \ truncated, strings longer than 255 chars overflow the \ 8-bits alloted to the length field, and the packed \ strings are stored "big-endian" style, presumably to \ simplify string-to-string comparisons: \ 00000005 00000B42 00000C43 00000D44 00000E45 00000F46 \ gets packed to \ 05424344 45460000 \ and \ 00000000 \ gets packed to \ 00000000 pack: phy \ save the registers we're going to phx \ stomp (except for a and c, which phu \ are for optional return values) pdb ,x \ register b is the char count pack2: ldu #4 \ register u: buffer pack counter pack3: tst #,b \ check the string count stz [mi]#,a \ if done, load padding 00s for the end stb [pl]#-1,b \ else decrement the string count and lda [pl],x+ \ fetch fresh unpacked char from x++ lsl #24,a \ stage the char to "left" edge of a, rot #8 \ and pack it into "right" edge of c dbn pack3,u \ always pack four chars at a time stc ,y+ \ store packed word buffer into y++ tst #,b \ end of string? bpl pack2 \ no: pack another word plb \ yes: restore all of the registers plu \ we stomped (except a and c, the plx \ optional return values) ply \ ... rts \ and return to caller \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \ unpack a counted string of 8-bit chars @ x into a counted \ string of 32-bit chars @ y: \ 05424344 45460000 \ gets unpacked to \ 00000005 00000042 00000043 00000044 00000045 00000046 \ and \ 00000000 \ gets unpacked to \ 00000000 unpack: phy \ save the registers we're going to phx \ stomp (except for a and c, which phu \ are for optional return values) pdb ,x \ register b is the char count lsr #24,b \ shift the count to "right" edge of b inb \ increment to include count field unpack2: ldc ,x+ \ fetch a fresh packed word from x++ ldu #4 \ register u: buffer unpack counter unpack3: rot #8 \ unpack the next char from the buffer and #$ff \ into the "right" edge of a dec #1,b \ decrement and check the string count sta [pl],y+ \ store unpacked word into y++ dbn [pl]unpack3,u \ unpack (up to) four chars at a time bpl unpack2 \ keep unpacking until count < 0 plb \ restore all of the registers plu \ we stomped (except a and c, the plx \ optional return values) ply \ ... rts \ and return to caller I have been very busy this week with a hundred unrelated personal tasks, but I wanted to let you know that I'll try to take some time to prepare my next post soon. In it, I will be examining the under-explored territory of 65m32 interrupt and exception management, and I will need some help from you guys in hammering out some of the low-level details. Thanks for watching, Mike B.
Sun Aug 21, 2016 6:34 am

barrym95838 Joined: Tue Dec 31, 2013 2:01 am Posts: 116 Location: Sacramento, CA, United States	Re: Introducing the 65m32 Okay, I apologize for not preparing this as well as I would have liked ... things are still a bit hectic around here. Interrupts and exceptions: I don't have a lot of experience with these, so I'm just going to throw out some ideas and let you guys evaluate them. The initial incarnation of the 65m32 will not have a separate supervisor mode, but I don't want to cripple a future version by doing something stupid early on. Please tell me if you see any potential problems here. Reset hardware: Set all registers except a to $ffffffff (including n, the instruction pointer) and proceed. a can be loaded with a hardware version number if desired. Reset software: memory cell $ffffffff needs to hold a short jump to ROM, a long jump with the target address in $00000000, or a "nop" followed by valid machine code in $00000000. The Reset routine should set up registers k, m, q, r, s and p, in roughly that order, then initiate a cold-start sequence of some sort. It is generally assumed that "negative" addresses are ROM and "positive" addresses are RAM, but this isn't etched in stone. k holds the address of the BRK ISR. m holds the address of the NMI ISR. q holds the address of the IRQ1 ISR. r holds the address of the IRQ2 ISR. s is the system stack pointer. p is the processor status register. The upper 24-bits are reserved for interrupt disable bits and future expansion, but are all accessible for now. The same goes for the vector registers detailed above. An exception is triggered by pulling an external input (/RES /NMI /IRQ1 /IRQ2) low, or by executing a BRK or ILL instruction. All external exceptions except reset wait until the current instruction has completed, then (if enabled) push registers n and p on the system stack before loading n with the vector contained in the appropriate register. IRQ1 and IRQ2 disable themselves before the ISR is entered. Reset doesn't try to push anything. BRK and ILL push the computed and/or fetched value or their operand as well, after pushing n and p. WAI pushes n and p, loads p with the operand value, and waits for an external interrupt. RTI pulls p and n, and cannot be interrupted until it does both. Sounds simple enough, right? Well, I eventually want to have privileged instructions and registers (including the upper 24-bits of p), and I don't want to set any awkward precedents here. Some unanswered questions: 1) Some complex arithmetic instructions will need several to dozens of machine cycles to complete in the initial design. Should these "TRAP" to their own instruction handler routines? 2) 65m32 instructions can be one or two words long. This could cause future memory management issues if, for example, an instruction straddles a paging boundary. Is it too soon to think about stuff like this? 3) I know there was something else I was going to ask, but it's late and my brain is on autopilot right now. Can you guys think of anything that I missed or completely botched? Thanks for watching, Mike B.
Thu Sep 01, 2016 7:52 am

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1877	Re: Introducing the 65m32 About long-running instructions: ARM provides two modes, a conventional mode where instructions will finish before the interrupt, and a fast-interrupt mode where instructions are abandoned and will be restarted after the RTI. (It does not offer what the later 68000 did, which is to save internal machine state on interrupt such that a long-running instruction can be interrupted part way through and will be resumed after the RTI.) And this might help with the question of faulting on a two-word instruction: using the restart paradigm will solve this. There's some penalty in that the interrupted instruction has done some cycles of work which will be redone, but I'd say it's worth paying. It does mean you have to design all your instructions to allow rerun. (Or make the unrerunnable ones uninterruptible.)
Thu Sep 01, 2016 8:28 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2497 Location: Canada	Re: Introducing the 65m32 Quote: 2) 65m32 instructions can be one or two words long. This could cause future memory management issues if, for example, an instruction straddles a paging boundary. Is it too soon to think about stuff like this? One means of avoiding problems is having the assembler output single word NOP instructions close to the paging boundary until the boundary is crossed. This assumes that the code isn't going to be moved around other than in page boundary sizes. _________________ Robert Finch http://www.finitron.ca
Fri Sep 02, 2016 5:32 am

barrym95838 Joined: Tue Dec 31, 2013 2:01 am Posts: 116 Location: Sacramento, CA, United States	Re: Introducing the 65m32 I remembered my fourth question: 4) Should I provide a mechanism for externally-vectored interrupts? I am thinking that I could just add a privileged interrupt base register and another external request line. A peripheral wishing to access the vectored interrupt capability would have to provide an ID on the data bus during the acknowledgement phase which would be used as an offset into a vector table to which that register is pointing. Regarding the restarting of an "aborted" instruction, I suppose that a proper treatment can wait until after I have version 1.0 up and running. I don't wish to postpone further development by getting caught up in minutiae ... that has already become a frustrating trend in my little journey. If no one has any further objections to my current plan, I will: 1) Head back to my simulator source (~75% finished in C) and get it to compile. 2) Modify a cross-assembler to fit my needs (probably SBASM Version 3). 3) Whip up a small "native" machine language monitor like WOZmon, in assembly. 4) Compose a simple (but hopefully expandable) "BIOS", also in assembly. 5) Install eForth. 6) Optimize eForth for size (first) and speed (second). 7) Learn enough Verilog to configure an inexpensive FPGA board as a 65m32 emulator. 8) Learn how to use eForth to do pretty much anything else, including porting other interpreters and/or compilers. I will post my simulator source here when I get it to compile cleanly. Thanks for watching, Mike B.
Thu Sep 29, 2016 5:08 am

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1877	Re: Introducing the 65m32 Just thinking aloud: I suppose vectored interrupts are useful when - peripherals are plugged in with an accompanying ROM with the drivers on - there are more sources of interrupt than interrupt lines and you need to handle more than one of them in a very time-critical fashion Of course, I'm used to the 6502, with only two interrupts, where the interrupt handler would need to know about the peripherals and check each one in some order, if there are more than two of them. It looks like ARM provides up to 32 interrupt sources (including faults and SVC calls and some reserved) I'm waffling about this because my inclination would be to make the simplest effective architecture, and I'm not sure I would go down the route of externally supplied vectors. Which is not to say they are a bad idea! I like the plan though - simulator, assembler, monitor, and so on.
Thu Sep 29, 2016 8:35 am

Garth Joined: Tue Dec 11, 2012 8:03 am Posts: 285 Location: California	Re: Introducing the 65m32 barrym95838 wrote: I remembered my fourth question: 4) Should I provide a mechanism for externally-vectored interrupts? I am thinking that I could just add a privileged interrupt base register and another external request line. A peripheral wishing to access the vectored interrupt capability would have to provide an ID on the data bus during the acknowledgement phase which would be used as an offset into a vector table to which that register is pointing. Would the vector table be arranged somehow in order of priority, such that higher-priority interrupts can interrupt the servicing of lower-priority ones, but not vice-versa? I'm always using interrupts, but very few of all the possible sources are ever active at once, so my prioritizing, although on 6502/816 which requires polling, is a matter of polling the sources in order of priority, and never wasting time polling sources that are not enabled. When an interrupt is enabled and its service is installed, it goes with an order of priority, so the highest priority gets polled first, and if that's the one, then service is more immediate, not having wasted time polling the others. I only arranged for a list of eight, and I've never used nearly that many. If it is disabled, its place in the priority queue is deleted and lower-priority ones get scooted up to fill in the space. This is all done in software though, not hardware. To have it in hardware would be nice, although again, in my uses, very few of all the possible interrupt sources are enabled at once. _________________ http://WilsonMinesCo.com/ lots of 6502 resources
Thu Sep 29, 2016 12:27 pm

barrym95838 Joined: Tue Dec 31, 2013 2:01 am Posts: 116 Location: Sacramento, CA, United States	Re: Introducing the 65m32 Hey, long time no see! I am still trying to get my simulator to compile, but I have run into a tough design decision, and I hope that you all can help me decide how to proceed. It's a bit complex, and I'm trying to sneak it in during my day job, so we'll see how well I can present it. The 65m32 has several different machine code execution scenarios, and some of the more complex ones need some clarification. What I need is advice on which path would be the most efficient, because it seems to me that anything I choose is possible at the expense of efficiency, but I would like to take care of this early to avoid future headaches. Confused yet? The fetch, decode, execute loop may or may not be pipelined, but let's initially assume that the pipeline is either very shallow or non-existent. Here are some simple examples first, so I can (try to) introduce the problem properly: lda #,b \ R sta #,u \ R ror #1,x \ R Above, the R is the op-code read, and nothing else is necessary to complete the execution, since these are simple literal mode instructions. lda #123456,b \ RR sta #123456,u \ RR ror #123456,x \ RR Above, the first R is the op-code read, and the second R is the extended operand read. They are also literal mode, so nothing else has to happen to compete the execution. The huge literal for the ror instruction is a bit silly, but still legal. lda 5678,b \ RR sta 5678,u \ RW ror 5678,x \ RRW Above, the initial R is the op-code read, and the action requires a read, write, or read-modify-write, respectively. If the operand was changed to something like 123456, then the operand wouldn't fit inside the 32-bit op-code, so the above instructions would become RRR, RRW, and RRRW, respectively (thanks, Dieter). Okay, are we still on the same page? I hope so, because here's where it starts to get tricky. jsr 1234,z \ RW jsr (1234,z) \ RRW jsr 1234,z translates to the native 'm32 instruction pdn #1234,z so the W is the pushing of register n (the instruction pointer) to --s; the subroutine address is contained in the op-code. jsr (1234,z) translates to the native 'm32 instruction pdn 1234,z* so the second R is the read from (1234+z) before the W to --s. It could also be RWR, but a little voice inside my head tells me that this should be avoided, for reasons I'll try to explain below. But keep the RWR possibility in mind for now. sla 1234,x \ RRW This is "Store-pulL", the "opposite" of "Push-loaD" above, and it pops the new value of a from s++ before writing the old value of a to (1234+x). My overwhelming urge is to always put the Write at the end, but I don't know if it is a false instinct, or has some engineering merit. It doesn't really become a programming issue until you do something tricky, like: jsr (,s+) \ RRW, equivalent to pdn ,s+ or sln ,-s \ RRW This is tricky but legal code, and I need to get a grip on what this instruction will do before finalizing my simulator. Let's break it down and see what happens. First, jsr (,s+): 1) Fetch the op-code, note that it has a short operand, so no extended operand fetch. 2) Feed the inherent 0 and the contents of s to the operand adder, which calculates ea = (0+s) and increments s. 3) Read the new operand from memory at ea, and hang on to it somewhere close, like 'temp'. 4) Push the current value of register n on the stack ... (--s) = n 5) Transfer the value of 'temp' to register n, and proceed to the next instruction fetch at the new address. This is how I envision the mechanism, but I don't know if I'm painting myself into a corner regarding speed and/or complexity and/or pipelining and/or memory management by doing it as RRW instead of RWR. If it is RRW, then it exchanges the value in the instruction pointer with the value stored on the top of the system stack. If it's RWR, then it's a slow no-op. Also, I am assuming that the value of s in step 4 has already been incremented by step 2, but this is a slightly different consideration than the RWR issue. Next, sln ,-s ... well, I ran out of time here, but I'll field any questions you may have in a timely manner. Can any of you provide any insights regarding my question about whether or not I should proceed as presented? Am I getting bogged down in unnecessary trivia? Will I ever manage to power through this project and get it hosted on an FPGA in my lifetime? As always, TIA. Mike B.
Wed Jan 04, 2017 7:57 pm

Dr Jefyll Joined: Tue Jan 15, 2013 5:43 am Posts: 189	Re: Introducing the 65m32 Nice to hear about your progress, Mike! Quote: I have run into a tough design decision [...] What I need is advice on which path would be the most efficient, because it seems to me that anything I choose is possible at the expense of efficiency Am I missing something? Why would you care about efficiency of the simulator? IMO that is getting bogged down in unnecessary trivia. It's the hardware version you want to be efficient. The emulator is just a thinking aid to get you there. Thoughts on simulation: I'm hardly an HDL expert but I know Verilog and VHDL have stuff under the hood that provides the illusion that all your statements are executing simultaneously. On actual hardware all the LUT's/macrocells do execute simultaneously, but a simulation on a host PC does some fancy footwork to make the illusion accurate, even in the face of logic which may very inconveniently second-guess itself before (eventually) settling into a stable state. In hardware you can largely ignore that, and ideally your simulator should take it in stride, too -- as if all your statements execute simultaneously. I'd like to hear some opinions better informed than mine. It seems to me that, if you're careful, C (that's what you're using, right?) might be adequate to simulated a non-pipelined processor, but as the complexity goes up so does the chance that you'll want the illusion of simultaneous execution. You can code your way around these problems, but obviously you have to get it right or else the simulation won't do what the hardware does. Quote: My overwhelming urge is to always put the Write at the end, but I don't know if it is a false instinct, or has some engineering merit. It doesn't really become a programming issue until you do something tricky, like: My overwhelming urge would be to avoid unnecessary complexity wherever possible, and the least complex policy is to unload the register before putting something new in. I didn't absorb your "something tricky" example, but I guess you have to weigh the value of having that capability. The added complexity in this case is probably manageable; IDK. But my suggestion is to forget instinct, and just evaluate the tradeoffs. cheers, Jeff _________________ http://LaughtonElectronics.com
Wed Jan 04, 2017 11:52 pm

barrym95838 Joined: Tue Dec 31, 2013 2:01 am Posts: 116 Location: Sacramento, CA, United States	Re: Introducing the 65m32 Dr Jefyll wrote: ... Why would you care about efficiency of the simulator? IMO that is getting bogged down in unnecessary trivia. It's the hardware version you want to be efficient. The emulator is just a thinking aid to get you there. [ ... snip ...] You can code your way around these problems, but obviously you have to get it right or else the simulation won't do what the hardware does. [ ... snip ...] ... the least complex policy is to unload the register before putting something new in. I didn't absorb your "something tricky" example, but I guess you have to weigh the value of having that capability. Thanks, Jeff. I want my simulator to accurately model hardware that has not yet been implemented, so I'm looking for a clue of how an efficient hardware implementation would likely do it, to prevent unnecessary [edit: excessive] revisions to the simulator. My example involved a hypothetical example of an instruction that uses the same effective address (in this case TOS) for two purposes [eg: source and destination] in the same instruction. If it was a rotate memory cell instruction, it seems obvious that you want to read, modify, then write, and I like that idea. I just want to carry it a bit further with my Push-loaD and Store-pulL instructions for the "corner case" of the effective address of the load or store pointing to TOS (thereby "interfering" with the push or pull). I would prefer to not make this undefined behavior. "Always write to RAM last" makes the little engineer in my head happy, but I don't know if he's leading me astray, and was hoping for some external opinions. I guess that the nearest 6502 analogy would be to put a JSR instruction in the stack area near TOS and execute it. Does the address field of the JSR instruction get overwritten by the pushed return address before it can be used for the jump? Mike B.
Thu Jan 05, 2017 2:17 am

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1877	Re: Introducing the 65m32 It's an interesting puzzle, which is the "better" instruction set design, one which writes earlier or later in the sequence of memory accesses. I don't have an answer! But I am reminded of the 6502 and the difference in stack content between JSR/RTS and BRK/RTI which has caught out a few programmers and emulator writers. In the case of the 6502, the designers chose according to simplicity of implementation, because low cost (small area) was their highest priority. In your case, you have the question of what the programmer's model is (easy to learn and use) versus what the implementation is (easy to implement and test) - a difficult tradeoff. I think I'd go for ease of implementation, because if that's too hard, you'll never be programming the device! But it's a close call - there might be some really useful aspect of choosing the other way.
Thu Jan 05, 2017 9:08 am

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1877	Re: Introducing the 65m32 Responding to Jeff's points about simulation: in making a commercial CPU, there are often two or even three levels of emulation, above the HDL and the simulation we do on that. You need - an instruction set simulator, to explore and explain the architecture as abstractly as possible. Needs to be simple. - an architectural simulator, with some modelling of pipeline delays, interlocks, mediated access to limited resources, to explore performance of microarchitectural choices. Memory accesses may be in approximately the right order. - a cycle accurate simulator, for higher performance simulation than HDL offers. Memory accesses will be in exact order, pipeline stalls and flushes will be exact. (All the above would generally be in C++ or C. Just possibly the highest-level might be in something else. For a hobby project, python or go or maybe java might be good choices. Or JavaScript!) When it comes to HDL we often see at least - two-level simulation for performance, sometimes using FPGAs - four-level simulation to determine that uninitialised or undefined state is not problematic, that tristate busses are correctly used (at most one driver, no reliance on stored charge.) (The four levels are 0, 1, Z and X. There are also other modes, like nine levels, but I'm not aware of them being used.) We use static timing analysis rather than timed simulations, almost everywhere, so we deal always at the level of clock cycles not picoseconds. We probably do some timed simulations to explore the crossing of clock cycle boundaries and will also so some very limited circuit simulations to explore design of the logic library, of the clock generation and distribution, of the pad driver, PLL, temperature sensor, memories and other analogue parts. (I'm approximating, of course, because I was always one step removed from the real work and because it's been a few years.)
Thu Jan 05, 2017 9:16 am

Dr Jefyll Joined: Tue Jan 15, 2013 5:43 am Posts: 189	Re: Introducing the 65m32 Thanks, Ed. Quite a lot to think about. Let me play some of that back to you. an instruction set simulator. Only good for trying out code examples. Doesn't model the hardware in any way. The instructions Just Work -- purely by hand-waving! an architectural simulator. Models someone's notion of what the pipeline delays etc will supposedly be, in order to appraise performance. a cycle accurate simulator. Does NOT employ the potentially unreliable assumptions of humans. Models the actual hardware at the gate level. Hugely resource-intensive! Is that a fair summary? And can anyone comment regarding whether it's safe for Mike to go with the first option if his implementation is not pipelined? -- Jeff _________________ http://LaughtonElectronics.com
Sun Jan 08, 2017 4:50 pm

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1877	Re: Introducing the 65m32 Thanks Jeff - I think that's pretty close to the mark. I have dim recollections of people struggling to get memory accesses into the right order, so there must be subtleties of what degree of fidelity each model is taken to. It's also true in my experience that not all models are kept fully up to date and working - as the project progresses, they have served their purpose. I would just note that "trying out code examples" is very important: how dense is the code, how well fitted to the expected usage, how readily targeted by a compiler, are all crucial. If doubling the register file knocks 10% off the speed, is that good or bad? You have to make some estimates of the implications of things, or the architecture won't be buildable. It's also notable that deeply expert machine-level coding of some kernels might get 10-20% performance improvement, where increasing the clock speed by so much might be nearly impossible or very expensive. So I've seen small teams of expert coders doing that, tracking and influencing the architecture. In our own world, I would think a python or javascript simulator (emulator) even if not quite cycle accurate could be a great thing to have before descending into HDL. Of course, opinions will differ!
Sun Jan 08, 2017 6:58 pm

Page 2 of 4

[ 49 posts ]

Go to page Previous 1, 2, 3, 4 Next

Introducing the 65m32

Who is online