View unanswered posts | View active topics It is currently Fri Mar 29, 2024 9:39 am



Reply to topic  [ 237 posts ]  Go to page Previous  1 ... 6, 7, 8, 9, 10, 11, 12 ... 16  Next
 rj16 - a homebrew 16-bit cpu 
Author Message

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
Philosophically it makes sense to have two separate places where only particular instructions can write into the code-space is better and safer, especially if you restrict those instructions to "super user" privilege.

The ijvm better enforces this concept, so the separation gets even deeper to the "ro-only" section, so "constants" are really things that cannot be modified by user-applications(1) because they are stored in code-space hence their modification is restricted to special ldcd stcd instructions.

It's called "constant pool" in "ijvm terminology", but basically it's what you need to do with Avr8 (Arduino) when you need to use LUT and tables. Although Avr8 doesn't have any privileges separation, there are actually special instructions to access the flash.

(1) user-applications, instructions without super-user privilege, load/store can only access the data-space but not the code-space.


Last edited by DiTBho on Sat Apr 10, 2021 8:00 pm, edited 2 times in total.



Fri Apr 09, 2021 9:31 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
I think 'ST r #' would be the best opcode to modify for a Harvard Machine.
A modifed 6502 memory cycle could let you use a single bus if you are willing to latch MAR
if you go with a TTL design. Phase 1 output data address,R/W program memory
Phase 2 output program address R/W data.
Can the 6502 be modified for a hardvard
architecture with a FPGA version of the CPU?


Sat Apr 10, 2021 7:37 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Quote:
Can the 6502 be modified for a Harvard
architecture with a FPGA version of the CPU?
I would venture that virtually any instruction set architecture can be implemented as a variation of the Harvard architecture. I implemented my 6502 soft-core model as a Harvard architecture in the design phase, but collapsed the instruction and data busses into a single bus in the implementation phase.

In my opinion, the 6502 as a Harvard architecture does not have as much as an advantage as might be otherwise perceived. Given the CISC style of its instruction set, especially the indirect addressing modes, the instruction bus is fairly underutilized after the instruction's operands are fetched. On the other hand, a typical RISC style instruction set, with its single word instructions, might easily gain a significant performance improvement, and likely ease the implementation of pipelining, with a Harvard architecture.

All that said, the modern von Neumann processor with separate level 1 instruction and data caches is essentially a Harvard architecture processor. The problem of loading and modifying the instruction memory is the single greatest issue in a Harvard architecture. The modern processor's memory management unit addresses this problem rather well without having to rely on a lot of specialized instructions to perform the rather common task of loading and reading instruction memory.

Specialized instructions like those of the 8051 family to read tables in instruction memory do not generally provide the flexibility that many programmers desire / require. Most of my 8051 implementations, including the ones that used Flash / EPROM processors, generally collapsed the Harvard architecture separation when external memory was accessed. This made loading and executing larger programs much easier, although it did reduce the system's security that is inherent in the Harvard architecture vis-a-vis the von Neumann architecture.

For instructions to access instruction and data memory across protection / privilege boundaries, look at the PDP 11 architecture: the prototypical CISC architecture.

_________________
Michael A.


Sat Apr 10, 2021 9:07 pm
Profile

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
MichaelM wrote:
8051


I would consider 80C390, 80C400, but they use the common old "/PSEN & RD/WR" trick

MichaelM wrote:
The modern processor's memory management unit addresses this problem rather well without having to rely on a lot of specialized instructions to perform the rather common task of loading and reading instruction memory.


Lot? Only two instructions are required: load data from code area, store data into code area. Usually 16bit or 32bit aligned. You don't need complex addressing, just the basic set.


Sat Apr 10, 2021 9:22 pm
Profile

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
MichaelM wrote:
For instructions to access instruction and data memory across protection / privilege boundaries, look at the PDP 11 architecture: the prototypical CISC architecture.


Can anyone tell me more about this? I don't know the PDP11, I have programmed the m68K, and from the 68000 to the CPU32 (683xx) I don't like how this stuff is managed.

In my EVS (68020) board the separation is imposed by a lot of trap-sys-calls plus an external MMU.

Basically, the board boots in "super-user" mode, loads the user application from the serial line, set the SP, changes the operative mode into "user-mode", and starts executing the user-code which cannot read/write anything from the "code area" because accessing these address will trigger an interrupt from the MMU which will terminate the user-application with a "error, program terminated because it attempted to access to the code-space"

If something in "user-space" needs to read/write something in the "code area" it needs to writes its needs into a "request buffer" and then to invoke a "system call" (m68k "trap #") so the CPU will switch into "super-user" mode and the kernel will serve the request and switch back to "user-mode".

This solution works, but actually I have recently found an exploit to turn off the MMU protection ... :o


Sat Apr 10, 2021 9:39 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
A good summary of the PDP11 instruction set can be found here. A more complete description of the architecture and its instruction set can be found on bitsavers.org.

_________________
Michael A.


Sun Apr 11, 2021 3:58 am
Profile

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
OT, thanks! This seems happening just in time to give me the opportunity to test my latest developed program: a web file downloader!

Code:
OrangeCube pdp11 # echo "https://pages.cpsc.ucalgary.ca/~dsb/PDP11" > url
OrangeCube pdp11 # myNET-get-files-from-url-v7 url
list_preparing, getting /index.html ... success
preparing /list ... done
downloading [/Family.html] ... success
downloading [/InsSumm.html] ... success
downloading [/InsForm.html] ... success
downloading [/SingOp.html] ... success
downloading [/DoubOp.html] ... success
downloading [/Control.html] ... success
downloading [/DoOpReg.html] ... success
downloading [/Float.html] ... success
downloading [/AddrModes.html] ... success
downloading [/AddrSum.html] ... success
downloading [/PicModes.html] ... success
downloading [/MemMgt.html] ... success


Code:
OrangeCube pdp11 # echo "http://www.bitsavers.org/pdf/dec/pdp11/" > url
OrangeCube pdp11 # myNET-get-files-from-url-v7 url
list_preparing, getting /index.html ... success
preparing /list ... done
/pdf/dec/ is ignored
preparing folder /1103 ... success
list_preparing, getting /1103/index.html ... success
preparing /1103/list ... done
/pdf/dec/pdp11/ is ignored
downloading [/1103/1103_Schematics.pdf] ... success
downloading [/1103/EK_KUV11_TM_LSI11_WCS.pdf] ... success
downloading [/1103/EK_LSI11_TM_002.pdf] ... success
...


Files got downloaded, now I just need to find the time to read and study them :D


Sun Apr 11, 2021 11:05 am
Profile

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
So, I think I am going to revert a few videos worth of content.

When trying to debug an issue that didn't appear in simulation but happened very reliably on an FPGA, I tried reverting to the latest youtube episode (well, the one I am editing now, so the next one) and lo and behold it worked there.

What I did was convert both memories to block RAM, and simply inverted the clock to both of them. I had to add a register to the fetch that I think ended up pipelining it (as it now has a branch delay slot). But it all runs at around 29 MHz on my slow up5k FPGA.

So I am strongly tempted to go back to this point in time and make these changes so I can run it on an FPGA and keep testing it there periodically so I can make sure it doesn't break with each incremental change.

I have also since figured out a much better way to introduce microcode. And I have figured out a bunch of simplifications to the instruction set to make microcode easier to implement. So I won't be reverting my learnings and it wasn't a waste of time.

Also, thanks for the discussion on Harvard vs von Neumann. I think I will stick with Harvard. For some reason I thought I needed to take two cycles to read/write data memory but it turns out all I needed to do was invert the clock for the data memory, and I can make load/store single cycle. So Harvard will be much simpler after all.

That's another thing, the very next video after this one is where I turn memory ops into two cycles and I now realise I don't need to do that. At least not yet anyway.

But anyway, I am editing a video now and will upload it hopefully today.


Sun Apr 11, 2021 2:41 pm
Profile

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Okay, next video is up. I implement the ability to stall the processor which lets me easily single step the processor on an FPGA. I had the idea that this could allow the processor to more easily support multi-cycle instructions but I am not sure that's a good idea now. It's still useful for single stepping though.

[024] Stall! https://youtu.be/fRRwy9xFTSk

So, since I think I am going to redo my cache of videos and thus I am going to record the next one in the future, I guess this gives an opportunity to do microcode better and get some input from you all.

What I plan to do is pipeline just the fetch (and maybe decode) sections of the processor. Register fetch, ALU and memory access will not be pipelined so I don't have to deal with data hazards. Though data hazards aren't too bad to deal with in a 3 stage pipeline if I move decode and register fetch into its own stage. I definitely don't want to have any more stages than that though.

So then the question is, if fetch is pipelined, what does that look like in the microcode? Usually there's a "fetch" micro op, but that's not required. I guess it's just a "reset micro program counter and enable IR register latching" control signal right?

The other question is, what's the safest way to annul a pre-fetched instruction on a branch taken? I put a multiplexor after the instruction register that would inject a nop, but I think that's not safe, since that will change the micro-instruction instantly when that comes high. I guess I could register that signal so it happens on the next cycle? I guess that's the same as putting the multiplexor before the instruction register?

Also... do you think it's possible to put the pipeline registers after the decode stage rather than before it? The microprogram address would be fed directly from memory in that case, though I suppose a multiplexor could select between directly from memory and a registered value. But I am thinking it would make more stable control signals, and better distribute the work between stages.

Anyway, I really want to pipeline at least the fetch because it essentially doubles the speed of the processor and the only complexity is trying to get rid of the branch delay slot, which I think is easy? What do you think?


Sun Apr 11, 2021 4:30 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
rj45 wrote:
So then the question is, if fetch is pipelined, what does that look like in the microcode? Usually there's a "fetch" micro op, but that's not required. I guess it's just a "reset micro program counter and enable IR register latching" control signal right?
Your statement regarding pipelining fetch is probably right. I then to think about in reverse: execute stage is pipelined with fetch / decode stage. IOW, I keep the machine running in synch with the instruction and operand fetch cycles, and then when all operands are ready, I execute the ALU operation required by the instruction. In the execute stage, the destination register is also written to.

When I initially got started with my core, I was thinking that I was writing for the current cycle. When I adjusted my thinking to writing the microcode in the current cycle for the next cycle, then everything starting working as I expected. This means that there is a pipeline to the microprogram / microcode. Therefore, I had to preload the ALU execute register and IR register with a NOP value. I chose to represent the NOP condition of all of my functions with all zeroes, so that only left loading the IR with the NOP instruction code on RESET.

Finally, I set up my address generator to generate the address of the next instruction, and I simultaneously decode the instruction as I capture the opcode in the IR. I seldom use the contents of the IR, especially in the execute stage, I really don't need an IR but I kept it in my design for completeness. It has come in handy a time or two when I wanted to use the current IR value to control some function later.

Since my core only overlaps the fetch and execute stages, my core is only partially pipelined. However, that partial pipelining of a 6502 provides about a 40% improvement in execution speed for the same clock speed versus a standard 6502. Another consequence of the partial pipelining scheme that I use with my 6502 core is that a few instructions like CLI/SEI are not interruptable. IOW, I don't allow these instructions to vector to an interrupt if the IRQ / NMI signal is asserted when the fetch of the next instruction is made. In this way, rather than using two clock cycles in their implementation to ensure the P register is properly set, these instructions complete fully during the following instruction fetch cycle and this thereby allows their overlapped execution cycle to properly set the P register.

Pipelining the microcode is natural because of the synchronous implementation of the microprogram controller and the RAM / ROM holding the microprogram (in an FPGA block RAM). Once I made that initial adjustment to my thinking, the development of the microprogram was fairly easy and straightforward. Single cycle implementation of the microprogram controller and the microprogram memory is possible only if the microprogram memory is asynchronous. Such an implementation, although possible in an FPGA with the microprogram in the RAM LUTs, is not as natural as using the synchronous block RAMs. You can get the effect of single cycle operation by clocking the microprogram controller on the rising edge and the microprogram memory on the falling edge. This will stress the timing of your FPGA, but it is a feasible approach, and it reduces the mental gymnastics that are necessary to track two pipeline levels when both the controller and the memory are clocked on the same edge.

_________________
Michael A.


Sun Apr 11, 2021 6:33 pm
Profile

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
Hmmm.... I don't want to build something that's going to hurt my brain to work on. There's something to be said for rigging up a ton of blinkenlights to all the control signals and knowing exactly what the computer is doing at each step. A pipelined processor tends to do some brain melty things, I found I needed to produce vcd files and study them in gtkwave for a good long time to try to figure out what's wrong when it went crazy. I would like to avoid that.

So if anyone feels like I am setting myself up for brain melt with pipelining just the fetch and maybe decode stages, then please say so now :-)

Michael, I am hoping I won't have so much microcode that it needs to be stored in block RAM. I am hoping it can be synthesized as logic (LUTs) instead. But I feel like in order to do that, the microcode must be kept as small as possible. Anyway, hopefully by not having to use block RAM I don't have to worry about that cycle delay. BUT if I do use block RAM, it's good to know I just need to think one step into the future. I am not sure how that will work with almost all instructions taking a single cycle but if I get there I will think about it.

One thing that I am curious about is: I stumbled upon a document in bitsavers about the IBM Clipper (was it IBM? I forget. And actually the ZPU works this way too). Anyway, what they did was hardwire a small core set of instructions and made those instructions the microcode instructions. But the key thing was, they were not just microcode instructions -- if the computer encountered one of those core instructions it would execute it directly via the hardwired logic. The only instructions that weren't executed that way were more complex instructions, in which case it would jump to a microcode routine to handle them.

The microcode "environment" is different from the normal program environment -- there's a different register set, and some of the registers are hardwired to things like operands and other instruction fields, and you would have special access to registers like maybe the MAR and PC. But to implement a two cycle memory op, you could:

Code:
    add MAR, rs, imm  ; MAR <- rs + imm operands
    move rd, MD       ; rd operand <- MD (memory data)


I find that idea very interesting. I am not sure what the core set of instructions would be, but I suspect just a move, add, branches, and a compare would mostly do. Probably very close to what I currently have implemented already.

Do you all think I should explore this idea? Or just leave it?


Mon Apr 12, 2021 12:24 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Could the ALU primitives, form a recusive define of high level language.
1st level mico operations. 2nd level simple memory and I/O operations 3 level
complex operands like floating point, virtual memory, bit mapped graphics,
4) device dependant language.


Mon Apr 12, 2021 5:51 pm
Profile

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 74
I think you can have a look to the ijvm; it was fully documented and it comes with a micro-code assembler, so it may inspire you :D


Mon Apr 12, 2021 10:40 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
rj45 wrote:
I am hoping it can be synthesized as logic (LUTs) instead.
This is very possible, but I recommend against synthesizing the microprogram into the LUTs based on a reduction of a Sum-of-Products representation like I wrote about here. Instead, you can easily synthesize a ROM into LUTs by any number of techniques, but the one that I used for the MiniCPU 8-bit processor challenge may be better.

I think it was Intergraph, a company from Huntsville, AL, that developed the Clipper. A number of the processors were incorporated into accelerator boards (Microway comes to mind) that were often used like BBC second processors in PC/ATs to increase the processing power. Here in Huntsville, many small companies used these boards to lower the cost of running simulations on DEC 11/780 VAXes.

As for using the Clipper approach to microprogramming, it sounds as it was an interesting concept, but I am not up to speed on the Clipper architecture. I think that a straightforward discrete logic or microprogrammed instruction controller would be better for you.

Pick the approach that appeals to you the best that is within your skill set to use. Looking forward to more of your posts on the rj16 processor.

_________________
Michael A.


Tue Apr 13, 2021 2:17 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
For small PROMS I like 22v10's and a table lookup. The last pal I needed, had a logic that would add 0 +3 -3 to a hex
digit and do other logic operands. I split this into parts. Part #1 generate a test pal and logic equations. Part #2 Take the logic equations and re-edit them with the for the final version. The PAL here does decimal adjustment for Excess 3 addition
and BCD/binary conversion.


Attachments:
File comment: PAL using the lookup table
paladj.txt [4.16 KiB]
Downloaded 81 times
File comment: table lookup PAL
pldtab.txt [1.07 KiB]
Downloaded 72 times
Tue Apr 13, 2021 4:43 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 237 posts ]  Go to page Previous  1 ... 6, 7, 8, 9, 10, 11, 12 ... 16  Next

Who is online

Users browsing this forum: No registered users and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software