Last visit was: Mon Sep 27, 2021 9:54 pm
It is currently Mon Sep 27, 2021 9:54 pm

 [ 30 posts ]  Go to page 1, 2  Next
 Possible 16-bit design, need guidance on pipelines & fetches 
Author Message

Joined: Sun Jul 05, 2020 9:00 pm
Posts: 15
I'd like to build a CPU in FPGA. I've thought about making a modified Gigatron, but part of me wants to start from scratch. But I could use pointers on how to go forward.

Here is what I've considered so far. It should be 16-bits with 20-24 address bits, preferably Von Neumann or modified Harvard. Instructions with no operands or an 8-bit operand should take a single cycle. I'd like to be able to handle at least 16-bit operands, though 24-bit should be just as fast, assuming a 16-bit memory bus.

I'm using the CMOD A7-T35, and it comes with an 8-bit, 512K, 10ns SRAM. It was a total of 52 GPIO lines, counting the PMOD connector. So if I want 16-bit SRAM, I'd need to use GPIO lines to add a 2nd one. In a way, that would be better than addressing 1 separately, since having 2 8-bit buses avoids alignment issues and can allow reading from and writing to 2 different addresses. So if the high byte becomes the low one, then the low byte can access the next address. That also avoids corrupting half of a word during byte writes or reading entire words before writing to them to avoid corruption.

The above architecture description sounds like it would cause a problem, but I have a solution. Now, the issue is that the boot address will likely be high. There would not be enough lines to have a 16-bit memory bus to drive an external ROM. So the 16-bit port can give up 8 of its lines to access memory past 512K words, or to access memory on the board, and I can likely spare a line to signal a change in role. If I were designing a board with the FPGA directly on it, then such a hack would not be needed, since all the memory lines would be on the motherboard.

Now here is where I am hung up in my mind. I don't know how to do pipelining and fixing things to where random accesses would be allowed. Some make memory accesses by software a pipeline stage. It would be nice to avoid that if possible or to make it an optional stage somehow. Maybe adding block instructions would help in that large moves between memory or between memory and a port could be done. That way, fetches can be halted until the instruction is finished, thus allowing single-cycle transfers during a block operation. It does seem that using odd and even memory banks with independent address lines could provide some advantage since if one does just low byte or high byte transfers, parts of future fetches can be done on the unused lines.

Also, I don't know how to handle branches. I know in many RISC designs, there are delay slots. That means the next instruction is already in the pipeline and cannot abort during a branch. I'd like to figure out how to do queueing and pipelining before I start since that would affect instruction speed. While building a Harvard design would be simpler here, I'd rather go with Von Neumann.

So let's say I have a fetch register with enough bits for the maximum instruction size (with the control unit ignoring what is unused). Let's say that a given instruction is 3 bytes long. So being a 16-bits machine, that will take 2 cycles to fill the fetch register (with the beginning of the next one, which I don't want to discard/ignore). So there needs to be a way to sort what comes in, maybe partially decode it to get the length. There would likely need to be a queue of sorts or multiple layers of fetch registers.

I'd even be willing to entertain having a fetch system that runs faster than execution speed. Maybe have at least 2 fetch timeslots per execution slot and selectively allow one of those slots to be used for random accesses by the code.

Any thoughts, ideas, or clarifications?

Last edited by Sugarplum on Tue Aug 03, 2021 9:03 pm, edited 2 times in total.

Mon Feb 22, 2021 12:13 pm

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 73
I don't know, but I don't personally like complex designs. Pipelining CPU-stages is rather tricky. It's difficult and very time-consuming. I won't do it. But it's up to you.

Mon Feb 22, 2021 12:29 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 205
Location: Huntsville, AL

It appears that you've given some thought to your project to build in an CMOD-A7 prototyping module. It's a good module, and I am also considering it for building a number of projects before moving on to build the final product.

I agree that initially pipelining your processor is a tall order. My recommendation is to work up to the processor that you desire by starting your project strictly within the FPGA. In other words, the Artix-7 FPGA that you selected has 200kB of internal dual-ported RAM. You can configure that RAM in various widths: 4, 8, 16, ..., 72, or wider. Get your instruction set defined and working in a small environment, and then worry about pipelining. A number of projects on this forum and on others provide extensive descriptions of developing their processors in this manner.

This iterative approach allows you to concentrate first on the most important architectural feature of a processor: its instruction set. Once that is well developed, and supported with some tools, then the move toward higher performance through pipelining is a natural progression.

Michael A.

Mon Feb 22, 2021 1:04 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
I'd agree: start simple. A CPU which can only run a NOP is a start.

A simple machine will take several ticks to go through several execution states for each instruction. Branches should not be any special difficulty here, I think. Stay away from external memory and pipelining and every other unnecessary feature - get something working first.

Mon Feb 22, 2021 5:00 pm

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 71
I agree about getting something working first - but realistically adding pipelining to an existing design is very difficult. Realistically you'll start over when you get to that point, and redesign based on what you've learned.

Pipelining isn't all *that* scary, provided you make sure you've fully grasped the concept of inserting bubbles where necessary, the concept of hazards to detect where bubbles are needed, and pipeline flushing when control flow changes. (Delay slots are one solution to that problem.)

Also make good use of simulation - GHDL and Verilator are both useful. (It *is* possible to debug a design entirely by running it on an FPGA and using SignalTap or ChipScope, but it's not recommended!)

Mon Feb 22, 2021 8:57 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
I think the crucial thing is to realise that the project you're doing right now isn't going to be your last project - and you are going to learn from it. There's no need to put into it every idea you have, and in fact because you'll be wiser for the next project, it's better for that to be the bigger one.

Mon Feb 22, 2021 9:03 pm

Joined: Sun Jul 05, 2020 9:00 pm
Posts: 15
Thanks for the attempts to help. Now if you can show me the deep end of the pool. And yes, it could be the last project I build, if I get to this one. I don't have the time for hand-holding. I won't get into the suspected cardiac symptoms I've been having. So I am after strong details, as techy and geeky as possible. It is important that I never start a project unless I can see it to the end in my mind. I consider it foolish to get started until I have finished it in my mind. I ask that others respect that if possible.

The reason to build the project is to use external RAM. Of course, I could start with 8-16 and use the 512K SRAM, but for what I'd want to do, I'd wire in a 2nd one.

For me, the stuff I mentioned is what I want to work on first. That will dictate my motherboard design and instruction set. It would be nice to be able to do as much as possible within a single cycle. Here is what I am thinking. I need to fill some sort of queue since instructions will be variable-length, and things will need to be sorted into the fetch register(s). One way to help with that is to give the opcodes 2 "instruction family" bits, whether 0-3 bytes as arguments.

I'm still trying to think about how to deal with jumps, and how to clear any pipeline or queue when they are encountered, and not deal with spare slots.

Due to the Von Neumann handicap, I'd probably want to include some block memory instructions. For instance, that could be useful if I wanted to bit-bang pixels to the port. If nothing else, do that like x86 and have a counter register for such instructions. Thus fetching is stalled and single-cycle transfers can be done. While I'd prefer using DMA or some other means to do video, being able to do a block copy to the port would make bit-banging possible.

I've also considered that in addition to using RAM, use some of the BRAM as an additional memory pool with its own bus. So it could work essentially as a Harvard machine when using 0-1 operand bytes and using the internal memory.

For some things, I'd just skimp on it, such as how I code. I saw the Menlo Gigatron source, and I can't, for the life of me, figure out why an RC Adder was coded. The original Gigatron uses 2 "fast" carry adders. Then I read in the notes about "a problem with the blue." Well, the blue is on the high nybble. So it seems that might be cleared up if the Verilog one were to use a single adder using Verilog primitives rather than XOR and AND on single bits. So for a lot of things, it is faster or more efficient to use the primitives. So I won't be overthinking the ALU. I'd likely go for a distributed ALU since Verilog makes that rather easy.

Tue Feb 23, 2021 2:31 am

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 73
If I were you, I would clone a well-tested RISC-V pipelined core, study it on a well written book, and master it to design your home-made board with a decent C toolchain to program something useful.

Sugarplum wrote:
The reason to build the project is to use external RAM. Of course, I could start with 8-16 and use the 512K SRAM, but for what I'd want to do, I'd wire in a 2nd one.

  • Asynchronous Static Parallel RAMs, ASP-RAM
    you need to design and to fine-tune a finite state machine that handles all the bus signals and timing for read and write cycles, which may take n-clock cycles to complete; "n" depends on the features of the chosen RAM, compared to the FPGA's clock used to manage the FPGA's external bus
  • Synchronous Static Parallel RAMs, SSP-RAM
    if the RAM is as fast as the FPGA, you only need to design and to handle the bus signals, since synchronous RAM always requires 1 clock cycle to complete a read/write (it's like BRAM, but external).

If the IO operation cannot be guaranteed to completed within 1 cycle, a pipelined CPU needs to stall the pipeline until completion, and with super-scalar pipelined CPUs things go even worse since you also have to introduce special instructions (EIEIO, ...) to stall multiple pipelines in order to avoid "out-of-order" issues and hazards.

Tue Feb 23, 2021 10:43 am

Joined: Sun Jul 05, 2020 9:00 pm
Posts: 15
@DiTBho -- Thank you.

I'm one of those who can learn just as much thinking and discussion as others can from doing. If it's something I know nothing about and have no frame of reference for, I file it away. I had the pleasure to get to know the Myers brothers several decades back. Aurie was a former IBMer, and while Charlie hadn't done as much in his life due to his cerebral palsy, both were very bright, and they both had extensive radio experience. They both told numerous exciting stories. Sadly, they were both up in years and had to go to a nursing home, where they subsequently passed away. They told of when robberies got quite bad in their area, yet nobody messed with them. They had unknowingly repaired CB radios over the years for the Outlaws (a branch of the Hell's Angels biker gang), maybe the Mafia, and likely also some Klansmen. Plus, they had civic ties. So they had friends on all sides of the law and the criminals had too much loyalty to mess with them. Charlie told of when he worked at the remote site for the local radio station as the station engineer. That's mostly a boring job since usually there was nothing to do but take readings every half-hour. So he'd bring a book and an alarm clock. One day, he was nearly asleep when some popping (arcing) and clicking (relays) got his attention. He didn't know what happened, managed to reset the transmitter, and wrote in the FCC logs that the transmitter was off the air due to an unknown reason. As he rested at home, a hunch came to mind. He returned with plastic tongs and a can of air freshener. His hunch was right; it stank in there. He removed the dead rat from the transmitter and amended his notations. He also told of how the laws used to hold the station engineer responsible for content instead of the station manager as they do now. One day, someone was on a call-in talk show and used the worst language possible, so he pulled the plate voltage until the cursing tirade was over and wrote it was due to foul language.

Anyway, I asked how to access more RAM than a CPU was capable of addressing, and Charlie kept talking about memory paging. I didn't know enough to understand what he meant, but as I've studied how to make a CPU, I now know what he was talking about. So nobody should fear talking over my head on this, since even if I don't understand it, it is not wasted.

Yes, it would be helpful to study some sort of finished CPU project. Let's start out a little simpler. The Gigatron has a 2-stage pipeline. The PC/IP addresses the ROM and loads it into the IR and DR registers. Then the execution unit executes from those registers. On jumps, there is a delay slot and the instruction after the branch always executes. When you have just one delay slot, it is rather easy to code around it and embrace it. You could move the instruction that is before the branch to after it or move the first instruction of the target address there. Okay, so how would one remove this delay slot if one wanted to? How would one insert a bubble into the path? Could that be as simple as looking for a jump during a fetch and then delaying the PC by one cycle? Wait, would that cause relative branches to jump twice and go to the wrong address? Maybe adding a 2nd set of registers could help, but then wouldn't you still have a delay slot while things would be running 2 cycles behind?

Tue Feb 23, 2021 3:05 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
You asked for an epic rant, so here's a super rough epic rant for you. Hopefully it's mostly coherent, I don't really have time to try to make it more concise.

Okay, rj16, the third CPU I built, was pipelined. I built it in Go but the principles should still apply. It was a classic 5 stage pipeline.

I built it first as a single cycle machine, then pipelined it. It took me a couple weeks to work out all the bugs, but it was a good exercise. Pro-tip: the bugs you will face are race conditions in time, make sure, however you build it, you can see good traces of the inner signals over time. I built a VCD trace output to the Go simulator in order to figure out what was going on, and that sped up the debugging process at least 20 fold. I should have done that from the very start.

For the external memory, I would recommend against doubling up the RAM to make a wider bus -- the main reason being the RAM that's close to your FPGA is going to run a LOT faster than RAM that is far away over external pins. Running PCB traces and 0.1" pins at really high clock rates is an art, and one you probably want to avoid if possible. 20 MHz should be fine.... maybe even 30 MHz. Beyond that and you get resonant frequencies that cause all kinds of havoc unless you handle the reflections in the lines, and have paired ground paths and other arcane things that are totally beyond me.

This is where the BRAM inside the CPU is awesome. The FPGA synthesizer software will handle making the wires as short as possible to run it crazy fast. It might be better, then, to have a cache inside the FPGA, and use the external RAM with a DMA circuit to sync it with the internal cache and page it in and out. Then the fact it's 8-bits wide is a non-issue, it doesn't need to run as fast as the CPU. Most software has good "cache locality" -- it tends to access the same small region of memory over and over again. So having a blazing fast cache more than makes up for a narrow pipe to main memory. All modern CPUs work this way, main memory is at least 100x slower than the CPU, and if it isn't it usually has a really high latency.

As for pipelining: The simplest thing you can do is strictly follow a RISC architecture and have only "single cycle" instructions and avoid state machines and multi-cycle instructions. This means that every pipeline stage either stalls or completes its work each cycle, and there's registers between each stage to take the results of the previous stage. This means you need a harvard architecture, but you can get that by having a separate instruction and data cache that share the same underlying memory. And then it's a von neumann architecture to the programmer, the harvard-ness is hidden. This also means your instruction cache has to be the same width as your instructions so you can fetch in a single cycle. It also means you need fixed-width instructions. This is the beast most modern CPUs hide under the covers, there's a decoder circuit that splits the more complex instructions down into a series of micro-ops that are all single-cycle and fixed-width, and the core of the beast is a super simple deeply pipelined RISC processor that has the shortest critical paths they can get.

But even given that, hazards will cause the pipeline to stall. For example, you can't use the value read from memory in the very next clock cycle because the value won't be available from memory until the end of the cycle, but the ALU takes a whole cycle to propagate the carry signal through the adder (or hardware multiplier). So the execute unit needs to be able to signal a stall, if it detects the value being used was read in the memory unit.

You'll also get a very similar stall with branches, but it's usually 2 or 3 cycles instead of just one. The fetch unit has fetched an instruction or two by the time it's known that a branch will be taken. You could discard any fetched values until the branch resolves, which is what I did. Or you could keep fetching and make sure the fetched instructions can do no harm (which is what my latest processor will do), and you can flush them all if the branch is taken. This way if the branch isn't taken, it's no cost. To get better than that, you need to do branch prediction. You might be able to implement a very crude branch predictor. That's a bit beyond me though. But one thing you can do is calculate the branch as early as possible, in the decode stage. It's more hardware, but it means fewer wasted cycles on branches, which are annoyingly common in software.

Now that's the simplest way to build a pipelined CPU. A different way is to build in state machines so that instructions don't always have to be single cycle, they can stall the pipeline as they do multiple cycles of calculation. The ARM1 worked this way. It only had 3 pipeline stages, and memory access and branches took multiple cycles. In a way, that processor was a hybrid CISC-RISC machine. The core ALU instructions were RISC, but the memory and branch instructions were more CISC-like. The advantage to fewer pipeline stages is that there will be less hazards that arise, and less hazards is less debugging time. But fewer pipeline stages also means a slower processor.

Another consideration is with data forwarding. This is done to avoid stalling when the result is available, it's just in a pipeline register instead of in the register file. So you can avoid waiting for the register to be written, and instead just forward the value. See the CPU74 thread on this forum, the last few pages have a discussion of a clever way of removing the writeback stage that I am keen to try. But for every stage in the pipeline, you need to handle more data forwarding situations and have more logic for it.

Anyway, rant over. I hope it was helpful. If you want to dive into the deep end, the CPU74 thread is epic awesome, well worth the read. I have a few gems in the rj16 thread too, mainly around my discovery of a really easy way to get a full GCC toolchain that I plan to take advantage of eventually.

Wed Feb 24, 2021 12:14 am

Joined: Sun Dec 20, 2020 1:54 pm
Posts: 73
rj45 wrote:
It took me a couple weeks to work out all the bugs

  • Classic approach, fully MIPS32R2-spec pipelined (first pipelined design, 2015), from draft HDL to 100% passed test-cases, took two years, 7-8 hours per weekend, two person on the project. It has mul, div, cop0 for exceptions, and a crude branch predictor, then replaced with a smart branch predictor, which, of the said two years, took two months to be designed, implemented, tested and properly verified.
  • Classic approach, minimal RISC-V but pipelined (2020), from draft HDL to 100% passed test-cases, took 5 months, 6-8 hours per weekends, two person on the project. Only the minimal instruction set is implemented, but it has mul and div, cop0 for exceptions, and a very smart branch predictor.
  • Classic approach, minimal RISC-V multi-cycles (2021), from draft HDL to 100% passed test-cases, took 3 weeks, 5 hours per weekends, one person on the project. It has mul, div, cop0 for exceptions. I made it smaller and simpler in order to better fit my hobby with small FPGAs
  • Modern approach, Scala/Chisel can speed up things 1:200 and it's what I experienced with RISC-V SoCs at work with my colleagues who are pro in that field (I am not), but thy say there is a deep learning curve with Scala/Chisel before you can use it.

Wed Feb 24, 2021 1:17 am

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
DiTBho wrote:
rj45 wrote:
It took me a couple weeks to work out all the bugs

took two years, 7-8 hours per weekend

Hey, when I say 2 weeks, I had been working on that project a lot more than a few hours on the weekend, for several months. It was designed from the very beginning to be pipelined, so I didn't do anything that would make that hard. It was also just a simulator, and it was by no means complete. But it was initially a single-cycle machine, and the "upgrade" from that to a pipelined processor only took a couple weeks of hair pulling. But again, it was a simulator written in Go, and it was not a gate-level simulator, it was pretty high level.

Anyway, the point of what I said was to give sugarplum some meat to chew on, and that was the meat I had to give. I, personally, would prefer not to discourage him with lengthy timelines. I am looking forward to his build stories. I have really enjoyed reading other's journeys here on this forum.

Wed Feb 24, 2021 1:08 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
Agreed - it's great to hear all the build stories, whatever tradeoffs someone made, or whichever trajectory they took.

Wed Feb 24, 2021 3:07 pm

Joined: Sun Jul 05, 2020 9:00 pm
Posts: 15
I still want to deal with multi-length instructions with at least 0-3 operand bytes, and I'd probably want state machines. If nothing else, use transparent state machines, where things can take multiple cycles without affecting program flow, such as an instruction to send x bytes to the port from its own BRAM pool. For instance, I've considered modifying the Gigatron to where it has some 16-bit load stores from RAM where an instruction would be allowed to take 2 cycles, and it would be up to the coder or the assembler at least for the next instruction to be either a useful non-RAM instruction or a NOP to avoid hazards.

Him. LOL!

Wed Feb 24, 2021 7:56 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
(Yeah, a bit of a naughty gender assumption there - hopefully we can all learn from that too!)

Wed Feb 24, 2021 8:02 pm
 [ 30 posts ]  Go to page 1, 2  Next

Who is online

Users browsing this forum: CCBot, SemrushBot and 0 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software