Last visit was: Sun Aug 01, 2021 5:21 am
It is currently Sun Aug 01, 2021 5:21 am



 [ 305 posts ]  Go to page Previous  1 ... 17, 18, 19, 20, 21  Next
 74xx based CPU (yet another) 
Author Message

Joined: Mon Oct 07, 2019 2:41 am
Posts: 255
I am sure I read that xv6 had 386 version a few months back, and that version
was based from a PDP 11 unix. It is not finding a OS, it finding someting small
in todays age. (256K or less) A link for a FAT file system can be found here. Add a shell program and you have
O/S.
http://elm-chan.org/fsw/ff/00index_e.html
Sadly I don't have link for a C compiler. If I did I would use it myself.


Sun Dec 06, 2020 5:44 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I have put some grey fills and colours in the Data Path diagram to graphically show the paths of several instructions.

First drawing represents the normal register-register alu instructions.
As well as a register plus offset memory store

Attachment:
CPU74DataPathA.png


Next, is the datapath for the call instruction:

Attachment:
CPU74DataPathB.png


And finally, the return instruction:

Attachment:
CPU74DataPathC.png


The return from subroutine instruction is particularly interesting because it makes use of the wait cycle -required for the pipeline- to increment the Stack Pointer. So, although it looks that return might take one more cycle than the call instruction, in reality both take the same.

The Store instruction is intrinsically a 2 cycle instruction. The instruction decoder splits it into two cycles as it has always been the case since the early processor implementations. On the first cycle, address calculation takes place, on the second cycle the actual memory update is performed. This means that the ALU remains iddle while the memory is being updated, and this is ultimately a consequence of the relatively short pipeline.

A deeper pipeline would improve that. For example, on typical 5 stage pipelines, all load/stores can potentially be performed in one single cycle provided they do not create hazards. In such case, during the current load/store cycle, the ALU can be used to process the following instruction. This also requires an additional port on the register file to capture the register being stored to memory.

It is tempting because there's at least the very common case of the register save/restore code inserted at the prologue/epilogue of C functions that happens all the time, and that would greatly benefit from requiring exactly half the number of cycles. But as said on my previous post, it remains to be decided whether going that far is worth the extra complication.


You do not have the required permissions to view the files attached to this post.


Last edited by joanlluch on Mon Dec 07, 2020 11:00 am, edited 1 time in total.



Sun Dec 06, 2020 9:05 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 122
Thanks for the link to xv6. I was not aware of that project and so far the document has been a very enjoyable read.

It is possible to fit an OS into a 16-bit computer with only 16 bit pointers, and there's various ways to allow the OS to have more than 64 KB of RAM.

But well, there is a certain charm to a nice retro feeling basic. That was my first programming language when I was a kid. Very likely that's the first thing I will do. I think I saw a project where someone ported a retro basic to C, that would be cool.


Sun Dec 06, 2020 11:16 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1443
Location: Canada
Quote:
but to be honest I have to learn everything about operating systems at their core, so I'm really very far from being able to use that. I also believe that only 16 bit addressing space is not enough for anything relatively serious, such as a proper operating system with virtual memory and memory protection, so maybe I may just implement my own basic interpreter and that would be it.
I have found TinyBasic first published in Dr. Jobb's Journal (May 1976) IIRC. to be a good starting point. It has been modified to run on several different architectures.

Quote:
I have put some grey fills and colours in the Data Path diagram to graphically show the paths of several instructions.
Wow! You do good diagrams.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 07, 2020 3:56 am WWW

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 122
So I started a thread on my CPU.

I am a little bit intimidated by those diagrams :-) But it's really interesting and makes what's happening so clear. Makes me wonder about trying to incorporate diagrams like those into a front panel somehow, where LEDs light up when the computer uses each path. Hmmmmmmm....


Mon Dec 07, 2020 12:40 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I took some time to think about what would/could be needed for a 4 or 5 stage pipeline, and this is what I came up with:

Attachment:
CPU74Diagram.png


This shows that as we increase the number of pipeline stages, we require more and more hardware.

The first one is the basic, 2 stages one that I have been using so far, with just minor changes that help seeing the evolution to more stages.

The second one is the already tested and working 3 stage pipeline model. The differences with the basic one have been posted already. As a remainder, the changes are the incorporation of registered inputs to the ALU, so the execution stage becomes independent of the decoding stage, and the addition of a set of multiplexers at the outputs of registers to manage data hazards.

The third drawing is notably untested but it should reflect what's required for a 4 stage pipeline. I call it 4 stages, not 5 stages, because I omit the write-back stage latches that are normally placed in the computer architecture bibliography. The reason why the 5th stage latches can be omitted is because I solve hazards at the end of the cycle previous to the affected one, not at the beginning of the affected cycle. The difference in the data path diagram is that I have the hazard multiplexers placed before the pipeline registers, not after them. I'm unsure why all the documents on the 5 stage MIPS pipeline insist in placing the hazard multiplexers at the beginning of the stages, I think the approach found in books is counterproductive from the point of view of performance and I assume it is just like this to make it academically easier. Furthermore, I think the flow of data is identical if we take into account that in the typical MIPS model, registers are updated at the beginning of the cycle, and read at the end, so the net effect of the last latch in the MIPS model is null (Although I would be happy to be contradicted if someone can explain that to me)

Anyway, the stages being considered are:

Code:
FETCH  - DECODE  - EXECUTE1 - EXECUTE2&WB


The clock frequency remains the same (30-32 MHz), but the improvements come from reducing the number of consumed cycles per instruction. Particularly loads and stores.

This what I found to be required:

* The extra stage causes an additional level of hazard possible, so there's a second row of data forwarding multiplexers just before the alu input registers, which are placed in a priority location with respect to the ones already existing in the 3 stage pipeline model.

* An additional read port in the register file is required to capture the register that might be stored to memory. I named this the D_BUS. For memory writes, the pipeline works by computing the address through the ALU inputs at the same time that the data to be written is available in the D_REG. Let's assume the instruction
Code:
st.w r3, [r1, r2]
. The decoder places r1, r2 in the alu input registers, it also places r3 in the d_reg register. On the next stage (exec), the effective address is computed through the alu, stored in MAR, and the data in D_REG is moved to D. (note that D is now a register, depicted in green, not just a memory data input). During the next stage, the memory is actually written, while the ALU is completely free to process the following instruction in the pipe. So, this means that writes to memory consumes effectively just one cycle.

* Loads from memory are performed in a similar way. Lets consider the instruction
Code:
ld.w [r1, r2], r3
. The decoder places r1, r2 in the alu inputs. The next exec stage computes the effective address and places it in MAR. During the next cycle the memory is read and stored in the destination register, r3. The alu during that cycle is free to process the next instruction in the pipe and store the result in ALU_Q (or MAR again if it a load or store). So effectively, load from memory consumes one cycle.

In the case of memory loads, there's the possibility of a hazard appearing that can not be solved by forwarding. For example if the register being read from memory is used in the next instruction. In this case the control circuitry must insert a bubble in the pipeline to allow time for the register value to clear. In this particular case a load will still consume 2 cycles.

* All normal instructions will still use ]one cycle. For example let's consider
Code:
add r1, r2, r3
The decoder will place r1, r2 in the alu inputs. During the next cycle (exec) the addition will be performed and the result stored in ALU_Q register. The only purpose of the ALU_Q register is to keep alu write-back in sync with load memory write-back by consuming the same exact number of stages. This does not delay execution because it's just a matter of instruction latency and the hazard multiplexers already resolve any read-after-write data conflicts. On the following cycle, the ALU_Q register will be moved to the destination register, r3.

* Branches are dealt with in an identical way than the 3 staged model, so they take 3 cycles (this can be improved too, and I will discuse this on a latter post). This is possible despite the additional execution stage because we already know whether a branch will be taken of not by the end of the first execution stage. So by the time we'll update the PC we only have two instructions in the pipeline.

* The 4 stage pipeline also requires a change in the ALU. On the previous models I used the Shift and Extend circuitry of the memory load/store module to perform register based 8 bit shifts and Sign/Zero extensions. The trick was to bypass the memory and use the already available circuits around it. This is no longer possible because we can find the situation where a register Sign/Zero extension or shift8 instruction (sext, zext, shr8,...) goes just after a memory load/store. By the time the instruction would begin the first execution cycle, the required module would be in use by the second execution cycle of the load/store. To make this work, the mentioned instructions must now be processed through the alu like any other register based instructions. Therefore, the alu must be extended to incorporate sext, zext, shr8... functionality in a way non-dependent of load/stores.

Now the results (excluding for now branch improvements that I will disclose later), assuming that load/stores may use 1.1 cycles on average, might be the following:

* 2 stage pipeline at 16 MHz: 1/(0.12*2 + 0.05*3 + 0,3*2 + 0.53) = 0.66 instructions per clock pulse -> 10.5 Million instructions/ second
* 3 stage pipeline at 30 MHz: 1/(0.12*3 + 0.05*4 + 0.3*2 + 0.53) = 0.59 instructions per clock pulse -> 17.8 Million instructions/ second
* 4 stage pipeline at 30 MHz: 1/(0.12*3 + 0.05*4 + 0.3*1.1 + 0.53) = 0.69 instructions per clock pulse -> 20.7 Million instructions/ second

3 pipes versus 2 pipes :17.8 / 15.5 = 68.6 % faster
4 pipes versus 2 pipes: 20.7 / 10.5 = 96.6 % faster

improvement from 3 pipes to 4 pipes: 20.7 / 17.8 = 16.6 % faster


You do not have the required permissions to view the files attached to this post.


Mon Dec 07, 2020 8:33 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 122
That is very interesting. Thanks again for sparking some really serious thinking.

Keep in mind, I believe the threshold for a human to feel that something is faster is 20%. Below that, while it is faster, it's really hard to tell the difference without a stopwatch or something. That might help you decide if 16.6% is worth the extra effort in building it.

As for the writeback stage, without getting into too much detail, it's common to build register files with synchronous RAM in an FPGA. This means you need a clock pulse (usually the inverted cpu clock) for the RAM to do a write and to register the read address and perform the read. So that means you need to have the value you want to write early in the cycle, hence needing the writeback stage. Further, the read value is only available very late in the cycle, so it would be very tight to add muxes after the register file, and often then you would put all the muxes at the beginning of the execute stage instead.

But after much staring at your diagrams, I can't think of a reason it wouldn't work to move the forwarding logic into muxes in the decode stage and write the result into the registers at the end of the cycle when it's available. This should especially work if you're using actual registers instead of RAM for the register file.

I need to think some more and maybe draw some timing diagrams, but it might even be possible to do that with synchronous RAM in an FPGA. You could read in the first half of the cycle and write in the second half, and forwarding would handle the case of reading a stale value. Now I really want to see if I can eliminate the writeback stage from rj16 :-)


Tue Dec 08, 2020 1:06 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Rj45,

Oh, I think what you explain about the need of the write-back stage for RAM based registers makes sense. Thanks for that. I have no previous experience on hardware designs and I only knew about actual 74xx based registers requiring just a clock pulse to update. So in my mind it made sense that the updates would took place at the clock pulse that ends the execute stage, and no further stages are needed, so that's the way I implemented it.

About placing the forwarding muxes at the end or at the beginning of the stages, I made the following considerations:

- Muxes are implemented with 74CBT analog switches. One interesting property of these devices is that they take some time to select input/outputs, around 4 or 5 ns depeding on the particular reference, but once selected, the propagation delay from inputs to outputs is as low as 0.25 ns. This means that it is a lot more effective (read faster) to place them at the end of the cycle because there's plenty of time for selection, and the actual signal is ready on the outputs almost instantly after it is computed on the same stage.

- On the other hand, placing the muxes at the beginning of a stage implies that we need some time (in particular those 4 or 5 ns) to get them ready after the select control signal is available, which needs to be added to the total propagation delay of the stage.

So unless I miss something, it looks to me that in the above scenario, it is faster to have them at the end of the stages than at the beginning.


Tue Dec 08, 2020 2:09 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Continuing with the subject of possible optimisations, the next one that can be reasonably done is trying to revert branch cycles back to 2 (as opposed to 3 which are required for the pipelined versions).

This requires speculatively computing branch addresses during the decode stage, and decide whether a branch should be taken or not as early as by the end of the same decode stage. Updating the PC with the destination address at this time would result in only one wait cycle as it was the case for the original 2 staged pipeline.

This is exactly what the, 3-stage pipelined, ARM Cortex-M processors do to accelerate branch execution, without resorting to branch prediction, and it's also applicable to the CPU74 architecture. Not all kind of branches can benefit from this, but the majority will.

The expected gains can be estimated as follows. Again I start from the already known figures of the 2 pipe and 3 pipe processors

* 2 stage pipeline at 16 MHz: 1/(0.12*2 + 0.05*3 + 0.30*2 + 0.53) = 0.66 instructions per clock pulse -> 10.5 Million instructions/ second
* 3 stage pipeline at 30 MHz: 1/(0.12*3 + 0.05*4 + 0.30*2 + 0.53) = 0.59 instructions per clock pulse -> 17.8 Million instructions/ second
* 4 stage pipeline at 30 MHz: 1/(0.12*3 + 0.05*4 + 0.30*1.2 + 0.53) = 0.69 instructions per clock pulse -> 20.7 Million instructions/ second

As a remainder:
- 0.12 is the estimated proportion of taken branches
- 0.05 is the estimated proportion of call and return instructions
- 0.30 is the estimated proportion of load/stores
- 0.53 is all remaining instructions

These figures are multiplied by the number of cycles consumed by these kind of instructions and the clock frequency to get the final Mega instructions/second figure

In this case, the proposed branch optimisation would reduce all call/return instructions to 3 cycles, instead of 4. Non conditional branches would use 2 cycles instead of 3 cycles, and I estimate that about 95% of the remaining conditional branches will benefit from the 1 cycle reduction too. So I think it's safe to consider that all taken branches will use 2 cycles. This result in the following figures for the 3 and 4 pipe models

* 3 stage pipeline at 30 MHz: 1/(0.12*2 + 0.05*3 + 0.30*2 + 0.53) = 0.66 instructions per clock pulse -> 19.7 Million instructions/ second
* 4 stage pipeline at 30 MHz: 1/(0.12*2 + 0.05*3 + 0.30*1.2 + 0.53) = 0.78 instructions per clock pulse -> 23.4 Million instructions/ second

So the gains with respect to the 2 pipes version would be as folows:

3 pipes: 19.7 / 10.5 = 87.5% faster
4 pipes: 23.4 / 10.5 = 122.6% faster (that's 2.2 times faster than the basic 2 stage pipelined, 16 MHz version)

But, what does this involve in practice? :

* A fast adder has to be placed in the decoding stage to speculatively compute PC+immediate destination addresses. Non-conditional jumps are just detected on the same stage and the computed destination address is used to update the PC right away. The execution stage does not even fire for this instruction. The same applies for the 'call' instruction. In this case, the second instruction cycle is resolved in the decode stage in the way described by the jump instruction, except that the adder does not even need to be used because the call destination address is absolute.

* For conditional jumps, the T status register flag must the known by the end of the jump decode stage. If the T flag was already computed in anticipation by an early instruction (generally a 'cmp' instruction) this is not a problem. The worse case is when the 'cmp' instruction is just preceding the 'branch' instruction, a rather common occurrence...

The ARM Cortex-M seems to resolve this by implementing a Read After Write hazard mechanism around the Status Register flags so that the status condition can be forwarded to the branch decode stage, directly from the execution stage of the preceding compare instruction.

The case of the CPU74 is slightly more difficult because due to excessive propagation delays of condition codes computation, they are currently computed half cycle later than the execution stage of the instruction generating them. This works in the general case because, by-design, status flags are never needed until almost the end of any execution cycle, so by when they are required they are already available from the first cycle half. Unfortunately, such half-cycle delay prevents them to be forwarded to the decode stage of a following instruction on time.

The workaround to that, is to speculatively compute the 'T' flag as soon as possible for the instructions where this is possible. At least for these instructions the status register forwarding would be possible. Fortunately, in practice this covers the majority of cases. The 'Carry', "Overflow" and "Sign" flags are not the problem because they are available even sooner than the ALU result. The problem is the "Zero" flag, which is currently computed in a general way by comparing alu result with zero, and this surprisingly takes a lot of time. However, it turns out that for the 'cmp' and 'sub' instructions the zero flag can be determined in advance by xorting the ALU inputs: if both inputs are identical the Zero flag will set, so there's no need to compare the alu result with zero latter on. Therefore, at least for these two instructions, cmp, sub, the full set of flags can be available by the end of the execution stage, and therefore they can be forwarded to the concurrent decode stage of a conditional jump in the pipeline, thus reducing jump processing by one clock cycle (!).


Tue Dec 08, 2020 2:34 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1627
Nice! BTW, I much prefer the formulation 2.2x faster, compared to the formulation 122% faster. For me, it works also for smaller improvements, like 1.1x or 1.05x faster.


Tue Dec 08, 2020 4:51 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
Nice! BTW, I much prefer the formulation 2.2x faster, compared to the formulation 122% faster. For me, it works also for smaller improvements, like 1.1x or 1.05x faster.
I agree with this, I don't really know why I started putting it in % figures, it's weird. So this is the summary using the now desired convention.

(0) 2 stage pipeline at 16 MHz: 0.66 instructions per clock pulse -> 10.5 Million instructions/ second
(A) 3 stage pipeline at 30 MHz: 0.59 instructions per clock pulse -> 17.8 Million instructions/ second
(B) 4 stage pipeline at 30 MHz: 0.69 instructions per clock pulse -> 20.7 Million instructions/ second
(C) 3 stage pipeline at 30 MHz with optimised branches: 0.66 instructions per clock pulse -> 19.7 Million instructions/ second
(D) 4 stage pipeline at 30 MHz with optimised branches: 0.78 instructions per clock pulse -> 23.4 Million instructions/ second

improvements relative to (0):

(A) -> 1.7x faster
(B) -> 2.0x faster
(C) -> 1.9x faster
(D) -> 2.2x faster


Tue Dec 08, 2020 7:00 pm

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 122
Very interesting. Those fast analog muxes sound amazing. Alas, FPGAs I don't think have that sort of thing as far as I know -- at least not the cheap ones, anyway. Even so, I am going to do some serious thinking about this. No writeback stage is a good simplification and well worth the extra thought.

One thing to factor in is that the branch optimization will have a greater impact on small tight loops like memcpy for example, which would be a pretty common operation if, say, you're trying to do a lot of graphics.


Tue Dec 08, 2020 7:19 pm

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
Hi Joan

First, congratulations on the design (and your presentation of it). It looks great!

Thanks for your note. I was about to email you back but I thought I might just post here a couple of comments that pertain to the discussion above. Here they are:

  • As far as I can tell, you can put hazard muxes before or after pipeline stages. Like you, I prefer them before the pipeline stage to get a clean start on the new cycle. But things can vary from design to design, and it may be advantageous to put them after. It’s all about the relative tpd of pipeline stages and balancing them. You want them equal as far as is possible, and that may influence where delays are best tolerated.
  • The WriteBack stages serves a useful purpose that is unrelated to the position of hazard muxes. That is to provide buffering so critical CPU elements are not exposed to undue capacitance and external influences — this is particularly true of the external data bus. The system data bus is an uncontrolled environment, and you don’t know what kind of mayhem may be going on there (i.e. various memories, glue logic, peripherals, sensors and goodness knows what else, all may be present and loading down the data bus). For that reason, it may be advisable to capture the data coming in from the external data bus at the boundary, and use a separate pipeline stage from there. That will allow maximum time for external systems to respond, and also gives you better control over delays internal to the CPU. You can then publish a clean and well-defined setup time spec for systems designers who may want use your CPU in a given environment.

As always, hope that’s helpful.

Cheers,
Drass


Wed Dec 09, 2020 2:48 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Drass,

Drass wrote:
The WriteBack stages serves a useful purpose that is unrelated to the position of hazard muxes.

Thank you very much for your input. I understand the purpose of the writeback stage in the way you describe it and this makes sense from this point of view.

I am confused by the typical drawing of the MIPS pipeline in relation with the position of the hazard muxes. I found the following drawing on the internet that is similar to the ones that appear in computer architecture books:

[MIPS five stage pipeline]
Attachment:
obrazek.png

[A]
On the above figure, the hazard muxes are placed at the beginning of the execute stage. Hazards are solved by capturing data either from (1) the Memory Stage or from (2) the WriteBack stage. In the first case the data comes from one stage later (but the previous instruction) and in the second case it comes from two stages later. Data in the WriteBack stage is written to registers, and that happens at the beginning of the same cycle that will transfer registers to the execute stage.

[B]
Now, imagine that we change the placement of the hazard muxes and we place them just at the left side of the execute stage column, i.e at the end of the decode stage. This would be similar to the drawing that I proposed for my CPU74 pipeline. Again, in case of hazard, we need to capture data from one stage later (previous instruction). since the latches are at the end of the decode stage, in this case we would take (1) from the Execute stage (as oposed to from the memory stage). Also, we would take (2) from two stages later, i.e, from the memory stage (as oposed to the write back stage). Therefore, now we have a situation where the very last column of latches looks unnecessary. In this scenario, data available by the end of the Memory Stage is directly moved to the register file at the clock pulse that signals the end of the cycle, thus new data is available on the register file just at the beginning of the following cycle, exactly the same outcome than scenario [A]. But now we have one less column of latches, so one less stage!. It looks to me that just by placing the hazard muxes at the end of stages, and using the clock edge to transfer results to registers, we can remove the very last stage. So maybe if we use normal registers, we don't really need that last stage? But if we want that stage for the reasons that you point out, then we should place the hazard muxes at the beginning of the stages?

rj45 wrote:
As for the writeback stage, without getting into too much detail, it's common to build register files with synchronous RAM in an FPGA. This means you need a clock pulse (usually the inverted cpu clock) for the RAM to do a write and to register the read address and perform the read. So that means you need to have the value you want to write early in the cycle, hence needing the writeback stage. Further, the read value is only available very late in the cycle, so it would be very tight to add muxes after the register file, and often then you would put all the muxes at the beginning of the execute stage instead.

Now, after some thought and rereading this, I think that this provides another good point about why we might need to place a write back stage.


You do not have the required permissions to view the files attached to this post.


Wed Dec 09, 2020 4:38 pm

Joined: Fri Nov 29, 2019 2:09 am
Posts: 18
Maybe I’m missing something but I don’t think the position of the hazards muxes has any bearing on whether you require a WB stage or not, or vice versa. It seems to me that you can place the muxes at either the beginning of the execute stage or at the end of the decode stage, for example, and this works whether you have a WB stage or not. Each approach simply affords a different tradeoff in terms of how delay is distributed between pipeline stages, that’s all. Conversely, you may decide to have a WB stage, either in order to provide buffering, or more time for the write to the registers, or more time for external devices to respond — all are valid reasons independently or in combination — and still choose to have hazard muxes before or after pipeline registers in each case (and indeed you can make a different decision in this regard for one stage or another). But maybe I’m not seeing the problem.


Thu Dec 10, 2020 1:21 am
 [ 305 posts ]  Go to page Previous  1 ... 17, 18, 19, 20, 21  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software