Last visit was: Fri Sep 17, 2021 11:32 pm
It is currently Fri Sep 17, 2021 11:32 pm



 [ 305 posts ]  Go to page Previous  1 ... 13, 14, 15, 16, 17, 18, 19 ... 21  Next
 74xx based CPU (yet another) 
Author Message

Joined: Mon Oct 07, 2019 1:26 pm
Posts: 43
Hi Joan, did you also see the 4bit TTL ALU as proposed here:

https://hackaday.io/project/160506-4-bit-ttl-alu

It looks a lot like your design.

7 simple TTL's per nibble, and no programmable devices needed....


Mon Sep 14, 2020 8:31 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 256
Can a zero output be added to the 1st carry lookahead unit
depending on how you compare things.
Z add/sub undefined
Z Pn=0 for other logic


Mon Sep 14, 2020 9:23 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
roelh wrote:
Hi Joan, did you also see the 4bit TTL ALU as proposed here:

https://hackaday.io/project/160506-4-bit-ttl-alu

It looks a lot like your design.

7 simple TTL's per nibble, and no programmable devices needed....

Hi Roelth, thanks for your comment. I recall having looked at your "square inch" ALU, but I think you are kind of comparing Apples with Oranges in this case. I am proposing a 16 bit fully carry-look-ahead alu, which is conceptually similar to a combination of four 74181 chips glued to one 74182, with even more functions. Yes, I'm using 20+ chips for that, but there's really not another solution if I want max performance that I am aware of (unless I open the can of fast analog switches and avoid look-ahead circuitry all together as in relay based alus).

For a 4 bit alu like the one you proposed, I would rather use four 74153 multiplexers connected to a single 74283 adder. That's only 5 ic and it has more functions. If your goal was fitting it all in a 25 cm square pcb, then this would seem to me the best/easiest approach. Of course if you are going to count the total number of gates then maybe the 74283 alone has more than your current design, but that chip features full carry look-ahead at all stages including the output, just like the 74181 and my alu, so there's a reason for that. Also, cascading two such 4 bit alus by just connecting the carry signals would still be significantly fast. If that's not enough, for a 8 bits ALU you can even play with speculative carry on the higher 4 bits (I think it's called carry skip adder), which only requires an additional 74283 and a 2 to 1 multiplexer. That's even faster with very little overhead.


Last edited by joanlluch on Mon Sep 14, 2020 10:13 pm, edited 2 times in total.



Mon Sep 14, 2020 9:56 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
oldben wrote:
Can a zero output be added to the 1st carry lookahead unit
depending on how you compare things.
Z add/sub undefined
Z Pn=0 for other logic

I'm not sure if I understand that. Please can you elaborate on this?. Are you referring to some kind of circuitry to replace the Z flag generator at the output?


Mon Sep 14, 2020 10:00 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I have now the instruction decoder modelled in Logisim, this is how it looks:
Attachment:
InstDecoder.png


As shown some time ago, the instruction set is not very difficult to decode. This is the opcodes summary:
Attachment:
Screen Shot 2020-09-17 at 19.29.32.png


To decode, one possible approach is to feed the 9 most significative bits of the instruction register to a 9 bit address ROMs in order to output control lines. Assuming 32 control lines (yet to be determined) this would require a total ROM capacity of at least with 2^9 = 512 addresses. So 512 addresses x 32 bits = 16 K bits. Although this is quite a common approach for many homebrew processors, I kind of feel this as an abuse of what we can do today that was not possible in the mid 80's or so, so I'm going for a slightly different route.

The instruction set was designed to use a mix of direct decoding and microcode, which means that the decoding ROM can be reduced. In order to do this, the instruction opcodes are pre-decoded with a simple mechanism and converted into shorter 7 bit opcodes, which are then fed to the decoding ROM requiring only 2^7 = 128 raw ROM addresses, or 128 addresses x 32 bits = 4 K bits (assuming 32 control signals).

Instructions requiring more than one execution step are processed in a rather unusual way. Instead of relying on a cycle counter, a linked list is created in the decode ROM that points to the microinstruction step that must be executed next, until all the steps are complete. I found this very flexible because there's no limit on the number of steps that an instruction can have (if needed) and it helps saving ROM space specially in this case where only few instructions require more than one execution step.

In the schematic, the pre-decoding circuit is made around a couple of ingeniously connected 74257 multiplexers which select one of three 74541 tri-state buffers depending on the instruction type. The relevant 5 bit opcode info is given two prefix bits, "00", "01", "10" before feeding the decoding ROM, in this case several 16V8B PLAs . The "11" prefix is reserved for additional execution steps in the linked list, and the 7474 D-FlipFlop stores whether we are executing a second/additional step of a instruction.

There's a couple of optimisation opportunities left aside on this design, but I will leave that for a latter stage when I have a clearer view of the entire processor as well as the critical paths and theoretical max clock frequency. The wires left open on the right of the schematic are still waiting to fill the required control lines. I expect I will not need more than the ones shown.


You do not have the required permissions to view the files attached to this post.


Thu Sep 17, 2020 9:50 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I just updated the GitHub repo with .png files of the Logisim based schematics. The relevant directory can be found here:

https://github.com/John-Lluch/CPU74/tree/master/Docs/LogisimDocs

The ALU is now fully functional, covering all the relevant instructions including Right Shifts, conditional moves, and of course all the logical and arithmetic functions. The eight general purpose registers file, is added as well.

The zero/sign extend ('zext', 'sext') instructions are not covered in the ALU but that's becaue they will be implemented in a separate module along the data path which will also cover 'sign/zero extend loads'


Thu Sep 24, 2020 10:00 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
Great - thank for the update.


Thu Sep 24, 2020 10:06 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I made some relatively minor changes to the ALU and implemented the Status Register, which I will discuss at later stage. But for now, I am interested in the Program Counter design. As straightforward as it may seem, it turns out that it is not as such.

My goal is still 16 MHz, which means a cycle time of max 62,5 ns, or ideally around 50 ns or so if I can, so I can attempt to overclock it to 20 MHz with some luck. Unfortunately, there's no EEProms faster than 45ns that I am aware of, I am looking at the AT27C512R-45PU for this design but suggestions are of course welcome. Since this is a Harvard architecture, the processor reads instructions directly from EEProm program memory on every cycle, and it seems there's not a lot of time left to so so.

The overall strategy on a Harvard processor is that the next instruction is fetched at the same time than the current instruction is decoded/executed. On any given execution cycle, the PC always points to the next instruction, which is being fetched from program memory as the current instruction executes. At the end of the current cycle, the next instruction is clocked to the instruction register for execution during the next cycle. On the fetch side, we need to increment the PC, read an instruction from program memory, and clock the instruction opcode to the instruction register.

Since memory access takes almost all available time, we need to have the PC incremented early on the cycle, or in other words, the new value must be already clocked to the PC at the onset of the cycle for this to work on time. For this, the incremented PC value is computed in parallel with the memory fetch, which may be used or discarded depending if the next instruction must be executed or the processor is taking a jump. The cycle then starts with the PC already loaded with the incremented value or with a branch address value in case it must be taken (on the later case, of course one wait cycle will be inserted to allow the instruction register to update with the branch address)

As per the CPU74 ISA, we also need a way to load data from program memory to be able to perform the 'movw_pr' opcode, (load from program memory instruction). This is mainly used to move program constant data to data memory as part of the setup code before jumping to user code. The PMAR register (Program Memory Address Register), is used for that purpose without altering the contents of the PC.

So the resulting schematic as represented in the Logisim simulator is this:

Attachment:
RegPC_Color.png


It consists on:

- The PC register, on the top left, made of two 74AC273 8 bit d-flip flops, allowing Reset to zero for processor initialisation. Reset is active low.
- The PMAR register, on the bottom left, made of two 74AC574 8 bit d-Flip Flops.
- The incrementer, on the center, made up of four 74AC283 4 bit adders arranged in a Ripple Carry configuration as there's enough time for it.
- A series of four 4 bit, 2:1 analog fast switches 74CBT3257 to select the incrementer or the input, in case of taken branch, to be clocked to the PC
- A couple of fast 8 bit, 1:1 analog switches 74CBT3345 to select the output from the PC or the PMAR register

- The transparent latches on top prevent phantom triggers on the register clock inputs due to write signals timing.
- The single d flip flop on the WR_PMAR signal enables the PMAR output for an entire cycle, so that a random program memory address can be read by the execution side of the processor during that cycle

WR_PC (low active) selects the PC to be updated, either from the incrementer or the IN bus, depending on BR. The contents of PC appear on PC_OUT
WR_PMAR (low active) selects PMAR to update from the input bus, and provides its contents to PC_OUT
WR_PMAR and WR_PC should not be selected simultaneously, but they can be left both unselected for example to prevent the PC from incrementing if a multiple cycle instruction is in execution

* The estimated typical propagation delay for the four adder chain is 42 ns, plus the 8 ns delay from clock pulse to Qn for the 74AC273 register, that makes 50 ns to get the incremented value ready for the next cycle (enough). The 74CBT3257 switches can be ignored because they will get set up much earlier than that, and their data propagation delay is depreciable.

* The delay from the clock pulse to the Output in the worse case is when going back to PC operation from PMAR. The 74AC273 takes 8 ns from clock to output, but the combined delay from the WR_PMAR latch plus the 74CBT3345 setup is 6 ns + 5 ns = 11 ns which is what counts in this case. The program memory is directly connected to the output and it is 45 ns which adds up to 56 ns to get the memory contents ready for clocking to the Instruction Register. That's still within the 16 MHz goal (56 ns < 62,5 ns) so we are fine for now.

Now. This is the circuit that I was able to come up with. It works just fine on the simulator and it does everything I want it to do. But of course, any suggestions or ways for improvements or simplifications are welcome.

Joan


You do not have the required permissions to view the files attached to this post.


Tue Sep 29, 2020 11:30 am
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
To put things in perspective I decided to update the processor general architecture drawing to be cleaner and hopefully clearer

Attachment:
CPU74Diagram10.png


I think that presented this way, figuring out which paths data should follow for instruction execution, is much easier.

The processor has 3 main buses, the A_BUS, the B_BUS, and the Q_BUS, and two individually addressed memories, Program Memory and Data Memory, as per the Harvard design.

- The A_BUS moves data from Registers to the Left ALU Input, or to the Data Memory input.

- The B_BUS moves Immediate constants, or prefixed instruction data, or Registers to the Right ALU Input

- The Q_BUS moves data from the ALU Output, or from Memory, to Registers, including the PC, or to Memory Address Registers (both Data and Program)

* All ALU instructions involving only Registers or Immediate constants take one clock cycle. The data path involves sending the register or constant values to the Left and Right ALU inputs. The ALU result is stored in the destination Register.

For example the instruction ADD R0, R1, R2 will put R0 to the A_BUS, R1, to the B_BUS, will configure the ALU for addition, and will store the result in R2

It is a Load/Store architecture, this means that memory can only be written or read from registers. In other words, memory reads or writes will never use the ALU on the same cycle. However memory address computation can involve the ALU.

* Memory Loads involve two clock cycles. On the first cycle the effective address is calculated and clocked to the MAR register. On the second cycle data is load from memory and stored into a Register.

For example the instruction LD.W [R0, 3], R1 will (1st cycle ) put R0 to A_BUS, constant 3 to B_BUS, the ALU will be configured for addition, and the ALU result will be stored in the MAR. Then (2nd cycle) memory will be put to Q_BUS and stored in R1

* Memory Writes involve two clock cycles. Again, on the first cycle the effective address is calculated and clocked to the MAR register. On the second cycle the memory contents is updated with the content of the Register.

For example the instruction ST.W R1, [R0, 3] will (1st cycle ) put R0 to A_BUS, constant 3 to B_BUS, the ALU will be configured for addition, and the ALU result will be stored in the MAR. Then (2nd cycle) R1 will be put on A_BUS and memory will be written

* Taken jumps involve two clock cycles. On the first cycle the PC is updated with the jump destination. The second cycle is just a wait cycle to allow the IR to update with the destination instruction.

For example the instruction JMP &Label will (1st cycle) move PC to the A_BUS, the jump offset is moved to the B_BUS, the ALU configured for addition, and the Q_BUS stored to the PC. A wait (2nd) cycle is necessary to prevent the instruction just after the JMP to be executed. After the wait cycle, the IR will correctly contain the first instruction of the destination address.

* Call and Return instructions are the only ones taking 3 cycles. This is because they involve a load/store of the PC on memory, and a SP increment/decrement. For the Call instruction, the SP is first decremented, then the PC stored in the memory address pointed by SP, and the PC updated with the destination address. In this case the PC store to memory and the PC update from an immediate absolute address can share the same execution cycle because the buses involved do not overlap. For the return instruction, the process is reversed. The PC is loaded from the address pointed to by the SP, then the SP is incremented.


You do not have the required permissions to view the files attached to this post.


Wed Sep 30, 2020 2:43 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 256
A) What do you use to draw the diagrams?
B) With out high speed/ high desnsity static ram, Risc would never work.
Easy to use (through the hole) parts only run a ~150 ns.
A CISC design has extra cycle or two to generate a effective address.
This time slot, can be used for DMA or Dram refresh or video displays.
Risc's are fast if you don't have to share memory and have ample registers.
You do,smart design, there. Ben.


Wed Sep 30, 2020 4:36 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1482
Location: Canada
Nice looking diagram, makes it a lot clearer how things are working.

_________________
Robert Finch http://www.finitron.ca


Thu Oct 01, 2020 2:58 am WWW
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Ben

oldben wrote:
A) What do you use to draw the diagrams?
B) With out high speed/ high desnsity static ram, Risc would never work.
Easy to use (through the hole) parts only run a ~150 ns.
A CISC design has extra cycle or two to generate a effective address.
This time slot, can be used for DMA or Dram refresh or video displays.
Risc's are fast if you don't have to share memory and have ample registers.
You do,smart design, there. Ben.

For this diagram I used an app called "draw.io". It provides a full set of object diagrams in many industrial fields, such as electric symbols, logic gates, electronics, flow charts, and more. The user interface is not the best in the world, and it suffers from some annoying drawing glitches, but it's good enough for the task. For tables I have simply used a spreadsheet.

I have not fully decided which RAM I will use, but it will be SRAM, that's for sure, so I don't have to worry about refreshes. I think one viable option is the available 256kbit (8k x 8 bit) chips, it seems that 20 ns access time is pretty standard on these chips, although I would probably do just fine with 35ns or 45ns. This is a filtered search on Mouser returning some available options: https://www.mouser.es/Semiconductors/Memory-ICs/SRAM/_/N-4bzpt?P=1yzay1oZ1yyxb3oZ1yzmm18Z1z0w7wf

And this is the datasheet of the one that had more stock at the time of my search:

https://www.mouser.es/datasheet/2/698/REN_71256SA_DST_20200629-1711442.pdf

These chips are 32K bytes, with 8 bit data busses, and 15 bit address buses, which is ideal for my processor because I can use 2 of them and access the low/high byte of a word by playing with the low address bit. Oh, and they are available on PDIP package if required.

As per your mention of a video display, at this time I'm more interested in the processor itself. So I will probably live with simply attaching a small arduino board to provide connection to the world in form of a remote terminal.


Thu Oct 01, 2020 5:57 pm
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
As shown earlier, instruction decoding involves a 'predecoder' step that compacts instruction opcodes for further decoding. It also extracts immediate constant fields from instructions having them. Compacted instructions are then feed to a series of ATF16V8 devices which together generate the control signals for the processor.

While working on that, I realised a problem that I did not anticipate. The issue is that I have put several PLAs on the decoder which were supposed to convert compacted 7-bit instruction opcodes into control lines, but I paid little attention on the PLA contents because that was supposed to be straightforward. After all it's just a matter of entering a table with all the opcodes and their control lines. But I now realised that according to the ATF16V8 spec I can only allocate 8 products terms to sum terms. Unfortunately, it seems that I need more. After creating an draft table of inputs->outputs (instruction_opcode -> control_lines), it turns out that some of the resulting logic expressions, have more than 8 product terms. For example this one that outputs the second bit of the PS signal for the ALU:

Code:
P2 = /M[1] ⋅ /M[0] ⋅ /I[4] ⋅ /I[2] + /I[4] ⋅ /I[3] + /M[0] ⋅ /I[4] ⋅ /I[1] ⋅ I[0] + /I[3] ⋅ I[1] + I[4] ⋅ /I[2] ⋅ /I[1] + M[0] ⋅ I[2] ⋅ /I[0] + M[0] ⋅ I[2] ⋅ I[1] + M[0] ⋅ I[4] + M[1] ⋅ /I[4] ⋅ /I[1] + M[1] ⋅ /I[3] + M[1] ⋅ /I[4] ⋅ I[0] + M[1] ⋅ I[2] ⋅ I[1]


Preferably, I want to generate the PS and GS signals for the ALU directly from the instruction decoder, but this bit of the PS alone needs no less than 12 product terms, which vastly surpasses the capacity of an ATF16V8...

I solved this issue in several ways. Some of them involving significant refactoring of the instruction set. I updated my changes here:

https://github.com/John-Lluch/CPU74/blob/master/Docs/CPU74InstrSetV10.pdf

This mostly involves the refactoring of many instruction encodings in ways that reduce the overall product term pressure on the decoder. For efficiency reasons I also decided to split the generated PS signals in two sets with speculatively correspond to the two possible values of the 'T' flag. This way I can have a very efficient way to configure the ALU depending on the instruction, including conditional instructions, without incurring in any overheat.

The table for the two sets of PS signals look like this:

https://github.com/John-Lluch/CPU74/blob/master/Simulator/LogisimSupport/DecoderRomTruthTableV10_1.txt

The first set, labeled PF on that file, is the PS set for the "False" condition code of the Status Register 'T' flag. The second set, labeled PT on that file, is the PS set for the "True" condition of the 'T' flag. On the ALU, I use the following circuit:

https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocs/ALUv10.png

* The PS0f..PS3f, PS0t..PS3t signals on the top left come directly from the instruction decoder PLAs. The Analog Switch at the top centre choses one or the other PS set depending on the 'T' flag. The 'T' flag from the Status Register is available before the PS signals, so this means that the ALU receives the PS signals with virtually no delay since they are on the outputs of the decoder.

Most instructions have identical sets of PS signals, but the strategy here is to have different PS sets for conditional instructions that might require a different configuration of the ALU depending on the 'T' flag. For example, the 'SEL' (select) instruction will configure the ALU to load BUS_A, or to load BUS_B depending on 'T', with zero overhead on all instructions.

* The GS signals are also provided directly from the Instruction Decoder to the ALU with no delays.

* The Carry input (CS) of the ALU can be configured in several ways depending on what we need to do. The decoder provides 3 bits, CS0, CS1, CS3, indicating the required input carry value. In this case we have the setup delay of the 74CBT3251 before the actual value enters the ALU Core, but that should not affect ALU performance because the carry is not immediately reeded by the ALU Core (here for reference: https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocs/ALUCore.png)

Also, as a matter of reference. This is how the new instruction decoder and immediates decoder look:

https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocs/InstDecoderv10.png
https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocs/ImmDecoderv10.png

So I need to update my assembler and software simulator with the modified opcodes, and perform some testing before I can go ahead with the logisim hardware simulator. That should take me several days before I post again, but I think it's worth the effort


Wed Oct 07, 2020 7:37 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1632
All this extra work seems to me like a good thing, in some sense: if you had flat memories for decoding, you could be arbitrary in your encodings, but the limited capability of the PALs is forcing you to find more efficient encodings. And that seems good to me, because in conventional circumstances, the implementation will indeed constrain the encodings - whether it's a diode matrix, or a matrix of ferrite cores, or a collection of TTL gates, or transistors on an integrated circuit.


Wed Oct 07, 2020 10:27 am
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
BigEd wrote:
All this extra work seems to me like a good thing, in some sense: if you had flat memories for decoding, you could be arbitrary in your encodings, but the limited capability of the PALs is forcing you to find more efficient encodings. And that seems good to me, because in conventional circumstances, the implementation will indeed constrain the encodings - whether it's a diode matrix, or a matrix of ferrite cores, or a collection of TTL gates, or transistors on an integrated circuit.
That's exactly the idea. In fact, from the beginning I refused the idea of fitting very large ROMs just for the decoding. That's a rather popular solution in many homebrew processors, it works and it's easy, but I regard the use of these big, inexpensive ROMs as an abuse of what's available in the current times.


Wed Oct 07, 2020 10:41 am
 [ 305 posts ]  Go to page Previous  1 ... 13, 14, 15, 16, 17, 18, 19 ... 21  Next

Who is online

Users browsing this forum: CCBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software