View unanswered posts | View active topics It is currently Wed Apr 24, 2024 9:13 am



Reply to topic  [ 305 posts ]  Go to page Previous  1 ... 16, 17, 18, 19, 20, 21  Next
 74xx based CPU (yet another) 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
oldben wrote:
16 bit access could be done as macro for the few cases you need to fake a structure
from a byte array like a disk directory structure.
read16(x) ((char) *x+(char)*(x+1)<<8)
Ben.

Hi Ben, that's a good idea. Actually it occurs to me that I can add 16 bit load/store 'pseudo' instructions to the compiler, so that the compiler can do its thing as if these instructions actually existed, then transform them to byte load/stores and shifts/ors during machine instruction selection. I think this should benefit from target independent compiler optimisations, and would still be transparent for users. At least for 16 bit array elements it can be done like this. On the other cases, namely scalar variables and structs, I think it's still better to just align everything to 32 bits and perform all non-byte load/stores with 32 bit load/store instructions.


Mon Nov 09, 2020 3:45 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I have now been able to run the "8 queens" code example in the simulator, with the improved instruction semantics. The example and previous tests were discussed around this thread on the forums:

http://anycpu.org/forum/viewtopic.php?f=8&t=447&p=5409&hilit=queens#p5409

As a remainder, this is the results that I posted at the time, with the total number of executed instructions and cycles required to find all the possible solutions to the queens problem.
Code:
Executed instruction count: 1321973
Total cycle count: 1762834
Elapsed time: 2.46086 seconds

Now, I got the following result:
Code:
Executed instruction count: 1296252
Total cycle count: 1733090
Elapsed simulation time: 2.70064 seconds
Calculated execution time at 1MHz : 1.73309 seconds
Calculated execution time at 8MHz : 0.216636 seconds
Calculated execution time at 16MHz: 0.108318 seconds

So it looks that the simulator is slightly slower, because the split instruction bitfields are harder to decode in software, but the actual CPU74 code is slightly faster as it takes less instructions and number of cycles. Taking into account the number of cycles, it is faster by 1.7 %. . I would have expected a bit better, but that's more than nothing. Code size is also reduced


Mon Nov 09, 2020 4:07 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Went back to the Logisim simulator model, and updated it with the new instruction semantics described on my previous posts.

This essentially involved inserting an "SHL1" circuit in the BUS_B path to the ALU, and adding a control signal to activate it. As commented earlier, this does not add any meaningful delay because the data path is only affected by the 0.25 ns max propagation delay of the 74CBT3257 switches.

All the updated Logisim drawings are available from here https://github.com/John-Lluch/CPU74/tree/master/Docs/LogisimDocsV12

As the critical path still remains on the "Fetch" stage, I also replaced the "RegPC" circuit to make it faster.

The problem with this module is that program memory must be accessed not only with the PC as the address register, but also independently. The isa provides the "load from program memory" instruction to do so. However, when data is read from program memory, the PC must keep its value because of course program execution must eventually continue from where it was.

The old circuit based on 74AC161 incrementers was capable of that, https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocsV10/RegPC.png and it worked in the simulator, but it is not as fast as it can be because it does it by selecting either "PMAR" or "PCMem" with a 74AC74 flip flop which is clocked at the beginning of the cycle. This adds the delay of the 74AC74 to the delay of the 74CBT3345 or the 74AC574, which added together with the 45 ns of the memory, it accounts for a critical path of exactly 16 MHz. It's just enough for my goal, but since I was able to improve the Decode-Execute path to a better figure, I wanted to attempt that too for the Fetch stage.

So the new circuit uses an explicit incrementer (made as a carry skip adder around 74ac283 adders), with a 74AC273 register for the PC. Now, the PMAR is always connected to the memory address inputs and updated every single cycle. Normally, both the PC and the PMAR are updated simultaneously at the clock edge, with either 'PC+1' or the INPUT. This is fast because program memory receives the new address as soon as the PMAR is updated. In the case of memory read cycles, the PC is simply not given any clock pulse. The circuit needs more components but it is faster and the control circuitry is simpler also. This is the direct link https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocsV12/RegPC.png

Dieter (ttlworks on the 6502 forum) has generously drawn a block diagram showing both the old and new circuits (new circuit is on top), which is much easier to understand than my crude logisim model files (thanks for that, Dieter!):

Attachment:
RegPC1.png
RegPC1.png [ 115.17 KiB | Viewed 1368 times ]


Now the critical paths of the Fetch and the Decode-Execute stages are very balanced with a top clock frequency well above my 16 MHz goal. as shown in the updated Timing Chart diagram https://github.com/John-Lluch/CPU74/blob/master/Docs/TimingChartV12.png

Joan

[In the following days/weeks, I will work on making the Logisim simulation actually running code. Provided there's no major bugs on the model, this will involve creating the PLA arrays for the instruction decoder, and a lot of testing]


Fri Nov 13, 2020 6:37 pm
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
I suspect the fastest version would not have PC, but next instruction field in the opcode, like some of the very early machines. {OP}{DATA}{NEXT}. Ben.


Fri Nov 13, 2020 9:53 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
oldben wrote:
I suspect the fastest version would not have PC, but next instruction field in the opcode, like some of the very early machines. {OP}{DATA}{NEXT}. Ben.

Well, I see a problem with that approach which is the instruction encoding length. On a 16 bit address machine this implies that every instruction requires 16 additional bits just to store the next instruction address. Given that most instructions are executed in memory sequence, I don't really see the advantage of that. This also complicates conditional branches I suspect...


Tue Nov 24, 2020 10:55 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
At some point I realised that I would needed a functional test suite for the CPU74 architecture. This is something I've been delaying, but I finally put some work on it. With the logisim model almost ready for testing, the test suite will come handy to debug any issues.

I have it now half-finished and posted here https://github.com/John-Lluch/CPU74/tree/master/Test-Suite. I got my inspiration in part from the Klaus Dormann 6502 suite, but I am writing it as a 'c' source file instead. The tests are however essentially written in assembly, so there's essentially a lot of 'asm' statements embedded in the in the 'c' code main structure (the .s file is just the output of the compiler).


Tue Nov 24, 2020 11:09 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Good move! Things always take a leap forward when you have a test suite - and once you've got it, it's not too hard to extend it.


Wed Nov 25, 2020 9:01 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I completed the CPU74 Functional Test Suite, and successfully ran it both in the software simulator and the logisim logic model.
For reference this is the test suite output with the 'log' option enabled
Code:
/Users/joan/Documents-Local/Relay/CPU74/Simulator/DerivedData/Simulator/Build/Products/Release/c74-sim
preMovCmp
   ...pass
preTestCall
   ...pass
preShortBranch
   ...pass
branchAddress
   ...pass
callAddress
   ...pass
branchCondtion
   ...pass
branchCondtion32
   ...pass
prefixEdge
   ...pass
byteShiftsAndExtensions
   ...pass
bitShifts
   ...pass
stackFrame
   ...pass
loadStoreOffset
   ...pass
loadStoreIndex
   ...pass
loadStoreAddress
   ...pass
selectAndSet
   ...pass
addSubNegTest
   ...pass
addSubTest32
   ...pass
andOrXorNotTest
   ...pass
Executed instruction count: 16349
Total cycle count: 20783
Elapsed simulation time: 0.117987 seconds
Calculated execution time at 1MHz : 0.020783 seconds
Calculated execution time at 8MHz : 0.00259787 seconds
Calculated execution time at 16MHz: 0.00129894 seconds
Program ended with exit code: 0

The test suite source code is pushed here:

https://github.com/John-Lluch/CPU74/blob/master/Test-Suite/TestUnits.c

Tests cover all available instructions and addressing modes including edge cases and a number of use case scenarios in a 2K long machine program. However due to the relatively slowness of the logisim model, the tests do not go that far as to iterating for all the possible operand values on any given instruction. The tests can however be updated in the future to cover all possible values. For example, the 16 bit arithmetic test can be updated to iterate for all possible values of 16 bit operands, which would represent 4 thousand million additions. At the effective speed of the real processor that would still be a totally reasonable wait.

The microinstruction decoding table that runs on the logisim model looks like this:

https://github.com/John-Lluch/CPU74/blob/master/Simulator/LogisimSupport/DecoderRomTruthTableV10_Full.txt

I have also pushed the logisim model to the github repo, so it can be found under the "Logisim" folder

EDIT: I thought it could be interesting to post a visual clue about the way the logisim model looks while running the test suite. So I recorded a quick video (straingt recording of my computer screen with my phone :D), and posted to youtube.

https://youtu.be/eN_2K4hNeW8

(sorry for the low video quality)


Tue Dec 01, 2020 11:37 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
Testing a design is really two parts. 1st part is the the logial blocks work correctly.
That is where simulation is useful. The second part is you have fault with hardware
and you need a process to go from what works, to what is not working. That is something
takes alot of creative thinking, with simple test programs. That is why the front panel was part
of a machine until the late 1970's. I found having one is good for testing stupid mistakes
with a design that goes through a lot of revisions. Having spent the wee hours of the morning
debugging floating point I/O routines, with the HALT instruction I got see why the thing was not working. A typeo in temp variable was the problem. saved to temp1 but later loaded temp2.
Ben. Now off the computer and to bed.
PS: Real fancy front panels could do all kinds of things, like setting break points with a running
program. Having a single step software trap is also useful.


Tue Dec 01, 2020 1:38 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi Ben.

You are making good points. In the model I already have a clock circuitry that allows it to run step by step, and two clock frequencies that can be switched on the fly without glitches. This is the clock switching circuit in the logisim model:

Attachment:
Clock.png
Clock.png [ 42.17 KiB | Viewed 1263 times ]


The actual input 'clock' signals are represented as small square waves on the left. The clock signal that goes to the processor is the 'CLK-PH' sgnal on the right side. There's a "slow/fast" switch, a "run" switch, and a "tick" button. The "halt" input is connected to a control signal provided by the 'halt' instruction, so it stops the clock output in sync with it.

I suppose that on the real thing (in case it is ever made) I will also have some way to pick the current program address and instruction, probably by means of a small arduino card polling the buses, so I can debug things while running it step by step or at a very slow frequency.


Wed Dec 02, 2020 9:50 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
Having a larger TTL design with a front panel, halt just clears the run flip/flop.
The microcode switches from decoding the IR register and decodes the front panel
inputs. The clock is always running. The advantage is that I can display the
registers when halted or in real time. The SWR is defined as IO device
saves me from having more connections to and from the ALU.
As it is a 100 pin connector (.125" pitch) is just ample for for the mother
board.


Wed Dec 02, 2020 9:26 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
I decided to push this thing a bit more and implemented a 3 stages pipelined version of the processor (instead of 2 stages). This is something that I had in mind for some time, and with the functional test suite in place, it looked that it was the right time to attempt this.

The pipeline consists on the following:

Attachment:
Pipeline.png
Pipeline.png [ 55.31 KiB | Viewed 1232 times ]


It's a relatively classic implementation of a 3 stage pipeline.

- Load/stores still take 2 cycles. During the first cycle the address is calculated and stored in an internal register. During the second cycle the actual write or read from memory is performed.

- Taken branches now use 3 cycles (instead of 2). This is because the pipeline now has a two cycle execution latency: by the time the branch is determined to be taken, there are already 2 following instructions being processed.

- Subroutine calls and returns are also affected by the two cycle latency. They now take 4 cycles instead of 3 cycles.

- Read after Write (RAW) data hazards can appear, for the general purpose Registers and the SP: they are detected and solved in the usual way.

- The decoder PLA is identical, except that some control signals are registered to be used one cycle later

The Logisim Model is implemented and working, it fully passes the Test Suite, and the simulation already feels 50% faster. The differences are the following:

- After decoding, control signals related to ALU and write back, are registered to be used on the following cycle

- The ALU has registered inputs, so that the simultaneous decoding of the following instruction does not interfere execution of the current one.

- The register file has hazard detection circuitry. It simply forwards the value on the alu output bus to the alu inputs if a register collision is detected.

The critical path analisis shows that it now runs at 30 MHz . This is the timing diagram:

https://github.com/John-Lluch/CPU74/blob/master/Docs/TimingChart-P.png

Compared with the previous 2 stage pipeline version the following performance enhancements apply

- Normal instructions (1 cycle) : 30 MHz / 16 MHz -> 87.5% faster
- Load/Stores (2 cycles) : 30 MHz / 16 MHz -> 87.5% faster
- Taken Branches (3 cycles vs 2 cycles) : (30/16) * (2/3) -> 25% faster
- Subroutine calls and returns (4 cycles vs 3 cycles) : -> (30/16) * (3/4) -> 40 % faster

Considering that on average of 20% of instructions are branches and 60% of them are taken, and that 5% of instructions are call/returns, and 30% are load/stores, that results in the following:

* 2 stage pipeline at 16 MHz: 1/(0.12*2 + 0.05*3 + 0,3*2 + 0.53) = 0.66 instructions per clock pulse -> 10.5 Million instructions/ second
* 3 stage pipeline at 30 MHz: 1/(0.12*3 + 0.05*4 + 0.3*2 + 0.53) = 0.59 instructions per clock pulse -> 17.8 Million instructions/ second

for an overall speed improvement of: 10.5 / 17.8 = 68.6 % faster

This is very good news considering that the implementation differences are really minimal. The top logisim circuit looks like this after it completed all the tests:

https://raw.githubusercontent.com/John-Lluch/CPU74/master/Docs/LogisimDocs-P/Main.png

That's pretty good and I'm really pleased with it !
Said that, there's still some room for further improvement, which I will disclose in another post. The extra gains in this case would not be that spectacular, but not minor either. Unfortunately, they would come with a non-negligible amount of complexity, so it remains to be seen if they will be worth the effort. What I have now is probably a good balance between performance and complexity, and it is already MUCH better than my initial goals, so it may be a good candidate to the final design after all.


Sat Dec 05, 2020 3:25 pm
Profile

Joined: Sat Nov 28, 2020 4:18 pm
Posts: 123
So, I have been watching this thread for over a year now. Thank you so much for sharing your ideas and your design process!

I have been inspired by your work to also build a 16-bit CPU. My journey has been a bit different, but perhaps I will share some of that in a separate thread. But your realizations have become my realizations as well, and has sparked lots of good thought and research. So thank you! (BTW I have not copied your design, just been heavily influenced by it, I hope that is okay.)

So, it's awesome to see you implementing a pipeline, and that speed boost is impressive. I also implemented a pipeline but I went with a 5 stage design that I am now regretting. I wish I had gone with a 3 stage like you from the beginning, but the amount of rework would be I think too high. I guess it's easier to add pipeline stages than to remove them. Anyway, with two extra stages there's a lot more forwarding required, and I fear there may be too many pipeline registers to make implementing in TTL practical. But we will see.

One thing I am now thinking about is OS support. Things like virtual memory, memory protection, supervisor mode, etc. Do you have any thoughts about how that would look with your CPU?


Sun Dec 06, 2020 2:23 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Welcome rj45! It'll be interesting to hear about your inventions.

Joan: thanks for sharing the results from repipelining your machine. Very educational I think, and a good result too, to see just a few dents in cycle count but a healthy improvement in cycle time.


Sun Dec 06, 2020 3:59 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 328
Location: Girona-Catalonia
Hi rj45

Thanks for your comments.

It's ok to get inspiration from this, if that's useful in some way. I got my inspiration from several sources too. I started by studying the 16 bit Ti MSP430 processor, and looked at the 8 bit AVR processors. From the latter, I got the idea of the 'carry' instructions, enabling all data widths to be processed (including comparisons) by a smaller width ALU. The idea of prefixed immediate values was given to me by someone in this forum, and later on I found that the same kind of thing was used by the Risc-V "compressed" instruction set, although with a different name. While implementing the compiler I realised about the importance of conditional instructions, other than branches, and that made me look at the ARM, and the ARM Thumb. That helped to connect the missing dots. For hardware implementation I'm getting strong influence from the work of 6502.org forum member Drass (particularly his 20 MHz C74-6502), he also helped me to understand basic hardware concepts, as I was a totally noob when I started this.

About OS support, I'm still undecided. When I started this I just wanted to run "space invaders" and "basic", so no OS was required, really. But at this time I am quite confused on what to do. I have looked at this
https://pdos.csail.mit.edu/6.828/2020/xv6/book-riscv-rev1.pdf
https://github.com/mit-pdos/xv6-riscv,
but to be honest I have to learn everything about operating systems at their core, so I'm really very far from being able to use that. I also believe that only 16 bit addressing space is not enough for anything relatively serious, such as a proper operating system with virtual memory and memory protection, so maybe I may just implement my own basic interpreter and that would be it.


Sun Dec 06, 2020 5:11 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 305 posts ]  Go to page Previous  1 ... 16, 17, 18, 19, 20, 21  Next

Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software