Last visit was: Fri Nov 01, 2024 12:09 am
It is currently Fri Nov 01, 2024 12:09 am



 [ 91 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7  Next
 8 bit CPU challenge 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1803
MichaelM wrote:
Well that was fun. Still need an assembler, but given the small number of opcodes, I was able to hand assemble and test my first Fibonacci series computation routine for the MiniCPU. From initialization to initialization is 810 cycles. Initialization of the workspace pointer, XP, the loop count (23), and the first two Fibonacci numbers (F0 = 0, and F1 = 1) is 24 cycles (17 instructions). Total number of instructions is 28 in the Fibonacci loop thus the CPI of program is approximately ((810 - 24) / 23) / 28 = 1.22 cycles per instruction.

Pretty pleased with the overall result. The compact encoding used for instructions, the workspace relative storage, and the simple load/store architecture appears to be efficient enough for the calculation of the first 23 Fibonacci numbers (after initialization of the recurrence).

It is a good result: that's a good CPI you're getting. I see the great majority of your instructions are single byte, not needing a prefix or nfix, so that's a good statement about the encoding. If your instruction count ends up on the high side, I suppose that's a statement about the expressiveness of the instruction set.

Revaldinho and Hoglet have been exploring Fibonacci programs with similar structure, to see what we can learn about the different architectures and different microarchitectures: we have the vanilla OPC5 with 8 instructions, the load-store variant with 16 instructions, and a more clock-cycle efficient flavour of the 8 instruction variant which uses more slices. And of course we're making a few coding errors and learning a few coding tricks at the same time.

You can see our sources here:
https://github.com/revaldinho/opc/blob/ ... 5/robfib.s
https://github.com/revaldinho/opc/blob/ ... s/robfib.s

The results, as a snapshot, looked like this:

Code:
Machine     Codesize    Cycles    LUTs     Slices
BFLY16      34 bytes    588       489      123
OPC5        56 bytes    519       225       68 (first effort)
OPC5-xp5    56 bytes    385       257       78 (first effort)
OPC5LS      52 bytes    473       299       91 (first effort)

OPC5        42 bytes    437       225       68 (recoded to suit the machine better)
OPC5-xp5    46 bytes    304       257       78 (recoded)
OPC5LS      42 bytes    437       299       91 (recoded)

OPC5LS      42 bytes    260       299       91 (slightly different approach)
OPC5-xp5   ?42 bytes    203       257       78 (slightly different approach)


The "slightly different approach" here is hoglet's idea, to check for termination using the main addition, instead of a separate comparison, and also to unroll the loop a bit which allows better register use by ping-ponging between two:
Code:
        ORG 0x0000

fib:
        mov     r4, r0, results
        mov     r5, r0, fibEnd
        mov     r6, r0, fibLoop
        mov     r10, r0, 1

        mov     r1, r0    # r1 = 0
        mov     r2, r10   # r2 = 1
fibLoop:
        add     r1, r2
        c.mov   pc, r5     # r5 = fibEnd
        sto     r1, r4     # r4 = results
        add     r4, r10    # r10 = 1
        add     r2, r1
        c.mov   pc, r5     # r5 = fibEnd
        sto     r2, r4     # r4 = results
        add     r4, r10    # r10 = 1
        mov     pc, r6     # r6 = fibLoop

fibEnd:
        halt    r0, r0, 0x999

        ORG 0x100
results:


I think it's interesting that we're exploring freedoms in
- instruction set architecture (more powerful opcodes)
- microarchitecture (pipelining, removal of dead cycles)
- coding optimisations (choosing best instruction sequences)
- algorithm (different loop termination strategies)

Coming next, I think, is an xp version of the ls machine, because the 16 instruction variation does seem to be a win, and the xp microarchitecture is not too costly in slice count. Also we've added a sign bit flag and two corresponding predicates, which makes some things a little easier to code and faster to run.


Mon Jun 12, 2017 2:47 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Those are interesting results. I particularly like the way that you guys unrolled the loop, and used the predicated move instruction to terminate the loop at an odd number.

The transputer-inspired encoding has some benefits in the sense that the MiniCPU's 8-bit ALU (isn't the OPC5-series cheating a bit with an internal 16-bit ALU :D ) requires loading and unloading each half of the sum of the Fibonacci series recurrence equation. The encoding provides a very compact representation of the program in a very competitive total of 45 bytes, but the number of operations required is higher than the OPC5 and BFLY16 are achieving because of number of load/store operations required by its 8-bit ALU for a 16-bit operation. There is also an obvious architectural limitation: the indirect mechanism to access the right operand register, B. That architectural limitation reduces the MiniCPU's computational efficiency even for 8-bit operations. The compact instruction encoding makes it less noticeable, but it has an impact on the program I provided above.

I am particularly intrigued by the predicated instruction feature included in the OPC5. I don't do much low level programming work with the ARM, nor have I needed to examine the assembler output of its C compiler. Therefore, I was not really aware of the technique and its applicability until Rob Finch started including predicated instructions on one of his projects.

All that being said, can the c.mov PC, r5 instruction used above not be aliased as bcs PC, r5? In reality, I think all valid combinations of the predicates to the OPC5 move instruction should be counted as separate instructions; the special purpose terminal registers of the instruction may also need to be counted as distinct instructions.

_________________
Michael A.


Tue Jun 13, 2017 12:27 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1803
You're right, on several counts! The OPC5s have an unfair advantage on the Fib task through being 16 bit machines. We're only arguably qualified for the 8 bit challenge, in that we expect finally to implement a shim to interface to an 8 bit memory system. Rather like the COSMAC, we're seeing what we can do within the implementation constraints, without fully embracing the architectural constraint.

And yes, the conditional moves to the PC could be aliased as branch instructions, and that would be convenient and an improvement. When we drop the one-page-source limitation on the assembler, no doubt we'll do it. Note though that other instructions can also be predicated, and that can be an advantage. At least, I think it can. (But I'm not sure I'd count these variations as separate instructions - they are more mnemonics for a programmer to learn, but still the machine is not made more complex. I'd still call it a 16 instruction machine.)

Where an 8-bit machine could shine, compared to a 16-bit machine, is in handling 8-bit data. The likes of string searches, for example. We don't have example code for that. We do now have a byte swap instruction and support in the assembler for packed strings. I expect it will be a bit messy to deal with packed strings, but a bit memory-inefficient to use non-packed strings.


Tue Jun 13, 2017 12:41 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Got a C-based simulator, one that does not include handling of Reset or Interrupts, going this weekend. Will update the github repository for the MiniCPU with this code in the next day or so. Found a couple of minor bugs in the code for emulating the BNC/BPL and LDL/LDN/LDY instructions which were easy to fix. The simulator successfully executes the fibonacci test program published above. That program does not use all of the instructions in the MiniCPU instruction set, but many of those instructions are single cycle ALU operations that can be easily verified.

Also over the weekend extended the ALU to 16-bits and modified the microprogram to support double reads and writes. In addition, added registers for all three flags not just the C flag, and added a registered V flag to support the single signed comparison, BGE, previously discussed. In the process of adjusting the fibonacci series test program for this 16-bit implementation of the MiniCPU to determine the cycle count. It should reduce the cycle count dramatically given all of the loading and unloading required to compute 16-bit extended precision numbers using an 8-bit ALU.

The ALU's architecture is not changing, so there's still the limitation that the B register, the right operand for any two operand ALU operations, must still be loaded through the A register, the left operand and destination of two operand ALU registers. Even by increasing the register and ALU operations from 8 to 16 bits, the slice count increases only marginally from 80-84 Spartan 6 slices to 105-107 Spartan 6 slices. That implies that there may be room to make the ALU a 3 register stack like the transputer with automatic pushing and popping of the registers. Additional instructions would be needed, however, to provide better access to the registers. Without the additional instructions, the slice count of a three 16-bit register stack version of the ALU is around 125-127 Spartan 6 slices. This means that this solution would be at the limit for the challenge, and although, it uses a 16-bit address bus and an 8-bit data bus, it would not have direct support for 8-bit data.

_________________
Michael A.


Mon Jun 19, 2017 2:06 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1803
The lure of 16 bits! Interesting update, thanks.


Mon Jun 19, 2017 3:10 pm

Joined: Fri Jun 23, 2017 12:31 pm
Posts: 1
MichaerlM wrote:
Got a C-based simulator, one that does not include handling of Reset or Interrupts, going this weekend. Will update the github repository for the MiniCPU with this code in the next day or so. Found a couple of minor bugs in the code for emulating the BNC/BPL and LDL/LDN/LDY instructions which were easy to fix. The simulator successfully executes the fibonacci test program published above. That program does not use all of the instructions in the MiniCPU instruction set, but many of those instructions are single cycle ALU operations that can be easily verified.


How many fibbonaci numbers does it go up to?


Last edited by Bateman on Thu Oct 28, 2021 1:55 pm, edited 3 times in total.



Thu Jun 29, 2017 8:58 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
The current test is for the first 25 Fibonacci numbers: 0, 1, 1, 2, 3, 5, 8, ....

_________________
Michael A.


Thu Jun 29, 2017 11:29 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1803
How are you getting on with your CPU project Michael? In particular I'm interested to know what test programs or projects you might have managed to run. Are you still in the world of hand-assembly?


Sat Jul 01, 2017 2:34 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL

I don't have an assembler yet. I can see my way forward on how to create an assembler using AWK and Lua. I may use this weekend and the holiday next week to begin work on the assembler. In the meantime, I have been working on creating a model of the CPU core in C. I think that it is complete, and I've pushed an update to a separate repository I created on github to hold the models of the CPU in various languages:MiniCPU-Src.

At the present time, both the C model and the iSim testbench for the CPU execute the same Fibonacci series program correctly. There are a number of instructions of the MiniCPU that the Fibonacci series program does not use, so those portions of the model cannot be claimed to be tested and verified at this time: LDY/STY/SWP, BPL, JSR/RTS, ROL/ASL/ROR/ASR, CPL/AND/ORL/XOR. (Other than JSR/RTS, the unused instructions are fairly simple and can generally be verified by inspection of the RTL/C, but I've learned that if it is not explicitly tested, then it must be considered to be wrong.)

My intent/objective for the C model is for its console output data stream to be used, when captured to a text file, as a test vector file for a self-checking Verilog testbench for the MiniCPU RTL model. At this time, I have not checked that my output from the C model can be used in the manner that I intend. As part of the effort to get the output of the C model in a useful form, I will be adding a simple disassembly of the program memory to the output. An example output file can be found in the root/head directory of the repository; the output does not yet include the disassembly of the instructions.

The C model and the test program handle the defined reset mechanism. I think that I stretch the reset signal 1 cycle in the RTL model, and that effect is not currently modeled in the C model. The current C model does yet support fielding interrupts from the main program. It will accept a RST signal, and generate the vector pull cycles for that event, but I've not yet included the feature to generate interrupt requests from the main program to the CPU model.


_________________
Michael A.


Sat Jul 01, 2017 4:59 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1803
Thanks for the update. A self-checking testbench is a good idea. I'm not too surprised you haven't yet modelled interrupts. Presumably it would be relatively easy to add subroutine calls into the Fib program? It's true that we haven't yet written (and probably won't write) an actual instruction-testing suite for the OPC-5 which tries to exercise most of the common behaviour of the machine. But we do know from 6502 land that when Klaus Dormann did that, it very much helped several HDL cores and emulators.


Sat Jul 01, 2017 5:04 pm

Joined: Tue Dec 31, 2013 2:01 am
Posts: 116
Location: Sacramento, CA, United States
MichaelM wrote:

I don't have an assembler yet.

What's with the huge font, Michael? Did you wind up with the same eye affliction that BDD had/has?

Mike B.


Sun Jul 02, 2017 7:28 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Mike:

I am having a bit of an issue with one eye, but it was mostly due to losing the video card for the laptop that I've used for the past 8-10 years. I am bringing up a new laptop with roughly twice the screen resolution. I was warned that the additional resolution might make it hard to see some pages, but I like the smoothness to the displayed text that the additional resolution provides. I've found that smooth edges reduces eye strain for me. So I generally use large display fonts for most editors such as MS Word, Ultra Edit, Xilinx ISE Editor, etc.

I normally zoom most pages, but when preparing the response to Ed's question, I just tried increasing the font size instead. In the response prep window, the increased font size was not as big as it appears after posting the response.

It does make it easy to read, :) but it does seem as if I was shouting the response back. Good thing I did not also use all caps. :D

_________________
Michael A.


Sun Jul 02, 2017 10:43 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
I managed to convert my functioning 8-bit CPU model into a 16-bit CPU with an 8-bit bus. I've also added some code to make measurements of some performance parameters. Interestingly, 1000 loops to calculate 25 Fibonacci numbers using the 8-bit implementation produces the following results:
Code:
Iterations            :   1000
Instruction Count     : 620002
Data Access Count     : 190000
Read Access Count     : 116000
Write Access Count    :  74000
Clocks per Instruction:  1.306

The same test using the 16-bit version of the CPU produces the following results:
Code:
Iterations            :   1000
Instruction Count     : 500002
Data Access Count     : 238000
Read Access Count     : 140000
Write Access Count    :  98000
Clocks per Instruction:  1.476

These are interesting results that, in some ways, support the idea that processors like the 68008 and 8088 operated at a significant handicap because of their 8-bit data buses. The number of data memory access went up, which I attribute to fact that 16-bit CPU utilizes a 16-bit loop counter. Since the loop counter is really an 8-bit quantity, there are many unnecessary data memory read/write cycles to access a byte that has no information. Adding instructions to the 16-bit CPU to support 8-bit data operations would alleviate this inefficiency. I suspect that adding those instructions will blow out the processor's 128 slice limit. In RTL, the 16-bit CPU uses about 110 slices compared to the 80 slices of the 8-bit CPU.

In the final analysis, the 16-bit CPU with an 8-bit data bus does not appreciably improve the performance for the MiniCPU architecture. The improvement is only about 9% in total number of cycles, and 11 bytes of program length.

_________________
Michael A.


Sat Jul 08, 2017 12:31 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1803
Interesting finding. You did get a gain, but not so much as you might like.

(I've posted an update on our adventures in another thread. We're still in the intersection of the one page challenge and the 128 slice challenge. We have now got a story for using 8 bit memory, which of course gives us a performance hit compared to the 16 bit story, and can run it in a real implementation. We have an interrupt too - will probably need to exceed the one page limit to add a second nestable interrupt.)


Thu Jul 13, 2017 12:22 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
BigEd:

It was a bit disappointing. I had expected better performance due to the computation of the Fibonacci number in a single cycle. However, after tabulating the results, and giving it some additional thought, I realized I had not taken my analysis of the change far enough.

It does set the stage for additional explorations using pre-fetch queues and/or internal caches. I read your report in the OPC thread with interest this morning because your cache selections match the ones I was considering applying to the MiniCPU. I had concluded that adding cache to the core when running with internal memory probably would not improve performance too much unless the path from cache to the processor core increased the bus width to at least 16 bits. An external 8-bit data bus would provide a performance bottleneck that cache may alleviate somewhat, but your results indicate that, for even these simple processors, larger caches are needed to provide reasonable performance.

_________________
Michael A.


Thu Jul 13, 2017 8:06 pm
 [ 91 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7  Next

Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software