Last visit was: Sat Sep 07, 2024 10:34 am
It is currently Sat Sep 07, 2024 10:34 am



 [ 91 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7  Next
 8 bit CPU challenge 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2153
Location: Canada
Quote:
MAP calculates the area required by the netlist to be 133 Spartan 6 slices

The discrepancy may be due to the tools ability to share resources and optimize away redundancy when an entire project is built as opposed to just the cpu. But it seems like a lot for a small cpu. The tools do say that implemented designs are often smaller than synthesized ones in terms of slice counts.
Are you sure that all the signals are connected when built into a project ? I have sometimes forgotten to connect a clock line, or bus enable line and suddenly the resource count goes a way down as stuff gets unexpectedly optimized out.

_________________
Robert Finch http://www.finitron.ca


Mon May 15, 2017 1:25 am WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
robfinch wrote:
Are you sure that all the signals are connected when built into a project ?
This is the same file that I've tested extensively in an FPGA. I did make a change in the port list, but the resources reported the same before and after the change in the port list. I will make a check to make sure as you suggest something has otherwise been trimmed.

_________________
Michael A.


Mon May 15, 2017 3:40 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
"slices" might not be the best measure of a design's complexity, as a slice might only be partially used. It's also true that logic of any complexity has many possible implementations, which will take different amount of resources. We can ask 'does a design fit within this constraint' or we can ask 'how big is this implementation' but we can't really ask 'how big is this design'.

I used Design Explorer to run off 7 implementations with different tactics, on another design, and got this:
Code:
135.851 MHz in 128 occupied slices
129.366 MHz in 146
132.714 MHz in 130
141.945 MHz in 128
125.298 MHz in 134
115.754 MHz in 124
132.996 MHz in 120

If I'd got any of these from a single run, I might have concluded the design was 146 and not yet small enough, or that it was 120 and I had room for some additional feature. But it's more useful to conclude that this design has an implementation in 146 slices and another implementation in 120.

The Xilinx tools give us a report at synthesis time, and another report after place and route. These differ.

Here's a bit of cpu.syr from a recent synthesis of Arlet's 6502 core:
Code:
Slice Logic Utilization:
 Number of Slice Registers:             157  out of   4800     3% 
 Number of Slice LUTs:                  426  out of   2400    17% 
    Number used as Logic:               418  out of   2400    17% 
    Number used as Memory:                8  out of   1200     0% 
       Number used as RAM:                8

Slice Logic Distribution:
 Number of LUT Flip Flop pairs used:    455
   Number with an unused Flip Flop:     298  out of    455    65% 
   Number with an unused LUT:            29  out of    455     6% 
   Number of fully used LUT-FF pairs:   128  out of    455    28% 
   Number of unique control sets:        18

And here's a bit of cpu.par from the place and route stage:
Code:
Slice Logic Utilization:
  Number of Slice Registers:                   157 out of   4,800    3%
    Number used as Flip Flops:                 157
    Number used as Latches:                      0
    Number used as Latch-thrus:                  0
    Number used as AND/OR logics:                0
  Number of Slice LUTs:                        372 out of   2,400   15%
    Number used as logic:                      368 out of   2,400   15%
      Number using O6 output only:             313
      Number using O5 output only:               0
      Number using O5 and O6:                   55
      Number used as ROM:                        0
    Number used as Memory:                       4 out of   1,200    1%
      Number used as Dual Port RAM:              0
      Number used as Single Port RAM:            4
        Number using O6 output only:             0
        Number using O5 output only:             0
        Number using O5 and O6:                  4
      Number used as Shift Register:             0

Slice Logic Distribution:
  Number of occupied Slices:                   120 out of     600   20%
  Number of LUT Flip Flop pairs used:          379
    Number with an unused Flip Flop:           235 out of     379   62%
    Number with an unused LUT:                   7 out of     379    1%
    Number of fully used LUT-FF pairs:         137 out of     379   36%
    Number of slice register sites lost
      to control set restrictions:               0 out of   4,800    0%


Those two 157 numbers which are the same between the two phases of implementation - in other designs, they differ, so this too is not a solid measure of the design.


Mon May 15, 2017 9:42 am

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
BigEd wrote:
"slices" might not be the best measure of a design's complexity, as a slice might only be partially used.

I'd be happy to use a better measure if there is one. The other obvious candidate is the LUT count. Does that seem more stable ?


Mon May 15, 2017 9:49 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
I think slice LUTs should be a better bet - it's a finer granularity - but to my slight surprise the number of LUTs changed (in this case reduced) when we went to P&R:
Number of Slice LUTs: 426 out of 2400 17%
Number of Slice LUTs: 372 out of 2,400 15%
Note that 120 slices contains 480 LUTs, but it's no great surprise that we don't always use all the LUTs in every slice, and we won't do until we've very closely packed the FPGA.

In my view it's more interesting to see what sorts of things people come up with, against a slightly fuzzy target, than to see precise counts. It's not as if we'll have 100 contenders and a tight finish between half a dozen, with a big prize at stake! (Unless you forgot to mention the prize...)


Mon May 15, 2017 9:58 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
I have completed the implementation of the processor core I've been working on for the 8-bit challenge. It certainly has been fun. A block diagram of the processor core is provided in the following image. (Edit: Architecture Diagram below was replaced 30 May 2017, MAM)
Attachment:
MiniCPU-Architecture.JPG

The processor core is not yet fully tested. The core requires between 71 and 75 Spartan 6 slices. This includes a distributed memory 64 x 42 microprogram ROM. The core provides a 16-bit instruction pointer (IP), a 16-bit operand register (KI)a 16-bit workspace pointer (XP), and a dual register 16-bit pointer stack (YP and YS), an 8-bit ALU with a left operand register (A) and a right operand register (B), and a single bit register for the carries (C). In the target FPGA, the Xilinx tools report that the maximum operating frequency of the core 104 MHZ+.

The MiniCPU project is up on github. The README file provides more details on the instruction set and timing. The following image shows the summary results for targeting the core to a Xilinx XC6SLX4-3TQG144 FPGA.
Attachment:
MiniCPU-XC6SLX4-Summary.JPG

The following image shows the placement of the core in clock region X0Y0, which only contains 84 logic slices.
Attachment:
MiniCPU-XC6SLX4.JPG


You do not have the required permissions to view the files attached to this post.

_________________
Michael A.


Last edited by MichaelM on Wed May 31, 2017 12:47 am, edited 1 time in total.



Tue May 30, 2017 3:07 am

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
Very nice! A little bit transputery in the sort of nibble-wise instruction stream. I like the flexibility of that.

As the on-chip registers are all 16 bit and described as pointers, does that mean a loop counter would probably be in memory? Have you written any little example programs yet? I see it would be hand-assembly at this point, so perhaps too early to ask.


Tue May 30, 2017 8:20 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
BigEd wrote:
Very nice! A little bit transputery in the sort of nibble-wise instruction stream. I like the flexibility of that.

As the on-chip registers are all 16 bit and described as pointers, does that mean a loop counter would probably be in memory? Have you written any little example programs yet? I see it would be hand-assembly at this point, so perhaps too early to ask.

I am kind of partial to the transputer. It is a processor I really wanted to apply to a project in the late 80s and early 90s, but it's radical departure from existing architectures led to a significant amount of resistance to its application to non-parallel computing. I have an Alta Technology Inmos Transputer Development System with a few TRAMs installed somewhere in the house, but no longer have a way to run the development software; not even sure where the TDS floppies may be located.

Both the MiniCPU for the 8-bit challenge and the MiniCPU-S (16-bit serial processor) use the transputer's instruction encoding approach. That approach allows variable length instruction representations with very compact representations for the most often used instructions.

Only the registers in the PCU module, IP, MAR, XP, YP, and YS, are 16 bits in width. The ALU A and B registers are 8 bits in width, the ALU C register is 1 bit in width. The address bus is 16 bits in width, and the data I/O bus, although separated into data in and data out busses, is 8 bits in width per the challenge rules. (I will update the processor architecture figure later today to more clearly define the register widths.)

Other than some simple tests of the instructions set (using the yet to be completed simulator in C), I've not written any complex programs using the instruction set. One obvious limitation, set by the rules of the challenge, is that all memory and I/O transactions take place over an 8-bit interface. Thus, loading and storing the pointer registers in the PCU requires multiple memory cycles. The same limitation applies to the pushing and popping of the return address (IP) from the external stack.

An obvious performance enhancement would be to include an on-chip (in processor core) return stack. I have developed such a module to enhance the subroutine nesting depth of the P16C5s processor core, but I have not applied it to the MiniCPU because I did not have an accurate slice count estimate. That module, included in the project's github repository, has an estimated slice count that would allow its addition to the MiniCPU without violating the 128 slice limit imposed by the challenge rules.

Using an on-chip return stack would allow the subroutine call and return process to be implemented as single cycle operations (assuming no prefix instructions are needed for the offset). Adding such a capability, with a 64 x 16 LIFO, would naturally lead into questions regarding support for threaded code support for the return stack and/or one of the other two pointer registers: XP or YP.

Before considering these additional modifications, which as you have pointed out in your One Page Computer thread entries lead to ongoing refinements of the processor core, I want to get a simulator going, then an assembler, and finally the Mak Pascal compiler. With a few exceptions, I think the single byte access that the core's instruction set provides to 16 local workspace and non-local (global) memory locations should prove convenient in the implementation of some complex algorithms. I don't see any limitations yet except my current choice for IP-relative addressing of subroutines and no support of indirect addressing. It is a simple microprogram change to change the IP-relative addressing mode for subroutines to absolute addressing. Adding indirect addressing is also not too difficult, but adding more instructions to support indirect addressing may lead to a violation of the challenge rules.

Edit: Changed Microway to Alta Technology. Found my Transputer Development Card from Alta Technology: a PC/AT 16-bit ISA card with 10 Inmos Transputer TRAM slots, 2 dual slot T800D 32-bit Floating Point Transputer TRAMs, and 1 T222 16-bit Integer Transputer to interface to the ISA bus. Microway was another PC/AT accelerator card vendor from the late 80s and 90s.

_________________
Michael A.


Last edited by MichaelM on Sat Jun 03, 2017 10:55 pm, edited 2 times in total.



Tue May 30, 2017 1:20 pm

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
The rules may not be clear enough on this point, but the idea is that the CPU should be a "general purpose" device, at least as far as the 6502 can be considered general purpose. Small return stacks, especially if not fully user accessible, would, I think, compromise that a bit. Any design should at least be capable of passing variables on the stack, add extra local variables, or allow multi-tasking (where every task has its own stack), allow monitor programs to be written that can dump the stack, or even single-step through the code (perhaps with some hardware support).

Imagine that your CPU was available in 1975, at the same price as the 6502. Would Wozniak have chosen it for the Apple 1 ?


Tue May 30, 2017 1:49 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
Hmm, it wasn't unheard of to have a hardware return stack in a microprocessor. It's not what we're used to now, but I wouldn't go so far as to exclude it. Be too restrictive, and you only allow the micros we actually got.


Tue May 30, 2017 2:07 pm

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
True, I don't want to be too restrictive. On the other hand, the CPU shouldn't be too restrictive either. Only having a 64 entry deep stack, for instance, may be too limiting for some applications.

It doesn't have to be exactly the same, but it should be at least be competitive, and allow similar kinds of applications and hacking, preferably more, not less.


Tue May 30, 2017 2:36 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
It would easy to exceed a 64 level stack, if you wanted to, but I'd argue it would also be easy to stay within a much smaller one. If that's what was on offer, the software would have stayed within the limit.

Garth often mentions that the 6502 stack is not as deeply used as one might suspect. I just ran a Basic timing benchmark, which exercises a lot of the Beeb's Basic interpreter, and was mildly surprised to see that it used as much as 136 bytes of stack. (That will be including IRQ handler usage, probably.)

For interest:
Quote:
Intel's 4004, the first microprocessor, had an on-chip hardware stack a mere three levels deep. When a year later the 8008 came out, it, also had a hardware stack, boosted to a depth of seven words. Since neither had push and pop instructions the stack contained only return addresses from CALL instructions. No, interrupts did not necessarily eat stack space. An interrupt cycle initiated a nearly-normal fetch. It was up to external hardware to jam an instruction onto the bus. Though most designers jammed an RST, a single-byte CALL, more than a few used alternatives like JMP.

With these tiny stacks developers constantly worried about deep call trees, for there were no tools to trace program execution. But programs were small. After getting burned once one learned to sketch simple execution path diagrams to track call depth.

- http://www.ganssle.com/rants/stackmanagement.htm


Tue May 30, 2017 3:11 pm

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
Another consideration is the future upgrability of the core. If you start with a small on-chip stack can you move it off-chip later without breaking backwards compatibility?


Tue May 30, 2017 3:25 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1796
They are all interesting design questions, but my main point is not to add rules to the challenge.


Tue May 30, 2017 3:39 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
The point of the "on-chip" return stack which I was suggesting as a possible performance enhancement is to speed the pushing and popping of subroutine addresses. Its proposed depth and width is essentially half of the page 1 stack of the 6502, or 128 bytes, . The proposed call stack should be particularly attractive for FORTH VMs for which the generally recommended call stack size for 16-bit FORTHs is 64 cells, which is equal my proposed call stack depth.

I don't think I made any mention of moving the workspace into the "on-chip" return stack. This leaves that workspace pointer to manage any local variables and parameters in the external workspace, which is then only limited by the amount of data memory connected to the core. Also notice in the architecture diagram provided above that it is possible to implement the memory system in a Harvard configuration with separate (isolated) program and data memories. It is also quite easy and inexpensive in terms of area to put in a call stack limit warning trap to handle situations where the call stack is in danger of overflowing.

Finally, there were many processors in the era in which the 6502 was developed that used on-chip call stacks. The PIC16C5x family is an example of a processor from that era, originally from General Semiconductor before Microchip spun out and further developed the product line, with a two level on-chip call stack. If the challenge rules are such that the call stack needs to be able to be examined, then that may pose an area constraint that may make an "on-chip" call stack less attractive, but it would be possible to add that capability to the peripheral set of the final solution rather than within the core itself.

_________________
Michael A.


Tue May 30, 2017 6:44 pm
 [ 91 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7  Next

Who is online

Users browsing this forum: CCBot and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software