View unanswered posts | View active topics It is currently Fri Jul 19, 2019 4:26 pm



Reply to topic  [ 91 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7  Next
 8 bit CPU challenge 
Author Message

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
Arlet wrote:
BigEd wrote:
I notice the redoubtable Eric Smith has a CDP1802 (COSMAC) core in VHDL - that's a contemporary (mid-70s) offering, has lots of registers and on-board DMA. Would be interesting to see how compact it could be.

Can you see if it meets the 128 slice requirement ? Would be interesting to compare it to other designs, even if you didn't write it yourself.

I finally dug out a working installation of ISE - it's 12.4.
Did a default ("balanced") build of Eric's COSMAC core, unconstrained, no SoC around it, target xc6slx4-3-tqg144, it comes up at 129MHz and using 68 occupied slices. 9 LUTs are reported as used for distributed RAM.


Fri May 12, 2017 6:39 pm
Profile

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
BigEd wrote:
I finally dug out a working installation of ISE - it's 12.4.
Did a default ("balanced") build of Eric's COSMAC core, unconstrained, no SoC around it, target xc6slx4-3-tqg144, it comes up at 129MHz and using 68 occupied slices. 9 LUTs are reported as used for distributed RAM.

Slice count is pretty good. You could almost double it, and still stay under the limit.

Clock frequency is good too, but offset by the high cycle times.


Fri May 12, 2017 6:57 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
You'd think so, but Eric's microarchitecture uses many fewer clocks than the original, AFAICT:
Quote:
The core executes each machine cycle in one clock cycle, whereas the CDP1802 required eight clock cycles per machine cycle.


Fri May 12, 2017 6:59 pm
Profile

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
You're right. I skipped the readme, and looked into the VHDL, which I barely understand :)

However, reading the introduction, I noticed it says "In a Xilinx XC7A100T-1FGG484 FPGA, the core can run with a 62.5 MHz clock", which is a big difference with your 129 MHz on an older device.


Fri May 12, 2017 7:03 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
Yes, that is an interesting difference... I was running an unconstrained synth, so possible the post P&R results are much worse, and possibly a populated SoC with memory and peripherals brings it down further - it might be that there are long paths outside the CPU.

Edit: I eventually got a P&R of the SoC design (an Elf) which took nearly 8 hours and ended up with about 25MHz performance. It was 2501 occupied slices on a 6slx75 ($125) - as noted, this is just because the tiny fast CPU core is talking to a 64kByte distributed RAM, where we'd normally see block RAM used. Oddly, looking at the VHDL, it is a clocked RAM - so there must just be some nicety about the way it is expressed which prevents it being implemented as block RAM.


Fri May 12, 2017 7:07 pm
Profile

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
Hmm... any ideas of extending the architecture immediately runs into the problem that the opcode space is well used.


Fri May 12, 2017 7:25 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
Well, there are two options there - the 9-bit byte, or the 16-bit extension!

I think that speed discrepancy might be because the system RAM is async, so when building the full SoC project, there's an enormous amount of distributed RAM which demands a huge FPGA and is a big P&R challenge. Reworking this core to use block RAM would be a big step forward. Or, if just double-clocking, that would explain a 2x speed penalty.


Fri May 12, 2017 7:34 pm
Profile

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
The requirement is still for an 8 bit data bus, so any 16 bit extensions need to be done through multi-byte opcodes.


Fri May 12, 2017 8:30 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
You can still build a machine with a 16-bit byte, it just has to take two memory cycles for each access. It might or might not work out well, but you do get a wide machine with plenty of opcode bits, plenty of room for register fields, and reduced complexity from needing variable length instructions.


Fri May 12, 2017 8:33 pm
Profile

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
True, but the design also needs to be competitive, and that's going to be a bit harder with 16 bit wide everything.


Fri May 12, 2017 8:36 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
Maybe. Maybe not! With lots of registers, perhaps memory bandwidth is slightly less important.


Fri May 12, 2017 8:38 pm
Profile

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
I think memory bandwidth is still important. If you require twice the bandwidth to manipulate registers, the CPU will require twice the instruction cache to do the same thing.

One important consideration is the upgrade path. When you move to a 256 or 1024 slice version, you can easily afford mixed opcode length, so it would be a shame if simple and common instructions such as RTS were stuck on 16 bits.


Sat May 13, 2017 5:00 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1202
Yes, it seems likely that a byte-sized variable length instruction would work well as resources scale up. Nonetheless, we are looking at a wider machine. It's interesting to explore the possibilities. (As things scale up even more, of course a 16 bit path to memory becomes a likelihood. You may know that RISC-V defines a compressed instruction set which is a mix of 16 and 32 bit encodings. So, byte-sized variable length is merely a stopping off point, a local minimum.)


Sat May 13, 2017 5:52 am
Profile

Joined: Sat Aug 22, 2015 6:26 am
Posts: 40
I agree, it's interesting to explore the various options. Perhaps we should also do a 16 bit challenge, where 16 bits is the smallest unit in the instruction stream. Based on current CPU designs, it seems that a well designed 16 bit unit is the most optimal.

For this topic, I do want to focus on 8 bit bus/instruction unit.


Sat May 13, 2017 6:52 am
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 160
Location: Huntsville, AL
Arlet wrote:
As a result of a thread on 6502.org, I would like to propose a challenge.

The challenge is create a 6502-era CPU, using an FPGA, using roughly similar amount of resources as were available to the 6502 designers. The CPU needs to have similar capabilities as the 6502: 16 bit address bus, 8 bit data bus, 2 interrupts, reset, RDY. To make design easier, the data bus may be split into separate in/out buses. Instead of an NMI, you can make higher priority maskable IRQ. It should interface to either a block RAM, or an external async SRAM. It doesn't need to be 6502 compatible, but you should be able to port typical 6502 programs to it.

Maximum area is 128 slices on a Spartan 6 (XC6SLX4), which is about what my NMOS 6502 core requires. Use of block RAMs or DSP blocks is not permitted inside the CPU, but these resources may be used outside the CPU to build a complete working system.

The goal is to make something as powerful as possible that could theoretically have existed as a 40 pin DIP in the 70's, hopefully better than the 6502 itself. One of the goals is to keep room for future improvement, so filling up the opcode space is not encouraged.

Edit: changed limit from 120 to 128 slices.


I too am working on this challenge. In the meantime, I have measured the area required to implement my P16C5x core on a Spartan 6 as required by the challenge: the core requires 105-107 slices when targeted to a XC6SLX4-3TQ144 FPGA. When just the core is placed and routed, a timing constraint of 10 ns is easily satisfied on the specified target FPGA.

Interestingly, MAP calculates the area required by the netlist to be 133 Spartan 6 slices; this is when the core is synthesized, mapped, placed and routed as an independent module. When integrated in the project, the M16C5x demonstration project, the module utilization for the same module is reported by MAP as 79 Spartan 6 slices. In the full project, the best timing constraint that can be easily satisfied is 12 ns instead of 10 ns just for the processor core.

I haven't had time to investigate the differences in the reported module utilization factors and the reported slice requirements of the PARed design. Does anyone have an explanation for the differences in the reported areas of the core?

_________________
Michael A.


Sun May 14, 2017 2:25 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 91 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software