Last visit was: Wed Jan 15, 2025 6:43 am
It is currently Wed Jan 15, 2025 6:43 am



 [ 775 posts ]  Go to page 1, 2, 3, 4, 5 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
I figured I'd mention this project I've been working on, on and off. It's for a cpu core called Thor with impressive characteristics.
I figured it'd never fit in the FPGA but when I tried synthesizing it:

1. Slice Logic
--------------

+----------------------------+-------+-------+-----------+-------+
| Site Type | Used | Fixed | Available | Util% |
+----------------------------+-------+-------+-----------+-------+
| Slice LUTs* | 59415 | 0 | 63400 | 93.71 |
| LUT as Logic | 59115 | 0 | 63400 | 93.24 |
| LUT as Memory | 300 | 0 | 19000 | 1.58 |
| LUT as Distributed RAM | 300 | 0 | | |
| LUT as Shift Register | 0 | 0 | | |
| Slice Registers | 9541 | 0 | 126800 | 7.52 |
| Register as Flip Flop | 9541 | 0 | 126800 | 7.52 |
| Register as Latch | 0 | 0 | 126800 | 0.00 |
| F7 Muxes | 2210 | 0 | 31700 | 6.97 |
| F8 Muxes | 315 | 0 | 15850 | 1.99 |
+----------------------------+-------+-------+-----------+-------+

Unless I missed something it looks like it'll just barely fit.
I had estimated it to be 2-3 times too large. So now I'm all giddy.

4,000 LUTs left over is just enough for Text display controller and a keyboard interface.

Thor Features:
2-way superscalar, 64 bit, 8 entry ROB
variable length instructions, instruction predication.
... and more

_________________
Robert Finch http://www.finitron.ca


Last edited by robfinch on Mon Jan 15, 2018 4:29 am, edited 1 time in total.



Sat Nov 28, 2015 4:55 am WWW

Joined: Tue Dec 11, 2012 8:03 am
Posts: 285
Location: California
With [code] and [/code] around it to preserve your spacing:

Code:
+----------------------------+-------+-------+-----------+-------+
|          Site Type         |  Used | Fixed | Available | Util% |
+----------------------------+-------+-------+-----------+-------+
| Slice LUTs*                | 59415 |     0 |     63400 | 93.71 |
|   LUT as Logic             | 59115 |     0 |     63400 | 93.24 |
|   LUT as Memory            |   300 |     0 |     19000 |  1.58 |
|     LUT as Distributed RAM |   300 |     0 |           |       |
|     LUT as Shift Register  |     0 |     0 |           |       |
| Slice Registers            |  9541 |     0 |    126800 |  7.52 |
|   Register as Flip Flop    |  9541 |     0 |    126800 |  7.52 |
|   Register as Latch        |     0 |     0 |    126800 |  0.00 |
| F7 Muxes                   |  2210 |     0 |     31700 |  6.97 |
| F8 Muxes                   |   315 |     0 |     15850 |  1.99 |
+----------------------------+-------+-------+-----------+-------+


My apologies-- I don't know enough about the content to comment on it.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources


Sat Nov 28, 2015 5:32 am WWW
User avatar

Joined: Tue Jan 15, 2013 5:43 am
Posts: 189
robfinch wrote:
Thor Features:
2-way superscalar, 64 bit, 8 entry ROB
variable length instructions, instruction predication.
... and more

I'm hardly an expert but some of this terminology seems like that used to describe modern CISC implementations (I guess I mean x86).

ROB is re-order buffer, right? So your CISC instructions are broken down into micro-ops, which are subject to Out-of-Order execution? And I gather there are two execution units. Are they both general-purpose or is each specialized for certain tasks?

-- Jeff

_________________
http://LaughtonElectronics.com


Sat Nov 28, 2015 1:33 pm WWW

Joined: Tue Dec 31, 2013 2:01 am
Posts: 116
Location: Sacramento, CA, United States
robfinch wrote:
I figured I'd mention this project I've been working on, on and off. It's for a cpu core called Thor with impressive characteristics ...

What does the assembly language look like, Rob? Can you post a brief sample?

Mike B.


Sun Nov 29, 2015 12:51 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Quote:
ROB is re-order buffer, right? So your CISC instructions are broken down into micro-ops, which are subject to Out-of-Order execution? And I gather there are two execution units. Are they both general-purpose or is each specialized for certain tasks?

Yes, ROB is re-order buffer. It's Tomasulo's algorithm. It can do things out of order. Thor is very RISC like although the instructions aren't all the same length in order to conserve memory. But they aren't broken down into micro-ops, the instructions are executed directly. There are two ALU's and a memory unit. Bot the ALU's are identical to keep things simple. It's a bit of a waste of transistor's as that means there's two barrel shifters, two multiplies, and other rarely used instructions.

A brief trivial example currently being simulated (clearscreen).
Code:
                                            VIDBUF   EQU      0xFFD00000
                                            
                                               code
                                               org      $FFFF8000
                                            
                                            cold_start:
FFFF8000 01 9D 20 00                                  ldis    zs,#0
FFFF8004 01 A1 F1 0A 00                               bsr      ClearScreen
FFFF8009 01 40 C2 10 00                               add      r1,r2,r3
                                            
                                            ClearScreen:
FFFF800E 50 00 D0 FF 00 01 6F 01 00                   ldi      r1,#VIDBUF
FFFF8017 01 6F 02 08                                  ldi      r2,#' '
FFFF801B 20 0F 01 6F C3 FF                            ldi      r3,#4095
                                            .0001:
FFFF8021 01 92 81 00 00                               sh      r2,zs:[r1]
FFFF8026 01 47 01 01                                  addui   r1,r1,#4
FFFF802A 01 47 C3 FF                                  addui   r3,r3,#-1
FFFF802E 01 00 03                                     tst      p0,r3
FFFF8031 06 3F ED                               p0.ge   br      .0001
FFFF8034 11                                           rts

It gets to the cold_start location by via a jump at the reset address. (not shown)
p0.ge is predicate for the instruction, predicate register p0 is set by the test instruction before it.
Every instruction has a predicate byte. '01' is the default predicate which is always true.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 29, 2015 9:21 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
All bleary eyed and tired, I've spent some time testing out string stores. String instruction can be great for performance when they're available.
I'm hoping of getting around to including them in the core. Right now there's just a string set. There's opcode room for more. Next will likely be string moves.
The core (at 32 bits) seems to fit into a xc7a100 with enough room left over for other things.

The following loads a loop count register with the number of half-words to write, then uses a string store instruction to rapidly fill the bytes.
I got confused at one point debugging because the RTS instruction was getting executed before the STSH instruction was finished.

Code:
ClearScreen:
      ldi      r1,#TEXTSCR
      ldi      r2,#' '|0x0007FC00;
      ldis   lc,#4096
      stsh   r2,hs:[r1]
      rts

_________________
Robert Finch http://www.finitron.ca


Mon Nov 30, 2015 10:04 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1808
I'm interested to know what you will do about interrupting these kinds of long-running iterative instructions. Restart, or resume, or wait until they finish?


Mon Nov 30, 2015 1:07 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Quote:
I'm interested to know what you will do about interrupting these kinds of long-running iterative instructions. Restart, or resume, or wait until they finish?

Hmm, I have to look at this again. It's a good question.

They are interruptible. The core resumes where it left off after the interrupt. The string instruction checks if there is a pending interrupt then claims it's finished even though it isn't really. The state is in the loop counter and a general purpose register (GPR). As long as the interrupt routine either saves/restores these registers, or does not use them it should be okay. The core doesn't allow other memory operations to take place until after the string operation is complete. There is only a single memory unit that's busy with the string instruction so other memory instructions don't issue to it. So there's no changes to memory to worry about repeating. The core stores off the address of a the string instruction in a flag variable called string_pc. If the string_pc is non-zero when an interrupt is processed then the core returns to that pc and begin executing the instruction again.
Interrupts are tricky to get working. I spent a number of late nights on them a few years ago running simulations. Interrupts in Thor are precise. When an external interrupt occurs the instruction queue is filled with INT instructions until one finally commits. The extra INT instructions are ignored after the first commit. There's actually quite a delay before interrupt processing is started (it maybe 8 or 10 cycles or more if a divide is being done).
I have to make sure the queue is invalidated after the string instruction when an interrupt occurs. I think interrupts need more work yet.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 30, 2015 11:53 pm WWW
User avatar

Joined: Thu Oct 08, 2015 11:57 pm
Posts: 74
Interesting. How fast can it go MHZ wise before something breaks? I'm designing a GPU for my game console (it's two topics below this one), and I'm going to use the XC7A200T (around twice as large in resources, but also twice as expensive). It's only going to have 8 little baby cores, the shader assembly only consisting of a small handful of instructions, and I'm going to combine it with a Northbridge thingy to connect to a Pentium III. I want to be able to push the entire chip to 400 MHz. This has shown me that I'm not too far out there with my design, and that it can be done. So in a way, I thank you. Keep up the great work!


Tue Dec 01, 2015 12:06 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Quote:
Interesting. How fast can it go MHZ wise before something breaks?

I've tried a trivial example at 37.5 MHz which seemed to work. The design isn't floorplanned manually. I have the development environment settings so that I can do a fast development cycle. The tools warn me that there's a combination loop somewhere so the timing might not be reliable. I've stared at the code for hours and can't find any loops.

I've found it's difficult to get much past 50MHz for anything because of the number of cascaded LUT's in my designs. Everytime a LUT is cascaded without being registered in between, it adds a propagation delay. With complex logic (like a superscalar processor) the number of cascaded LUT's really slows it down. If one tries to get a high clock rate then the design must be really simple. If you look at the processor's at Opencores.org a lot of them run in the 40 to 50 MHz range. Including pipelined designs. I usually end up with an entire system running at something more like 25MHz. Adding more registers into the design may allow a higher clock frequency but then one is into using multiple clocks to get anything done and performance isn't actually a whole lot better. Check out the clock rates from the Vendor's cpu's cores (Microblaze, NIOS). I think they run at something like 100MHz in an inexpensive FPGA.

Executing one or more instructions per clock, even at 25MHz I'd bet the core would beat an older 486 for performance.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 01, 2015 8:00 am WWW
User avatar

Joined: Thu Oct 08, 2015 11:57 pm
Posts: 74
robfinch wrote:
The design isn't floorplanned manually. I have the development environment settings so that I can do a fast development cycle.


Hmmm, I'm going to design my GPU by "wiring" the logic gates, and then manually routing everything. I know it's going to be very difficult and take a long time, but I've tried Verilog and VHDL, and to be honest I hate them. I perform much better programming with logic diagrams. I was told it was possible to clone a Pentium 4 on an FPGA and have it run at 400 MHz, so if I route everything correctly that may be a speed within reach for me.

robfinch wrote:
Executing one or more instructions per clock, even at 25MHz I'd bet the core would beat an older 486 for performance.


That's fast for a home made CPU. I've always wanted to work with a 486, but I can't seem to find a working one around here. It's been awhile since I last looked, so one might have shown up on eBay or something.


Wed Dec 02, 2015 12:46 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
More string instructions were added to the core. It now supports memory moves and string compares. It's untested features at the moment. The Thor guide has also been updated.

Quote:
Hmmm, I'm going to design my GPU by "wiring" the logic gates, and then manually routing everything. I know it's going to be very difficult and take a long time, but I've tried Verilog and VHDL, and to be honest I hate them. I perform much better programming with logic diagrams. I was told it was possible to clone a Pentium 4 on an FPGA and have it run at 400 MHz, so if I route everything correctly that may be a speed within reach for me.

I'll be looking forward to following along with the project. I've read through some of your other posts but I don't comment very often. You have an interesting project in the works. Please keep it up.

Quote:
I was told it was possible to clone a Pentium 4 on an FPGA and have it run at 400 MHz,

The clock rate might be 400MHz, but that doesn't mean the performance is that good. It takes a 486 several cycles to execute each instruction. I doubt they have it pipelined for single cycle execution of instructions at that clock rate. I've got timing for some simpler cores at 130MHz but they're really devoid of features. With a small RISC type processor it might be possible to break 100MHz in a real cheap FPGA. Then performance can be gained by creating multi-core.
Some of the more expensive FPGA's are a lot faster. If you can afford to buy a part that's two or three times faster, then of course the performance is going to be a lot better. I've heard of the Apollo project where they have a 68060 compatible core running at over 200MHz I think. Apollo's a superscalar core.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 02, 2015 2:16 am WWW
User avatar

Joined: Tue Jan 15, 2013 5:43 am
Posts: 189
robfinch wrote:
The clock rate might be 400MHz, but that doesn't mean the performance is that good. It takes a 486 several cycles to execute each instruction. I doubt they have it pipelined for single cycle execution of instructions at that clock rate.
To be clear, an FPGA 486 may operate quite differently from an actual 486. But the latter is capable of completing one instruction per clock. That -- and the on-chip cache -- are the main features that set it apart from a '386.

The Wikipedia article explains in more detail. "Tightly coupled pipelining allows the 486 to complete a simple instruction like ALU reg,reg or ALU reg,im every clock cycle (even though the latency was several cycles)."

_________________
http://LaughtonElectronics.com


Wed Dec 02, 2015 3:02 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
Quote:
But the latter is capable of completing one instruction per clock. That -- and the on-chip cache -- are the main features that set it apart from a '386.

My bad. I must have been thinking of the '386. I looked it up a while ago and was going by memory.

Still working on the core. Fixed a number of bugs including one with immediate prefix instructions. I had coded it too simply. The prefix instruction can't be done until the following instruction is enqueued. I was thinking that test wasn't needed because the prefix would remain in the queue long enough to be picked up by the next instruction. Well I ran into a case testing where that wasn't so, so I had to fix it.

I'm trying to get a keyboard driver working on the FPGA. Right now all the core does is initialize a few registers and clear the screen.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 02, 2015 6:19 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2232
Location: Canada
This morning's trouble was with the loop instruction. The loop counter was updating every other time through the loop. How ?
After some head scratching I finally realized that there was no feed-through bypassing on special purpose registers, so the instruction was picking up a stale value.

Today I made the ALU's asymmetrical as an option. ALU#1 doesn't support all instructions in order to reduce the size of the core.
Some instructions that are supported are rarely if ever used. The list includes population and zero/one counting, divide, BCD math, bit-field operations and few others. Making the instructions optional made the issue logic slightly more complicated. But it reduced the size of the 32 bit core by about 2,000 LUTs. Performance probably isn't impacted very much by removing the instructions.

The core can actually be working on up to five instructions at the same time. Two ALU, one memory, one FP and a branch. It typically fetches three or four instructions at once (128 bits worth). But it only queues two at a time and only updates two registers at a time. Although it can be finished with three instructions at same time if the third instruction doesn't update a register.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 04, 2015 12:57 am WWW
 [ 775 posts ]  Go to page 1, 2, 3, 4, 5 ... 52  Next

Who is online

Users browsing this forum: AhrefsBot, claudebot, SemrushBot and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software