View unanswered posts | View active topics It is currently Fri Mar 29, 2024 9:20 am



Reply to topic  [ 57 posts ]  Go to page 1, 2, 3, 4  Next
 rfPhoenix 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
rfPhoenix is a GPGPU approach. It is a barrel processor with 16 threads.
32-bits data path
32 GPRs.
40-bit instructions
16 vector lanes, 64 vector registers
some out-of-order operation, 12 entry reorder buffer.
A thread can occupy only one reorder buffer entry at a time.

_________________
Robert Finch http://www.finitron.ca


Fri Aug 26, 2022 4:12 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Hmm, that last one surprised me. I have to admit, I have no detailed idea how a reorder buffer works or how to build one. But, with 16 threads and 12 reorder buffers... why not 16? Presumably because you don't always need one per thread? But with 16 and 16 you'd be able to say that this buffer is for this thread, whereas with 12 shared between 16 you have some kind of dynamic mapping... I'll admit that my musings might make no sense in the light of how these things actually work!


Fri Aug 26, 2022 5:48 am
Profile

Joined: Sun Mar 27, 2022 12:11 am
Posts: 40
I have a soft spot for barrel processors. I've been thinking about building a general purpose barrel processor, but was having some second thoughts.

I'm also curious about how the OoO side will work.


Sat Aug 27, 2022 12:21 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Hmm, that last one surprised me. I have to admit, I have no detailed idea how a reorder buffer works or how to build one. But, with 16 threads and 12 reorder buffers... why not 16? Presumably because you don't always need one per thread? But with 16 and 16 you'd be able to say that this buffer is for this thread, whereas with 12 shared between 16 you have some kind of dynamic mapping... I'll admit that my musings might make no sense in the light of how these things actually work!
It is because the buffers are dynamically mapped in the interest of performance. If there is an L1 I$ miss then no buffer is allocated until the next thread that does not miss. You have a good point though, I will need to think about this some more. As it is only out of order between threads I think it could work the way you suggested. The only thing I can think of, is it might be desirable to hold multiple instructions for a single thread in the buffer. They cannot execute because of register dependencies but maybe they could have been fetched and decoded.

Quote:
I'm also curious about how the OoO side will work.
The instruction the thread is working on is stored in a buffer and the thread is marked busy until the instruction completes. While the thread is busy other threads may be running. It is possible that thread 1 got fetched and decoded before thread 2, but thread 2 completes first because it is a one cycle operation and thread 1 is running a multiply.


The instruction cache turned out to be a bottleneck at 60 MHz timing. It was clocked on the negative edge of the clock to get output in the same clock cycle as the IP was applied. However , this allows only ½ clock to fetch the line. Since it is not as critical that the instruction cache output be available within one clock cycle for a barrel processor, the instruction fetch is now setup to use three clock cycles but they are pipelined so a new thread access the I$ every clock cycle. It takes one cycle to read the tag and then two more cycles to read the cache line. The I$ memory uses an additional output buffer to improve the fmax. The tricky part is ensuring that the I$ hit signal, instruction output, IP, and thread id are all lined up properly.

Decided to violate the ‘all instructions are 40-bits’ rule and provide an immediate postfix instruction which is only 24-bits in size. The basic idea is to improve code density. It is easy to detect a postfix instruction and increment by eight instead of by five. Instructions needed to be byte-aligned anyway so the extra format does not affect things much.

Put a switch in the decoder to disable writes to r0 after the first time it is written. This allows r0 to be updated once, presumably with the value zero. Having start-up code write r0 with zero is more hardware efficient and eliminates a mux in the register read path.

Pipelined the FMA unit so that it has a latency of seven cycles, but can start a new operation every clock.

Currently the system misses 60 MHz timing by about 2ns which means it should run at 50 MHz but that’s not good enough :)

_________________
Robert Finch http://www.finitron.ca


Sat Aug 27, 2022 3:44 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
56.3 MHz gives a nice 800x600 vga display.
.


Sat Aug 27, 2022 3:58 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Had to switch the data cache line size to 256 bits from 512 bits. The issue is that two cache lines, even and odd, are returned in a buffer. Unfortunately, the fifo component does not support more than 1024 bit wide fifos. Two cache lines plus other information would make the fifo too wide. <- have to look at this issue again as it does not allow for vector loads and stores
rfPhoenix has a slow memory system. It takes nine clock cycles to process a request in the BIU. However, everything is pipelined so new requests can start every clock cycle.
3 to perform virtual to physical address translation, 3 more cycles to access the data cache, a cycle to align the data and a cycle to get into and out of the memory pipe.
Added onto the nine cycles are the cycles required to perform the memory access.

Thinking about having a ‘translate virtual address’ instruction to perform virtual to physical address translations. The translation would then be done in code. It increases the instruction count but the performance might be better as it removes the address translation from the memory pipeline shortening the pipeline. For loops a base address could be translated, then subsequent memory references could be relative to the translated base. That would remove the address translation from all the subsequent memory references.

Quote:
56.3 MHz gives a nice 800x600 vga display.
84.5 Hz refresh rate? 800x600 is a 40 MHz clock IIRC.

_________________
Robert Finch http://www.finitron.ca


Sun Aug 28, 2022 3:53 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
As Big Ed suggested the core was switched from dynamic buffering to using fixed buffers for each thread. This resulted in considerable savings of hardware due to the elimination of multiplexors. The default number of threads was also reduced to four. Fewer threads is higher performance for single thread operation. Number of threads is a configuration parameter. The number of vector lanes is also a parameter.

I did a brief experiment with using instruction fifos instead of just a one instruction buffer. I am not sure that the fifos would do anything except add to the latency of operations, and it was starting to get complex.

The core is currently about 50k LUTs but it is bound to get larger as more functionality is added.

Decided to add a bunch of floating-point functions to the core. They are all pipelined to the FMA’s latency of seven cycles. Several of the functions have large lookup tables so I am not sure yet they will be supported as the size is multiplied by the number of lanes.

Did a considerable update to the documentation. I was able to cut and paste from prior docs.

_________________
Robert Finch http://www.finitron.ca


Mon Aug 29, 2022 3:51 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Could you also base prority, on the ring in in operating system level. 0 message passing 1 kernal 2 kernal i/o 3 network 4 user 5 demon 6 debug


Mon Aug 29, 2022 4:33 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Could you also base prority, on the ring in in operating system level. 0 message passing 1 kernal 2 kernal i/o 3 network 4 user 5 demon 6 debug
Priority could be based on the operating mode (user, supervisor, machine...) but I think it will be left up to software. There is a privilege level the machine is running at. Which could be checked against the priv level of memory pages.

Added interrupt and exception processing to the core.

Hit upon storing the exception return address in a vector register. The exception return address needs a stack to allow nested interrupts. A vector register has enough room for a stack. On exception a vector slide takes place to make room for the exception address. On return from interrupt another slide operation is done to restore the previous vector. This makes use of existing hardware to manage the stack as opposed to adding a stack of special function registers. Some portion of the register set was going to be used to store exception instruction pointers anyway.

Spent some time running things in simulation. It is too soon to expect good results, but I managed to get sim to run up to the point where an I$ line is loaded. Loading the I$ goes through the same memory pipeline that data requests go through. That means the data cache is checked for the data, ignored, then the memory request for an I$ load sent to the memory state machine. The way things are working it may be possible to use the data cache as a second level cache for the I$.

_________________
Robert Finch http://www.finitron.ca


Tue Aug 30, 2022 5:44 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Needed to put code in to supress requests of the same cache line multiple times. It is possible that different threads will request the same cache line about the same time. This occurs because the cache line has not been updated yet before the second request comes along.

There is currently a minimum of five threads required because that is the length of the instruction fetch pipeline plus one. That makes is simple to increment the instruction pointer. That way it can be guaranteed the pointer will not be used before the increment happens.

There were six threads running at one point and I noticed that two of the threads never executed. It turns out the simple scheduler that just picks the first thread ready to go is too simple. One of the first four threads was always ready to go so the last two never got selected. A round-robin scheduler was used instead to fix this issue.

_________________
Robert Finch http://www.finitron.ca


Wed Aug 31, 2022 5:18 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Eliminated three stages from the memory pipeline by virtually indexing the D$. It has now been reduced to six clocks from nine. D$ access now takes place at the same time as virtual to physical address translation, TLB access. Modified the TLB access to use a vector register to load or store TLB data. More documentation updates.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 01, 2022 5:04 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Completely re-wrote a large portion of the core to use instruction fifos. It now has a scoreboard for tracking register dependencies and can have more than one instruction waiting in the fifo. Rollback for the memory pipeline was a bit of a monster. There are multiple fifos to roll back. Branches and rollback are not tested or debugged at all yet. Still working on getting simple instructions to execute properly.

The simple scheduling logic that used loops did not work. So, I had to re-write it. It took several iterations to get things just right. I ended up expanding out the loops. Does not run the pipeline properly yet. I noticed one of the threads skipping over an instruction, so more work yet.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 03, 2022 4:34 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
After a lot of muddling with pipeline signals the core now runs the first few instructions correctly for all threads.
The pipeline looks like the following:
Attachment:
File comment: rfPhoenix pipeline diagram
rfPhoenix Pipeline.png
rfPhoenix Pipeline.png [ 37.58 KiB | Viewed 5274 times ]


Note it is fairly long, a simple instruction takes about 10 clocks to go through the pipeline. But it should be reasonably fast. Shooting for 50 MHz operation.
Pipeline is similar to the Nyuzi pipeline by Jeff Bush.

_________________
Robert Finch http://www.finitron.ca


Mon Sep 05, 2022 5:45 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Interesting - thanks for the pointer! Here's the Related Projects and Research page on the Nyuzi wiki. And here's the microarchitecture page.


Mon Sep 05, 2022 7:11 am
Profile

Joined: Mon Oct 07, 2019 2:41 am
Posts: 585
Could a addition be broken into three stages?
First stage xor and carry lookahead for incriment , second stage carry lookhead for addition skipped for incriment.
third stage write back sum bit or not sum bit . I suspect routing delays could eat up 1/3 of that 20 ns cyle time.
Most of the time one just adds or compares small constants -127 to 128. Could this knowlage give a fast addition
for special cases and a slower addition for general purpose use.
Ben.


Mon Sep 05, 2022 8:36 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 57 posts ]  Go to page 1, 2, 3, 4  Next

Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software