View unanswered posts | View active topics It is currently Thu Mar 28, 2024 2:34 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 16, 17, 18, 19, 20, 21, 22 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Made the core more configurable again. It can now be run as a scalar OoO core by reducing the width of the fetch / queue / execute paths to a single instruction. I’ve got about 2/3 of the code done for a 3-way version.
Well I tried the new version of the core as a scalar OoO processor and it turns out to be faster than the superscalar version. 4952 instructions got executed versus 3719 making it about 33% faster. I’m confused and mystified by this result. I ran the test twice just to be sure. I expected the superscalar version to be able to execute more instructions. I wonder if it has to do with some form of thrashing. I probably fat-fingered something when updating the core.

_________________
Robert Finch http://www.finitron.ca


Tue Sep 25, 2018 3:32 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Figured out one problem with the superscalar version. The data cache was being loaded twice with the same line on a miss because reading forwards through memory caused both memory channels to miss at the same time. Both the misses were being serviced before retesting for a hit. A fix was added so that if memory channels are active a retest of the hit status is done if the memory address is in the same cache line. With the fix the data cache only loads once, this upped the number of instructions executed to 4116. Still slower than the scalar version ??? There must be something else happening as well. Sim output is a challenge to read.

The branch predictor was broken. When I adopted 16-bit instructions I forgot to include address bit 1 in the predictor, which previously used address bits 2 and above to index tables. This caused a perpetual branch-miss in the superscalar configuration.
I also didn’t notice that when three instruction can commit the third could be a branch. The branch predictor was looking only at the first two commits.

There was a problem with the compressed instruction expander for the addi instruction with negative values. Sign extension of values didn’t expand out to enough bits, making the number look positive.

With these fixes the superscalar version is now faster than the scalar one. 5718 insns vs 5098 insns got executed.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 26, 2018 3:46 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Cool. Does that improvement in performance justify the additional complexity, in your opinion?

I have questioned the benefits of some of the additional complexity that provide what I consider small incremental performance increases. It seems to me that a simpler may be better. That's not to say your efforts to extract performance from your core with structures as you've described here are not without merit. I myself spend an inordinate amount of time on small incremental improvements to my cores. For me, and I suspect for you, my efforts along these lines are more of a learning exercise and a challenge I've set out for myself.

I certainly appreciate your blog on this core. It has certainly led me to investigate many of the features you've been incorporating. Between this blog and the One-Page Computer blog, I've certainly expanded my knowledge of modern processor architecture. It's good to see how the features that you've described on this blog relate to performance, design and simulation effort, etc.

_________________
Michael A.


Wed Sep 26, 2018 10:17 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Does that improvement in performance justify the additional complexity, in your opinion?

For me, I don't think it's worth the extra complexity. It's only 12% faster cycle-wise. A higher fmax of a simpler core would probably be faster. The scalar version is interesting because it can make use of out-of-order execution of instructions, and the queuing of instructions in order to improve performance. I can see where however with additional enhancements it might be worth it getting extra performance. If the queue were larger that would help. If it did three or four way processing that would help too. In sim sometimes the queue empties of instructions, then the processor just sits and waits. Part of the problem with fmax is the implementation in an FPGA. I think FPGA's don't do random logic very well. I got a "design is too congested" error and I had to remove part of the design to get it to build.
I decided to run the test system with the scalar version of the core because it's about 1/2 the size.

Spent time today working on a sprite controller circuit. I’ve managed to get the system built through to a bit-stream that can be loaded into the FPGA. And found out it doesn’t work in the FPGA. I dump the address bus and data bus to an oled display and they look right for a start-up address and data. So there is a clock getting through and it is being reset, but the core doesn’t execute the first line of code which is to display $AA on the leds. Sprites are not appearing on the screen either. Just dummied out the sprite color, bitmap, and positions to constant values to see if anything displays. It takes the system about an hour to build.

_________________
Robert Finch http://www.finitron.ca


Fri Sep 28, 2018 3:27 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Something we found with the (tiny) OPC projects was that the fmax result from synthesis was relatively variable. That makes it easy to see if you've hit some target speed, but difficult to see if a change was beneficial or not. With the xilinx tools, one option is SmartExplorer which can do a number of runs and report the best - by default I think it's 8 runs, and it can use multiple cores too to make progress in parallel. Of course if you have more runs than cores it slows down your synthesis throughput, which can be quite a disadvantage.


Fri Sep 28, 2018 7:38 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Is SmartExplorer a free tool ? I don't see it on my machine. I wouldn't mind giving it a try.

Figured out one issue with sprite display. Using $urandom in synthesis resulted in registers being set to zero, I had assumed it would pick a random constant. Zero is the code for transparent so no sprites displayed. Fixing this caused sprites to be displayed at the expected locations on screen, but in the wrong color. Black instead of red. It appears be at the output of the color lookup ram. I hardcoded the color red at that point to ensure the pipeline is working and the display worked. It looks like for some reason the palette ram contents are zero even though they are being set with an initial begin statement.
There is one display glitch in the sprite display. For some reason the last scan-line of the tenth (and only the tenth) sprite is chopped off. This has me stumped at the moment.
I’m no further along at getting the cpu to work. It works beautifully in sim and I ran sim on the whole system. I tried registering the data input on the notion that perhaps it would fix a timing problem. It's gotta be close.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 29, 2018 4:05 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Yes, SmartXplorer is part of ISE - it's an option in one of the menus. But maybe you're on the newer Vivado? It seems they don't do the parallel exploration any more, you have to make multiple projects with different options. (I remember back in the 90s with synopsys we used to script up multiple tactics to find the best one for each design. Flatten, Structure, Map, and combinations and permutations. And Effort was varied too.)

See
https://www.xilinx.com/support/document ... plorer.htm
https://www.xilinx.com/support/document ... tegies.htm


Sat Sep 29, 2018 8:21 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
But maybe you're on the newer Vivado? It seems they don't do the parallel exploration any more, you have to make multiple projects with different options.
Yes, I'm on Vivado. I may not have all the possible tools installed. I just downloaded and installed the default. I remember there were some checkboxes left unchecked.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 29, 2018 3:53 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
On the FPGA the boot-rom chip select appears to be frozen active.

In sim a case was found where the memory load state machine locked up because of a stomped-on memory load by a branch miss. The first issue to solve was the fact that the memory state machine didn’t see the stomp and continued to mark the instruction which it no longer owned done when it finished. This turned out to be a newly queued instruction. Which caused incorrect results to be generated as the instruction was marked done too soon. With that fixed, the issue was the load state machine didn’t transition properly. A few fixes later, it seems to work.

An I/O bridge component was added to the system. The system’s got complex enough that it needs them. The video cores are now hung off the I/O bridge rather than directly off the cpu. Without the bridge routing ran for 13 hours, then bitstream generation failed with routing problems. Adding a bridge into the system reduced the route time to 40 minutes, and bitstream was successful.

I modified the L1-icache load code of the cpu core code slightly, and now it seems to get to the point of initializing the bitmap controller in the FPGA. It seems to be executing at least 8-10 instructions.

_________________
Robert Finch http://www.finitron.ca


Sun Sep 30, 2018 2:54 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Interesting improvement in synth with the bridge. Is that just an organisational change, or does it have performance implications at all?


Sun Sep 30, 2018 7:29 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Using an I/O bridge adds at least 1 clock cycle to I/O accesses because the values are registered in the bridge. So it slows down I/O performance in terms of clock cycles. However, I’ve found that in a larger system a bridge helps keep the fmax where it should be. It seems to make it easier for the router to find paths between the components because the path lengths are cut in half. It also turns the connection paths from a flat set of connections into more of a tree. As an example, rather than having to route from one master to twenty-slaves, with more of a tree, the router only has to route from the master to four slave bridges. And then from a slave bridge to one of five peripherals. Sometimes the slaves break down the connections internally too so it’s really like a tree.

_________________
Robert Finch http://www.finitron.ca


Mon Oct 01, 2018 6:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Possibly found a bug in synthesis. An initial begin statement failed to set a register value properly. It was causing sprites to appear all black. Left a message on the community forum at Xilinx.
The FT64 system on a chip is growing. Today I worked on an audio controller rather than FT64. I added a testing mode to the controller that allows recording and playing back audio without cpu intervention, so parts of the core can be tested without having a working cpu.
I started reading up on GPU’s. More specifically the NVIDIA PTX instruction set.

_________________
Robert Finch http://www.finitron.ca


Tue Oct 02, 2018 5:59 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Went through the cores and added registering on the input signals. I’m trying to reduce the congestion of the system. It’s taking inordinately long times to synthesize / implement.

Spent some time sketching out a GPU for the system. It's approximately 24,000 LUTs and has 12x 32-bit risc processors each controlling a part of the display. As in the following diagram:
Attachment:
File comment: FT64v5 GPU
GPU.png
GPU.png [ 27.46 KiB | Viewed 6099 times ]

_________________
Robert Finch http://www.finitron.ca


Thu Oct 04, 2018 4:03 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I tried changing two things slightly today. The write buffer and the register clock. The register clock was lowered from a 4x to a 2x clock as this still works in simulation. I thought perhaps values were not being written to the register file at the 4x rate. The write buffer pointer was modified to be a registered signal rather than combo logic. Also, the write buffer was made available outside of the idle state. Changing these two things made no difference to the operation in the fpga. Still a challenge to get working. According to the led display the cpu gets past the first write to leds but nothing is showing on the leds. So, I conclude the store isn’t working. This could hold up the cpu after several cycles. I think the first issue to fix then is to get the store working.

While waiting for the system to build, which takes about 3 hours, I managed to put together more of the GPU. Each GPU now performs a 32-bit subset of the FT64 ISA. The GPU has a couple of custom commands for graphics processing. These include a command to alpha blend, perform point transformations (rotate / scale) according to a matrix, fixed point multiplies and divides. Divide is interesting because there’s four dividers that the GPU cycles through. The divide instruction immediately returns a handle to the divider, which may then be used later to get the results of the divide. This allows processing to continue while a divide is being performed. The divide operation takes about 66 clock cycles. Having four dividers available allows some overlap in the processing. The GPU is about 6,700 LCs this is less than half the size of the orsoc graphics accelerator (16,000 LC’s). Of course, performance is likely a lot lower, but there’s 12 cores for a total of about (80,000 LC’s) to do processing with.

_________________
Robert Finch http://www.finitron.ca


Sat Oct 06, 2018 3:41 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
I am assuming that your GPU is using floating point (FP). FP division is slower than the other three FP operations: addition, subtraction, and multiplication. Out of curiosity, are these operations / functions pipelined in your GPU design? What is driving the high number of clock cycles of your division operation? Do you have a comparison to the number of clocks used by division in other GPUs?

_________________
Michael A.


Sat Oct 06, 2018 4:19 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 16, 17, 18, 19, 20, 21, 22 ... 52  Next

Who is online

Users browsing this forum: Amazonbot, SemrushBot and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software