View unanswered posts | View active topics It is currently Sun Dec 16, 2018 3:48 am



Reply to topic  [ 366 posts ]  Go to page Previous  1 ... 17, 18, 19, 20, 21, 22, 23 ... 25  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 729
Location: Canada
Quote:
I am assuming that your GPU is using floating point (FP).
Nope. It just uses fixed point. I modelled some of it from orsoc graphics accelerator. I needed to keep the core size down. The divider is a multi-cycle non-restoring divider that processes 1 bit per clock cycle. While there are other faster dividers around the size of the divider is minimal. One thought was to use a pipelined divider, but then each stage would be almost the same as a non-pipelined divide in terms of size. I just guessed that four non-pipelined dividers would still be a lot smaller than a fully pipelined one. (64 stages !). orsoc graphics accel can do two divides per clock using a pipelined divider and a queueing of results. I was going to use that core, but only three of them would fit in the FPGA.
I have not looked up how fast GPU division is. This is a work in progress likely to change as it already has. I should maybe try the FPU, it's not that large (2,500LC's for single precision IIRC). I may be better off to go with fewer more powerful processors. The current processor has a CPI of 4 or more, but runs at the dot frequency.

_________________
Robert Finch http://www.finitron.ca


Sat Oct 06, 2018 9:57 pm
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 143
Location: Huntsville, AL
Over the past year or so, I developed a fixed-point division function based on the Goldschmidt square root and reciprocal square root algorithm. It generates two results, and I use the square of the inverse (reciprocal) square root to provide the reciprocal of the divisor. The algorithm I used converges in 6 (total) iterations, and uses a Booth multiplier, left/right arithmetic shifter, and summer.

All of these components were already part of the ALU, so all I needed was the sequencer logic to use the standard components of the ALU to compute the square root, reciprocal of the square root, and the reciprocal of the divisor. If I recall correctly, the algorithm, using a 4-bits per cycle Booth multiplier, comes out slightly slower than your 66 clock cycles. I did save a considerable amount of logic by not implementing a dedicated divider function in the ALU since there is only one division and square root per update cycle, but there are considerable more multiplies and accumulations per cycle.

Perhaps the Goldschmidt reciprocal algorithm may be something for you to consider.

_________________
Michael A.


Last edited by MichaelM on Sun Oct 07, 2018 6:34 pm, edited 1 time in total.



Sat Oct 06, 2018 11:43 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 729
Location: Canada
I’ve been playing around with a Goldschmidt divider and I like the results. It took me a bit to figure out how to get it to converge fast. Requires the use of a shifter as Michael hinted. The only issue I see using it is with large numbers of bits for example 128-bit floats. It would require about 100 multiplier slices and adders to implement and that’s got to be slow for single cycle operation. The fmax may suffer. Still I think I'll include it in the FPU unit and GPU. Is it patented still? Waiting six or eight cycles for a divide isn't too bad. Combined with a handle trick and divides could be almost single cycle in some circumstances.

I have the following code snippet from line drawing which hides some of the divide latency. With Goldschmidt it'd be completely hidden.
Code:
      fxdiv   r9,r8,r7      ; dy/dx
      fxdiv   r12,r7,r8      ; dx/dy
      lh      r16,TARGET_X0[r20]
      lh      r17,TARGET_Y0[r20]
      lh      r18,TARGET_X1[r20]
      lh      r19,TARGET_Y1[r20]
      divwait   r9,r9
      divwait   r12,r12

_________________
Robert Finch http://www.finitron.ca


Sun Oct 07, 2018 5:08 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 143
Location: Huntsville, AL
Quote:
Is it patented still?
US patent was granted in 1998. I think 14 years is the term of the protections. So if those facts / assumptions are correct, there should be no concern regarding the patent for you.

_________________
Michael A.


Sun Oct 07, 2018 6:42 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 729
Location: Canada
I built the test system using Goldschmidt dividers in the GPU and pow. The system size was about 500,000 LC's. I wasn't expecting that. The chip is only 200,000 LC's :) I was wondering why the divider wasn't available as an IP core in the vendor's catalog.

After experimenting with the Goldschmidt divider it’s being shelved because it takes too many resources. Not that the divider itself takes too many, but the way I implemented it. There is undoubtedly smaller implementations possible. Rather than share the multiplier and shifter with other instructions in the GPU the GS divider has it own set. The overhead of a sequencing machine and registered multiplexes of values would partially defeat the performance benefit of the divider. Synthesis must have implemented the multipliers with LUTs because the divider by itself was 32,000LC’s. Far too large. One issue is that the divider requires triple width registers and two double width multipliers.

I'm trying to get an indication that the GPU is working or rather at least starting up. The first thing the GPU software does is set the display pixel area to the color red. At the moment the display shows random colors so the software's not working, but the hardware pixel randomizier must be.

_________________
Robert Finch http://www.finitron.ca


Tue Oct 09, 2018 2:11 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 729
Location: Canada
Had to reduce the number cores in the GPU to eight from twelve. The screen is now managed by a 4x2 grid instead of a 4x3 grid. Each core is responsible for more pixels and it’ll slow things down, but otherwise the design just wouldn’t route. There were problems with the chip being too full and failing to route all signals.
The fixed point multiply was added to the base core. It's the same as a regular multiply and shares the same multiplier. All it does is take a different selection of bits from the multiplier output.
The workstation needs to be rebooted again, something’s wrong with sim.

_________________
Robert Finch http://www.finitron.ca


Wed Oct 10, 2018 5:22 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1030
I'm wondering about having the 8 GPU cores service the needs of a larger number of smaller tiles: say 24 tiles. Different tiles will need different amounts of work in each frame. And the tiles at the top of the frame should be serviced before those at the bottom. A kind of chasing the beam but at a tile granularity, and with an 8-way parallel processor. Does that make any sense?


Wed Oct 10, 2018 5:41 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 729
Location: Canada
What you say makes some sense to me. I added an instruction to detect the scan position and allow processing accordingly for beam racing code. I also want the GPU to reset every so often (based on a frame counter). I’m thinking of the Amiga’s Copper co-processor with a more modern twist.
The GPU needs a lot of work yet. I finally got the system to build with an eight core GPU. I reorganized the GPU a little bit. Each core now processes for a 128x160 target area. This is easily broken up into smaller tiles as BigEd suggests (eg. 4x4 grid of 32x40 pixels). Making the targets 128 pixels wide avoids a multiplier when calculating the address. This gives a target map of 512x320 and the screen resolution is 400x300. So, there is some extra room in the map to allow smooth scrolling. The amount of block ram limits the size of the target area. Currently 320k is allocated for the display.

Reset uses the scan counter on the ‘b’ side of the ram to initialize the display to random pixels using a LFSR. This initialization lasts for 60 frames (approx. 1s). Normally the ‘b’ side is read-only. GPU ram connection:
Code:
FT_GPURam ulr1
(
  .clka(clk_i),
  .ena(cyc & cs_ram),
  .wea({4{we_o}} & sel_o),
  .addra(adr_o[15:2]),
  .dina(dat_o),
  .douta(ram_dat),
  .clkb(dot_clk_i),
  .enb(1'b1),
  .web({2{por}}),   // active only for 60 frames after reset
  .addrb(scan_adr_i),
  .dinb(lfsr_o),
  .doutb(scan_dat_o)
);

This part of the code is working, so some things must be wired up correctly.

The reset code the GPU core executes is supposed to clear the display to the color red. This is a very simple test program to ensure the GPU core is working. It’s just an infinite loop setting the screen.
Code:
      code
      org      $FFFC0100
start:
      ldi      r1,#$0F00   ; red in ZRGB4444
      ldi      r2,#20480   ; number of pixels (128x160)
      ldi      r3,#SCREEN
.0001:
      sc      r1,[r3]
      add      r3,r3,#2
      sub      r2,r2,#1
      bne      r2,r0,.0001
      jmp      start

This code seems to work, but there is a black band through the middle of the display. It looks like the scan addressing is off a bit.
Now for a slightly more difficult test program - calling a subroutine to clear the screen.

_________________
Robert Finch http://www.finitron.ca


Thu Oct 11, 2018 3:10 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 729
Location: Canada
Found a bug in FT64v5 when the tail pointers are being reset because of a branch miss. The tail pointer may or may not have been set correctly. This bug was found by inspection. It did not appear to affect the operation of the processor, but eventually there would be a circumstance that would cause it to fail.
Some more of the issue logic was re-written as loops. Closer to the goal of having the number of queue entries fully parameterized.

A slightly more complex program to be executed on the GPU didn't work (yet). This time a subroutine is called and the color set based on which core is executing the code. The screen appeared black, suspect is the load of the color value failing. An error in the load logic was found.
Code:
      code
      org      $FFFC0100
start:
      call   ColorScreen
      jmp      start

ColorScreen:
      csrrd   r1,#1,r0
      or      r2,r1,r0      ; r2 = r1
      and      r2,r2,#$10   ; get row
      and      r1,r1,#$03   ; mask column
      shr      r2,r2,#2
      or      r1,r1,r2      ; combine row,column
      shl      r1,r1,#1      ; convert to index
      lc      r1,table3[r1]   ; select color
      ldi      r2,#20480      ; number of pixels
      ldi      r3,#SCREEN
.0001:
      sc      r1,[r3]
      add      r3,r3,#2
      dbnz   r2,.0001
      ret

table3:
      dc      $0F00,$00F0,$000F,$0F0F
      dc      $0F0F,$00FF,$0FFF,$0000


_________________
Robert Finch http://www.finitron.ca


Sat Oct 13, 2018 4:10 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 729
Location: Canada
Fixed a bug in FPP. FPP would find files only in exactly the specified directory if the filename was specified with quotes. It now looks in additional include directories if not found where specified.
I tried compiling part of the standard C library and uncovered several bugs in the compiler which are now fixed. There are still a few bugs left to go before the library will compile successfully.
If the CC64 compiler I setup the identifier “null” to be equal to the value zero. Well in the C library they decided to use a variable called “null” which was a non-zero pointer. I ended up just commenting out the code in the compiler to set the “null” variable to zero.
The compiler also had to be modified to support initialization of non-primitive types in unions.

_________________
Robert Finch http://www.finitron.ca


Sun Oct 14, 2018 3:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 729
Location: Canada
Got about 75 files from the standard C library to compile without errors after a long debug. Found another bug in FPP where a macro failed to expand after a quote was detected in the input. I figure if I can get the complete standard C library to compile that'l be a stepping stone to other software. There is some pretty nasty code for compilers in the library.
One thought borrowed from another person on the web is to try and get DOOM running on the system.

_________________
Robert Finch http://www.finitron.ca


Mon Oct 15, 2018 3:01 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 729
Location: Canada
After a marathon debug session, the C standard library compiles! There’s about 300+ files that compile cleanly. Whether or not the compiled code works is another story.
I found a feature missing from FPP. The C preprocessor ‘#’ to stringize tokens was missing from FPP. This is rarely used, and I hadn’t bothered to add it. Well it’s there now and works well enough to compile the library.

_________________
Robert Finch http://www.finitron.ca


Tue Oct 16, 2018 10:26 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1030
Another big milestone!


Tue Oct 16, 2018 4:57 pm
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 143
Location: Huntsville, AL
Rob:

I'm trying to follow along, but for the life of me, I can't translate FPP. What does the acronym refer to in your posts above?

_________________
Michael A.


Thu Oct 18, 2018 12:16 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 729
Location: Canada
Optimized code generation for binary operations where both sides of the binary operation have the same expression tree. For instance, “x = g * g” now compiles to using the same temporary for g twice rather than evaluating the expression g twice.
Code:
;          double y = g * g;
            lf.d        $fp4,[$r11]
            fmul.d      $fp3,$fp4,$fp4
            mov         $fp11,$fp3

The above used to do two separate loads of the variable ‘g’ and assign the value to two separate temporary registers.
Found a problem in the peephole optimizer where a floating-point instruction would be removed if it had the same register code as an integer instruction for the target register. The code needed to distinguish between the register classes.
The names of functions declared as static had to have the lexical unit name prepended to avoid name conflicts. This is something I knew had to be done one day to deal with larger program sets.
Several issues with the conditional operator were resolved.
The code now compiles and assembles. Numerous code generation issues were fixed. Next thing to work on will be the system emulator. It would be cool to have some real software running in at least the emulator.

_________________
Robert Finch http://www.finitron.ca


Thu Oct 18, 2018 4:34 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 366 posts ]  Go to page Previous  1 ... 17, 18, 19, 20, 21, 22, 23 ... 25  Next

Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software