Last visit was: Sun Nov 10, 2024 7:41 pm
|
It is currently Sun Nov 10, 2024 7:41 pm
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
Quote: I am assuming that your GPU is using floating point (FP). Nope. It just uses fixed point. I modelled some of it from orsoc graphics accelerator. I needed to keep the core size down. The divider is a multi-cycle non-restoring divider that processes 1 bit per clock cycle. While there are other faster dividers around the size of the divider is minimal. One thought was to use a pipelined divider, but then each stage would be almost the same as a non-pipelined divide in terms of size. I just guessed that four non-pipelined dividers would still be a lot smaller than a fully pipelined one. (64 stages !). orsoc graphics accel can do two divides per clock using a pipelined divider and a queueing of results. I was going to use that core, but only three of them would fit in the FPGA. I have not looked up how fast GPU division is. This is a work in progress likely to change as it already has. I should maybe try the FPU, it's not that large (2,500LC's for single precision IIRC). I may be better off to go with fewer more powerful processors. The current processor has a CPI of 4 or more, but runs at the dot frequency.
_________________Robert Finch http://www.finitron.ca
|
Sat Oct 06, 2018 9:57 pm |
|
|
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Over the past year or so, I developed a fixed-point division function based on the Goldschmidt square root and reciprocal square root algorithm. It generates two results, and I use the square of the inverse (reciprocal) square root to provide the reciprocal of the divisor. The algorithm I used converges in 6 (total) iterations, and uses a Booth multiplier, left/right arithmetic shifter, and summer.
All of these components were already part of the ALU, so all I needed was the sequencer logic to use the standard components of the ALU to compute the square root, reciprocal of the square root, and the reciprocal of the divisor. If I recall correctly, the algorithm, using a 4-bits per cycle Booth multiplier, comes out slightly slower than your 66 clock cycles. I did save a considerable amount of logic by not implementing a dedicated divider function in the ALU since there is only one division and square root per update cycle, but there are considerable more multiplies and accumulations per cycle.
Perhaps the Goldschmidt reciprocal algorithm may be something for you to consider.
_________________ Michael A.
Last edited by MichaelM on Sun Oct 07, 2018 6:34 pm, edited 1 time in total.
|
Sat Oct 06, 2018 11:43 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
I’ve been playing around with a Goldschmidt divider and I like the results. It took me a bit to figure out how to get it to converge fast. Requires the use of a shifter as Michael hinted. The only issue I see using it is with large numbers of bits for example 128-bit floats. It would require about 100 multiplier slices and adders to implement and that’s got to be slow for single cycle operation. The fmax may suffer. Still I think I'll include it in the FPU unit and GPU. Is it patented still? Waiting six or eight cycles for a divide isn't too bad. Combined with a handle trick and divides could be almost single cycle in some circumstances. I have the following code snippet from line drawing which hides some of the divide latency. With Goldschmidt it'd be completely hidden. Code: fxdiv r9,r8,r7 ; dy/dx fxdiv r12,r7,r8 ; dx/dy lh r16,TARGET_X0[r20] lh r17,TARGET_Y0[r20] lh r18,TARGET_X1[r20] lh r19,TARGET_Y1[r20] divwait r9,r9 divwait r12,r12
_________________Robert Finch http://www.finitron.ca
|
Sun Oct 07, 2018 5:08 am |
|
|
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Quote: Is it patented still? US patent was granted in 1998. I think 14 years is the term of the protections. So if those facts / assumptions are correct, there should be no concern regarding the patent for you.
_________________ Michael A.
|
Sun Oct 07, 2018 6:42 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
I built the test system using Goldschmidt dividers in the GPU and pow. The system size was about 500,000 LC's. I wasn't expecting that. The chip is only 200,000 LC's I was wondering why the divider wasn't available as an IP core in the vendor's catalog. After experimenting with the Goldschmidt divider it’s being shelved because it takes too many resources. Not that the divider itself takes too many, but the way I implemented it. There is undoubtedly smaller implementations possible. Rather than share the multiplier and shifter with other instructions in the GPU the GS divider has it own set. The overhead of a sequencing machine and registered multiplexes of values would partially defeat the performance benefit of the divider. Synthesis must have implemented the multipliers with LUTs because the divider by itself was 32,000LC’s. Far too large. One issue is that the divider requires triple width registers and two double width multipliers. I'm trying to get an indication that the GPU is working or rather at least starting up. The first thing the GPU software does is set the display pixel area to the color red. At the moment the display shows random colors so the software's not working, but the hardware pixel randomizier must be.
_________________Robert Finch http://www.finitron.ca
|
Tue Oct 09, 2018 2:11 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
Had to reduce the number cores in the GPU to eight from twelve. The screen is now managed by a 4x2 grid instead of a 4x3 grid. Each core is responsible for more pixels and it’ll slow things down, but otherwise the design just wouldn’t route. There were problems with the chip being too full and failing to route all signals. The fixed point multiply was added to the base core. It's the same as a regular multiply and shares the same multiplier. All it does is take a different selection of bits from the multiplier output. The workstation needs to be rebooted again, something’s wrong with sim.
_________________Robert Finch http://www.finitron.ca
|
Wed Oct 10, 2018 5:22 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
I'm wondering about having the 8 GPU cores service the needs of a larger number of smaller tiles: say 24 tiles. Different tiles will need different amounts of work in each frame. And the tiles at the top of the frame should be serviced before those at the bottom. A kind of chasing the beam but at a tile granularity, and with an 8-way parallel processor. Does that make any sense?
|
Wed Oct 10, 2018 5:41 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
What you say makes some sense to me. I added an instruction to detect the scan position and allow processing accordingly for beam racing code. I also want the GPU to reset every so often (based on a frame counter). I’m thinking of the Amiga’s Copper co-processor with a more modern twist. The GPU needs a lot of work yet. I finally got the system to build with an eight core GPU. I reorganized the GPU a little bit. Each core now processes for a 128x160 target area. This is easily broken up into smaller tiles as BigEd suggests (eg. 4x4 grid of 32x40 pixels). Making the targets 128 pixels wide avoids a multiplier when calculating the address. This gives a target map of 512x320 and the screen resolution is 400x300. So, there is some extra room in the map to allow smooth scrolling. The amount of block ram limits the size of the target area. Currently 320k is allocated for the display. Reset uses the scan counter on the ‘b’ side of the ram to initialize the display to random pixels using a LFSR. This initialization lasts for 60 frames (approx. 1s). Normally the ‘b’ side is read-only. GPU ram connection: Code: FT_GPURam ulr1 ( .clka(clk_i), .ena(cyc & cs_ram), .wea({4{we_o}} & sel_o), .addra(adr_o[15:2]), .dina(dat_o), .douta(ram_dat), .clkb(dot_clk_i), .enb(1'b1), .web({2{por}}), // active only for 60 frames after reset .addrb(scan_adr_i), .dinb(lfsr_o), .doutb(scan_dat_o) );
This part of the code is working, so some things must be wired up correctly. The reset code the GPU core executes is supposed to clear the display to the color red. This is a very simple test program to ensure the GPU core is working. It’s just an infinite loop setting the screen. Code: code org $FFFC0100 start: ldi r1,#$0F00 ; red in ZRGB4444 ldi r2,#20480 ; number of pixels (128x160) ldi r3,#SCREEN .0001: sc r1,[r3] add r3,r3,#2 sub r2,r2,#1 bne r2,r0,.0001 jmp start
This code seems to work, but there is a black band through the middle of the display. It looks like the scan addressing is off a bit. Now for a slightly more difficult test program - calling a subroutine to clear the screen.
_________________Robert Finch http://www.finitron.ca
|
Thu Oct 11, 2018 3:10 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
Found a bug in FT64v5 when the tail pointers are being reset because of a branch miss. The tail pointer may or may not have been set correctly. This bug was found by inspection. It did not appear to affect the operation of the processor, but eventually there would be a circumstance that would cause it to fail. Some more of the issue logic was re-written as loops. Closer to the goal of having the number of queue entries fully parameterized. A slightly more complex program to be executed on the GPU didn't work (yet). This time a subroutine is called and the color set based on which core is executing the code. The screen appeared black, suspect is the load of the color value failing. An error in the load logic was found. Code: code org $FFFC0100 start: call ColorScreen jmp start
ColorScreen: csrrd r1,#1,r0 or r2,r1,r0 ; r2 = r1 and r2,r2,#$10 ; get row and r1,r1,#$03 ; mask column shr r2,r2,#2 or r1,r1,r2 ; combine row,column shl r1,r1,#1 ; convert to index lc r1,table3[r1] ; select color ldi r2,#20480 ; number of pixels ldi r3,#SCREEN .0001: sc r1,[r3] add r3,r3,#2 dbnz r2,.0001 ret
table3: dc $0F00,$00F0,$000F,$0F0F dc $0F0F,$00FF,$0FFF,$0000
_________________Robert Finch http://www.finitron.ca
|
Sat Oct 13, 2018 4:10 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
Fixed a bug in FPP. FPP would find files only in exactly the specified directory if the filename was specified with quotes. It now looks in additional include directories if not found where specified. I tried compiling part of the standard C library and uncovered several bugs in the compiler which are now fixed. There are still a few bugs left to go before the library will compile successfully. If the CC64 compiler I setup the identifier “null” to be equal to the value zero. Well in the C library they decided to use a variable called “null” which was a non-zero pointer. I ended up just commenting out the code in the compiler to set the “null” variable to zero. The compiler also had to be modified to support initialization of non-primitive types in unions.
_________________Robert Finch http://www.finitron.ca
|
Sun Oct 14, 2018 3:58 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
Got about 75 files from the standard C library to compile without errors after a long debug. Found another bug in FPP where a macro failed to expand after a quote was detected in the input. I figure if I can get the complete standard C library to compile that'l be a stepping stone to other software. There is some pretty nasty code for compilers in the library. One thought borrowed from another person on the web is to try and get DOOM running on the system.
_________________Robert Finch http://www.finitron.ca
|
Mon Oct 15, 2018 3:01 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
After a marathon debug session, the C standard library compiles! There’s about 300+ files that compile cleanly. Whether or not the compiled code works is another story. I found a feature missing from FPP. The C preprocessor ‘#’ to stringize tokens was missing from FPP. This is rarely used, and I hadn’t bothered to add it. Well it’s there now and works well enough to compile the library.
_________________Robert Finch http://www.finitron.ca
|
Tue Oct 16, 2018 10:26 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
Another big milestone!
|
Tue Oct 16, 2018 4:57 pm |
|
|
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Rob:
I'm trying to follow along, but for the life of me, I can't translate FPP. What does the acronym refer to in your posts above?
_________________ Michael A.
|
Thu Oct 18, 2018 12:16 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2215 Location: Canada
|
Optimized code generation for binary operations where both sides of the binary operation have the same expression tree. For instance, “x = g * g” now compiles to using the same temporary for g twice rather than evaluating the expression g twice. Code: ; double y = g * g; lf.d $fp4,[$r11] fmul.d $fp3,$fp4,$fp4 mov $fp11,$fp3 The above used to do two separate loads of the variable ‘g’ and assign the value to two separate temporary registers. Found a problem in the peephole optimizer where a floating-point instruction would be removed if it had the same register code as an integer instruction for the target register. The code needed to distinguish between the register classes. The names of functions declared as static had to have the lexical unit name prepended to avoid name conflicts. This is something I knew had to be done one day to deal with larger program sets. Several issues with the conditional operator were resolved. The code now compiles and assembles. Numerous code generation issues were fixed. Next thing to work on will be the system emulator. It would be cool to have some real software running in at least the emulator.
_________________Robert Finch http://www.finitron.ca
|
Thu Oct 18, 2018 4:34 am |
|
Who is online |
Users browsing this forum: No registered users and 1 guest |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|