View unanswered posts | View active topics It is currently Thu Mar 28, 2024 1:07 pm



Reply to topic  [ 105 posts ]  Go to page Previous  1, 2, 3, 4, 5 ... 7  Next
 DSD7 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2016/11/20

The compiler gave me grief today. It couldn’t match function calls to prototypes properly because of char promotions. Character constants in function calls were automatically being promoted to integers. However in the function prototype for the called function they were defined as chars. Since an ‘int’ doesn’t match a ‘char’ the compiler thought it was calling a different function and didn’t output the correct parameter passing. The compiler no longer promotes ‘chars’ to ‘ints’ automatically. That means if one wants to pass an integer to a function that takes a char as a parameter it has to be typecast. I should maybe make it a compiler option to treat chars and ints as equals. I believe the default for ‘C’ is to treat them equally.

Simulation crashed when a 128MB memory was defined for main memory.

I found write cycles no longer worked once the MMU was enabled. It was the sense of the write protect bit from the MMU was inverted.

Compressed instructions caused me trouble. I ran into a case where a compressed instruction caused the previous instruction’s high order bits to be zeroed out. I’m not sure why this happens. So I just disabled the compressed instructions for now. Too much work to debug.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 21, 2016 10:31 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2016/11/21
I
Started working on Dark Star * Dragon Eight, planned to be Super Scalar.
Spent most of the past day documenting.
Spent some time modifying peripherals to work with DSD7. (PIC, TextController) and started putting together a system-on-chip to allow testing in a FPGA board.

_________________
Robert Finch http://www.finitron.ca


Wed Nov 23, 2016 5:03 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Started working on incorporating floating point in to DSD7. Updating the floating point for 128 bit quad precision operations. Most of the FP stuff is parameterized to support 32, or 64 bit operations already. I started working on the FP unit in 2006! And it still isn’t thoroughly tested.

The FP unit isn’t pipelined to work in an overlapped fashion and doesn’t have results forwarding muxes so the results of the FP operation aren’t available immediately. I’ve estimated this would slow the core down about 30%. Since most FP operations take multiple clock cycles, having to wait a couple of extra clock cycles for a result doesn’t have as big an impact as if it were an integer instruction. Since the core is only 32 bit the FP unit has it’s own set of registers. There’s no way to transfer data between a general purpose register and a FP register other than saving and restoring values from memory. Load / store directly to the FP registers is supported.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 24, 2016 2:41 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Wow - an FPU - could you give a rough estimate of how big an FPU turns out to be?

Do you manage divide or square root?


Thu Nov 24, 2016 5:27 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Wow - an FPU - could you give a rough estimate of how big an FPU turns out to be?

Do you manage divide or square root?

The FPU is about 5,854 LUT's (9367LC's) according synthesis.
Divide is supported, square root isn't. Divide takes like 130 clock cycles.

2016/11/23
FP Normalizer didn’t work properly. An intermediate value needed to be left aligned by appending trailing zeros to the value in order for the leading zero count to work properly.
Getting the divider to work properly was a challenge. Originally I coded a radix 8 Booth divider. But for some reason or other it didn’t quite work. I recoded the divide using a simple radix 2 divider and got it working.
After several bit fiddling fixes the FP unit seems to work. I need more test software now.
The FP unit is probably what will limit the clock cycle of the design. There’s a 112x112 bit multiply that is not pipelined in the FP multiplier.
I ran the following test program and dumped the results in the simulator.
Code:
      ; 10.0 + 10.0
      ldi      r1,#$0
      sw      r1,0
      sw      r1,2
      sw      r1,4
      ldi      r1,#%01000000_00000010_01000000_00000000
      sw      r1,6
      lf.q      fp0,0
      lf.q      fp1,0
      fadd.q   fp2,fp0,fp1   // 10.0+10.0
      fmul.q   fp3,fp0,fp1   // 10.0*10.0
      fsub.q   fp4,fp0,fp1   // 10.0-10.0
      fadd.q   fp5,fp3,fp3   // 100.0+100.0
      fdiv.q   fp6,fp3,fp1   // 200.0/10.0
      fdiv.q   fp5,fp0,fp1   // 10.0/10.0

The FP status and control register is only readable using the CSR instructions. In order to manipulate control flag settings one of the FP control instructions has to be used.

I modified DSD7’s instruction set to include branch on bit set/clear instructions, in order to make FP branching more efficient. FP compare returns a status code in the register that has a number of bit fields.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 24, 2016 8:54 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Thanks! I'm thinking of FPU as a common case of a more general idea: the point accelerator.


Thu Nov 24, 2016 8:59 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Thanks! I'm thinking of FPU as a common case of a more general idea: the point accelerator.

Okay, what is a point accelerator ? Does it have to do with graphics ?

The divide was off by a bit. Next problem was fractional values didn’t work properly. An extra bit was required in the remainder calculation to fix this.
I’ve now got to write a 128 floating point software emulation.
The compiler had to be altered to treat float constants in a fashion similar to string constants. The constants are output to a literal pool, which can be loaded from with a load instruction. The problem was there are no immediate mode floating point operations as that would require the ability to encode 128 bit constants in the instruction stream.
The compiler’s ability to convert values to 128 bit constants is probably limited. Encoding PI was off after about the seventh digit. The compiler uses a cascade of multiplies by 1/10 (which is inexact) for fractional numbers to convert ascii to binary.
I’ve used the double type as a temporary placeholder for the quad precision type.
Code:
   code
   align   8
public code _main:
            push    xlr
            push    bp
            mov     bp,sp
            sub     sp,sp,#_mainSTKSIZE_
            lf.q    fp3,FPTest_0
            sf.q    fp3,-8[bp]
            lf.q    fp3,FPTest_1
            sf.q    fp3,-16[bp]
            lf.q    fp3,FPTest_2
            sf.q    fp3,-8[bp]
            lf.q    fp3,FPTest_3
            sf.q    fp3,-24[bp]
            lf.q    fp3,FPTest_4
            sf.q    fp3,-8[bp]
            lf.q    fp4,-8[bp]
            lf.q    fp5,-16[bp]
            fadd.q   fp3,fp4,fp5
            sf.q    fp3,-24[bp]
FPTest_8:
            mov     sp,bp
            pop     bp
            pop     xlr
            ret     #2
endpublic



_mainSTKSIZE_ EQU 24

   rodata
   align   16
   align   8
FPTest_4:   ; quad (3.141593653589793238….)
   dw   55A06406,EB18FD85,B5381469,4000921F
FPTest_3:   ; quad (200.5)
   dw   00000000,00000000,00000000,40069100
FPTest_2:   ; quad (0.5)
   dw   FFFFFFFF,FFFFFFFF,FFFFFFFF,3FFDFFFF
FPTest_1:   ; quad (20.0)
   dw   00000000,00000000,00000000,40034000
FPTest_0:   ; quad (10.0)
   dw   00000000,00000000,00000000,40024000
   align   1
;   global   _main

_________________
Robert Finch http://www.finitron.ca


Sat Nov 26, 2016 7:39 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Ah, by "point accelerator" I mean some kind of hardware dongle or coprocessor or execution unit which performs some application-specific task quickly. Because any common CPU is Turing-complete, it can do anything - eventually. But sometimes it's useful to have specific hardware support, for things like
- CRC
- integer multiply
- multiply-accumulate
- bit counting
- bit reversal
- find first bit set
- bit field extraction
- floating point
- SIMD processing
- array multiplication
- graphics pipeline
- ECC
- Crypto and hashes
- Viterbi
and so on. As mentioned recently over on 6502.org, an accelerator for Conway's Game of Life could also provide a huge speedup for very modest hardware cost. That's extremely application-specific!

Any general-purpose lookup table could possibly be viewed in the same light - it might be dollar-efficient but may not be transistor-efficient, so everything depends on how you're implementing things.


Sat Nov 26, 2016 9:39 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
For the point accelerator I've heard some of the planned newer processors are going to incorporate some programmable logic (FPGA blocks) close to the processor. There's always something like the Zynq chip with an ARM core and FPGA logic on the same device.
The point accelerator would be for an embedded controller ? Most of the things listed are present in high-powered processing cores.

2016/11/26
Figured out an inaccuracy in the Float128 class (software 128 bit floating point). (Took about 1/2 day to write). The carry bits between words were being lost. The word values needed to be typecast to 64 bit integers. The FP emulation now gets the correct digits of PI at least out to 128 bits. The software float type maintains a 256 bit mantissa to allow for multiply results.
I changed the way the compiler scans numbers to be more accurate.
In the FPU core Multiply / divide didn’t have logic to propagate NaN’s.
The 128 bit FPU’s now about 6500 LUTs with additional NaN logic and bug fixes. It uses about 55 DSP slices as well.
Just as a trial to see how difficult it would be, I added a 32 bit FPU unit in addition to the 128 bit unit. I haven't got the time to debug it all so I’m going to stick with 128 bits.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 27, 2016 9:26 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2016/11/27
Modified the float compare operation to return a combined less than or equal status in a bitfield. This reduces the number of branches required for a combined less than or equal test.
Made float register fp63 available as a CSR. This allows using an integer register to set and read fp63. This alleviates the problem of having to move a fp register to / from memory. Since a mux was required on the fp register read path, I decided to use a four-to-one mux and also forward results. Fp0 is now set to read as +0.0.
Round mode was not being read from the instruction.

The compiler didn’t account for the size of a float (quad) when removing parameters with the pascal calling convention.
The compiler was modified to recognize the suffix ‘S’, ‘D’, ‘T’, or ‘Q’ on a floating point constant to indicate the precision required. So “3.14Q” is a quad precision constant while “3.14D” is only double precision. No suffix defaults to quad precision. This is similar to the ‘U’ or ‘L’ suffix on an integer constant. Also a predefined variable called __floatmax was defined to return the maximum number that can be represented as a float. I changed the usage of double representing quads to floats representing quads. Supporting multiple floating point precisions in the compiler is a pita. So, it currently only supports what’s supported in hardware: the quad precision float. I think the “C” standard sets a minimum of 32 bits for a float but doesn’t say you can’t use more.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 28, 2016 9:36 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2016/11/28
Added the increment/decrement memory instruction to the instruction set. The compiler fairly often outputs code that would use this for loop increments. I added this instruction to FISA64 and Thor. I had left it out originally thinking it would reduce the size of the core. However with fp added, the size increase for INC is minimal (actually synthesis reduced the size of the core).
Mulling the over the idea of float branches that work like the integer branches, in order to eliminate the float compare instruction from the instruction stream. But there would be only nine bits available for a displacement.
The compiler wasn’t dereferencing the value in a register properly. This happened only with optimization turned on.
The push operation didn’t update the stack pointer.
Register to register operations didn’t increment the pc correctly.
The multi-cycle detect logic was missing a default case. This cause the generation of a latch and the core thought some operations were multi-cycle which shouldn’t be.
The assembler assembled an immediate load as a memory load causing problems during simulation.
CSR instruction weren’t updating the register file.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 29, 2016 4:49 pm
Profile WWW

Joined: Tue Dec 31, 2013 2:01 am
Posts: 116
Location: Sacramento, CA, United States
I'm sure that I'm not the first to try this, and I certainly won't be the first to finish it, but have you considered a float encoding that allows integer comparison and negation to work equally well on floats?

32-bit example:

$00000000 = 0.0
$80000000 = NaN
$40000000 = +1.0
$C0000000 = -1.0
$7FFFFFFF = largest positive float
$00000001 = smallest positive float
$80000001 = most negative float
$FFFFFFFF = least negative float

You can decide which bit position to separate your exponent from your significand, and even include a hidden .1 bit in your design, if you choose to treat 0.0 and NaN as special cases.

Mike B.


Wed Nov 30, 2016 2:04 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
but have you considered a float encoding that allows integer comparison and negation to work equally well on floats?

I hadn't thought of that, but I'm convinced that the IEEE representation is the way to go. If you have to read or write a binary float value from another app, it's bound to be IEEE. There's lots of tools on the web for converting IEEE values. I've experimented a bit with the float format originally used in the Apple computer.

Indexed loads and stores weren’t fully coded yet.
Simulator runs out to at least 72,000 ns now.
Put together a SoC for DSD7 in preparation for running in real hardware. I’ve reused almost the same SoC for several projects now, so I can put it together fairly quickly.
First test in FPGA: nothing happened, blank screen.
Several hours later…. after studying software code thoroughly.
I realize I didn’t wire up the video outputs from the display controller in the SoC.
Fixed a typo in the SoC code.
The clearscreen point has been reached !
Can you spot the difference ? It took me a while to find. There was a branch to label “.j1” but the label was actually just “j1”. The assembler assembled the code as a branch back to self when it couldn’t find the label. Of course it spit out an error message, but who reads those ?
Old code stored a byte to LEDS for diagnostics but the assembler doesn’t support byte operations and got confused and output the wrong address.

The optimizer optimizes away name constants when the name is unknown because they are given the value zero. The optimizer says “aha there’s a zero let’s substitute r0 for it.”. A quick hack changed the value to -1 for undefined constants.
For a flow transfer the target register of the following instruction wasn’t being invalidated when it should be.
Results forwarding forwarded a result multiple times in a loop during a multi-cycle operation causing the value to decrement more than it was supposed to. Results forwarding changed to be active only on the last cycle before the pipeline advances.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 01, 2016 3:08 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2016/11/30
Struggling with problems getting the core to run in the FPGA. It runs in the simulator as far as I’ve coded, but not in the FPGA. According to the status LEDS it clears the screen then dies. The blank screen doesn’t show properly however. I should mention this is running primarily “C’ code. I’d like to get printf() working.
In the memsetW() routine I had coded a “BNE” as a loop test, when I changed the “BNE” to a branch less than “BLTU” then the program got further. This was on a hunch that there was a bit error while looping. The loop increment was incrementing by two evenly and a BNE test could fail if the loop increment became odd.
I increased the size of the memory paging tables. They were only using 6 block RAMS and there’s a bunch remaining, so I doubled the size of the tables.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 02, 2016 1:08 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2016/12/01
Discovered the tools don’t handle blocking assignments correctly at least not during simulation. Either that or I’m misunderstanding how they’re supposed to work. It treated the blocking assignment under a clock similar to a non-block assignment. On a hunch it might have something to do with synthesis as well I moved all the combinational logic out from under the clock to a separate always block. Not as convenient, but it works in sim.
Discovered that multiply and divide immediate results were not output to the result bus.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 03, 2016 11:15 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 105 posts ]  Go to page Previous  1, 2, 3, 4, 5 ... 7  Next

Who is online

Users browsing this forum: Applebot and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software