View unanswered posts | View active topics It is currently Wed May 08, 2024 4:00 am



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 46, 47, 48, 49, 50, 51, 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Having fun with the FP sin / cosine module. It returns the same results as the sin() / cosine() functions in Verilog for the first pi radians. After that it does not use the correct sign for values, and the values appear to be off. For the first pi radians values are within about 4 ulp of Verilog’s built in functions. I attribute the difference to the use of more than 53-bits of precision in the new modules.

I spent a lot of time trying to fiddle with signs and may just have to spec that the input arguments must be between 0 and pi. IDK why it is not working as the cordic core seems to be working fine. Part of the issue is the conversion of the input argument which is an FP number into a fixed-point range between 0 and 1. It is necessary to divide by two pi to scale things. And the FP number needs to be denormalized with leading zero inserted where necessary.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 03, 2023 2:23 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Now working out to 2 * pi!
Needs some more bits calculated at the low end, but much greater than single precision.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 03, 2023 3:19 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added an FPU and micro-code access. The FPU is based around an FMA module and the sin/cosine module. It not fully implemented yet, but I wanted to get some idea of the size. The ALU / FPU turns out be one of the smaller components in the design.

Micro-code is treated as a branch to the micro-code address. It is interesting because every other micro-code instruction fetched is forced to be a NOP. The first instruction fetched is the micro-code then the second is set to a NOP. This allows the micro-code table to be single ported and avoids issues updating the PC. Since most of the micro-code is currently loads and stores the extra NOPs would not affect performance very much.

_________________
Robert Finch http://www.finitron.ca


Sun Jun 04, 2023 8:35 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Removed the FDIV function from the FPU and micro-coded it instead. It is about the same number of clock cycles to micro-code it, but it is less hardware. It also has the advantage of being interruptible. Will be micro-coding the square root function too at some point.

Got side-tracked into thinking I could increase the accuracy of sine / cosine by calculating using long doubles. It turns out that long doubles are not supported in MSCV, it treats them as doubles, so there was no accuracy to be gained.

Also spent some time trying to track down why the vendor’s cordic core was outputting a constant in the third quadrant.

_________________
Robert Finch http://www.finitron.ca


Mon Jun 05, 2023 3:48 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Ugly. I should have provided a precision field in the instruction encoding for float operations. My original thought was to perform all operations at quad precision and provide instructions only to switch precisions to and from quad. But now I see how costly some operations are at quad precision, they are being implemented at double precision instead. There are enough opcodes left in the instruction set to support different precisions, except that they are spaced out and would not be grouped together. Looks like I am shuffling opcodes again. I see I can make a horizontal row in the table, that would let the two LSB’s of the opcode represent precision.

Only quad precision decimal-float ops will be provided though. If one is choosing to use decimal-float it is likely because of the decimal precision and obviously not performance.

Shuffling Thor opcodes again because of the lack of a precision field for float operations.

Ran some more simulations, getting a little farther all the time. Going to try it in the FPGA to see if the LEDs light up like they should.

Different precisions for reciprocal estimate and reciprocal square root estimate have been micro-coded.

Testing the reciprocal estimate I was surprised to find that for denormal results only, they were a way off sometimes as much as 70%. Other results are usually off by only a fraction of a percentage point. For instance, 0.28% or less. I managed to fix the denormal numbers, so they are off only a fraction of a percentage now too. The estimate is good to eight bits in only two clock cycles.

_________________
Robert Finch http://www.finitron.ca


Thu Jun 08, 2023 3:00 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 596
Do the modern floating point standards support 48 and 72 bit floating point?


Thu Jun 08, 2023 7:17 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Do the modern floating point standards support 48 and 72 bit floating point?
I believe they do. But it is somewhat unusual to have 48-bit or 72-bit floating point. But defacto standards do not really support them. There may be issues converting between the usual 32/64/128 bit floats. I think the IEEE standard is carefully worded. IIRC a single precision number is at least 32-bits, but it could be more, so that would include a 48-bit float. I think the same is true for 64-bits. It is at least 64-bit so that could be 72 or 80. I suspect most modern languages use 16/32/64/128 bit floats though. If a float var is called a double, I do not think it is counted on to have more than 64-bits. Software may be accurate only to 64-bits even if hardware supports more.

_________________
Robert Finch http://www.finitron.ca


Fri Jun 09, 2023 4:11 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Switched the color output of the system from 40-bit to 32-bit RGB10-10-10. RGB10-10-10 is a standard format.

Added in a return stack buffer, RSB, which is tricky to do. It must keep track of whether calls and returns were committed or stomped on. At the fetch stage a call pushes the return address on an internal stack, and a return pops a return address from an internal stack. At the commit stage a call that turned out to be not executed causes the stack to be popped to keep it in sync with returns. Both fetch and commit stages are involved to handle the case where a return occurs before a call commits. If the called function is short enough this might happen. The goal is to get it to work most of the time.

Wondering why simulation is hanging on a store to the text display. I forgot I changed the text display address, so the store was happening to an invalid address. The bus error exception is not wired up yet, so the core just hung.

With the address fixed, the store still hangs because it is trying to store to the incorrect address. I have not yet figured out how the core is arriving at the store address. It looks correct in the simulation dump up to the point it is spit out to the system.

_________________
Robert Finch http://www.finitron.ca


Fri Jun 09, 2023 4:13 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The store turned out to be storing to the correct address, just for an instruction that should not have been executed.

Stuck on a pipeline bug for a while that turned out to be an issue in the data cache controller instead. It looked like store instructions were not being processed properly. The data cache controller must break up 512-bit cache line requests into 128-bit units for bus transactions. Every request can result in four bus transactions. But the bus transactions are not showing up in the way they should be.

_________________
Robert Finch http://www.finitron.ca


Sat Jun 10, 2023 5:54 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Modified the i-cache to remove the latency. This helps with the pipelining. Not sure what effect it will have on the timing though. With the existing pipeline, the extra latency in the cache added a clock cycle and the fetch of two instructions that end up wasted when a branch occurs. Branch prediction was delayed a clock cycle, lowering performance. It may be possible to have the same performance at a lower clock rate with a simpler pipeline.

Sometimes memory operations are being repeated. I am sure this is not a master pipeline issue, but an issue with the memory pipeline or data cache controller. It was occasionally missing memory operations and a I “fixed” it. Now it occasionally repeats operations. It is supposed to repeat operations if the bus is not available or if there is an error or retry response.

While it is running in simulation it does not run in the FPGA. Time to add the logic analyzer in to see what is going on.

_________________
Robert Finch http://www.finitron.ca


Sun Jun 11, 2023 11:30 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got the duplicate or missing memory operations issue resolved.

Still no luck running in the FPGA. According to logic analyzer none of the bus requests to update the LEDs or screen display are making it through to the system from the CPU. The CPU is successfully reading the instruction memory though, and executing instructions, but it hangs waiting for a response to IO. It at least seems to come out of reset properly and load the i-cache.

_________________
Robert Finch http://www.finitron.ca


Tue Jun 13, 2023 5:37 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Well, its running in sim but really slowly. 153 clock cycles to execute a nine-instruction loop in Fibonacci. That include two stores, a load and six other operations. The load and stores are taking 40 clocks each! I have not figured out why so many clocks yet. Looking at the simulation waveforms the round-trip time between a request and a response is only eight clock cycles. Something in a memory controller somewhere is not working properly. Subtracting out the 120 memory clock cycles, it is still 33 clocks for six instructions or about 5.5 clocks per instruction. That is an IPC of only 0.18. Some work to do yet.

I modified the commit logic so that it can commit three instructions per clock if the third instruction does not have a target register. I also did a trial run of the core using 128-bit bundles with 3 40-bit instructions in the bundle. The idea was to replace the byte aligner with a 40-bit aligner. Hopefully decreasing the amount of logic and improving timing. It turns out to add complexity in the assembler and in the FPGA logic as multiples by multiples of five were then required (120 to select a bundle from the cache line and by 40 to select the word). So, it could work, the shifter has fewer logic levels, but it is offset by more logic to determine the shift amount. It also complicated the PC increment and makes PIC code more complex. It is left as a config option now.

_________________
Robert Finch http://www.finitron.ca


Wed Jun 14, 2023 3:55 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I cleaned up the code some in the data cache controller which is what accesses the system bus. The count is down to a much more respectable 63 clocks for the nine instruction loop now. There was a random wait in the bus access that added 6 to 9 clocks to every bus access, so I reduced that to 0 to 3 additional clocks. The bus access is varied a bit to try and avoid collisions when multiple cores are accessing the bus. If all the cores had exactly the same timing it would virtually guarantee a collision. I counted the clock for a store operation and it is now 31. Since there are three memory ops, it looks like all the time is taken by the memory ops and the other instructions are hidden. Average IPC including memory ops = 0.14.

_________________
Robert Finch http://www.finitron.ca


Wed Jun 14, 2023 6:13 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Milestone: first micro-code run. The reset micro-routine was run that fetches the stack pointer and the program counter from the last two hexi-bytes of memory. One minor hiccup was failing to increment the PC address to the next instruction after micro-code is run. This led to a loop executing the ENTER micro-code. Another hiccup was setting the macro-instruction to ‘done’ automatically when it is queued as there is no functional unit associated with macro-instructions. The macro instruction gets inserted into the instruction queue even though it is just a placeholder for the micro-code routine.

Broke the TLB up into more smaller modules. Simulation was having trouble with proper signal timing. Even though things are pipelined with structures, sometimes signals are off by one clock cycle. This has led me to think about modifying the TLB so that addresses are translated more independently with empty space between the translations.

_________________
Robert Finch http://www.finitron.ca


Thu Jun 15, 2023 9:20 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Modified the TLB to use fifos for the channel inputs. Originally it was just a multiplexor with a feedback indicating busy. What was really needed was fifos.
Had to test the TLB state to ensure it was ready to run before trying to perform translations. The fifos require a couple of clock cycles for reset, and loading the TLB RAM with default info also requires about 128 clocks.

_________________
Robert Finch http://www.finitron.ca


Fri Jun 16, 2023 8:32 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 46, 47, 48, 49, 50, 51, 52  Next

Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software