View unanswered posts | View active topics It is currently Fri Apr 19, 2024 11:02 pm



Reply to topic  [ 57 posts ]  Go to page Previous  1, 2, 3, 4  Next
 rfPhoenix 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Could a addition be broken into three stages?

I think it could be. I have been thinking about adding more pipelines with different latencies. Several of the float instructions require only two or three clocks. They are lumped in with the FMA. I did not worry too much because it is infrequently used operations. It would be a good place to add an adder broken into a couple of stages.
Right now, the FMA multiplier is what is on the critical timing path. So, I am trying adding registers at the input which adds a cycle of latency.

_________________
Robert Finch http://www.finitron.ca


Mon Sep 05, 2022 5:10 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Making sure there were regs at both the input and output of the multiplier did the trick. It added a cycle of latency, but the multiplier is fast enough now to meet 50 MHz timing. I think it would be really tough to meet 60 MHz timing. The core may end up running at the display dot clock rate of 40 MHz for simplicity.

Had to re-write the register file to use generated block RAMs rather than having the tools infer them which it could not do efficiently enough.

Changed the latency of the data cache and instruction cache fetches. Reduced it by one by removing a reg at the output. Normally the reg is in place to boost the fmax. But since the caches are not on the critical timing path I figured I try removing the regs.

The core seems to run simple instructions okay, but when it encountered a branch it began running the wrong instructions. It did successfully branch to the target address. I think this has to do with instructions not being removed from the branch shadow. So, I put some code in to help ensure all instructions after the branch in the pipeline are disabled.

With four threads and sixteen lanes of execution the core currently uses up about 91% of the FPGA. This is leaving out some functionality yet. The number of lanes may need to be reduced.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 07, 2022 6:08 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Issues with a branch that has a cache miss at the target address and a cache miss for instructions in the branch shadow at the same time. I think I got these resolved. It is tricky because the core will run ahead on a cache miss pretending there was no miss. The hit/miss indicator is not available until a cycle after attempting to fetch an instruction. This means the IP has incremented a couple of times even though it missed, and it must be backed up to the miss address. At the same time the IP which is incrementing may have advanced into a new cache line. This can cause requests to load a new cache line for both the next sequential instruction and the target of a branch instruction.

The core is skipping instructions occasionally and I have yet to figure out why. I look in the fifo for thread one and see instruction progressing: fffd0000, fffd0005, ffd000a, fffd0014. Note the difference in address between the last two instructions is 10 instead of 5. This has me a bit mystified as there is only one place the IP is incremented, and it is always incremented by five. I check the other thread fifos and they are not missing instructions and they do not have extra instructions. I conclude that somehow the instruction is not being placed in the fifo. Writing the fifo is dead simple. It always writes to the fifo corresponding to the thread of the instruction just fetched. The thread to select for instruction fetch comes from a thread select module. The same thread selected for the fetch is selected to update the thread’s IP.

I have been scrolling around in debug dumps.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 08, 2022 6:19 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Had the Rb register of a conditional branch instruction treated as a target register by the scoreboard. This wrecked havoc on the core’s ability to issue instructions after the branch. Effectively the register was marked invalid and never marked valid again. So, instructions stopped issuing.

Added debug capability to the core. In theory it can exception on address matches to debug address registers.

Had address generation for load / store operations performed by a separate clock cycle now. Previously the address generation was just inline code. It needed to be broken out to allow the debug registers to be matched against generated addresses.

Added a bunch of 16-bit float operations to the float module library.

_________________
Robert Finch http://www.finitron.ca


Fri Sep 09, 2022 3:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Modified the bus interface unit to support unaligned vector loads and stores. An unaligned vector op can require up to five bus cycles since the external bus is only 128-bits. Previously I had coded for two bus cycles to support unaligned loads and stores of non-vector data. Each of these bus cycles had its own set of states. This has been reduced to a single set of states which is called recursively. Each call shifts the data and select signals over according to the bus width.

The MemoryRequest and MemoryResponse types were about 80% identical. So, they were combined into a new type called MemoryArg_t.

Got rid of the independent privilege level stack and made the privilege level part of the status register. A new type was added to ease manipulation of status register bits. The status register is now 32-bits wide and has its own stack.

Found the core would sometimes write an invalid instruction to a fifo. The thread valid indicator was not being used to filter instructions.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 10, 2022 1:30 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Altered the usage of the ‘m’ bit in instructions. It is now an additional register specification bit so 16 mask registers are supported instead of 8. Previously it indicated whether or not to use a mask register for a vector instruction. But the same thing can be done by reserving a vector mask register to contain all ones, and using that register if an unmasked operation is desired. To sum up it makes more effective use of the bits in the instruction.

Missed timing for 50MHz by 30 ps for one signal. It took six hours to implement. The signal was reset for a fifo. There is not much that can be done to improve the timing except to try an asynchronous reset instead of a synchronous one. I also put in an `ifdef IS_SIM to disable resetting the fifo contents for anything but simulation. For synthesis the contents of the fifo is not important but for simulation it makes dumps easier to read if things have been reset.

Added address masking capability to the debug address match. Addresses are now matched excluding the bits in the mask register which causes the compare bits to be ‘don’t care’. It is possible to match addresses with don’t care bits now. Like FFDxxxxxh.

Fixed up the instruction trace fifo. It should be close to working now. This required adding PEEKQ, POPQ, and RESETQ instructions. Instruction tracing should be enabled for only a single thread at a time as there is no way to tell which thread the trace address is for.

Suppressed writing instruction postfixes to the instruction fifo to remove them from the instruction stream. Did not have the almost_full signal for the instruction fifo set low enough causing instructions to sometimes overwrite in the fifo.

_________________
Robert Finch http://www.finitron.ca


Tue Sep 13, 2022 4:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
More heavily pipelined the core. The execute buffer was acting as a bottleneck requiring several cycles. The execute buffer was pipelined into several stages now. According to the latest stat, which could be in error, the instructions per clock is up to 0.822 for a single thread. Quite a bit better than previously measured.

Considering adding branch prediction.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 14, 2022 2:36 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Managed to eliminate a register read port by muxing two of the register specifiers. I had used a very simple approach to begin with of having a register read port for every possible read position in the instruction. But the branch and store instructions were the only instructions using the Rt field as a read port, so that was modified to use the Rc read port.
All the additional pipelining increased the size of the core. Decided to reduce the number of vector lanes to eight to compensate. There is lots of room in the FPGA now.

Modified the instruction cache to use even, odd line pairs to handle instructions crossing cache lines. This makes the instruction cache work the same way as the data cache and gives a consistent cache line size, which should make it easier to interface to a L2 cache.

Fetching data through the data cache is not working yet.

Found a bug in the instruction fifo. If a read was done when the fifo was empty the read pointer was advanced. This caused the empty signal to go false and reading of stale data to begin. Fortunately, this was easy to fix.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 15, 2022 5:27 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added fifos between the end of the execution pipelines and the writeback stage. The issue to resolve was writeback conflicts due to multiple results entering writeback at the same time. To determine the order of updates a tag was added to pipeline entries. Whichever tag is the lowest gets updated first.

Features have been added and bits shaved in the opcode space to support them. The float-immediate mode instructions all share the same opcodes for different precisions. The precision is determined from the number of postfix immediates tacked onto the instruction.

Spent some time modifying the assembler to allow for instructions not being able to cross cache lines. It can now optionally output NOP instructions to place instructions so that the instruction and postfix does not cross a cache line. This is not really an issue for the current core however, as instructions and postfixes are allowed to cross.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 17, 2022 3:47 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Just for kicks added 128-bit floating point support to the core. Some 128-bit integer ops, ADD, SUB, AND, OR, XOR, CMP are supported as well. 128-bit processing spans 4 32-bit lanes. Vector registers are used to manipulate 128-bit values. Added an instruction ‘remask’ to modify the vector mask to account for differences in number of lanes between 32 and 128 bits. For instance, to store 128-bit vectors the lanes must be expanded by four for each bit since the store operation only stores 32-bit values and the store is expecting a mask for 32-bit values.

Added 16-bit ops as well. There is more support for 16-bit ops than 128-bit ops.

I was going to implement 80-bit floating-point but thought the better of it. 80-bit floats could get by with just two postfix instructions for constants. Decided to implement 128-bit instead. With 128-bit floats the immediate constant would require four postfix instructions to encode. That would make an instruction potentially 200-bits long. So, only two postfixes are supported which allows constant formation up to 80-bits. To get a 128-bit constant the value must be loaded into a register using a three-instruction sequence.

Issues with the scoreboard tonight. Instructions were being issued before they were supposed to. The scoreboard is used to delay issue until registers for the instruction are available.

Changing the pipeline a little bit. There are parallel pipelines for each thread for part of the core. There was logic to select which pipeline to advance at each stage of the pipeline. This has been reduced to selecting at the start of the pipeline then allowing the remaining stages of the pipeline to run free.

_________________
Robert Finch http://www.finitron.ca


Mon Sep 19, 2022 4:28 am
Profile WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 592
Do you plan to have densely packed decimal operations, or is that only with a special licence from IBM?


Mon Sep 19, 2022 4:34 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Do you plan to have densely packed decimal operations, or is that only with a special licence from IBM?
No. rfPhoenix will not support decimal floating-point.
Although decimal float has some nice features it would be difficult to pipeline into the core.

Spent the last day or two developing mpmc10, version 10 of the multi-port memory controller. Version 10 of the controller uses the AXI4 bus protocol, and has a fifo with better pipelining than version 9. It also uses round-robin selection instead of fixed priority for channels.
The cache may also be updated by write cycles as opposed to simply invalidated. The cache is four-way associative and 32kB in size with six read ports.

Had to alter the cache line size for everything in the system to be 32B instead of 64B. The 64B size was consuming too many block RAMs.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 22, 2022 3:23 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Discovered there were more block RAMs available than I thought so I made the cache deeper. The synthesizer could not handle a wider cache. It croaked trying to synthesize a 2k+ bit wide cache.

After some more experimentation, a lot of trial runs, about the best result is a 64kB system cache. It uses a handful more block RAM than I like. I was expecting 90 to 100 block RAMs and it uses 130.

Created a Wishbone version of mpmc10.

_________________
Robert Finch http://www.finitron.ca


Fri Sep 23, 2022 4:14 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The day was spent getting the multi-port memory controller, mpmc10_wb, implemented and the system-on-chip and several peripherals updated to use it. The new version is larger than the old one, 15,000 LUTs versus 6,000. Started using structure variables to encapsulate buses. It is so much easier to pass around a structure var than having to type out all the signals.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 24, 2022 4:29 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Converting the SoC to use structure variables for the Wishbone bus.
Modified the video components to support 40-bit 4-12-12-12 zrgb color. Previous support was 4-8-8-8. 4 bits are dedicated to the video plane the remaining 36-bits allow for 12 bits per color component. Of course, some components use fewer bits to conserve memory, but the colors are expanded out to 40-bit by padding with zeros. For instance, sprites are RGB555 but are mapped to ZRGB 4,12,12,12.
At 40 bpp resolution three pixels are fit into every 128-bits. The frame buffer fetches at least 128 bits at a time. The frame buffer can deal with colors from 1 to 12 bits per color component. At lowest color depth six bits are used per pixel. Meaning 21 pixels get fetched at once.
Lots of mods to lots of components.

_________________
Robert Finch http://www.finitron.ca


Mon Sep 26, 2022 4:01 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 57 posts ]  Go to page Previous  1, 2, 3, 4  Next

Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software