Last visit was: Mon Jan 05, 2026 2:31 pm
It is currently Mon Jan 05, 2026 2:31 pm



 [ 261 posts ]  Go to page Previous  1 ... 14, 15, 16, 17, 18
 Qupls (Q+) 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2447
Location: Canada
Re-wrote the functions in the memory scheduler as separate modules. The memory scheduler keeps getting elided from the design, and I have not figured out why yet. I had hoped that breaking it up into smaller modules would help isolate the issue.

Finally got to the first simulation. The memory scheduler shows up in simulation. The simulator does not cut it out.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 30, 2025 6:32 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2447
Location: Canada
Finally pulled the load / store queue out of the Qupls4 mainline into its own module. Dunno if I did it “the right way” but I bolted a command interface onto the LSQ. It can process up to 10 commands in parallel. I am not sure there are enough parallel commands allowed. There could be several branches wanting to invalidate the LSQ while Rob entries that are done also want to invalidate LSQ entries all in the same clock cycle.
Commands are: Invalidate, enqueue, set address, set data, increment address.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 31, 2025 7:20 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2447
Location: Canada
Tonight’s quandary: getting read data to the reservation stations at reasonable speed.

Multiplexing register tags from issue queues in the reservation stations onto a four wide bus for register read requests was 289 logic levels. I forgot to register the outputs which I had intended to do. But it looks like a few more registers are required. Registering the outputs moved the timing critical path elsewhere.
So many logic levels are required I am guessing because the multiplexers are built out of cascaded LUTs. Discrete logic could probably do better.

I am hoping to get 40 MHz performance out of the core which should make it roughly the same (or better) performance than an 80 MHz in-order design.

I have not figured out why some modules are being removed from the design by synthesis. But I have found what seem to be minor flaws causing some modules to be removed. Most of the design is present now.
The 6551 UART was being eliminated, but I found that the state machine was advancing too quickly, not allowing output registers to be set, so they were always at zero when the state changed. The tools picked up on the fact and simply removed the component from the design. This was the result of changes made to support two different bus protocols.

Found out the read port select logic was way too slow (291 logic levels). The logic dynamically selects ports for reading. It was packing the port selects into the minimum number of read ports being wary of only active ports, so a ton of multiplexers. Now it is coded differently as shown in the diagram below.
Attachment:
Qupls4_read_port_selector.png

After a few minor adjustments the timing is up to 37 MHz. It may need to run under 40 MHz as I cannot see a way to improve the timing. The critical path is now in instruction dispatch, which basically copies values from a pipeline register into another pipeline register feeding the reservation stations.


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 01, 2026 2:31 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1867
wow, that's quite the logic depth! what's the depth down to now?


Thu Jan 01, 2026 2:13 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2447
Location: Canada
Quote:
wow, that's quite the logic depth! what's the depth down to now?
I think I have got the depth down around 100 logic levels now. I am not sure how the tools calculates the logic level depth. I am assuming a lookup table counts as multiple depths of logic. I seem to recall hearing that a superscalar design is somewhere around 20 logic levels. The FPGA implementation needs to cascade logic sometimes I suspect a custom discrete logic design would not need to.

Tools timing is telling me it should work to 46 MHz now, which is good as 40 Mz is desired (the video dot clock rate which is handy).

Got the Qupls4 Arpl compiler working. It at least generates code that looks like it should work.

There was a nasty issue with a delete in a doubly-linked list that did not work properly. I have yet to figure out why. The delete causes push and only push instructions to be removed from the code. The delete works just fine around other instructions. I think it may be some sort of weird memory dependency having to do with pointer aliasing. I am just guessing. It is “fixed” at the moment by not doing a delete, and instead putting a special NOP opcode in the place of the deleted instruction. When the output routine sees this it just does not output anything. The effect is that the output code is right, but there is an extra linkage in the code list.

Did some work on the assembler too. The assembler will progress along more slowly than the compiler.

All this work while waiting for synthesis.

Scrapped the store immediate instruction. It had too much overlap with ordinary stores with constant postfixes applied. The only difference is that store immediate would allow a four bit constant field in the instruction to be stored. This would be handy for storing zeros for instance. But the same thing can be done with a postfix instruction, except that it takes up more room in the program.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 02, 2026 4:32 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2447
Location: Canada
Learned a new trick the other day reading comp.arch newsposts. Finally got around to implementing it. Using the move instruction as a renamer command. MOVE does not need to do anything other than assign the source register tag to the destination register. There is no need for processing on a move instruction. No new tag is assigned. MOVE is subsequently treated as a NOP. NOPs get removed from the pipeline by the dispatcher.

Freed up two opcodes used for IP relative addressing. Turns out they were not needed as the IP can be specified as register #62 in ordinary load / store instructions. Changed the code for a zero register to register #63. This makes the r0 register completely general-purpose, works the same as any other GPR.

Sacrificed a bit of branch displacement, repurposed to indicate increment or decrement of the tested register. It only works for specific branches: iblt, ible, ibltu, ibleu, dbne and dbnez. 22 bits is still plenty of displacement bits for conditional branches.

Did some work on Qupls version 5. Comparing:

Qupls4 (48-bit inst.)
Fibonacci: 21 instructions, 126 bytes
Serial driver: 260 instructions, 1622 bytes
Xmodem: 177 instructions, 1100 bytes

Qupls5 (32-bit inst.)
Fibonacci: 22 instructions, 100 bytes
Serial driver: 277 instructions, 1224 bytes
Xmodem: 186 instructions, 832 bytes

While Qupls4 instructions are wider by 50%, the code density is only about 34% worse. Made up for due to a fewer number of instructions. Qupls4 uses about 3% fewer instructions, meaning it may code execute slightly faster.

_________________
Robert Finch http://www.finitron.ca


Mon Jan 05, 2026 1:02 am WWW
 [ 261 posts ]  Go to page Previous  1 ... 14, 15, 16, 17, 18

Who is online

Users browsing this forum: claudebot and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software