Tonight’s quandary: getting read data to the reservation stations at reasonable speed.
Multiplexing register tags from issue queues in the reservation stations onto a four wide bus for register read requests was 289 logic levels. I forgot to register the outputs which I had intended to do. But it looks like a few more registers are required. Registering the outputs moved the timing critical path elsewhere.
So many logic levels are required I am guessing because the multiplexers are built out of cascaded LUTs. Discrete logic could probably do better.
I am hoping to get 40 MHz performance out of the core which should make it roughly the same (or better) performance than an 80 MHz in-order design.
I have not figured out why some modules are being removed from the design by synthesis. But I have found what seem to be minor flaws causing some modules to be removed. Most of the design is present now.
The 6551 UART was being eliminated, but I found that the state machine was advancing too quickly, not allowing output registers to be set, so they were always at zero when the state changed. The tools picked up on the fact and simply removed the component from the design. This was the result of changes made to support two different bus protocols.
Found out the read port select logic was way too slow (291 logic levels). The logic dynamically selects ports for reading. It was packing the port selects into the minimum number of read ports being wary of only active ports, so a ton of multiplexers. Now it is coded differently as shown in the diagram below.
Attachment:
Qupls4_read_port_selector.png
After a few minor adjustments the timing is up to 37 MHz. It may need to run under 40 MHz as I cannot see a way to improve the timing. The critical path is now in instruction dispatch, which basically copies values from a pipeline register into another pipeline register feeding the reservation stations.