Last visit was: Sat Jan 24, 2026 9:19 am
|
It is currently Sat Jan 24, 2026 9:19 am
|
| Author |
Message |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
Quote: With a project your size how do check for the stupid little errors? I run something called lint on the code. It checks for all kinds of mistakes. Then I also run synthesis to get the size of things. I spot a lot of the little errors while editing the code. There are bound to be errors yet undetected. I work on things incrementally, and have ported code from other projects that was working. I try not to do too much at once without running some sort of test. Doing diffs helps too. I have lots of experience by now and that helps. I tend to avoid common mistakes. ***************************** Re-did micro-op loading into the decode stream. The number of micro-ops allowed per instruction was increased to 48 from eight. That is enough micro-ops to cover more complex sequences like float divide and reciprocal. The synthesize size is just slightly less than it was before, while at the same time allowing many more micro-ops per instruction. I have been busy reducing the size of the core. A whopping 60k LUTs has been removed, without changing functionality. I got the 23,000 LUT ROB down to about 18,000. Did a lot of work on the micro-op (instruction) decoder, converting about a dozen modules for Qupls4. Got register r0 working as a general-purpose register now, except when it is used in an address calculation. Rbase = r0 bypasses to 0 Rindex = r0 bypasses to 0 Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0 is a general-purpose reg. Register fields can be used as small or large constants for most instructions. There is not much need to bypass r0 to 0. Added a LOADI instruction so that loading a constant into a register is possible while r0 is non-zero. An ADD instruction was being used before to load constants. Added two opcodes to allow instruction pointer relative addressing for loads and stores. Updated some of the documentation again.
_________________Robert Finch http://www.finitron.ca
|
| Sun Dec 07, 2025 2:55 am |
|
 |
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1868
|
> I have been busy reducing the size of the core. A whopping 60k LUTs has been removed, without changing functionality
Very good!
|
| Sun Dec 07, 2025 1:13 pm |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
Reduced the size of the core again, due to work on the instruction dispatcher.
Figured out what was causing the tools to take a long time to synthesize. It was in the instruction dispatcher. There were a way, way too many muxes. It worked previously because data structure sizes were smaller, and there were fewer functional units. I managed to re-write things in a better fashion and now the instruction dispatch is much smaller. There are some limitations however.
Previously there was no limitation on instruction dispatch other than a max of four per clock cycle. Now, there are restrictions on which functional units can be issued instructions in the same clock cycle. For instance, both a multiply and a divide cannot be issued in the same clock cycle as they are sharing a dispatch slot now. But I managed to increase the number of dispatch slots to five.
_________________Robert Finch http://www.finitron.ca
|
| Tue Dec 09, 2025 6:16 am |
|
 |
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 894
|
Does scaling where you multiply and then divide work ok? tax 7.25 % pennys = (pennys * 725) / 100
|
| Tue Dec 09, 2025 9:04 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
Quote: Does scaling where you multiply and then divide work ok? tax 7.25 % pennys = (pennys * 725) / 100 In theory it should work. It will still do all the instructions eventually, but it may not be in the first cycle that it is encountered. If a divide and multiply happen at the same time, it will choose to do the divide to queue first as that can take many clock cycles. The CPU does take dependencies into consideration when going to execute instructions. I think in your example it will do the multiply first then the divide afterwards because the division depends on the result of the multiply. It might queue the instructions in a different order but they do not execute until all the operands are valid. Since the divide depends on the multiply result it wont execute until later. The core is too large to fit in the FPGA even in its smallest configuration. It is close to fitting but I am not going to try and shoehorn it in. I may end up writing a software emulator for the core. There is an even larger core with 512-bit registers being worked on. Started working on getting the thing running in simulation.
_________________Robert Finch http://www.finitron.ca
|
| Wed Dec 10, 2025 6:13 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
It looks like the stripped-down version of the core just might fit (90% full by last estimate). No room for anything else but perhaps a UART. Keyboard controller and text display may also be possible. No room for a DRAM controller though. Will have to use block RAM for memory so no OS.
Altered how MSI interrupts are handled so that 32-bit slave devices could cause MSI interrupts. Previously a lot of information was passed on the response data bus to identify the interrupting device requiring a 64-bit bus. Now there is just a 10-bit index into a device address table that identifies the device. This does limit the number of interrupting devices to 1024 devices per interrupt controller. The system can support 62 QMSI controllers so there is a limit of 62k devices in a system.
The 10-bit index is expanded out to a 45-bit device address using an address table in the QMSI controller. Only the upper 32-bits of the address are recorded in the table since devices are aligned on memory page boundaries (the thirteen LSBs are zero).
_________________Robert Finch http://www.finitron.ca
|
| Thu Dec 11, 2025 9:20 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
Worked on multi-threading and branch streams.
Turned Qupls4 into a multi-threaded machine, supporting four threads. It chooses a thread to run based on a weight register. Higher values in the register increase the chance of the thread running. But computing which instructions get stomped on due to a branch miss has been challenging. A dependency matrix is computed (easy) but using it another story. Best I have got it so far is about 40,000 LUTs to handle 16 levels of branches. If only 4 levels of branches are supported the count does down to about 13,000 LUTs. Might be doable.
Tried to research this topic on the web, read one or two interesting articles.
_________________Robert Finch http://www.finitron.ca
|
| Sat Dec 13, 2025 6:11 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
Fixed a lot of minor bugs to try and get synthesis to work better. There is literally thousands of warnings. The unoptimized size is about 330,000 LUTs. Optimization is not working correctly due to bugs in the code almost everything gets optimized out.
Changed the way checkpoints are freed up. Previously a checkpoint was freed when a branch resolved. The pipeline was being searched for checkpoints coming after the branch that could be freed. The state machine was only partially implemented.
Now the checkpoint is freed simply if it is not in use anywhere in the pipeline, regardless of branch state. The difference is that it is easier to free multiple checkpoints in the same clock cycle and the machine to free checkpoints is simpler. A benefit is that orphaned checkpoints (if there was a hardware glitch) get freed up. The drawback is the checkpoint may not be freed as soon as possible.
_________________Robert Finch http://www.finitron.ca
|
| Mon Dec 15, 2025 5:13 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
I stripped the Qupls4 core right down to a bare minimum configuration to see how far off it is from fitting: about 330k LUTs / 530k LCs. Then I decided to see what a large configuration of the core was like, reconfigured adding FP and vector support and a couple of other things. The resulting size was: 290k LUTs / 470k LCs, 40k LUTs smaller. Now I am scratching my head trying to figure this out. That is unoptimized sizes. Area optimized sizes should be a lot smaller.
The optimizer optimizes just about everything out resulting in a size of 2k LUTs, obviously incorrect. Usually that means a major signal like clock or reset is messed up, but I have been unable to figure out yet what is amiss. It is some sort of terminal signal that is missing then the optimizer is working backwards eliminating sources. So, it just shows a whole bunch of unconnected signals for what is present.
_________________Robert Finch http://www.finitron.ca
|
| Tue Dec 16, 2025 5:11 am |
|
 |
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1868
|
The bigger config is smaller? That's odd. Do let us know when you find what's happening there.
|
| Wed Dec 17, 2025 8:53 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
Quote: The bigger config is smaller? That's odd. Do let us know when you find what's happening there. It is really strange. The only thing I can think of besides some sort of weird coding issue, is it must have optimized differently. The fact that it is a good chunk smaller means that if it can be made to be optimized properly it may reduce the size considerably (meaning it may fit). Fixed a whole bunch of ‘no field’ errors, changing 79 files. TG for text search and replace on files. All the changes came about as a result of modifying the micro-op structure. The structure was modified to remove a level of indexing ‘.’ There is now only one format of micro-op. There were literally thousands of changes, but it is just a big refactoring. Did some work on the instruction dispatcher. Modified things to support the REXT (extended register selection prefix). REXT is needed for some floating-point ops. Went through several iterations of modifying the dispatcher to keep the size reasonable. Discovered the 'foreach' loop in SystemVerilog making the code a little cleaner.
_________________Robert Finch http://www.finitron.ca
|
| Thu Dec 18, 2025 6:29 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
Ran synthesis with the option to not-flatten the hierarchy so that missing pieces could be identified. Found out the instruction dispatcher was missing; there were a couple of signals necessary to its use that were not connected. Got the optimized size up to 95k LUTs now. The unoptimized size latest reading was about 400k LUTs.
Found yet another signal that got messed up that might affect things having to do with the memory stage.
Switched constant handling back to an earlier method. Using constant postfix instructions now for large constants. I think it will be easier to manage software-wise in the assembler. Previously, the location of the constant had to be encoded in the instruction. Now there is just a postfix to include which identifies which register is overridden.
_________________Robert Finch http://www.finitron.ca
|
| Sat Dec 20, 2025 1:22 pm |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
Many bug fixes from synthesis generation.
Checkpoint index valid bit was not being set when the checkpoint index was set. This led dispatch to ignore all fetch groups causing it to be elided from the design during synthesis.
Valid bits on the pipeline registers were not being set, which also caused modules to be elided.
Found an issue with the DRAM ‘more’ module that detects when to process more data for an unaligned access. The state input was using a bit vector that was too small. It should have been using the state type instead of a bit vector to be safe. I changed the number of states a while ago but did not update the bit vector.
Found another issue the DRAM, data input/output was shifting by bits for alignment when it should have been shifting by bytes.
After a few fixes it looks like most of the design is present in the optimized synthesis version, and it is only about 105k LUTs which will easily fit. It is still missing a few pieces like integer divide but the major pieces are there.
I have been reviewing tons of schematics; there are over 100 pages generated.
_________________Robert Finch http://www.finitron.ca
|
| Mon Dec 22, 2025 6:27 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
Worked on a config utility for Qupls4. It looks like this: Attachment: Qupls4_config_screen.png I have been unable to get simulation to work. It messages that packages are not found, and they are there. I think it may have to do with the size of the project. The toolset won’t allow full hierarchy browsing saying ‘too many files’. So, I am shelving it for a bit. I really need to rebuild the project with fewer files. Worked on Bigfoot’s native mode. Bigfoot is a VLIW CPU. I worked out a beautiful set of templates for a 256-bit bundle containing five instructions. It took me hours. Then I realized I forgot to include the predicate bits in the instructions, so they then needed to be a bit wider. So, the bundle now contains four 60-bit instructions and a 16-bit template code. The instructions are so wide as there are 256 registers specifiable. I hope to have it loosely modelled after the Itanium, with the register windows, and rotating registers. It will likely have 96-bit floating-point. I really need those 24 digits of precision.
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
| Wed Dec 24, 2025 5:31 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2458 Location: Canada
|
Reduced the number of bits to represent the instruction pointer in each bank of the pipeline register. Only a five-bit offset is stored in the bank, the remaining bits of the IP are stored in the header of the pipeline register, as they are in common for all banks. This shaves about 120+ bits off the register size for each entry (about 2000 total). It does mean that the IP needs to be rebuilt from the pieces in several places.
Found a bug by examining schematics. The first time I have found a bug that way in ages. There was only a single instance of a find-first-one component being generated in a generate block and there should have been four. There was a signal that needed to be an array that was not. All the FFO’s inputs were the same for each FFO so the optimizer said “aha, I can just replicate the outputs without creating separate FFO’s’’. Then things got optimized further resulting in a lot of missing logic.
The re-order entry structure for the re-order buffer refers to another structure called pipeline_reg_t that represents the state of the pipeline. I should have made them both the same structure… I should also have made use of an operand structure to represent operands… Not changing things now. Live and learn.
Found a bug in the register file. Output values were being set to zero at the wrong point. This was a result of changing the register file from a 4wNr to an MwNr file. (The number of write ports is now configurable).
_________________Robert Finch http://www.finitron.ca
|
| Sun Dec 28, 2025 4:41 am |
|
Who is online |
Users browsing this forum: chrome-10x-bots, claudebot, CN-mobile-9808-b and 2 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|