Last visit was: Fri May 09, 2025 2:25 pm
|
It is currently Fri May 09, 2025 2:25 pm
|
Author |
Message |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 772
|
Synthesis reports 73,000 LUTs for Stark. This is considerably smaller than Qupls. Which I think was over 100,000 LUTs for the same configuration. 2 ALU’s 1 FPU, 1 MEM, and 1 branch. Going to try using up some of the difference for a larger ROB.
I just finished my cpu, let me count the LUT's, it might take a while if I need use my toes. 303 LUT's for a simple 18 bit cpu split over 3 CPLD's. 1985 tech vs 2025 tech. Not sure when 128 macro cell CPLD's came out. I sent off the the ALU pcb's with all the changes made since DEC of 2024.
This sure shows a real big change in tech over the years. Ben.
|
Wed Apr 23, 2025 3:20 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
Now up to 92,000 LUTs with a 32-entry re-order buffer. It took seven hours to synthesize. I just cannot design anything with less than 1000 LUTs anymore. The logic puzzle is not captivating enough. Moved rename logic out to an asynch process operating on the ROB. Quote: 303 LUT's for a simple 18 bit cpu 303 LUTs for a CPU is amazing. I think the 6502 is somewhere around 600 LUTs. It is amazing the number of transistors a modern CPU may use, and how much can be done with just a few transistors.
_________________Robert Finch http://www.finitron.ca
|
Thu Apr 24, 2025 3:45 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
Migrating the machine to a micro-op based design. There are just too many ports per ISA instruction to handle directly. So, the solution is to break up the ISA instructions into micro-ops. A simple micro-op decoder was made. It decodes ISA instructions into one to eight micro-ops at the decode stage. With four instructions processed at decode, up to 32 micro-ops could be produced. These are buffered in a shift register. Decode then consumes four micro-ops from the head of the shift register. When all the micro-ops are used up, and new set is fetched and decoded. Many instructions only need a single micro-op, so in many cases the machine is processing four ISA instructions at a time. However, if a complex instruction is done it may take more than four micro-ops to process.
_________________Robert Finch http://www.finitron.ca
|
Fri Apr 25, 2025 7:45 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
Put in a small fix to suppress stomp logic when the destination of a branch is found in the reorder buffer.
I think I have got the single-step mode logic restored.
Another hoop to jump through: making modifiers work when in single-stepping mode.
More work on micro-ops, and back-tracking on all the ALU result ports.
Deferred interrupts occurring in the middle of a micro-op stream for an instruction to the start of the next instruction.
_________________Robert Finch http://www.finitron.ca
|
Sat Apr 26, 2025 4:17 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
A busy day. A lot of minor changes to improve synthesis; the result is a larger core, but hopefully closer to working. Also some more major changes.
* Got rid of processing for the fifth instruction in the in-order pipeline stages. Processing is now limited to four instructions. The fifth instruction was for postfixes which have been removed from the design. * Moved the backout flag out to a separate module. * Moved the restore flag generation out to a separate module. * Moved copy destination flags logic out to a separate module. * Created a module for register validation in the reservation stations. This is instead of using a task. Synthesis warned about the assignments using a task. I was not sure if it would work or not, so I made sure by creating a module instead. * Moved inline code for the dram done signal to a separate module. There were two copies of the code in the mainline one for each dram port. There is now only a single copy to maintain.
Found an alternate way to implement sync and flow control dependencies. If there is a sync instruction the following instructions should not issue. Like sync if there is a flow control op then memory store instructions should not issue. This was implemented by searching the ROB for preceding sync or flow control instructions. It is now done by recording the ROB entry of a sync or flow control at enqueue time. At commit time when the sync or flow control commits, dependent instructions are cleared of the dependency. I am not sure it is any better. The idea was to try and reduce the amount of logic.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 27, 2025 4:15 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 772
|
Do you have any error correcting on memory?
|
Sun Apr 27, 2025 4:25 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
Quote: Do you have any error correcting on memory? Nope. Broke the ALU / FPU up into more components with different latencies to get better performance. Most of the components could handle a new instruction every clock cycle, but were limited by a ‘done’ signal for longer latency components. For instance, integer multiply takes three clocks, but can start a new multiply every clock cycle. The previous configuration stalled the integer ALU for three clocks while the multiply completed. It had to because it was in the same pipeline as other integer operations. Now there is no stall. Made up a nice PowerPoint slide set for the in-order pipeline. Makes it easier to see what I am doing. Needs a lot of changes in Stark.sv. Attachment: StarkCPU_execute_stage.png
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
Tue Apr 29, 2025 2:53 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
Major re-write day. Got rid of the scheduler component, replaced with an instruction dispatcher and scheduling in the reservation stations. The result is a little larger, but should be better performance.
Also made the reservation station generic so the same station can be instanced for different functional units. Added a three-entry queue to the station. This makes the station quite a bit larger so it is an option. Each station can request a register file read for up to four registers per clock cycle. With 11 stations this is 44 reads. I tried building the read port selector for a 44:16 mux and it was quite large. So, I changed it to a 64:16 mux and the result was 33% smaller. I guess the non-binary power number made it harder for the synthesizer to optimize. It is a 64:16 mux now with 20 slots unused.
A big addition was the reservation_station_entry_t structure. Reservation entries are now passed around instead of individual signals. It makes the code a little cleaner.
The organization of the CPU is now such that there are parallel pipelines for execution units, which may have different latencies. Some of the stalls were eliminated.
The core was around 110,000 LUTs but recent changes likely made it significantly larger. I am guessing 150,000 LUTs. The core is synthesizing ATM (it takes about 3 hours).
_________________Robert Finch http://www.finitron.ca
|
Wed Apr 30, 2025 4:28 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
Spent time fixing hundreds of minor bugs in preparation for simulation. Also got working on micro-ops. Here is a diagram of how they fit into the pipeline. Attachment: StarkCPU_decode_stageA.png Attachment: StarkCPU_decode_stageB.png
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
Thu May 01, 2025 5:24 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
Worked on getting the core to simulate and synthesize properly. Synthesis insists the core is only about 40k LUTs, but I know it should be well over 100k LUTs. It is trimming out logic for some reason to be determined. Added the floating-point control and status register. Added FMA instruction support and documented. Modified the result queues to output an almost full status.
_________________Robert Finch http://www.finitron.ca
|
Fri May 02, 2025 3:05 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
Modified the micro-op translator to convert FMUL, FADD, and FSUB into FMA instructions. Added two bits to the micro-op structure to support a third source register extension field. There are a couple of instructions that have a third source register such as CHK, CMPSWAP, and FMA, FMS. Also added an exception bit to the micro-op structure.
Added stomp logic to the reservation stations, result queues and meta_ interfaces to function units. Reduced the size of result queues to 12 entries to keep the size under control as they are made more complex due to inclusion of stomp logic.
Got the core size up to about 94kLUTs. Thinking that is still not quite the right size, I tried synthesizing the core using a run-time optimized approach which does not do some optimizations. The result was 347.4k LUTs in size. It looks like everything was included. So, I have run synthesis again, this time optimized for area. After six hours it is still running, but close to being done.
_________________Robert Finch http://www.finitron.ca
|
Sat May 03, 2025 2:54 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
There is a few muxes in the design (sample): Code: 512 Input 2 Bit Muxes := 16 2 Input 1 Bit Muxes := 108366 5 Input 1 Bit Muxes := 136 Seems to fit them onto the chip though.
_________________Robert Finch http://www.finitron.ca
|
Sat May 03, 2025 3:06 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
It looks like the CPU is around 100k LUTs (160k LCs) when synthesized optimized for area.
Worked on putting together a co-processor for the StarkCPU. Decided to use RISCV then found out after synthesizing it, it was too large. Almost 7000 LUTs. Looked at the resource utilization and the divider logic was consuming 5000 LUTs. So, I removed it thinking it would reduce the size by about 5000 LUTs. The result: synthesis reports the size as about 5700 LUTs. Still too large for the intended use. I was not able to figure out why the size was so large. I want a CPU that is around 1000 LUTs for the control. So, I redid the co-processor as a scaled down 32-bit version of the Stark CPU. Result is about 850 LUTs, a usable size. It has its own program memory and scratchpad RAM, and a bus for external memory access. The address bus is limited to 16-bits. Internal program space is limited to 12 kB. Auto condition-recording to CR0 is not supported. Only a handful of instructions are supported. Just enough for its intended purpose. ADD,CMP,AND,OR,XOR,SLL,SRL,LOAD,STORE, conditional branches, subroutine calls and returns. It should be able to read/write parts of the larger CPU and interface to a serial port to run a monitor. It is not very fast, taking a minimum of five clock cycles per instruction.
_________________Robert Finch http://www.finitron.ca
|
Sun May 04, 2025 6:27 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2317 Location: Canada
|
Separated the StarkCPU code out from the Qupls code in the Vivado project, and made it its own project. Having them combined there were about 500 files involved. Separating them reduced the number of files <400. It takes time to process all the files for simulation and synthesis.
Simulation crashes with a stack trace so I have not been able to simulate anything yet. I tried running a newer version of the toolset, but it looks like it just got stuck in an infinite loop processing the same files over and over again, instead of crashing. I cancelled it after about 10 mins.
Running into namespace conflicts and having to prefix names with the namespace.
Vivado seems to have an annoying property that it will automatically import packages even when they are not specified as part of the translation unit. I have two packages in the project that use the same names for structures and constants. Even though I specify the import that is desired, both packages are imported causing name conflicts.
_________________Robert Finch http://www.finitron.ca
|
Wed May 07, 2025 7:07 am |
|
Who is online |
Users browsing this forum: claudebot and 2 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|