View unanswered posts | View active topics It is currently Sat Apr 27, 2024 2:43 pm



Reply to topic  [ 67 posts ]  Go to page 1, 2, 3, 4, 5  Next
 Qupls (Q+) 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Qupls (Q+) is a new CPU project. Started Nov 16th, 2023. A good portion of the Thor core is being re-written hence a new name for the project.

Qupls takes the Thor2025 ISA and simplifies it. Predication and masks are gone. Instructions can vary in length but most are 32-bit. With the simplified ISA fewer register ports are required meaning there can be more made available for parallelism. The goal is to have an at least two way, and possibly three or four way superscalar out-of-order processor.

The ROB operates differently than it did in Thor. It no longer stores values, values are maintained by the register file.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 19, 2023 4:00 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
Should be interesting! I know the number of register ports is an important concern in commercial CPU designs, because I remember a comment or two about it at work.


Sun Nov 19, 2023 7:48 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Should be interesting! I know the number of register ports is an important concern in commercial CPU designs, because I remember a comment or two about it at work.

Yes, there are conflicting requirements for the register file. It should be as small as possible for speed, at the same time it needs to service functional units adequately.

The PRF (physical register file), RAT (register alias table) and register renaming are all features of commercial designs. I think most of the patents have expired by now meaning no worries for an open-source homebrew CPU using these features.
The Qupls register file is large, 4w18r (four write and eighteen read ports), consuming 72 block RAMs and 7600 LUTs for a live value table (LVT). Thor RF was 3w10r but could only handle queuing two instructions at a time. The Qupls register file provides ports for six instructions.

The PRF is so large to accommodate a full set of three read ports for six different functional units. Although there are six functional units only four of them write to the register file, hence four write ports. Each functional unit has its own dedicated ports on the register file, contrasted with Thor where values are read from and written to the re-order buffer. Removing the value management from the ROB reduces the ROB size and associated logic considerably. Forwarding values between units does not require being able to read a value from any ROB entry. There are fewer multiplexors involved. Less is better.

Qupls retains Thor’s unified register file with 64 entries. The 64 architectural registers are supported with 192 physical registers via register renaming. Because a block RAM is used for each port, there could be up to 512 registers supported. I do not know what to do with the 320 unused registers. One thought is to have two register banks, possibly switching for exception processing.

Currently I am studying branch checkpoint logic (GC) in association with the register alias table (RAT). I got some pointers on logic associated with physical register files from a poster on comp.arch. Wrote a component to manage which registers are allocated from the PRF (physical register file). Using the trick of having the pick function look at available registers in groups of 48. There are four 48 register pick function rather than choosing from all 192 registers. This reduces the size of the pick function and improve performance. It is possible to select from a reduced number of registers, as long as the ROB has less than 48 entries, there will always be registers available in the group of 48 because register will be freed up before they run out.

The typically suggested way of building the register renamer is to use fifos. I think it is more practical to use a bitmap of available registers and find-first-one (FFO) functions.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 20, 2023 3:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Some work on the front end of the CPU. It is turning into a 4-wide machine.

The instruction cache is inherited from Thor, being used almost verbatim. It is designed to return an entire cache line and handle oddball instruction lengths so it may be used in different projects.

In theory the PC increments only once four fetched instructions are queued. Queuing may take more than one clock, but ideally all instructions queue in a single clock. As instructions queue a flag is set indicating they queued. Once the flag status is 4’b1111, all queued, the PC moves to the next PC. Next PC comes from the early branch predictor the branch-target-buffer. The process is made tricky due to pipelining and the desire to fetch four instructions per clock.

The branch-target-buffer works in blocks. The BTB works in an approximate fashion, the best it can do since it does not know what instructions are executing. They have not been decoded yet. BTB entries are updated if there is a taken branch in the four-instruction group of the branch being executed. The entire group of instruction addresses is stored in the table under the address of the first instruction of the group. The current PC is used to index into the groups stored in the table. If any instruction in the group stored takes a branch, then that branch address is used as the next PC. Otherwise, the next PC is simply the current PC incremented by the sum of the lengths of the four instructions fetched at the current PC.

The instruction length is decoded from the opcode. There are currently six different instruction lengths. 1 byte for the NOP opcode, 5 bytes for branches, float, and indexed memory operations, 4,6,10, or 18 bytes for postfix instructions. 4 bytes for remaining instructions. The decoder is very small, occupying only 7 LUTs.

Instructions are extracted from the cache line based on the length of previous instructions. Four instructions are extracted per cycle.

The current implementation of the CPU may not perform execution of back-to-back dependent instructions in consecutive clock cycles, the maximum performance. Instead, back-to-back dependent operations need to wait until values percolate through the register file. This cost about 10% in performance according to what I have read, but the smaller simpler core may operate at a higher fmax so performance is not really reduced. I may need to add bypassing logic at a later date.

_________________
Robert Finch http://www.finitron.ca


Wed Nov 22, 2023 3:04 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Worked on the mainline of Q+ today. Enqueue and commit. Commit seems to be easy. It just frees up the target register used by the instructions and advances the head pointer. If the ROB entry has been marked done, then the register file will have already been updated. So, there is little to do in commit. This is unlike the Thor ROB where commit updates the register file. One complication is that commit commits either four instructions or no instructions at once. The ROB is currently organized as 16 entries with 4 slots for each entry. Each slot accommodates an instruction. Instructions are enqueued and committed then in groups of four. The head and tail pointers of the ROB are in terms of groups of four instructions.

When instructions are enqueued instructions following a predicated taken branch are stomped on. This leaves holes in the ROB with slots that are ignored. This is in contrast with Thor which packs the entries as they are being queued. Q+ has many more ROB entries than Thor.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 23, 2023 4:12 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Worked on the TLB and hardware page table walker (PTW) today. The TLB is vastly different than the Thor2024 TLB. It has only 128 entries which are two-way associative. It is shared between instruction and data. Thor’s TLB was much larger, 1024 entries, six-way associative, as it was built out of block RAM and intended to be shared between multiple CPU cores. The Qupls TLB attempts to do two address translations per clock cycle and stores misses in a miss queue, so translations do not block. The miss queue is transferred to the PTW where the misses are looked up and updated translations sent back to the TLB.

Experimented with having root pointers for each address space stored in the PTW.

Also started working on the load and store queues.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 25, 2023 4:27 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
More work on the page table walker. Modified the page table base register. Previously the base register contained control bits plus a page number of the page table. Now it just contains an address plus three bits indicating the level of table to begin at. The other control bits have been moved to another register. The switch from a page number to an eight-byte aligned address makes it easier to manage in software. It also allows reaching into the top-level page directory, allowing the top-level page directory to be shared between tasks. This is commonly done by an OS.

The page table walker was partially modified to allow walking a page table with any number of levels, up to eight, in it. Currently it is coded for a 32-bit virtual address, but it is setup to easily support more address bits.
The PTW has an eight entry miss queue to store TLB misses. It can be in-progress to update multiple misses at the same time. It is designed to work in an asynchronous fashion unless a page fault occurs. A page fault causes the PTW to grind to a halt until the page fault is cleared.

The PTW might service misses in any order for instance:
Miss1 level 2
Miss2 level n
Miss3 level m
Miss1 level 1
Miss1 level 0
Miss 2 level n-1
Etc.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 26, 2023 4:35 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
More work on the page table walker. Added an active region component to the TLB. The active region provides defaults for access rights and cacheability for TLB entries. These access rights may be overridden if specified in the TLB entry.

Updated the mainline code. Added an arbiter for external bus access. There are currently four modules requiring access to the external bus: the page table walker, the instruction cache, the first data port, and a second data port. They are prioritized for access in the given order.

Added reservation stations for the address generators.

The code is looking a bit better. Closer to being able to get an approximate size. But not ready for synthesis yet.

_________________
Robert Finch http://www.finitron.ca


Mon Nov 27, 2023 4:09 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added short forms of 64-bit load and store instructions. The short form fits into 24 bits but only has a five-bit displacement. Also added a quick immediate ADD instruction that adds a five-bit constant; the instruction also fits into 24 bits. Modified the R1 type instructions so they all fit into 24-bits.

Modified the register file, removing a read port by noting that only ALU #0 has three source operands. ALU #1 only needs two operands. The three operand instructions are rarely used, so they are supported only on ALU #0. (MUX, CMOVxx, MIN3, MAX3, PTRDIF).

Worked on the load/store queue.

Addresses are generated by agen then fed to the TLB to look up the physical address. Not sure where to put the results though. I do not want to place them in the load / store queue right away as there may be other load / store instructions that come before in program order that have not been processed yet. I want the loads and stores to happen in program order.
My current thought is to place the addresses in the ROB and schedule memory instructions like other instructions.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 28, 2023 4:03 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Had to add a tlb valid signal to the icache controller. The mainline icache address now feeds through the TLB. The TLB is triple ported with a I$ port and two D$ ports. The caches were basically taken from Thor2024 which were working, and plopped into Qupls. There should be fewer issues getting them to work than there were with Thor.

Added some logic for memory access. Almost scrapped the load / store queue in favor of doing things through the ROB, but eventually decided to keep it.

Did some work on the ALU.

Mostly porting code from Thor2024.

_________________
Robert Finch http://www.finitron.ca


Wed Nov 29, 2023 5:21 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Some work on the commit logic and committing CSR values and handling exceptions. Commit will now only commit up to the first CSR instruction or exception; previously it always committed four instructions. There is only a single CSR read / write port so accessing a CSR tends to serialize the CPU operation. The value to load into the CSR is handled differently than general purpose register updates. The value is passed in the re-order buffer to the commit stage. There is no register renaming for the CSR registers.

The mainline code is looking a lot more complete, but flow control code still needs to be added.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 30, 2023 3:11 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Some work on the RAT and register banks. As a first try, two banks of 64 registers will be supported. This may turn out to be too large. The first eight registers will be shared between banks to allow passing data between operating modes. The RAT had to be modified so that a single architectural register could map to one of two physical registers. Register banks are selected based on the operating mode or possibly the interrupt level of the CPU.

Tonight’s quandary is how to get the store data to the load / store queue. It needs to come from the register file. The selected register for the source data is in the ROB. The register data is not immediately available when the entry is queued in the LSQ.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 01, 2023 2:30 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Forgot to copy the current checkpoint to a new one on checkpoint allocate during a branch queue. Copying the current checkpoint turned out to be easy to do. Fixing this increased the core size a little bit.

Got a first trial synthesize done. It looks like the core is about 95,000 LUTs in size. But something is not synthesizing properly yet. Only 36 BRAMs are reported as being used for the register file and it should be 44 BRAMs. A couple of the ports must be incorrectly connected. IIRC the core is somewhat larger than the Thor core, but it is four-wide instead of two-wide. Qupls should still fit in the FPGA. To do yet is add branch prediction.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 02, 2023 3:14 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
It is looking like Q+ is just a little large for the FPGA. Got another trial synthesis run done and a stripped- down version is about 100k LUTs. Includes only 1 ALU, 1 FPU, 1 MEM. 135k LUTs available, but I do not want to use more than about 80k LUTs for the CPU.

Switched back to variable length instructions. Added length decoding stage in the front end of the CPU. The length decode is broken out over two clock cycles, hopefully it is fast enough.

Changed how vector instructions are encoded. They are encoded with a VEC postfix instruction appended onto a scalar instruction to convert it to a vector instruction. This keeps the scalar instruction lengths shorter than they would have to be if the vector info were encoded in the same instruction. The VEC postfix encodes a mask register and a three-bit format field indicating which registers are vector and which are scalar.

Modified the postfix immediate instructions to use separate opcodes for each length of postfix. Using up opcodes to provide for several different lengths simplifies determining the instruction length. The length of an instruction can then be determined by looking at only the first seven bits of the instruction. Making the modification removed a level of LUTs from the length decoder which will make it faster.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 03, 2023 2:57 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Had to add code to invalidate ROB entries on a branch miss. Also, to track the branch miss state. Branch miss processing takes several clock cycles.

Added complexity to the done state. It is now two bits as some instructions can issue to two functional units and the instruction is not done until it is done on both units. These instructions include jump, branch to subroutine, and return instructions. They need to execute on both the ALU and FCU. The scheduler can now also schedule the same instruction on more than one unit.

Added an interface to the micro-code. Micro-code instructions get stuffed into the instruction stream when a macro-instruction is detected. The PC is frozen and the micro-ip pointer takes over.

Still too early to tell but, it is looking like the core will run at close to 60 MHz at least that is a goal. Executing a maximum of 4 instructions per cycle, it should be close to 240 MIPs peak. A much more realistic estimate would be 50 MIPs. Given costly branch misprediction, and a lack of forwarding between units. All this assuming I have not made too many boo-boos.

Q+ may have register zero default to all one’s when specified as a vector mask register. This amounts to bypassing the value to -1 instead of 0 in the register file for specific read ports.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 04, 2023 5:06 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 67 posts ]  Go to page 1, 2, 3, 4, 5  Next

Who is online

Users browsing this forum: AhrefsBot and 121 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software