Last visit was: Mon Dec 09, 2024 7:02 am
|
It is currently Mon Dec 09, 2024 7:02 am
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Put in some time working on the cc64 compiler now. Updating it for Q+. Starting with the Thor version. Added a __sync() intrinsic to allow the boot code to be written in the cc64 language. __sync() is also a fence function and accepts a constant value used to build the instruction. The constant determines what is synchronized with respect to what else. Currently the only supported value by hardware is 0xFFFF. Which syncs everything.
_________________Robert Finch http://www.finitron.ca
|
Thu Dec 28, 2023 1:30 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Made many, many changes the last couple of days, mainly fixes to get things working. Changed to a fixed 40-bit format instruction. There were issues using instruction blocks with the linker, which can relocate code. So, instructions are no longer in blocks. This cost two bits in branch displacements.
Changed the shift instructions to shift pair instructions. There is room for three register specs. This allows the shift instructions to perform rotates if the same register is specified for both registers of the pair. Explicit rotate instructions were removed from the instruction set.
Changed from using postfix immediates in Q+ to using instructions that can shift immediate values before use. This was to reduce the size of the core. Multiplexing trailing constants in the instruction stream consumed logic. The CPU is a fixed length instruction one now. Using independent instructions means more instructions may be packed into memory since there are no longer postfixes to accommodate in instruction blocks. This increases code density too.
The shifted immediate instructions shift in multiples of 20 bits, so only three instructions are needed to create a 64-bit constant in a register. There is provision for constants up to 128-bits to be built up using the shifted immediate instructions. Shifted immediate instructions are limited to ADD, AND, OR, and EOR.
_________________Robert Finch http://www.finitron.ca
|
Sun Dec 31, 2023 9:20 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Had this thought of having a huge number of registers, eg 128, in the ISA instead of SIMD style registers, and then allowing the registers to be manipulated in groups. There would be 16 groups of eight registers. A group ADD instruction would then apply the add operation to all eight registers in a group. Started working on Q+ 2024 with 128 register design. My next thought was why do that? Why not just use vector instructions? It is less expensive encoding to support a smaller number of registers. So, the ISA spec has a smaller number of registers than what is actually in use. A huge number of registers are used, but they are hidden.
Think I found a low-cost way of implementing SIMD style registers in the Q+ implementation. Vector instructions are being implemented as micro-coded instructions. When a vector instruction is encountered the CPU fires off a sequence of eight operations with incrementing register numbers. The incrementing register numbers are the elements of the vector register. Only the high order bits of the register sequence are needed to specify the vector register number. The micro-code reuses the scalar instructions, but with a different set of registers. Up to four instructions may queue in a given clock cycle so an eight-element vector register should queue in only two clocks. The operations are then scheduled to execute on available functional units. Currently only one ALU is enabled so the vector instructions wait their turn at the ALU. They execute sequentially. If there were more ALUs they could execute more in parallel. This allows the vector instruction set without using up a whole lot of resources for a vector ALU. There are 74 architectural registers in Q+ (64 visible in the programming model). Vector registers need eight registers for each vector register. (Eight 64-bit registers is a 512-bit vector register). The ISA supports up to 64 vector registers which is 512 registers for vectors. Given that about 3x as many registers are needed for renaming, that results in about 1536 registers needed. That’s a lot. So, for my demo Q+ it is only going to support 8 vector registers. That many can be fit into the available block RAM space.
There are issues ATM synthesizing the core. It seems to be stripping out a lot, about 50%, and the reason has not been determined yet. It is running in simulation, so there must be some difference.
_________________Robert Finch http://www.finitron.ca
|
Fri Jan 05, 2024 5:37 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Added: conditional load and store instructions. A load or store may optionally take place depending on the value of a predicate register. Bugs: The ROB entry valid flag was not being tested for oddball instructions. This led to an exception being falsely processed. Trade-offs in Q+: setting the number of vector elements in a vector register to eight 64-bit elements. Trade-off: One issue is keeping the number of rename registers sane. Each vector element may be renamed. There are 64*8 or 512 registers in the design then. This is better than some odd number of registers. Renaming based on smaller vector elements would require too many rename registers. Trade-off: making the general-purpose registers part of the vector register file. They are aliased with vector registers zero to seven. The number of general purpose vector registers is reduced to 55 from 64. One additional vector register is reserved to implement micro-code registers. Aliasing the registers allows vector instructions to be applied to the GPR file. Thus, it is possible to load or store multiple GPRs in blocks of eight registers using vector load and store instructions. The GPR register context can be pushed on the stack with the push vector instruction. It takes only eight instructions to push the entire GPR context. Code: push v0 push v1 push v2 push v3 push v4 push v5 push v6 push v7
Will push all 64 GPRs to the stack.
_________________Robert Finch http://www.finitron.ca
|
Mon Jan 08, 2024 12:34 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Got rid of the conditional load / stores in favor of using a generic predicate modifier instruction.
Came up with a cheesy register renamer. It is based on a circular shift register instead of a fifo and makes use of FPGA SRL’s. The difference in size is 50 LUTs versus 5000 LUTs. It works by rotating the shift register every time a register is needed. If the register that comes into view turns out not to be available then the shift register is rotated again and the machine stalls for a clock cycle. It may take several clocks to find an available register. The SRL could potentially by clocked at a much higher clock rate than the CPU, for example five times. That would allow it to skip over unavailable registers relatively quickly.
Not having much luck tonight. The CPU is erroneously using registers that are stomped on. The result is the stack pointer is incorrectly updated, then values written to the wrong location in memory. It is because a branch instruction is encountered. The CPU is supposed to restore a checkpoint, which should restore register tags to valid values. But the dang thing does not work right.
_________________Robert Finch http://www.finitron.ca
|
Thu Jan 11, 2024 8:08 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Working on supporting quad precision ops today. Decided to take a break from debugging, only to add more to debug.
Since the CPU is a 64-bit machine with 64-bit registers some means must be arrived at to perform 128-bit quad precision operations. The solution used is to perform the operation using register pairs. The pair of registers is specified by a combination of the quad precision instruction and an instruction modifier, QFEXT, dedicated to performing quad precision operations. The modifier supplies registers to hold the upper 64-bits of the quad precision value. The quad precision operation then borrows an ALU port to act as a venue to be able to store the quad precision value. A quad precision operation uses the ALU as a holding place to store values. The scheduler sees the quad precision modifier and schedules it for the ALU. The modifier is does not complete its execution until the quad precision operation is complete. The scheduler schedules a quad precision operation as a pair of operations, one on an ALU used for passthrough, and one on the floating-point unit.
A couple of the details need to be worked out yet. Like what happens when the QFEXT modifier is specified, then a quad float operation is not performed? ATM the machine will hang waiting for the op, there should probably be some sort of check to ensure the hang does not happen.
_________________Robert Finch http://www.finitron.ca
|
Sat Jan 13, 2024 4:31 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Created a little C# app that generates several sequences of micro-code statements. I was looking at having to write hundreds of lines of micro-code most of which was similar statements, so I wrote the C# utility program instead.
_________________Robert Finch http://www.finitron.ca
|
Sat Jan 13, 2024 5:03 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Spent some time putting together a multi-precision ALU and FPU. The FPU supports 16,32,64, and 128-bit floating-point. Lower precisions are implemented SIMD style. Two times 64-bits, four times 32-bits or eight times 16-bits are processed by one FPU. The FPU with this support is about 50k LUTs. The ALU is a similar size about 42k LUTs. Some cores needed to be added or updated to support the FPU.
Fewer registers in the ISA pending, to help reduce core size.
Given some thought as to using fewer really wide registers as the number of registers in use impacts the size of the core. Making the registers wider instead of having more of them may reduce the amount of support logic, although it does increase bus sizes in the core. Rather than add more functional units to support faster processing of vectors, it is better to make the functional units wider. It is not practical to have many functional units requiring write ports because the register file goes up in size multiplied by the number of write ports. To keep the number of write ports small (6 or less) it limits the number of functional units. Two ALUs, two FPUs, and two memory ports would use all the write ports.
The checkpoint valid RAM supporting a full complement of 16 checkpoints is 63k LUTs in size. This occurs because the RAM must be implemented using FF’s and LUT multiplexers. If there were a more efficient way to implement the RAM that would be good. The current core limits the number of checkpoints to three to reduce the size to about 11k LUTs.
_________________Robert Finch http://www.finitron.ca
|
Sun Jan 14, 2024 8:53 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Modified the micro-code so it would work with a 1-way, 2-way or 4-way core. It may work with a 3-way core as well. The sequence of instructions was set to increment by one instead of in groups of four as previously coded.
Wrote yet another RAM module for checkpointing, checkpoint valid ram #4. This time, if it is correct, it uses a lot fewer LUTs and instead uses 12 block RAMs. A snag has been run into testing, there are signals which are defaulting to ‘X’ even though they are coded to default to zero. This looks like a simulation bug to me ATM. With the great reduction in LUT usage the core is significantly smaller. Meaning more features can be added. It has been configured to support 16 vector registers, and the PRED modifier is enabled. Core size is still about 100k LUTs.
The core has been updated for 32 register support instead of 64 and the documentation is in the process of being updated.
_________________Robert Finch http://www.finitron.ca
|
Tue Jan 16, 2024 7:56 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Q+ now uses 2 ALUs by default, possible because of the LUTs free up by going to 32 regs instead of 64 regs. There were some issues to work out with the second ALU, but all is good now.
With new daylight, the X’s of the previous night are gone. I have no idea why. My best guess is workstation RAM issues.
Found out the checkpoint RAM was too simple. It requires a lot more block RAM. 68 I think, but is now capable of supporting up to 1024 physical registers, the full complement for the CPU.
Updated some of the compiler and assembler software for 32-register support.
_________________Robert Finch http://www.finitron.ca
|
Wed Jan 17, 2024 10:08 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
The Q+ core runs a little bit better all the time. Switched the boot-up to running Fibonacci to try and work out some issues. It helps to vary the code being run from time to time. It hangs after about 11 micro-seconds, or about 200 instructions.
Latest bugfixes: Micro-code was being triggered regardless of the fact that the instruction was being stomped on. This led to erroneous micro-code execution.
More register bypassing was required for when the target register acts as a source. The previous target register from an immediately preceding instruction needed to be provided.
At reset the interrupt level was being set to seven. This caused the stack pointer to be invalid as the machine stack pointer is loaded at reset assuming no interrupt is present. To fix it the interrupt level was set to zero at reset.
The target register acting as a source for ALU #1 was not connected. This led to copies not working correctly when done on ALU #1.
Stores were executing before a previous conditional branch had resolved, but only if the store was at the front of the queue.
With bug-fixes the core is still about 100k LUTs when synthesized for area.
_________________Robert Finch http://www.finitron.ca
|
Fri Jan 19, 2024 4:56 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Not having much luck tonight. Tried to improve the operation of branches which are yet to be working correctly. After that I tried getting a vector instruction working.
Added bypassing on the checkpoint index for instruction following a branch where the checkpoint is incremented.
The core seems to split up the sequence of micro-code used to implement a vector. It will queue the first four instructions correctly, then quit queuing because queue space has run out. Which causes the remainder of the vector to not queue.
_________________Robert Finch http://www.finitron.ca
|
Sun Jan 21, 2024 7:16 am |
|
|
Findecanor
Joined: Fri Jan 19, 2024 1:06 pm Posts: 14
|
I downloaded and browsed through the ISA documentation from your web site. Some details are missing but I suppose it is a living document. Having 40-bit large instructions and explicitly parallel bundles though — it reminds me of Itanium, but the Itanium did not always do as much in a single instruction. Designers of some other ISAs have instead optimised for small instructions to increase code density. It's a trade-off. I wonder, are there any standard corpus and tools available to measure instruction usage statistics? robfinch wrote: Changed the shift instructions to shift pair instructions. There is room for three register specs. This allows the shift instructions to perform rotates if the same register is specified for both registers of the pair. That's also how Aarch64 does `ror` with an immediate amount. I think I may have seen some other ISA do it that way too. Something that I would like to see more in ISAs is arithmetic right shift with a rounding mode. Just discarding bits is equivalent to rounding down. Some DSP algorithms depend on rounding up, so you can see such instructions in some SIMD sets, but they are unusual in scalar. The unusual variety that I'd like to see more of would be rounding towards zero: because that is what most integer division instructions do. If the shifted value is negative then it should be rounded up. You will otherwise need multiple instructions to replace a signed DIV by a power of two with a right shift.
|
Tue Jan 23, 2024 5:37 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Quote: I downloaded and browsed through the ISA documentation from your web site. Some details are missing but I suppose it is a living document. Thank-you for your interest. Yes, it is a work-in-progress document, always a bit outdated. There is no official release yet. Quote: Having 40-bit large instructions and explicitly parallel bundles though — it reminds me of Itanium, but the Itanium did not always do as much in a single instruction. I scrapped the instruction bundling idea a while ago, I should really remove it from the docs. Instructions are independent and just sequential now, with no bundling or block header. I like the Itanium and have been thinking of trying to replicate it in an FPGA. It would be very challenging however to get software for it. The 41-bit instructions are a bit of a killer. Quote: That's also how Aarch64 does `ror` with an immediate amount. I think I may have seen some other ISA do it that way too. I think that is how the Itanium performed rotates too. Quote: Something that I would like to see more in ISAs is arithmetic right shift with a rounding mode. Just discarding bits is equivalent to rounding down. Seems like a good idea to me. There are extra bits in available in the shift instruction so a rounding mode could be added. ****** Micro-code was being triggered for the SYS instruction when a cache miss came back with a line of all zeros. Cache miss status was added to the decode of the micro-code address to prevent this. Having a lot of fun getting branches to work. All the pipeline signals have to be lined up just right. I am using a naming convention for pipeline signals. _f for fetch stage outputs _x for extract stage outputs _d for decode stage outputs _r for rename stage outputs _q for enqueue stage outputs If there was a cache miss at the fetch stage, it must percolate down to the subsequent stages as the pipeline is advanced. A fetch from micro-code is not considered to be a miss even if there is a cache miss. Also, it is possible that the branch predictor correctly predicted an address. A correct branch prediction should not be stomped on after a branch that stomps on subsequent instructions. Or if the correct instructions just happen to be sitting in the pipeline. There seems to be a lot of corner cases and fixing one seems to break a different one.
_________________Robert Finch http://www.finitron.ca
|
Wed Jan 24, 2024 4:33 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Still working on branches. Finally got the first sequence of vector instructions to execute. A simple load two vector registers with a constant and add them. No operation masking. 64-bit values only. Test program placed in the boot code: Code: start: ; set global pointers lda gp,_start_bss orm gp,_start_bss ldi v8,1 ldi v9,2 add v10,v8,v9 bra fibonacci nop nop nop nop
A register dump reveals it worked! Code: v 8 -> 0: 0000000000000001 1: 0000000000000001 2: 0000000000000001 3: 0000000000000001 4: 0000000000000001 5: 0000000000000001 6: 0000000000000001 7: 0000000000000001 # v 9 -> 0: 0000000000000002 1: 0000000000000002 2: 0000000000000002 3: 0000000000000002 4: 0000000000000002 5: 0000000000000002 6: 0000000000000002 7: 0000000000000002 # v 10 -> 0: 0000000000000003 1: 0000000000000003 2: 0000000000000003 3: 0000000000000003 4: 0000000000000003 5: 0000000000000003 6: 0000000000000003 7: 0000000000000003 #
Still working on branches.
_________________Robert Finch http://www.finitron.ca
|
Thu Jan 25, 2024 11:13 am |
|
Who is online |
Users browsing this forum: CCBot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|