View unanswered posts | View active topics It is currently Sat Apr 27, 2024 12:42 pm



Reply to topic  [ 67 posts ]  Go to page Previous  1, 2, 3, 4, 5  Next
 Qupls (Q+) 
Author Message

Joined: Mon Oct 07, 2019 2:41 am
Posts: 593
Is there a way to hint if a branch will branch or not?
I/o-wait: ld a (io); beq a i/o-wait;
with a hint the branch often true.
while(*ch==SPACE) ch++;
blank stripper branch often true as the test is bne a continue;
Can't think of case where branch is false other than a switch loop.
loop:
if(case==*table #) goto *table jmp;
get next case; goto loop;


Tue Dec 05, 2023 12:46 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Is there a way to hint if a branch will branch or not?
I/o-wait: ld a (io); beq a i/o-wait;
with a hint the branch often true.
while(*ch==SPACE) ch++;
blank stripper branch often true as the test is bne a continue;
Can't think of case where branch is false other than a switch loop.
loop:
if(case==*table #) goto *table jmp;
get next case; goto loop;

No. But there is branch prediction. Generally branch hints have fallen out of favor with the availability of more transistors. They take up opcode bits and typically are not well used. A branch predictor will pick up on the fact that a branch is being taken or not-taken most of the time, and later iterations will be just as fast as if a hint had been specified. Currently (planned) in Qupls is a BTB predictor and a g-select branch predictor.


***
Currently the low order 12-bits of the PC are the micro-code instruction pointer. This was done so that the micro-code IP could be managed by OS code for interrupts and exceptions. But it looks like there is no need to do this as the exception return address is stacked in an internal stack. Both the PC and status register have internal stacks for exception processing. I think the micro-code IP could simply be made part of the status register.
The micro IR register needed to be stacked on exception as well, so the MC IP was made part of that register.

Added the REGS instruction modifier. The modifier causes the following load or store instruction to repeat using the registers specified in the register list bitmask for the source or target register. In theory it can also be applied to other instructions but that was not the intent. It is pretty much useless for other instructions, but a register list could be supplied to the MOV instruction to zero out multiple registers with a single instruction. Or possibly the ADDI instruction could be used to load a constant into multiple registers. I could put code in to disable REGS use with anything other than load and store ops, but why add extra hardware?

Put together the MPU component which include the CPU, PIT, and PIC. I am just working on the CPU for now, but attempting to fake out proper usage so the entire core can be built to implementation to get timing information.

Got a bug causing a large part of logic to be elided. The instruction decoders are showing up using too few LUTs (7) and it should be more like 400. They are being largely trimmed out of the design. The thing to do is check the inputs and outputs and ensure all connections are correct. I still have yet to find the bug.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 05, 2023 4:49 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Trying to run a simulation has revealed the first set of errors and omissions.

Had to put reset signal in the instruction length decode stage. This stage propagates the cache line which was all X’s without the reset. This caused simulation issues. The cache lines are now forced to all 1’s which is the code for a NOP.

Put code in to initialize the last four entries in the TLB to point to the ROM area. This lets the CPU boot without having to process a TLB miss. A TLB miss cannot be properly processed at reset as the page table base register is cleared and there is no page table setup.

Modified the stomp logic that resets the queue tail pointers. It must not reset the pointers if nothing was stomped on, which is possible if the queue was empty after the branch instruction. No following instructions having queued yet.

Instructions are fetched and the queue fills up to the point of the first load / store instruction. Load / stores are not executing properly yet, so the CPU stalls at that point in the queue.
It is good to see the ROB filling up with 32 entries. This is four times the size of the ROB for Thor.

The mem scheduler was issuing zombie load operations. Issuing loads when it should not have been. Another flag was needed to indicate when to perform memory ops.

A pipeline stall signal was not being fed to the register renamer causing the renamer to stream out a bunch of rename registers unnecessarily.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 07, 2023 7:00 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got the first handful of instructions executing in simulation. Now the assembler needs to be updated to correspond to Qupls instructions.

Got the core to fit after some finagling, about 95% full.
Cut the size of the RAT in half by supporting only a single register bank. Two banks were being supported because the block RAM has the capacity. ½ the block RAM is wasted now.
Found out that the register renamer was a larger implementation than it needed to be. It was re-written to use fifo’s instead of a bitmap. The result was about 3000 LUTs. The bitmap version was about 13,000 LUTs.

There are 40 logic levels in the signal on the critical timing path. It has to do with branches. Obviously, the number of logic levels needs to be reduced. The logic could be split up across multiple instructions. Separate compare and branch instructions are used by many architectures. Another approach is to insert pipeline registers into the logic, turning the branch operation into a multi-cycle operation. Having separate instructions would essentially turn the branch operation into a multi-cycle operation and it costs code density. Q+ will use multiple clocks for a mis-predicted branch and conserve code density. If predicted correctly, the branch latency will be hidden because other instructions can execute at the same time.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 08, 2023 3:41 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Updated the assembler so it is now possible to assembler programs. It is not completely up-to-date yet, but works for simple programs.

The outputs of the scheduler were not registered so it was combo logic driving other logic. Adding a set of registers improved timing.

Made the reglist modifier an option rather than being fixed in the machine. It turns out the register list logic is on the critical timing path. Even with the signal pipelined, it slows down the CPU too much. Most of the time taken is due to routing. There are about 20 logic levels for the signal. It does work, but at a 40MHz fmax, and I am trying to get to 66MHz now.

_________________
Robert Finch http://www.finitron.ca


Sat Dec 09, 2023 4:28 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Removed a 1024-bit wide bypass multiplexor at the output of I$ RAM. Bypassing is not needed.

Added code to defer TLB update by the page walker until the instruction is at the commit stage. This is to mitigate Spectre attacks.

Found another signal that could be improved, this time with 34 logic levels. The output of the branch predictor was not registered leading to timing issues due to the increased number of logic levels in the path. It turns out the signal needed to be registered anyway for pipeline alignment.

Slow going, it takes about six hours to build the system to get a timing report.

Added a quick immediate load instruction which is 24-bits allowing a target register to be loaded with an 11-bit immediate.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 12, 2023 5:07 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added a set of three-byte opcodes which are an immediate operate with the immediate assumed to be zero unless overridden with a postfix instruction. This shaves a byte off the size of many instructions requiring postfixes. In combination with a 64-bit postfix only 12-bytes are needed.

Forgot to include SUBFI, subtract from immediate, in the ALU.

Added a REGC postfix which adds register C and may complement or negate the register. This is needed for some instructions which could not encode three registers in the instruction.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 13, 2023 5:08 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Converted the core to use 40-bit instruction parcels instead of varying by byte lengths. This is to alleviate timing issues with the length decoders. Also used an instruction block approach. The last instruction in the block must fit entirely in the block or it is moved to the next block. Using a block approach allows the length decoders to be at only 12 fixed positions in the block. Otherwise, 64 decoders would be required, and timing would not be met. The overhead of padding and block headers is only about 7.6% of the code size for the boot ROM.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 14, 2023 3:56 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Bootup ground to a halt in simulation because one of the first statements is a write to LEDs. LEDs are an I/O device and I/O for LEDs was not mapped into the address space yet. An initial page table needed to be setup. The TLB startup had to be modified to include the TLB in the initial mappings. The TLB being itself an I/O device. Storage space for the first page table also had to be supplied in the SoC.

Got confused about a bad branch displacement. The target address just coincidently happened to be the same address as was being loaded into the stack pointer. Talk about coincidences. I thought there was an issue with the way register values were being passed around. It turned out the branch instruction was simply not encoded correctly by the assembler.

_________________
Robert Finch http://www.finitron.ca


Sun Dec 17, 2023 8:31 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Working on branches today.

Had valid / invalid markings in the RAT reversed. This led to a hang waiting for a register never marked valid.

Get challenges getting branch displacements calculated. It looks like the assembler has trouble evaluating huge 128-bit numbers. So, things had to be reduced to 64-bits.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 18, 2023 8:49 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Just about went nuts trying to figure out why the first byte of an instruction was being nopped out. Spent hours debugging the hardware then finally figured out it was the binary generated by the assembler. There was a bad relocation type coded causing a relocation where there should not be one.

Q+ is working much better now. The current holdup is an issue with the data cache.

There are still pipeline issues to fix.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 20, 2023 8:40 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
A data cache issue fixed, the transaction id was not being set on a bus retry response, so the requester did not know which transaction needed to be retried. This caused the bus to hang.

Somehow register zero is getting allocated by the renamer. In theory it should not be possible, so some displays were added to the renamer to track the bug down.

Included code to force the register specs to zero if there there is a corresponding immediate value instead. This keeps the renamer from mapping the register.

Figured out why entries in the RAT were being zeroed out, it was the new checkpoint was not updated properly from the old one.

Now the core hangs because of an orphaned register. It ends up waiting for a register value from a register that is no longer valid. The issue has been identified but not resolved yet.

_________________
Robert Finch http://www.finitron.ca


Thu Dec 21, 2023 6:30 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The register valid bit was being associated with the architectural register and not the physical register. This caused things to sometimes work and sometimes not. This bit is handled in the RAT. Also updated the core to handle stomped on target registers. They needed to be marked valid, otherwise the core would hang waiting for them. The stomped-on registers are not freed until the commit stage. In fact, all the register free-up is being done at the commit stage. Which increases register usage. The registers could be freed during the execute stage but that would create multiple places requiring more write ports in the renamer. It is simpler to reclaim registers at commit.

Had to add some more NOPs in the reset routine so that the TLB had time to initialize. It was creating TLB miss exceptions triggering a page walk because the IP register was not set to the correct reset address yet.

Core works much better now, currently it hangs accessing an address that is not available through the MMU yet. A table walk is triggered but does not work.

I have made a lot of fixes to the core and have not rebuilt it through to implementation. It may no longer fit.

Seeing an IPC during boot-up of about 0.5 or so, with 50% or more of clock cycles used for cache loads.

_________________
Robert Finch http://www.finitron.ca


Fri Dec 22, 2023 2:43 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Broke the 1.0 IPC barrier! Got up to about 1.75 instructions per clock! The stat is a bit misleading in terms of performance because most of the instructions executed were NOPs. The loop, clearing a section of memory, shown below, is a bit of a worst-case scenario. The branch is pushed off onto the next cache line from the remainder of the loop. So, it is ping-ponging between cache lines.

Code:
----- Stats -----
Clock ticks:                 2760 Instructions:          4829:         4748 IPC: 1.749638
I-Cache hit clocks:                 2646


Code:
02:000000000000001E 0401000000             62:    ldi   a1,0                              # number of entries to clear *8
                                           63: .clrpgtbl:
02:0000000000000023 53400000C03C0000       64:    sto   r0,pgtbl[a1]
02:000000000000002B F8FF3C00000000
02:0000000000000032 0441400000             65:    add   a1,a1,8
02:0000000000000037 FFFFFFFFFFFF3700       66:    bltu a1,8192*8,.clrpgtbl
02:000000000000003F 00285900FAFF3C00
02:0000000000000047 000100

_________________
Robert Finch http://www.finitron.ca


Sat Dec 23, 2023 7:54 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1783
robfinch wrote:
Seeing an IPC during boot-up of about 0.5 or so, with 50% or more of clock cycles used for cache loads.


I suppose that combination of facts is cause for optimism!

robfinch wrote:
Broke the 1.0 IPC barrier! Got up to about 1.75 instructions per clock! The stat is a bit misleading in terms of performance because most of the instructions executed were NOPs.


Congratulations!


Sat Dec 23, 2023 8:42 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 67 posts ]  Go to page Previous  1, 2, 3, 4, 5  Next

Who is online

Users browsing this forum: No registered users and 83 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software