Last visit was: Mon Dec 09, 2024 7:24 am
|
It is currently Mon Dec 09, 2024 7:24 am
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Upgrading to Thor version 2, I just realized I have a number of spots where blocking and non-blocking assignments are being performed under the same always block. This is a bad practise and not guaranteed to work the way one would think. It's amazing the core worked as well as it did. It will be quite a bit of work to move the assignments to their own always blocks. Example in the following clocked logic: Code: exception_set = `FALSE; queued1 = `FALSE; queued2 = `FALSE; queued1v = `FALSE; queued2v = `FALSE; qstomp = `FALSE; if (branchmiss) begin // don't bother doing anything if there's been a branch miss reset_tail_pointers(0); seqnum <= 8'h00; end else begin case ({fetchbuf0_v, fetchbuf1_v}) 2'b00: ; // do nothing 2'b01: enque1(tail0,1,0,1,vele,seqnum); 2'b10: enque0(tail0,1,0,1,vele,seqnum); 2'b11: begin enque0(tail0,1,1,1,vele,seqnum); if (allowq) begin enque1(tail1,2,0,0,vele+1,seqnum+8'd1); end validate_args(); end endcase if (queued2) seqnum <= seqnum + 8'd2; else if (queued1) seqnum <= seqnum + 8'd1; `ifdef VECTOROPS // Once instruction is completely queued reset vector element count to zero. // Otherwise increment it according to the number of elements queued. if (queued1|queued2) vele <= 8'd0; else if (queued2v) vele <= vele + 2; else if (queued1v) vele <= vele + 1; `endif end
Tasks being called also have a mix of blocking and non-blocking assignments.
_________________Robert Finch http://www.finitron.ca
|
Fri Apr 21, 2017 4:02 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
2017/04/22 Previously I thought that I could make the Thor2 core significantly smaller than the Thor core by removing the requirement of having the target register as one of the registers required for operands. The target register was required for predicated operations. However I changed the instruction set since predication was removed, I added conditional set and move instructions. Duh! It turns out that conditional set and move instructions also require the target register as an operand. So the core wouldn’t be a whole heaping lot smaller. The only way to get rid of this is to remove the conditional set and move instructions.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 23, 2017 6:09 pm |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
2017/04/23 Ported the Thor system to a new FPGA board. However bitstream generation results in an error message about signals forming a combinational loop. DRC (design rule check fails). This is in the TLB so I’ve started working on it again. I can’t see where the problem is, so I’ve disintegrated some of the logic into smaller blocks. Hopefully that will make the problem more apparent.
_________________Robert Finch http://www.finitron.ca
|
Tue Apr 25, 2017 2:55 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
I started a new superscalar core that would be a radical change to Thor for version 2. Getting rid of the variable length instructions and predication. Currently the new core has only 16 instructions. Once again it’s a 64 bit extension of the RiSC16 core by Dr. Jacob. The first iteration of the core was 39,000 LUTs, much smaller than I figured a 64 bit superscalar would be. So I ambitiously added a whole bunch to it. After adding branch prediction, a bus interface unit, caches and register renaming the core is about 81,600 LUTs. Yes, I finally figured out how to implement register renaming! I’m sure my solution is far from the best, but it looks like it could be working. I also had to re-write the elegant asynchronous logic loops for issue and branch miss logic into something synchronous that would work in an FPGA.
_________________Robert Finch http://www.finitron.ca
|
Tue Jul 04, 2017 4:14 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1807
|
Wow! Will be interested to hear more about this as you progress.
|
Tue Jul 04, 2017 8:19 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
The core doesn’t seem to work correctly across branch misses. In my small test program register r1 is being set to zero, but only when a branch miss occurs. For some reason this is after three iterations of the loop occur. It looks to me like the branch predictor isn’t working correctly either, but that will be left to fix later (there’s a minor performance cost for incorrect predictions but otherwise the core should work even with a broken predictor). Code: start: FFFC0010 002AAA9B ldi r1,#$AAAA5555 FFFC0014 55550809 start1: FFFC0018 4C026042 shr r2,r1,#12 FFFC001C 003FF71B sh r2,$FFDC0600 FFFC0020 06001014 FFFC0024 00010844 add r1,r1,#1 FFFC0028 FFEC0001 bra start1
_________________Robert Finch http://www.finitron.ca
|
Wed Jul 05, 2017 10:33 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
The problem didn’t have to do with branch misses. The wrong register was sometimes being freed if two instructions queued at the same time. I’ve gotten this larger program to run in simulation at least up until the display memory is written. The test bench needs to be modified yet to support the display memory. I’m trying to get a SoC using the core to implement so that it can be run in an FPGA. The toolset keeps running into a combinational loop error. There isn’t one as far as I can tell. Code: start: FFFC0010 7FFCF809 ldi r31,#$7FFC ; set stack pointer FFFC0014 002AAA9B ldi r1,#$AAAA5555 ; pick some data to write FFFC0018 55550809 FFFC001C 00001809 ldi r3,#0 start1: FFFC0020 4C026042 shr r2,r1,#12 FFFC0024 003FF71B sh r2,$FFDC0600 ; write to LEDs FFFC0028 06001014 FFFC002C 00010844 add r1,r1,#1 FFFC0030 000118C4 add r3,r3,#1 FFFC0034 0000079B cmp r2,r3,#2000000 ; stop after a few cycles FFFC0038 848010C6 FFFC003C FFE00881 bne r2,start1 FFFC0040 003FFF1B jal r29,clearTxtScreen start3: FFFC0044 004CE818 FFFC0048 FFFC0001 bra start3 brkrout: rti ;---------------------------------------------------------------------------- ;---------------------------------------------------------------------------- clearTxtScreen: FFFC004C 00242009 ldi r4,#$0024 FFFC0050 003FF71B sh r4,LEDS FFFC0054 06002014 FFFC0058 003FF41B ldi r1,#$FFD00000 ; text screen address FFFC005C 00000809 FFFC0060 09B01009 ldi r2,#2480 ; number of chars 2480 (80x31) FFFC0064 000021DB ldi r3,#%000010000_111111111_0000100000 FFFC0068 FC201809 .cts1: FFFC006C 00001854 sh r3,[r1] FFFC0070 00040844 add r1,r1,#4 FFFC0074 FFFF1084 sub r2,r2,#1 FFFC0078 FFF00881 bne r2,.cts1 FFFC007C 00000758 jal [r29]
_________________Robert Finch http://www.finitron.ca
|
Fri Jul 07, 2017 8:02 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
I’ve been experimenting with the register renaming I added and without the additional register renaming and the core executes programs in the same length of time in both cases. I think there is something I’m not understanding about the power of the original design. In the text it’s mentioned that the core does register renaming, but I didn’t see any registers being renamed, so I assumed the core didn’t implement it. However it appears to resolve all data dependencies anyway. I suspect I added 30,000 LUTs to the design that don’t do anything. In any case I’ve made my addition optional with a parameter.
_________________Robert Finch http://www.finitron.ca
|
Sat Jul 08, 2017 4:24 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Access to data is quite slow. Because the data cache isn’t single cycle it’s pipelined. It takes three cycles of waiting before it’s known if there’s a hit on the data cache. So external memory reads to fill the cache can’t begin for at least three cycles. The kicker is that there must also be a wait of three cycles after the access so that the pipeline can flush or the next memory access might get confused by a false hit left over from the previous access.
6,000 LUTs were shaved off the size of the core by using a register file hand coded rather than relying on the synthesizer to implement the register file. The register file is somewhat complex needing six read ports and two write ports.
A branching unit was added to the core rather than have branches handled by the ALU. The code supporting flow control operations was moved from the ALU to the branching unit.
_________________Robert Finch http://www.finitron.ca
|
Mon Jul 10, 2017 5:18 am |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1807
|
This really is a major machine. But it can't be right that register renaming gets you nowhere - unless, perhaps, your memory system is such that you can't dispatch enough instructions to make it worthwhile?? I'm not really clear on superpipelined machinery. I'm not even sure that's the right word!
|
Mon Jul 10, 2017 8:29 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
In the original code registers are effectively renamed using a level of indirection and tracking the source of data for any given register. The register is effectively mapped to a data source. FT64 now uses the original mechanism. I'm pleased I got renaming to work through the register file as well.
I've been busy the last couple of days and added a lot to the core. It now has dual result busses that allow the use of push / pop / call / return instructions. It started out looking like a RISC machine now it looks more like a CISC machine.
Improved cache load time by using a pipelined access. Down to 7 clock cycles from 25. There’s still at least one clock cycle that could be trimmed off somehow. Memory access is dismally slow with an mmu in place. The problem now is that data is expected back from memory on every clock cycle once ack is asserted. It should make the main memory interface interesting. It's running in sim with just a small scratchpad ram and rom.
The core hasn't been run in an FPGA yet. The toolset keeps complaining about timing loops which I think are fictitious. The so called timing loops are located in sequentially clocked logic, I think it's impossible. I think it's not a source code problem but a build problem. The loop it finds varies from one build to the next as source changes slightly.
_________________Robert Finch http://www.finitron.ca
|
Wed Jul 12, 2017 11:21 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Work for the C64 compiler for FT64 (Thor 2) has been started. But now that I've spent time on FT64 I'm not sure I like it well enough to continue working on it. I like DSD9 better with it's 40 bit instructions and 64 registers, even though that probably isn't as efficient. One thing I miss is an integrated compare to immediate and branch instruction. There's something about having something oddball that seems to appeal to me. I may be working on the FT80 soon. It should be about 25% larger than 64 bits would be. That'd be about 104,000 LUTs. It just might fit on the FPGA.
_________________Robert Finch http://www.finitron.ca
|
Fri Jul 14, 2017 8:14 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
I did some experimentation with 80 bit version of the core and found out it’s only about 10,000 LUTs larger. I then went back to see how I could make the 64 bit version more palatable. So the branches were totally redone, using up some of the excess branch displacement bits for other things. Added branch on bit set / bit clear, and branch on equal immediate. The other branches now compare two registers instead of comparing a single register to zero. The issue was with previous branch setup branches would often take two clock cycles because a compare was done beforehand. Many of the compares should now be eliminated. However the drawback to the new setup is that there are only 11 bits for a branch displacement. I had to double up on the opcodes used for branches in order to get 11 displacement bits. Here is a switch statement implemented with the old branches Code: cmp r4,r3,#8 beq r4,BIOSMain_14 cmp r4,r3,#2 beq r4,BIOSMain_15 cmp r4,r3,#1 beq r4,BIOSMain_16
Here is the same switch statement with revised branches Code: beqi r3,#8,BIOSMain_14 beqi r3,#2,BIOSMain_15 beqi r3,#1,BIOSMain_16 Note it uses one less register and should execute at least twice as fast.
_________________Robert Finch http://www.finitron.ca
|
Sat Jul 15, 2017 5:52 am |
|
|
barrym95838
Joined: Tue Dec 31, 2013 2:01 am Posts: 116 Location: Sacramento, CA, United States
|
robfinch wrote: ... One thing I miss is an integrated compare to immediate and branch instruction. There's something about having something oddball that seems to appeal to me ... I totally know what you mean, Rob. For me, it's the integrated push-load and store-pull instructions for my 65m32. I can write a print immediate string subroutine in five instructions: Code: \ Print the in-line NUL-terminated string to the console \ Must call with JSR, so the string address is stacked. \ Trashes: A primm: exy ,s \ stack y, load string address primm2: lda ,y+ \ load string char sly [eq]#,n \ restore y and exit if NUL jsr charout \ print char to console bra primm2 \ loop
Mike B.
|
Sun Jul 16, 2017 5:50 am |
|
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2231 Location: Canada
|
Arrrgh! The compiler option to display mixed source code and generated code was turned on and…. It toasted the peephole optimizations because the source code is output as comments in-between other instructions. Apparently the optimizer then thinks there’s additional instructions between code it might have otherwise optimized. It’s not a simple fix.
Store instructions hung the machine once branches were introduced because they wait until there are no previous flow control instructions. Unfortunately the branches would sit in the queue even though complete. So once an instruction commits to the machines architectural state the instruction is turned into a NOP to unfreeze the following stores.
_________________Robert Finch http://www.finitron.ca
|
Sun Jul 16, 2017 7:23 am |
|
Who is online |
Users browsing this forum: CCBot and 1 guest |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|