View unanswered posts | View active topics It is currently Tue Mar 19, 2024 4:18 am



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Upgrading to Thor version 2, I just realized I have a number of spots where blocking and non-blocking assignments are being performed under the same always block. This is a bad practise and not guaranteed to work the way one would think. It's amazing the core worked as well as it did. It will be quite a bit of work to move the assignments to their own always blocks.

Example in the following clocked logic:
Code:
    exception_set = `FALSE;
    queued1 = `FALSE;
    queued2 = `FALSE;
    queued1v = `FALSE;
    queued2v = `FALSE;
    qstomp = `FALSE;
    if (branchmiss) begin // don't bother doing anything if there's been a branch miss
        reset_tail_pointers(0);
        seqnum <= 8'h00;
    end
    else begin
        case ({fetchbuf0_v, fetchbuf1_v})
        2'b00: ; // do nothing
        2'b01:  enque1(tail0,1,0,1,vele,seqnum);
        2'b10:  enque0(tail0,1,0,1,vele,seqnum);
        2'b11:  begin
                enque0(tail0,1,1,1,vele,seqnum);
                if (allowq) begin
                    enque1(tail1,2,0,0,vele+1,seqnum+8'd1);
                end
                validate_args();
                end
        endcase
        if (queued2)
          seqnum <= seqnum + 8'd2;
        else if (queued1)
          seqnum <= seqnum + 8'd1;
`ifdef VECTOROPS
        // Once instruction is completely queued reset vector element count to zero.
        // Otherwise increment it according to the number of elements queued.
        if (queued1|queued2)
          vele <= 8'd0;
        else if (queued2v)
          vele <= vele + 2;
        else if (queued1v)
          vele <= vele + 1;
`endif
    end


Tasks being called also have a mix of blocking and non-blocking assignments.

_________________
Robert Finch http://www.finitron.ca


Fri Apr 21, 2017 4:02 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2017/04/22
Previously I thought that I could make the Thor2 core significantly smaller than the Thor core by removing the requirement of having the target register as one of the registers required for operands. The target register was required for predicated operations. However I changed the instruction set since predication was removed, I added conditional set and move instructions. Duh! It turns out that conditional set and move instructions also require the target register as an operand. So the core wouldn’t be a whole heaping lot smaller. The only way to get rid of this is to remove the conditional set and move instructions.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 23, 2017 6:09 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
2017/04/23
Ported the Thor system to a new FPGA board. However bitstream generation results in an error message about signals forming a combinational loop. DRC (design rule check fails). This is in the TLB so I’ve started working on it again. I can’t see where the problem is, so I’ve disintegrated some of the logic into smaller blocks. Hopefully that will make the problem more apparent.

_________________
Robert Finch http://www.finitron.ca


Tue Apr 25, 2017 2:55 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I started a new superscalar core that would be a radical change to Thor for version 2. Getting rid of the variable length instructions and predication. Currently the new core has only 16 instructions.
Once again it’s a 64 bit extension of the RiSC16 core by Dr. Jacob. The first iteration of the core was 39,000 LUTs, much smaller than I figured a 64 bit superscalar would be. So I ambitiously added a whole bunch to it. After adding branch prediction, a bus interface unit, caches and register renaming the core is about 81,600 LUTs. Yes, I finally figured out how to implement register renaming! I’m sure my solution is far from the best, but it looks like it could be working. I also had to re-write the elegant asynchronous logic loops for issue and branch miss logic into something synchronous that would work in an FPGA.

_________________
Robert Finch http://www.finitron.ca


Tue Jul 04, 2017 4:14 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Wow! Will be interested to hear more about this as you progress.


Tue Jul 04, 2017 8:19 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The core doesn’t seem to work correctly across branch misses. In my small test program register r1 is being set to zero, but only when a branch miss occurs. For some reason this is after three iterations of the loop occur. It looks to me like the branch predictor isn’t working correctly either, but that will be left to fix later (there’s a minor performance cost for incorrect predictions but otherwise the core should work even with a broken predictor).
Code:
                           start:
FFFC0010 002AAA9B       ldi      r1,#$AAAA5555
FFFC0014 55550809
                           start1:
FFFC0018 4C026042       shr      r2,r1,#12
FFFC001C 003FF71B       sh      r2,$FFDC0600
FFFC0020 06001014
FFFC0024 00010844       add      r1,r1,#1
FFFC0028 FFEC0001       bra      start1

_________________
Robert Finch http://www.finitron.ca


Wed Jul 05, 2017 10:33 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
The problem didn’t have to do with branch misses. The wrong register was sometimes being freed if two instructions queued at the same time. I’ve gotten this larger program to run in simulation at least up until the display memory is written. The test bench needs to be modified yet to support the display memory. I’m trying to get a SoC using the core to implement so that it can be run in an FPGA. The toolset keeps running into a combinational loop error. There isn’t one as far as I can tell.
Code:
                           start:
FFFC0010 7FFCF809       ldi      r31,#$7FFC      ; set stack pointer
FFFC0014 002AAA9B       ldi      r1,#$AAAA5555   ; pick some data to write
FFFC0018 55550809
FFFC001C 00001809       ldi      r3,#0
                           start1:
FFFC0020 4C026042       shr      r2,r1,#12
FFFC0024 003FF71B       sh      r2,$FFDC0600   ; write to LEDs
FFFC0028 06001014
FFFC002C 00010844       add      r1,r1,#1
FFFC0030 000118C4       add      r3,r3,#1
FFFC0034 0000079B       cmp      r2,r3,#2000000   ; stop after a few cycles
FFFC0038 848010C6
FFFC003C FFE00881       bne      r2,start1
FFFC0040 003FFF1B       jal      r29,clearTxtScreen
start3:
FFFC0044 004CE818
FFFC0048 FFFC0001       bra      start3
                           
                           brkrout:
                              rti
                           
                           ;----------------------------------------------------------------------------
                           ;----------------------------------------------------------------------------
                           clearTxtScreen:
FFFC004C 00242009          ldi      r4,#$0024
FFFC0050 003FF71B          sh      r4,LEDS
FFFC0054 06002014
FFFC0058 003FF41B          ldi      r1,#$FFD00000   ; text screen address
FFFC005C 00000809
FFFC0060 09B01009          ldi      r2,#2480      ; number of chars 2480 (80x31)
FFFC0064 000021DB          ldi      r3,#%000010000_111111111_0000100000
FFFC0068 FC201809
                           .cts1:
FFFC006C 00001854          sh      r3,[r1]
FFFC0070 00040844          add      r1,r1,#4
FFFC0074 FFFF1084          sub      r2,r2,#1
FFFC0078 FFF00881          bne      r2,.cts1
FFFC007C 00000758          jal      [r29]

_________________
Robert Finch http://www.finitron.ca


Fri Jul 07, 2017 8:02 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I’ve been experimenting with the register renaming I added and without the additional register renaming and the core executes programs in the same length of time in both cases. I think there is something I’m not understanding about the power of the original design. In the text it’s mentioned that the core does register renaming, but I didn’t see any registers being renamed, so I assumed the core didn’t implement it. However it appears to resolve all data dependencies anyway. I suspect I added 30,000 LUTs to the design that don’t do anything. In any case I’ve made my addition optional with a parameter.

_________________
Robert Finch http://www.finitron.ca


Sat Jul 08, 2017 4:24 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Access to data is quite slow. Because the data cache isn’t single cycle it’s pipelined. It takes three cycles of waiting before it’s known if there’s a hit on the data cache. So external memory reads to fill the cache can’t begin for at least three cycles. The kicker is that there must also be a wait of three cycles after the access so that the pipeline can flush or the next memory access might get confused by a false hit left over from the previous access.

6,000 LUTs were shaved off the size of the core by using a register file hand coded rather than relying on the synthesizer to implement the register file. The register file is somewhat complex needing six read ports and two write ports.

A branching unit was added to the core rather than have branches handled by the ALU. The code supporting flow control operations was moved from the ALU to the branching unit.

_________________
Robert Finch http://www.finitron.ca


Mon Jul 10, 2017 5:18 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
This really is a major machine. But it can't be right that register renaming gets you nowhere - unless, perhaps, your memory system is such that you can't dispatch enough instructions to make it worthwhile?? I'm not really clear on superpipelined machinery. I'm not even sure that's the right word!


Mon Jul 10, 2017 8:29 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
In the original code registers are effectively renamed using a level of indirection and tracking the source of data for any given register. The register is effectively mapped to a data source. FT64 now uses the original mechanism. I'm pleased I got renaming to work through the register file as well.

I've been busy the last couple of days and added a lot to the core. It now has dual result busses that allow the use of push / pop / call / return instructions. It started out looking like a RISC machine now it looks more like a CISC machine.

Improved cache load time by using a pipelined access. Down to 7 clock cycles from 25. There’s still at least one clock cycle that could be trimmed off somehow. Memory access is dismally slow with an mmu in place. The problem now is that data is expected back from memory on every clock cycle once ack is asserted. It should make the main memory interface interesting. It's running in sim with just a small scratchpad ram and rom.

The core hasn't been run in an FPGA yet. The toolset keeps complaining about timing loops which I think are fictitious. The so called timing loops are located in sequentially clocked logic, I think it's impossible. I think it's not a source code problem but a build problem. The loop it finds varies from one build to the next as source changes slightly.

_________________
Robert Finch http://www.finitron.ca


Wed Jul 12, 2017 11:21 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Work for the C64 compiler for FT64 (Thor 2) has been started. But now that I've spent time on FT64 I'm not sure I like it well enough to continue working on it. I like DSD9 better with it's 40 bit instructions and 64 registers, even though that probably isn't as efficient. One thing I miss is an integrated compare to immediate and branch instruction. There's something about having something oddball that seems to appeal to me. I may be working on the FT80 soon. It should be about 25% larger than 64 bits would be. That'd be about 104,000 LUTs. It just might fit on the FPGA.

_________________
Robert Finch http://www.finitron.ca


Fri Jul 14, 2017 8:14 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I did some experimentation with 80 bit version of the core and found out it’s only about 10,000 LUTs larger. I then went back to see how I could make the 64 bit version more palatable. So the branches were totally redone, using up some of the excess branch displacement bits for other things. Added branch on bit set / bit clear, and branch on equal immediate. The other branches now compare two registers instead of comparing a single register to zero. The issue was with previous branch setup branches would often take two clock cycles because a compare was done beforehand. Many of the compares should now be eliminated. However the drawback to the new setup is that there are only 11 bits for a branch displacement. I had to double up on the opcodes used for branches in order to get 11 displacement bits.
Here is a switch statement implemented with the old branches
Code:
             cmp     r4,r3,#8
            beq     r4,BIOSMain_14
            cmp     r4,r3,#2
            beq     r4,BIOSMain_15
            cmp     r4,r3,#1
            beq     r4,BIOSMain_16


Here is the same switch statement with revised branches
Code:
             beqi    r3,#8,BIOSMain_14
            beqi    r3,#2,BIOSMain_15
            beqi    r3,#1,BIOSMain_16
 

Note it uses one less register and should execute at least twice as fast.

_________________
Robert Finch http://www.finitron.ca


Sat Jul 15, 2017 5:52 am
Profile WWW

Joined: Tue Dec 31, 2013 2:01 am
Posts: 116
Location: Sacramento, CA, United States
robfinch wrote:
... One thing I miss is an integrated compare to immediate and branch instruction. There's something about having something oddball that seems to appeal to me ...

I totally know what you mean, Rob. For me, it's the integrated push-load and store-pull instructions for my 65m32. I can write a print immediate string subroutine in five instructions:
Code:
\ Print the in-line NUL-terminated string to the console
\ Must call with JSR, so the string address is stacked.
\ Trashes:  A
primm:
    exy  ,s             \ stack y, load string address
primm2:
    lda  ,y+            \ load string char
    sly  [eq]#,n        \ restore y and exit if NUL
    jsr  charout        \ print char to console
    bra  primm2         \ loop


Mike B.


Sun Jul 16, 2017 5:50 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Arrrgh! The compiler option to display mixed source code and generated code was turned on and….
It toasted the peephole optimizations because the source code is output as comments in-between other instructions. Apparently the optimizer then thinks there’s additional instructions between code it might have otherwise optimized. It’s not a simple fix.

Store instructions hung the machine once branches were introduced because they wait until there are no previous flow control instructions. Unfortunately the branches would sit in the queue even though complete. So once an instruction commits to the machines architectural state the instruction is turned into a NOP to unfreeze the following stores.

_________________
Robert Finch http://www.finitron.ca


Sun Jul 16, 2017 7:23 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 52  Next

Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software