View unanswered posts | View active topics It is currently Sat Apr 20, 2024 4:18 am



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 41, 42, 43, 44, 45, 46, 47 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
robfinch wrote:
The Thor2022 scheduler component is driving me crazy. It is now reporting as being over 100,000 LUTs in size, totally ridiculous and blowing the LUT budget, when included in the top module. If I synthesize the module by itself, it reports as being 51 LUTs in size, which I think is the proper size. So, I am experimenting to try and find out why the difference.


That smells like a RAM block not being inferred - if it works in isolation, does the full integrated design maybe end up doing something like feeding the output of one memory block directly into the address input of another?


Fri Aug 26, 2022 9:15 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
That smells like a RAM block not being inferred - if it works in isolation, does the full integrated design maybe end up doing something like feeding the output of one memory block directly into the address input of another?
It does work like that a little bit. One 8x3bits wide ram is used to address a second ram. I figured it may make an 8x8 matrix but that is only about 3200 LUTs. I had a very complex scheduler and it worked out to about 20,000 LUTs in size. I figured it may be turning the RAM access into a matrix. So I went ahead and really simplified the scheduler and things went nuts. I am sure it is just something that I cannot see ATM.

_________________
Robert Finch http://www.finitron.ca


Sat Aug 27, 2022 3:53 am
Profile WWW

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 92
robfinch wrote:
It does work like that a little bit. One 8x3bits wide ram is used to address a second ram.


I'm going to hand-wave here, since I'm fuzzy on the details (and they no-doubt vary between FPGAs anyway), but my guess would be that the tool wants to pack the address signal as a register either inside or immediately adjacent to the RAM block, and likewise for the output of the other RAM block. If the design considers those registers to be one and the same, then it can't satisfy both requirements simultaneously, and thus uses logic instead of RAM blocks. If so, adding an extra register between the two blocks should help, but obviously will cost you an extra cycle.

(I don't know which FPGA and toolchain you're using, but there was an update to Quartus 18.1 which fixed some RAM block corner cases.)


Sat Aug 27, 2022 10:29 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I am back onto this project yet again with the 2023 version. It keeps a chunk of the original Thor like 64 GPRs and vectors but adds some new tricks like sign control on operands and immediate operand swapping.

Project Thor2023 underway.
- Fixed 40-bit instruction format, instruction postfix words for extended constants
- Predication via PRED modifier
- 64 GPRs, unified register file, 96-bit registers
- Sign control on operands, immediate operand swapping
- 88-bit extended precision binary floating point, optional 96-bit decimal float
- 32/64 bit addressing
- Block tagging of data in MMU
- 1 address mode, base plus scaled index plus displacement
- Vector operations
- Bit/Bit pair manipulation instructions

- Vector instructions always use vector mask register #0 unless overridden with a VMASK modifier


Four operating modes, App, supervisor, hypervisor and machine
- 512 entry relocatable vector tables for each operating mode
- Load using stack canary register, gpr #54, checks canary value and exceptions if differs
- 24-bit branch displacements
- BSR, PIC
- Loading / storing groups of five registers to / from cache line
-

_________________
Robert Finch http://www.finitron.ca


Sun Jan 01, 2023 4:50 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Decided to switch the spec to 96-bit triple precision from 88-bits. For the demo some of the low order bits of the float may not be supported to make best use of the DSP blocks. But best to keep things officially to a standard precision.

Got enough of the assembler working to assemble the Fibonnaci program. The assembler needed to be coded to handle 96-bit integer values. Fortunately, the assembler had most of the logic in place already. It was just a matter of making use of it.

Code:
                                           13: start:
02:0000000000000000 0302000160             14:    CSRRD   r2,r0,0x3001   # get the thread number
02:0000000000000005 0802020F00             15:    AND      r2,r2,15            # 0 to 3
02:000000000000000A DC09820000             16:    BNZ      r2,stall         # Allow only thread 0 to work
                                           17:
02:000000000000000F 040200FD00             18:    LDI      r2,0xFD
02:0000000000000014 0402000100             19:    LDI      r2,0x01      # x = 1
02:0000000000000019 52000000001F00FC       20:    STT      r2,0xFFFC0000
02:0000000000000021 FF00
                                           21:
02:0000000000000023 0403001000             22:    LDI      r3,0x10      # calculates 16th fibonacci number (13 = D in hex) (CHANGE HERE IF YOU WANT TO CALCULATE ANOTHER NUMBER)
02:0000000000000028 0201030024             23:    OR      r1,r3,r0   # transfer y register to accumulator
02:000000000000002D 040303FDFE             24:    ADD      r3,r3,-3   # handles the algorithm iteration counting
                                           25:
02:0000000000000032 0401000200             26:    LDI      r1,2      # a = 2
02:0000000000000037 52040000001F00FC       27:    STT      r1,0xFFFC0004      # stores a
02:000000000000003F FF00
                                           28:
                                           29: floop:
02:0000000000000041 50040000001F00FC       30:    LDT      r2,0xFFFC0004      # x = a
02:0000000000000049 FF00
02:000000000000004B 0201010210             31:    ADD      r1,r1,r2            # a += x
02:0000000000000050 52040000001F00FC       32:    STT      r1,0xFFFC0004      # stores a
02:0000000000000058 FF00
02:000000000000005A 52000000001F00FC       33:    STT      r2,0xFFFC0000      # stores x
02:0000000000000062 FF00
02:0000000000000064 040303FFFE             34:    ADD      r3,r3,-1            # y -= 1
02:0000000000000069 DC0DD8FFFF             35:   BNZ    r3,floop      # jumps back to loop if Z bit != 0 (y's decremention isn't zero yet)
02:000000000000006E 9F00000000             36:   NOP
02:0000000000000073 9F00000000             37:   NOP
02:0000000000000078 9F00000000             38:   NOP
02:000000000000007D 9F00000000             39:   NOP
02:0000000000000082 9F00000000             40:   NOP
02:0000000000000087 9F00000000             41:    NOP 
                                           42: stall:
02:000000000000008C DC00000000             43:    BRA   stall

_________________
Robert Finch http://www.finitron.ca


Tue Jan 03, 2023 6:22 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Motoring along on Thor2023. Today put the bus interface unit into place, stolen from the rfPhoenix project. Coded up part of a state machine to run the core. Going to use a simple state machine driven approach as most of the clock cycles will be burned up accessing memory. An instruction cache is being used which is part of the BIU.

The current goal is having enough of the machine in place to simulate Fibonacci.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 04, 2023 5:18 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
More work on coding the core and updating the spec document. It will be a while before the fun of debugging begins.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 05, 2023 3:59 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Spent today mulling over the operation of atomic memory ops. Started coding support for them in the multi-port memory controller mpmc10. It is interesting that it is the memory controller that needs to be able to execute the atomic operations. The cpu more or less just passes the instruction through to the memory controller.

AMO ops supported are ADD, AND, OR, EOR, ASL, LSR, MIN, MAX and CMPXCHG.

Not planning on getting these working right away, but they are planned in.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 06, 2023 3:47 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Rob:

Given the long list of instructions to which the AMO attribute / behavior applies, maybe an AMO instruction may be a more efficient approach to flagging the AMO behavior to the memory controller? Usage from a compiler may require a special syntax or structure, but the potential performance penalty of AMO would not apply to such common instructions such as ADD, EOR, etc.

_________________
Michael A.


Fri Jan 06, 2023 9:15 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
Given the long list of instructions to which the AMO attribute / behavior applies, maybe an AMO instruction may be a more efficient approach to flagging the AMO behavior to the memory controller? Usage from a compiler may require a special syntax or structure, but the potential performance penalty of AMO would not apply to such common instructions such as ADD, EOR, etc.
Yes. There is a separate set of instructions independent of the usual ADD, AND, etc. just for AMO operations. I did not mean to imply that regular instructions were passed to the memory controller. I am calling them AMADD, AMAND, AMOR, etc. I got lazy and left the 'AM' prefix off when listing them. Supporting them with the compiler could be interesting. Normal 'C' may use intrinsic functions. But I think I will make up a way to support them in my non-C C like compiler. It may be as simple as keywords like "amo_add".

Spent time today working on AMO, atomic memory operations. Realizing that the way to do them is at the coherence point. So, an opcode needs to be passed to the memory controller indicating the AMO to perform. The address and data are supplied by the CPU, but the operation actually takes place in the memory controller. The Wishbone bus I have been using does not support this mechanism, so I have added onto it. It needed an opcode field and another data field. The AMO operations increased the size of the memory controller by about 25%. That combined with the bus interface unit is quite large.

_________________
Robert Finch http://www.finitron.ca


Sat Jan 07, 2023 4:13 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Fixed the float compare module to include comparisons of infinities. The compare while taking only a single clock cycle must act like a subtract operation. If there are two infinities being subtracted then the result should be a nan. If one of the operands is infinity then it should be greater than the other which is a non-infinity.

Got the basic machine coded with ADD, CMP, AND, OR, EOR, LOAD, STORE and branch instructions and predicates too. No complex ops yet but it should be enough to run the Fibonacci example.

Using a simple state machine to implement the core, performance will not be the best but hopefully the core will fit into the FPGA.

_________________
Robert Finch http://www.finitron.ca


Mon Jan 09, 2023 3:45 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Some work on a first simulation. Nothing works at the moment. But I was able to see the I$ loaded with instructions.

_________________
Robert Finch http://www.finitron.ca


Tue Jan 10, 2023 4:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Modifying the ISA to remove the rounding mode spec on some FP instructions. The available bits will be used for FP constants. Rounding specification will be available with an instruction modifier.

Added FP16 to FPx conversion routines, where x is 32,64,96 or 128. The conversion is probably fast enough to use inline as comb logic with immediate input coming from the instruction. It is basically just a bit copy with a small adder needed for the exponent.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 11, 2023 6:42 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Spent some time reading up on the AM2901 bit-slice.

Modified the Thor2023 ISA to be more like the Thor2022 ISA. Rather than a vector register indicator on every register there is a single bit which indicates a vector instruction along with an additional bit for register spec B to indicate a vector register and the same for register spec C if present. This saves one bit in many instructions.

Have not been able to get simulation to execute instructions yet. There is an issue loading the I$. The instruction data is being fetched from memory but not loaded. There is a control glitch somewhere.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 12, 2023 3:14 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Still unable to get the I$ loaded in simulation.

Accessing memory is slightly complicated as things are pipelined and requests and responses are asynchronous. Responses can come back in a different order than they were requested. The BIU keeps a table of outstanding requests and as responses come in it matches the requests and incoming response. It also keeps track of burst length for requests and burst length is what is seeming not to match properly. The cache is updated only once all data in the burst is retrieved. The memory segment, CODE, must also match. The BIU has evolved over time and seemed to be working for the Phoenix project. It was just plugged into Thor2023 and does not appear to work. There is probably something not being initialized in the same manner.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 18, 2023 4:48 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 41, 42, 43, 44, 45, 46, 47 ... 52  Next

Who is online

Users browsing this forum: No registered users and 14 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software