View unanswered posts | View active topics It is currently Thu Apr 18, 2024 11:46 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 43, 44, 45, 46, 47, 48, 49 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Moved the instruction victim cache from the bus interface unit to the instruction cache component, and made the victim cache optional. Added logic to invalidate the victim cache on a snoop hit in the victim cache.

It was pointed out on comp.arch that fully associative comparators for snooping were not required. All that is required is comparators for every way of a set, since the addresses end up in the same set. So, updating the cache to account for this reduced the size by 13%.

_________________
Robert Finch http://www.finitron.ca


Wed Mar 08, 2023 6:17 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Completely re-writing the bus interface unit. This has come from updating the cache modules and writing cache controllers. A lot of the logic that was in the bus interface unit is ending up in the cache controllers. Moving the TLB out to a shared component also results in lots of changes.

_________________
Robert Finch http://www.finitron.ca


Sun Mar 12, 2023 4:05 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Just pouring the work into the bus interface unit, BIU. Whittled it down to 2,700 lines from 3,330. It should end up much smaller yet. The BIU previously included just about everything to interface to the outside bus. It included hardware table walkers. Some of the code is being moved outside of the unit to allow for multiple cores.

_________________
Robert Finch http://www.finitron.ca


Tue Mar 14, 2023 5:23 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Working on the memory request queue also called a load / store queue today. The code is not terribly large, 500 LOC, but it generates a lot of logic because of the manipulation of each queue entry separately. The queue has parameters allowing it to perform store merging and load bypassing. With everything enabled the core takes about 66,500 LUTs which is too large to be used for the current project. The minimal version of the core is just 16,500 LUTs. The store merging merges stores to the same cache line together resulting in a single store operation to the external bus. For example, storing a byte to offset zero of a cache line, then storing a wyde to offset 12 of the same cache line merges the store data together, and ends up performing only a single store operation. Any number of stores could be merged. If performing a byte-by-byte store up to eight bytes will be merged into a single cache line, limited due to the queue depth of eight, and a single store operation will take place. This is significantly faster than performing the individual stores.
The core also features load bypassing where if a load address matches an address and the selected range of data is available in the queue then the load will take place out of the queue instead of from external memory.

Both store merging and load bypassing respect the memory cache-ability of the operation. If the operation is non-cacheable because it is to I/O then store merging and load bypassing is disabled.

_________________
Robert Finch http://www.finitron.ca


Wed Mar 15, 2023 4:37 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Moved the region table into the shared TLB. The PCI config space is now shared between the TLB and region table.

I finally figured out a lower cost way to represent a super-page in the TLB. Done by increasing the number of ways associative. Two ways are now dedicated to 16MB super-pages. Four ways are dedicated to 16kB pages. I chose to increase the ways rather than add another read port to the TLB because it is lot less expensive. Adding a port would quadruple the RAM requirements of the TLB. Adding two ways only increases RAM requirements by 50%.

The TLB currently piles up all the 16MB pages in the first 256 entries of the TLB. This occurs because there are not enough distinguishing address bits incoming. 16MB pages have only eight significant address bits when the address bus is 32-bits.

_________________
Robert Finch http://www.finitron.ca


Mon Mar 20, 2023 10:43 pm
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Expanded size of the LVL field in the PTE to four bits, to allow up to 16 levels of paging. With a smaller page size and larger virtual address range, more than eight levels may be required.

Wrote a TLB miss interrupt routine after researching some routines on the web. In part to get a feel for how well the ISA works. I think I am doing too much work in the TLB component. There are updates to pages required on a cache miss that should not really be in the TLB.

The idea for the TLB miss routine is to have a separate routine for each entry level from the root pointer. This avoids having branches in the miss routine at the cost of some code replication.
Code:
; TLB miss handler
; Handles a 34-bit virtual address
; The TLB device needs to be permanently mapped into the system's address space
; since it is MMIO and uses the TLB.
;
;
tlb_miss_irq34:
   st96   a0,[sp]                              ; save working registers
   st96   a1,12[sp]
   st96   a2,24[sp]
   st96   a3,36[sp]
   st96   a4,48[sp]
   ld96   a0,TLB_TLBMISS_ADR            ; a0 = miss address
   ld96   a1,PTBR                              ; a1 = page table base
   clr      a1,a1,0,13                        ; clear 14 LSBs, address is page aligned
   extu   a2,a0,24,9                        ; get miss address bits 24 to 33, index into top level page table
   ld96   a3,[a1+a2*]                        ; get PTP from top level table
   bbc      a3,PTE_V,.noL1PTE               ; check that entry is valid
   bbc      a3,PTE_T,.L1superPage         ; check for 16MB superpage
   extu   a1,a3,PTE_PPNLO,22            ; get PTP pointer low bits
   extu   a4,a3,PTE_PPNHI,32            ; and high bits
   asl      a4,a4,22                           ; build into one variable
   or      a1,a1,a4
   asl      a1,a1,14                           ; convert PPN to table address
   extu   a2,a0,14,9                        ; get miss address bits 14 to 23
   ld96   a3,[a1+a2*]                        ; get MPP
   bbc      a3,PTE_V,.noL0PTE               ; check that entry is valid
   bbs      a3,PTE_T,.corrupt               ; should be a PTE, otherwise table corrupt
.L1superPage:
   st96   a3,TLB_TLBE_HOLD               ; store MPP in holding reg
   st16   r0,TLB_TLBE_TRIGGER            ; update TLB
   ld96   a0,[sp]                              ; restore working registers
   ld96   a1,12[sp]
   ld96   a2,24[sp]
   ld96   a3,36[sp]
   ld96   a4,48[sp]
   rti
   ; Here, memory was not mapped to support the access. So, the program must be
   ; trying to read or write a random address. Abort the program.
.noL1PTE:
.noL0PTE:
.corrupt:
   ldi      a0,ABORT_PROGRAM
   ldi      a1,ERR_TLBMISS
   syscall

_________________
Robert Finch http://www.finitron.ca


Wed Mar 22, 2023 3:50 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Coded high-performance semaphores to be shared between CPU cores. Any core may set or read the semaphores via a CSR register. One use of the semaphores is to obtain exclusive access to the TLB registers. I re-wrote the TLB miss handler to account for exclusive access.

Code:
; shared TLB miss handler
; Handles a 34-bit virtual address
; Slightly more complex than an unshared TLB as the TLB registers need to be
; protected via a semaphore. Updates must be restricted to one core at a time.
; The TLB device needs to be permanently mapped into the system's address space
; since it is MMIO and uses the TLB.
; The stack must be mapped into a global address space.
;
;
tlb_miss_irq34:
   st96 t0,[sp]                              ; save working registers
   st96 t1,12[sp]
   st96 t2,24[sp]
   st96 t3,36[sp]
   st96 t4,48[sp]
   st96 t5,60[sp]
   ld96 t0,TLB_MISS_ADR                  ; t0 = miss address, reading miss address clears interrupt
   csrrs   r0,3,M_IE                           ; enable interrupts
   csrrd   t1,r0,S_PTBR                     ; t1 = page table base
   clr   t1,t1,0,13                           ; clear 14 LSBs, address is page aligned
   extu t2,t0,24,9                           ; get miss address bits 24 to 33, index into top level page table
   ld96 t3,[t1+t2*]                        ; get PTP from top level table
   bbc   t3,PTE_V,.noL1PTE                  ; check that entry is valid
   extu t5,t3,PTE_T,0                     ; get PTE.T bit
   bbc   t3,PTE_T,.L1superPage            ; check for 16MB superpage
   extu t1,t3,PTE_PPN,63                  ; get PTP pointer
   asl   t1,t1,14                              ; convert PPN to table address
   extu t2,t0,14,9                           ; get miss address bits 14 to 23
   ld96 t3,[t1+t2*]                        ; get MPP
   bbc   t3,PTE_V,.noL0PTE                  ; check that entry is valid
   bbs   t3,PTE_T,.corrupt                  ; should be a PTE, otherwise table corrupt
.L1superPage:
   extu t1,t0,20,75                        ; VPN bits 6 to 83 = miss address bits 20 to 95
   csrrd   t2,r0,S_ASID                     ; add ASID to miss address
   asl   t2,t2,80
   or t1,t1,t2                                 ; t1 = VPN+ASID
   extu t2,t0,14,9                           ; t2 = address bits 14 to 23 = TLB entry number
   asl t2,t2,5                                 ; shift into position
   csrrd t4,0,S_LFSR                        ; choose a random way to replace
   pne t5,"TFFIIIII"                        ; way depends on page level
   and t4,t4,3                                 ; way 0 to 3   ; normal page
   and   t4,t4,1                                 ; way 0 or 1   ; superpage
   add   t4,t4,4                                 ; way 4 or 5   ; superpage
   or t2,t2,t4                                 ; bump out
   csrrc   r0,3,M_IE                           ; disable interrupts
.lock:
   csrwr t4,1,M_SEMA                        ; try and set semaphore
   csrrd t4,0,M_SEMA                        ; check and see if set, zero returned if set
   bbs   t4,0,.lock                           ; must have been clear
   st96 t1,TLB_PTE                           ; do quick stores to memory
   st96 t3,TLB_VPN
   st96 t2,TBL_CTRL
   st8 r0,TLB_WTRIG                        ; trigger update
   csrrc r0,1,M_SEMA                        ; release semaphore
   csrrs   r0,3,M_IE                           ; enable interrupts
   ld96 t0,[sp]                              ; restore working registers
   ld96 t1,12[sp]
   ld96 t2,24[sp]
   ld96 t3,36[sp]
   ld96 t4,48[sp]
   ld96 t5,60[sp]
   rti
   ; Here, memory was not mapped to support the access. So, the program must be
   ; trying to read or write a random address. Abort the program.
.noL1PTE:
.noL0PTE:
.corrupt:
   ldi a0,ABORT_PROGRAM
   ldi   a1,ERR_TLBMISS
   syscall

_________________
Robert Finch http://www.finitron.ca


Thu Mar 23, 2023 6:19 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Worked on the hardware card table, HCT, and cache coherency tonight. The HCT is a telescopic memory reflecting with progressively more detail where a pointer store occurred in memory.
The write barrier for pointer stores ends up looking like the following:
Code:
   ; Milli-code routine for garbage collect write barrier.
   ; This sequence is short enough to be used in-line.
   ; Three level card memory.
   ; a2 is a register pointing to the card table.
   ; STPTR will cause an update of the master card table, and hardware card table.
   ;
GCWriteBarrier:
STPTR      a0,[a1]          ; store the pointer value to memory at a1
LSR      t0,a1,#8      ; compute card address
ST8      r0,[a2+t0]      ; clear byte in card memory


The STPTR instruction updates two levels of the HCT automatically. The highest level is only a single 32-bit word indicating which 16MB page a STPTR happened on. The second highest level is a 4kB memory accessed as 1k x 32 bits indicating which 16kB page a STPTR happened on. It is then necessary to scan only the cards in a 16kB page. There are only 64 cards to check and likely only 32 pointers to check in a card.

_________________
Robert Finch http://www.finitron.ca


Sun Mar 26, 2023 4:21 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Yesterday: Elbow grease into the MPU component tonight. The MPU ties together several other components including the interval timers, interrupt controller, serial port and shared TLB.

Putting together the system-on-chip using the MPU component. The system does not synthesize to the correct size; most of the system is being omitted from the build. Obviously something is amiss.

Have the group register load and store instructions partially implemented. The idea is that a group of registers is stored with a single instruction. The group, in this case five registers, occupies 480-bits of a 512-bit cache line. Registers are stored in groups of five beginning with r0 to r4, then r5 to r9, r10 to r14, etc. The entire register context can be saved with 13 store instructions instead of requiring 64 instructions. These instructions could also be handy during function prolog and epilog.
Setting up the register file for group access was a challenge. The register file is broken into five groups of 16 registers each. Of which only the lowest 13 registers out of the 16 are used, resulting in 65 available registers. There is map involved to convert the six-bit register code into a three-bit group and four-bit index.
The ABI should be setup to make best use of the groups of five registers.

_________________
Robert Finch http://www.finitron.ca


Tue Mar 28, 2023 2:46 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Started working on the compiler, CC64, which is seriously out of date. Lots of strcpy() to change to strcpy_s() and so on.
Also had to update the preprocessor FPP64.

_________________
Robert Finch http://www.finitron.ca


Thu Mar 30, 2023 3:33 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got a first pass at the compiler done beginning with the source code of an earlier version. I believe it compiles code close to correctly but there are some performance issues. For instance, the compiler is not using base plus scaled indexed addressing when it could be. This results in code like the following:
Code:
# if(flags[i]){
  sll      t0,s0,1
  lea      t1,_flags[gp]
  ldw      t7,[t0+t1]
  beqz     t7,.00035

Which uses two extra instructions and two temporaries, when it could look like:
Code:
  ldw      t7,_flags[r0+s0*2]
  beqz     t7,.00035

The best place to fix this in the compiler is at the expression parsing stage, when it builds expression nodes representing the indexing operation. It might also be possible to handle this with pattern matching in the peephole optimizer but that would not allow the temporaries to be used elsewhere.

_________________
Robert Finch http://www.finitron.ca


Fri Mar 31, 2023 4:01 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Heavy duty work on the compiler today. Re-wrote the processing of initialization of aggregate types. It does not work as well as it used to. I think the code is improved though. It is tricky to do because types must be matched up between the variable and the initialization data. All the initialization data is grabbed at once and stored in expression trees. The trees should evaluate to constants. And having the data in trees creates an issue matching up the data with the variable elements. So, the expression trees are converted to linear lists in several places. It was necessary to add an ‘order’ number to the expression nodes recording which order they were encountered in. Unfortunately, it does not quite work correctly yet in all cases. Sometimes incorrect data is output for initialization.

Got back a compiler test suite off the web. The test suite I had disappeared during the great hard drive crash of 2022.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 09, 2023 3:58 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added the ORF instruction. It operates the same way as the OR instruction except that it uses an immediate value encoded as a float. Half, single, double, and quad precisions are supported. The instruction can be used to load a floating-point immediate value into a register. As a single cycle operation it is faster than using FADD to load a value.

Converted immediate constants from 96 to 128 bits. Dropping the whole 96-bit machine idea, and just going with 128 bits. Changed the way the PFX2 instruction works, it is now issued twice in succession to provide 64-bits of constant information. This frees up a prefix.

Got the first try on a sequential machine coded, but there is a signal amiss. It does not synthesize correctly, and leaves out the data cache.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 16, 2023 5:35 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Made page relative branching an option, and changed the default branching mechanism to simple relative addressing. The issue with page relative branching is that it makes the code more exposed to attacks because it is almost absolute addressing for the code. The core also now supports unaligned 64-byte accesses. This hopefully will make the compiler code easier to manage.

_________________
Robert Finch http://www.finitron.ca


Sun Apr 23, 2023 8:08 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Forgot to provide an ASID for instruction accesses. Fixing this did not fix the issue of code being elided during synthesis.

Decided to make vector instructions one byte wider than scalar ones. So, they are 48-bit instructions. The extra byte is to be able to specify the vector mask register to use. There is a leftover bit. The issue is that the compiler would always spit out a vector mask modifier before every vector instruction, then rely on the peephole optimizer to merge the mask instructions together where possible. A vector mask instruction was occupying five bytes of storage and it turns out that it is probably not any more storage efficient than just adding an extra byte to each instruction. The optimizer could not merge the mask instruction together across flow control boundaries. And it was otherwise tricky to do properly.

Getting rid of the decimal mode flag. The issue is that there needs to be able to be a mix of binary and decimal mode instructions available *at the same time*. This was discussed on a newsgroup. There is an extra bit available in some register-register operate instruction which will probably be used to indicate decimal mode.

_________________
Robert Finch http://www.finitron.ca


Mon Apr 24, 2023 3:52 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 43, 44, 45, 46, 47, 48, 49 ... 52  Next

Who is online

Users browsing this forum: DotBot, trendictionbot and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software