View unanswered posts | View active topics It is currently Thu Mar 28, 2024 3:03 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 12, 13, 14, 15, 16, 17, 18 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
That’s a good read.

I wanted to avoid ‘stop-the-world’ garbage collection by having it run periodically, controlled by the OS. It’s basically running a background thread since it’s interrupt driven. And it has a limited number of ticks that it runs in. I theory apps can still allocate memory while the garbage collector is running in the background, so they don’t stop. (Of course, it’s far from working yet).

They have got the GC latency down amazingly low according the article. But it seems a bit like they “cheated” by doubling the heap size. Write barriers seemed to be an issue. I’m wondering what write barriers are.

After reading up on garbage collection, I’ve gone on a crusade to implement shared memory. Shared heaps with garbage collection.
My goal is to ensure that the ISA has all the facilities to support the software. A goal at this point isn’t to get a complete full-blown OS working, but to sketch out requirements by trying to implement some things. It seems a lot of software can be supported by a simple computing core.

_________________
Robert Finch http://www.finitron.ca


Fri Aug 10, 2018 4:21 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
This was quite a quick and easy paper to read:
A Fast Write Barrier for Generational Garbage Collectors by Urs Hölzle
and this GC FAQ looks handy too.

(I'm sure you're right, to try to get some handle on what an OS or a HLL might need to do, or do efficiently, to guide your ISA design.)


Fri Aug 10, 2018 5:06 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Okay, having refreshed on some terminology, I’m reminded of computer graphics and variable resolutions and telescopes.

Seems to me a write barrier as described by Urs, is using a lower resolution memory to track updates to a higher resolution memory. We want to be able to look at writes at a lower resolution to gain performance in the GC. It seems to me that with the right hardware a single store operation could do both updating the object and performing a write barrier operation. If the memory range over which the store operation is taking place were aliased with another lower resolution memory, then on a write a bit could be set automatically in the low resolution memory.
For instance on my FPGA board there’s 512MB of memory. If the entire memory were aliased at a resolution divide by 128 it’d require 4 M bits (or 512kB) of alias ram. I think there’s actually enough block ram in the FPGA to do this. I’d use considerably less alias ram though.
A write to ram would also write to alias ram. A readback of the alias ram by the GC could be made a word at a time through a different address range.

_________________
Robert Finch http://www.finitron.ca


Fri Aug 10, 2018 6:23 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
One thing I noticed about Urs's paper is it shows using a "sll" (shift left logical) instruction in order to divide by 2 to the k. I think this should be a "srl" (a right shift) instead. It's computing an index into the card memory which should be a smaller address range.

_________________
Robert Finch http://www.finitron.ca


Fri Aug 10, 2018 2:47 pm
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
I wonder if there's an argument here for a RMW - a suitably sophisticated CPU could defer the modify and the write until the read is done, without holding up any subsequent instructions. (Perhaps it should be a 'SetBits' instruction.)


Fri Aug 10, 2018 4:41 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I had a set of bitmap instructions added to one of the 65xx series cores that could do this. The address specified was a bit address, so the address range of the instruction was reduced.

Speaking of other instructions, I was thinking of adding a store pointer (sp) instruction. The store pointer instruction would cause an update of the card memory alias in addition to regular memory. This would be in lieu of having all stores update the card memory.

Just got a radical idea. Using a multi-level card map (telescoped memory system) rather than a single level. With progressively lower resolutions. This would allow the CG to scan even faster. It could scan at progressively higher resolution only as needed. This wouldn't be done without hardware aliased memories as it would make the number of instructions in the write barrier too large.

In the system there only 256MB of the 512MB are allowed to be aliased in order to reduce the block ram usage. The 256MB are managed by a 128kB dual ported block ram. The 128kB block ram is managed by a 512B block ram, which is then managed by a single bit indicator.

Code:
The GC would first check the single bit,
    If the bit is set
        Then scan the 64 word memory,
        If a bit is set,
            Then scan the corresponding 128kB memory,
            If a bit is set
           Then scan the actual corresponding memory page

It adds more levels of loops to the GC.

_________________
Robert Finch http://www.finitron.ca


Sat Aug 11, 2018 3:00 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I was just thinking about how to integrate a telescopic memory into system. The approach I’d like to use is to have the memory respond to virtual addresses (not physical ones) so that a portion of the memory could be mapped into the virtual address space of the apps.
In the card memory the addresses and data can be multiplexed across multiple clock cycles because the card memory is single cycle in nature and main memory takes many clock cycles to update. Writes always write through to main memory taking many cycles. So, there is time to perform three writes to card memory in the same timespan as a single write to main memory.

An ACB (application control block) contains an assortment of things associated with the app such as the command line, pointers to the start of heap memory, pointers to video buffers and a keyboard buffer. The kind of stuff one would find in zero-page memory.
The application control block (ACB) is stored in the first page of virtual memory at virtual address zero in the app’s memory space to make it easy for the OS to find. Memory management pages are 8kB in size. I now have the first 2kB of the ACB reserved for card memory. This allows writes to update the card memory to begin at offset zero. Following the ACB in the virtual address space is the text code segment. The text code segment may take up contiguous virtual memory pages. So, normally the lower addresses of the card memory would never be updated because the address corresponds to the ACB and the text code segment of the application. Assuming that pointers are not stored to the code segment (no self modifying code), this means the lower portion of the virtual address space never updates the lower addresses of the card memory.
This is good because it allows the first few words of the card memory to be reassigned use as a second level lower resolution card memory. So, the second level card memory overlays the first few words of the first level.
Right now, apps are limited to 8MB of virtual memory and 2kB card memory is enough to cover a 4MB range at a resolution of 256 bytes. The upper part of the app’s virtual memory space contains the apps virtual screen buffers. This memory does not need to be managed with card memory. So only the lower 4MB of memory space is covered.

At a high resolution of 256 bytes a pointer is located to within 32 words. That means the GC has to scan only 32 words to try an locate a pointer. The second level table resolves the pointer to addresses within a 65kB block.

Found an error in the lane selects of the original FT64. For character (16 bit) loads and stores only.

Added data type recording to the core. Every other word of memory records the data types in the following word. This is to gain some locality of reference for better cache performance. The memory system always writes 16 bytes to memory regardless of the which store instruction is used. So, by placing the data type in the upper eight bytes for the lower eight bytes, only a single write instruction is required to record both the data type and data value. The data type is recorded on a byte-by-byte basis. A single byte is used to represent the data type primitive. The low order three bits indicate which byte of the primitive is represented. The most significant bit of the data type byte indicates if it’s a pointer. The remaining four bit indicate the type 0 = byte, 1 = 16 bits, 2 = 32 bits, 3 = 64 bits. Other values are reserved for float types.

The reason to record the data type is to make pointer detection unambiguous. The author found that without this, it’s possible for a memory cell to be misidentified as a pointer. Although this shouldn’t happen in normal circumstances, there’s always the possibility of a maliciously written program taking advantage of it.

_________________
Robert Finch http://www.finitron.ca


Sun Aug 12, 2018 1:46 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I have the following code to detect if a value is a pointer:
Code:
// Detect a pointer:
// The lowest five bits must be zero
// The upper 41 bits must be zero (8MB memory range max)
// The magic number should be 'OBJ'

naked inline int IsObjPointer(register int *p)
{
   __asm {
      and      $r1,$r18,#$FFFFFFFFFF80001F
      bne      $r1,$r0,.notptr
      lw      $r1,24[$r18]
      cmp      $r1,$r1,#('O'<<16)|('B'<<8)|('J')
      bne      $r1,$r0,.notptr
      ldi      $r1,#1
      bra      .0001
.notptr:
      ldi      $r1,#0
.0001:
   }
}

So, I’m thinking “Yikes!” that’s a lot of code to detect a pointer, and it’s executed in an inner loop in the GC. The code branches so it might upset the processor’s queue. It would be nice if it could be replaced by a single instruction – isptr. Isptr returns a one in a register if the source register contains a valid pointer, otherwise it returns a zero.
It would work as follows:

Tgt = 0
If ((a & ptrmask)==0)
If (memory[offset+a]==”OBJECT”)
Tgt = 1

It would do a conditional load of memory. It would replace eight instructions with a single instruction with no branching. Probably close to eight times as fast. But, it’s a somewhat application specific instruction.

_________________
Robert Finch http://www.finitron.ca


Mon Aug 13, 2018 3:08 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
In a modern system, where the CPU vastly outpaces RAM, the load from the pointer address might be the most time-consuming part - instruction count not being everything!

The second most costly part could be the branches (although prediction works wonders these days) - finding a way not to branch would be a win. Using a branch to load either 0 or 1 could be avoided by computing a result instead. That would surely mean using a second register, but registers are cheap - and they are there to be used!

That said, there might well be room for one or two CISCy instructions.


Mon Aug 13, 2018 6:05 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Bought a book on garbage collection.

Given the amount of hardware in machines these days, I’m surprised there isn’t GC accelerator hardware. People spend money on physics accelerators and GPU’s.

Spurred on, I could get the original IsObjPointer function down to just four lines of code with no branch, but with the load always happening, or five lines of code with a branch and conditional load. The problem with the load always happening is it might exception because it’s trying to load from potentially anywhere in memory. This also wouldn’t be good if an IO device were struck. So, the version with the branch is probably the safest route because the address is guaranteed to be within the virtual memory range.

Anyways I decided to keep the ‘isptr’ instruction for now and drop the data type recording mentioned in a previous post. I have to admit it was late night when I decided to go that route. Data type recording was just some paranoia about guaranteeing there was no mistaking a pointer. Otherwise it’s not actually that useful (for the target HLL). It uses ½ of memory for too limited a value. Instead of recording all data types, I came up with a couple of schemes of recording if there was a pointer in an address but discarded them as either taking too much memory (byte per pointer) or being too slow (RMW for bit per pointer).

My latest quandary is getting a value from a register and reducing the number of registers that need to be scanned. The problem is it’s needed to get from any register in the machine including registers in different register sets that the machine supports. There is a ‘mov’ register instruction to do this, but it only accepts fixed register designations. I need the program to get the register values indirectly. (So, it can be done in a loop, I ain’t coding 2048 cases). I toyed with the idea of a ‘get register indirect’ instruction, but it’s very difficult to implement in the cpu. So instead some form of self-modifying code is needed. SMC is undesirable because the icache must be invalidated.
I have the following monster function which makes use of a rarely used ‘exec’ instruction I had hoped to eliminate from the ISA:
Code:
// Get specified register from specified register set by building
// the proper MOV instruction. The MOV instruction is executed from
// a code buffer.

naked int GetRegister(int regset, int regno)
{
   __asm {
      bbs      $r18,#5,.regset32
      // load r1 with move from register set instruction template
      ldi      $r1,#%100010_001_00000_00001_00000_00_000010
      bra      .0001
.regset32:
      ldi      $r1,#%011100_001_00000_00001_00000_00_000010
.0001:
      // Set the register set and register number fields according
      // to the desired register.
      and      $r18,$r18,#31
      and      $r19,$r19,#31
      shl      $r18,$r18,#18
      shl      $r19,$r19,#8
      or      $r1,$r1,$r18
      or      $r1,$r1,$r19
      // Load code buffer
      csrrw   $r0,#$80,$r1
      // Flush instruction pipeline to ensure the code buffer
      // update is visible for execution with the exec instruction
      add      $r0,$r0,#0
      add      $r0,$r0,#0
      add      $r0,$r0,#0
      add      $r0,$r0,#0
      add      $r0,$r0,#0
      add      $r0,$r0,#0
      add      $r0,$r0,#0
      add      $r0,$r0,#0
      exec   #0            // execute code buffer #0
   }
}

I could possibly get rid of the branches at the start of the code by re-arranging the opcodes.

One problem with the system garbage collector is it needs to scan a lot. A lot of registers and a lot of VM addresses. Scanning the VM addresses can be greatly reduced by the two-level card table. The GC has only to look at four words of information per app to see if pointer changes have occurred. So, if an app hasn’t run it’ll cycle past the app quickly. There are 2048 registers to check however. It’s tempting to use a card table for the registers too. A bit record of a pointer load operation in the register set could be handy. It’d reduce the amount of scanning required by 32 times (32 regs in the register set), allowing a register set scan to be skipped over. It’s a challenge because there are numerous ways a pointer value could be loaded into a register. I think a way to handle it is with a single instruction dedicated to updating the indicator for the register set. Otherwise it’s a mess of options in a number of instructions.

Something I thought would never be needed occurred. A two-up level return from an interrupt subroutine. The GC is interrupt based and begins running when a GC interrupt occurs. The GC stops running when it’s complete, or when a stop GC interrupt occurs. The stop GC interrupt stops the GC (interrupts the interrupt routine) from running anymore by recording the GC state and then performing a two-up level interrupt return. The next time the GC interrupt occurs it checks to see if there was stored state and resumes operation from the previous stop location if there was. Otherwise it starts running from the beginning again.

_________________
Robert Finch http://www.finitron.ca


Tue Aug 14, 2018 1:22 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Goodness, interrupting and cleanly terminating an ISR is a new one for me!

I'm reminded of some GC system - might have been in the Chrome browser or its Javascript engine - which would follow anything which looked like a pointer. It would have a few false-positives, following things which were not pointers, but this was felt to be a reasonable cost. You can't afford to miss any real pointers. It could have some leaks where memory is pinned by a false pointer. Certain "unlikely" data values could make this problem worse. It all got a lot better when they moved to a 64 bit world.

Possibly this is mentioned in this article:
http://jayconrod.com/posts/55/a-tour-of ... collection


Tue Aug 14, 2018 9:19 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Another good article. I’ve updated recently as my notions of garbage collection and memory management were out of date. I’m still thinking DOS and memory blocks.

It looks like tagged data may be the way to go. It's important to decide because it affects how the processing core interprets data.
I wonder if pointers to code are handled the same way as pointers to data ? If not then using two bits to indicate a pointer type might be better.


One scheme for storing a pointer flag is to modify the virtual address output by the core by multiplying it by 66/64. (a simple shift and an add). This creates inaccessible (without special instructions) “holes” in the memory every 32 words that can be used to store pointer type flags. At the same time the memory system remains looking like a contiguous array of words. Getting it to work with the cache requires some more thought, but it should provide some locality as pointer type flags are never more than 32 words away.

_________________
Robert Finch http://www.finitron.ca


Wed Aug 15, 2018 3:21 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Decided to dedicate a portion of the address space (1/64) to storing single bit pointer indicators. Rather than monkey with the VMA the memory space is located with a base register (CSR).
Came up with the ‘LPT’ (load pointer tag) instruction to calculate the index into the pointer tag memory for a given address. It takes the virtual memory address shifts it right by nine bits adds the tag memory base address, zeros the three LSB’s, then uses bits 3 to 8 of the VMA to select a bit from the indexed word. LPT replaces at least five or six instructions to do a load and a bitfield extract. Updating the pointer tag will have to be done with an RMW instruction. Possibly ‘SPT’ for set pointer tag.

Code:
   lea      $r3,disp[$rn]               ; getting tag for disp[Rn]
   shr      $r1,$r3,#9                  ; compute tag memory index
   csrrd   $r2,#PTMBASE,$r0            ; add in base address of tag mem
   add      $r1,$r1,$r2
   and      $r1,$r1,#$FFFFFFFFFFFFFFF8      ; mask off 3 LSB's
   lw      $r1,[$r1]                  ; load the tags
   shr      $r3,$r3,#3                  ; word index
   and      $r3,$r3,#63                  ; could probably eliminate this inst.
   shr      $r1,$r1,$r3                  ; extract bit corresponding to word
   and      $r1,$r1,#1                  ; mask off other bits shifted

_________________
Robert Finch http://www.finitron.ca


Thu Aug 16, 2018 3:15 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
There needs to be a pointer tag associated with each register. This is a single bit, effectively making registers 65 bits wide.
The ALU needs to calculate the pointer tag. Given arguments that are pointers sometimes the result is just a data value and not a pointer. For instance, the result of a compare instruction.
The internal busses need to be 65 bits wide.
There seems to be a lot of scanning involved to find the root objects. Pointers to the root object pointers need to be stored in an array. I wonder how big to make the array?

_________________
Robert Finch http://www.finitron.ca


Fri Aug 17, 2018 3:22 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
Even in the 32 bit world, people have squeezed a few bits out of their pointers. It does then lead to grief when those bits are later needed for addressing.

But in the 64 bit world, surely we can afford to take some bits from the top (and maybe the bottom) to have rich pointer types without going over 64?


Fri Aug 17, 2018 7:25 am
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 12, 13, 14, 15, 16, 17, 18 ... 52  Next

Who is online

Users browsing this forum: No registered users and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software