View unanswered posts | View active topics It is currently Thu Mar 28, 2024 2:38 pm



Reply to topic  [ 775 posts ]  Go to page Previous  1 ... 13, 14, 15, 16, 17, 18, 19 ... 52  Next
 Thor Core / FT64 
Author Message

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
I was going to suggest tagged memory a few days ago, but a lack of time prevented my posting. Two things that may or may not have already been considered in the use of tagged memory.

First, the Burroughs B5xxx series of processors pioneered the concept. They used to tag the data type of the word in memory. One particular feature that stood out to me was they were able to tag the data type of a word so that only one arithmetic instruction opcode was necessary to support integer or floating point operations. In fact, mixed integer and floating point operations were supported by a single opcode / instructions and automatic conversions from one type to another were built in to the processor / ALU.

Second, perhaps I've missed it, but I've not seen any mention of the new protected pointer types that the newer Intel architectures support. Apparently, they are not well supported by the compiler suites, but Intel has some new pointer types that operate similar to the old '286/'386 segment register descriptors. In other words, the pointer is represented in memory by more than one word, and the additional words contain additional information that the processor can use to implement more protections for such things as uninitialized pointers, object reference counting, etc. I only glanced / skimmed at this material several years ago, but it might be pertinent to what you are attempting to do.

In addition, I saw the comment regarding the lack of a GC hardware accelerator. I think that you need to expand on this more. I think that a HW-accelerated GC should be the way to go, especially if you don't want the main core slowing down to perform this function on a regular basis. GC should be a background task handled by an asynchronous HW component of the core so that GC activities do not noticeably affect the processing of most tasks.

_________________
Michael A.


Fri Aug 17, 2018 1:00 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
First, the Burroughs B5xxx series of processors pioneered the concept. They used to tag the data type of the word in memory. One particular feature that stood out to me was they were able to tag the data type of a word so that only one arithmetic instruction opcode was necessary to support integer or floating point operations. In fact, mixed integer and floating point operations were supported by a single opcode / instructions and automatic conversions from one type to another were built in to the processor / ALU.

Definitely a plus for tagged data. It could save a lot of opcodes. But generally this has been passed off to the compiler to know what data types to use.
Quote:
Second, perhaps I've missed it, but I've not seen any mention of the new protected pointer types that the newer Intel architectures support
I had not heard of those. More to research !

I'm backtracking on the tagging. The problem with a dedicated pointer tag memory is every single store operation must be mirrored with an RMW operation to the tag memory. This cuts the performance of stores to about 1/3 what it otherwise would be. It also doubles the code size required for stores.

Representing a pointer with a tag can be done in 64 bits. For sure there’s plenty of room in a 64-bit value. For my purposes I was thinking of setting the upper 24 bits to ‘PTR’ to indicate a pointer. That would allow 40-bit pointers to data. Conflicts with other numbers would then be reduced to 1 in 16 million odds. I suppose an exception could be generated if a value conflicted with a pointer. Another possibility is to use a floating-point NaN to indicate a pointer. This would prevent the FP from digesting a pointer and would allow non-conflicting floating-point values in memory. Ya know if the top 20 bits were set to 20’hFFF01 for instance to represent a pointer that would make it a NaN value and leave 44 bits for the pointer value. The only thing reduced is the range of negative numbers – there’s a hole in the range.
The ALU still needs to calculate the pointer tag. But internal busses would not need to be modified.

The IsPointer() function now looks like:
Code:
naked inline int IsPointer(register int *p)
{
   __asm {
      asr      $r1,$r18,#$44
      cmp      $r1,$r1,#$FFFFFFFFFFFFFF01
      not      $r1,$r1
   }
}

The compiler’s broken: auto Incrementing/decrementing a value in a bitfield doesn’t work. The problem occurs when the field is dereferenced the value is placed in a register. The auto increment logic then says “aha” the value’s in a register so it just outputs “add reg, reg, #1” instead of putting the value back in the bitfield. Fixed this one. It takes about a dozen instructions to manipulate the bitfield, so the field in question was changed to a byte to increase performance.
Pointer comparisons were using signed compares sometimes. More code needed to be added whenever a pointer type was set, to set the signed status to unsigned.

_________________
Robert Finch http://www.finitron.ca


Sat Aug 18, 2018 3:22 am
Profile WWW

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
There's a discussion of these pointer extensions on Wikipedia. It looks like not everyone is convinced that these enhancements are well thought out and are compatible with common C/C++ usage idioms. In fact, it may be that several competing software-based alternatives may be more effective and efficient.

_________________
Michael A.


Sat Aug 18, 2018 6:11 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Bounds Checking:
The Intel bounds checking registers look similar to a set of bounds checking registers I had in an another cpu a few years ago. It could be I heard mention of them back then. IIRC the bounds registers worked with the CHK instruction. There’s still a set of bounds registers for the stack (SBL, SBU CSR’s) in FT64 which check all stack references to make sure they are in range. I have them effectively disabled by setting the bounds to the max range. Otherwise they are additional state to be managed on a thread switch.

I think the idea in my case was to optimize the CHK instruction so that registers didn’t have to be loaded with bounds information while the program was running. Instead the bounds would be set at program start. Most of the bounds to check are statically determined, for instance is a pointer within the data segment? Hopefully this would reduce the need for registers in functions.

Hardware GC:
GC appears to be too complex to support with hardware. It looks like it works in phases for instance determining all roots before moving onto the next phase.
One item I spotted for hardware support is the marking queue. I’m not sure it would help very much though. It could be a hardware queue. A simple store to an address to push an item, and a load to pop an item. The queue status (empty, full) at another location. A hardware queue might be able to trim several instructions off the push/pop operations. This is another spot where a generic hardware queue could be used in the system. I mentioned before the use of hardware queueing by the OS to handle the ready / waiting lists.

Tonight, I’m contemplating adding to the CC64 language to provide compiler support for GC.
One issue is that the GC scans entire objects to try and detect pointers. If only there were a way to send a command to the CG to tell it to skip over words while scanning. I was thinking of having the compiler output a special constant field in objects (structures) that indicates what to skip over. The constant would be a second NaN value that specifies the number of words to skip over. The problem is this field would have to be set at runtime. If the field wasn’t properly set the GC wouldn’t skip the following fields and hence might falsely identify some data as pointers. So, code would still work, just with memory leaks.
Code:
typedef struct _tagObject align(64) {
   int magic;
   int size;
   __gc_skip skip1 {
      __int32 typenum;
      __int32 id;
      __int8 state;         // WHITE, GREY, or BLACK
      __int8 scavangeCount;
      __int8 owningMap;
      __int8 pad1;
      __int32 pad2;
      unsigned int usedInMap;      
   };
   struct _tagObject *forwardingAddress;
   void (*finalizer)();
} __object;
...
P = new __object;
P->skip1 = 3;   // Tell the GC to skip 3 words

Knowing what to skip over while GC’ing would help.

_________________
Robert Finch http://www.finitron.ca


Sun Aug 19, 2018 2:43 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Added a couple of instructions to the instruction set: ISNULL and ISPTR. Isnull detects if a register contains a null pointer, which in many cases is not a simple zero. Isptr detects if a register contains a pointer. Both instructions are at least two words more compact than using the usual instruction mix.

Just trying to get the compiler to recognize when it can use a conditional move instruction as in:
Code:
   return (IsPointer(rg) ? rg : 0);

_________________
Robert Finch http://www.finitron.ca


Mon Aug 20, 2018 3:08 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Got the compiler to do a conditional move for hook operations. The conditional move instruction is used about ten times out of 22,000 instructions. Been thinking of dropping it.

The compiler now considers functions that only contain inline function to be leaf functions.

Toying with the idea of adding a ‘findptr’ instruction. This instruction would scan a memory region until it finds a pointer value. It’s another function where it’d be nice to have dual result busses as it could return both the value found and an updated search pointer. It’ll have to settle for updating the search pointer. The value can be retrieved by doing a load at the search pointer.
The original Thor had a small set of string instructions. It’s tempting to add several string instructions to FT64.

An example using findptr:
Code:

static naked inline int FindPtr(register int *start, register int *end)
{
   __asm {
      findptr   $v0,$a0,$a1
   }
}

static void GetGlobalRootsInDataArea()
{
   int *ptr, *qtr;
   int *endptr;
   int nn;
   ACB *pACB;

   _SetDataLevel(OL_USER);
   pACB = (ACB *)0;
   pACB->gc_rootcnt = 0;
   ptr = pACB->pData;
   endptr = pACB->pData+(pACB->pDataSize >> 3);
   do {
      nn = FindPtr(ptr,endptr);
      qtr = (int *)ptr[nn];
      if (IsPointer(qtr) {
         if (!IsNullPointer(qtr)) {
            if (*qtr==OBJ_MAGIC) {
               pACB->gc_roots[pACB->gc_rootcnt] = qtr;
               pACB->gc_rootcnt++;
            }
         }
      }
      ptr += nn + 1;
   } while (ptr <= endptr);
   _SetDataLevel(OL_MACHINE);
}

_________________
Robert Finch http://www.finitron.ca


Tue Aug 21, 2018 1:36 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
A command value for the garbage collector was defined. The NaN code $FFF02xxx… is now generated by the compiler which indicates to the garbage collector the number of words to skip over while scanning. It’s used in the data segment as the following example shows. Several structures and array variables were declared.
Code:
typedef struct
{
   int a;
} A;

typedef struct
{
   char s[50];
   A a1[25];
   A a2;
   A a3;
   int b;
} B;

typedef struct
{
   int c;
   A *pa;
} C;

char c1;
__int8 i8;

A gblA;
B gblB = {
   "Hello world"
};

int iarray[100][20];

B sarray[100][30];

C cvar;

B s2array[100][30];

Code:
    data
   align   8
   bss
   align   2
public bss _c1:
   fill.b   2,0x00

endpublic
   align   2
public bss _i8:
   fill.b   1,0x00

endpublic
   data
   align   8
   fill.b   5,0x00
   bss
   align   8
   align   8
   dw   $FFF0200000000002; GC_skip
public bss _gblA:
   fill.b   8,0x00

endpublic
   data
   align   8
   align   8
   dw   $FFF0200000000029; GC_skip
public data _gblB:
   dc   72,101,108,108,111,32,119,111
   dc   114,108,100,0
   fill.b   76,0x00
   fill.b   228,0x00

endpublic
   bss
   align   8
   align   8
   dw   $FFF02000000007D0; GC_skip
public bss _iarray:
   fill.b   16000,0x00

endpublic
   align   8
   align   8
   dw   $FFF020000001E078; GC_skip
public bss _sarray:
   fill.b   984000,0x00

endpublic
   align   8
   align   8
                         ; GC_skip
public bss _cvar:
   fill.b   16,0x00

endpublic
   align   8
   align   8
   dw   $FFF020000001E078; GC_skip
public bss _s2array:
   fill.b   984000,0x00

endpublic
   rodata
   align   16
;   global   _i8
;   global   _cvar
;   global   _s2array
;   global   _gblA
;   global   _gblB
;   global   _iarray
;   global   _sarray
;   global   _c1
Notable is the variable cvar can’t be skipped over because it contains a pointer. Also note a skip command is output only for structures and arrays. This may be improved in the future by grouping together primitive types to skip.

_________________
Robert Finch http://www.finitron.ca


Wed Aug 22, 2018 2:06 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Compiler:
Cleaned up some of the variable initializer code and added the capacity to initialize unions. Initializing a union looks at the constant value’s type and sees if there is a similar type in the union. If not an error occurs.
Migrated some of the code from global functions to class members.
Found an error in the compiler, the CSE table for an inline function was being deleted after the first use. If the function is inline the CSE table is no longer deleted -> memory leak.

_________________
Robert Finch http://www.finitron.ca


Thu Aug 23, 2018 2:48 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
More Compiler:
Modifications to the symbol class are in the works. The symbol class is used for all symbols the compiler can process. Symbols include local and global variables and functions and methods. An issue with the symbol class is that it holds fields for all the different types of symbols. There are a lot of fields related only to functions and methods. So a lot of space is wasted in the symbol table. It would be better if the additional function and method information were stored in a secondary class. There is only a single symbol table in the compiler. It supports up to 32,000 symbols in a lexical unit. There is unlikely to be anywhere near that number of functions / methods in a single unit (file).
 Moved the additional information out to another class referenced from the symbol class. Set a limit of 3,000 functions in a single file.
There was an error in the compiler where an indicator for a function pointer wasn’t being reset. This caused a dereferencing error sometimes. The error has been in the compiler a long time.
Managed to fix several other compiler bugs.

The assembler is being used to link all the source files together. It’s outputting thousands of lines of hex dumps without source code comments and I don’t know why yet. It’s supposed to be outputting a mixed listing.

_________________
Robert Finch http://www.finitron.ca


Fri Aug 24, 2018 3:02 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Played with the idea of adding string instructions to the ISA. The string instructions would be interruptible, but one would have to test for completion of the operation after the string instruction, and loop back if not completed. Or, disable interrupts while the string instruction is executing.
Code:
.cont:
   strlen   $v0,$a0
   lc      $v1,[$a0+$v0*2]
   bne      $v1,$r0,.cont


Also re-thinking dual result busses (because it’s a cool idea) which would allow:
MOV2 move two registers to two targets (which can be used to implement XCHG).
Does two move instructions in a single instruction
POP (PUSH requires only one target)
LINK
UNLINK
DEMUX
DIVMOD (return both calculated quotient and remainder)
MUL (return both high and low order products)
Dual result busses could be limited to a single ALU, forcing instructions requiring them to be executed on that ALU. I previously removed dual busses from the design, but I had dual busses on both ALU’s.

Increased the number of interrupt levels to 16 from 8. The plan is to use several levels exclusively for garbage collection. Since these levels can’t be used for anything else more levels were required. Not all interrupt levels need be used.

Added code to the core to recognize cache preloads. A preload instruction is signified by the target register of a load being r0. For preloads errors are ignored.

_________________
Robert Finch http://www.finitron.ca


Thu Aug 30, 2018 3:14 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
I noticed tonight that there’s a fair number of shift-left instructions followed by an add operation. This occurs typically during the calculation of memory addresses. With 48-bit instructions available I got to thinking that these operations could be combined into a single 48-bit opcode. The instruction would have to take two cycles to get through the ALU rather than cascading both operations in a single cycle. So there’s little if any performance benefit to using a combined instruction. However, it does consume less memory, so there may be some benefit for the cache. Performing a shift operation and a second operation with the same instruction reminds me of an ARM processor.
A right shift followed by an ‘and’ operation can be used to do a bit field extract.

_________________
Robert Finch http://www.finitron.ca


Fri Aug 31, 2018 3:20 am
Profile WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1780
They put rather a lot of chip area into their barrel shifter, so it must somehow have seemed worthwhile. It might be that it used up encoding bits which would otherwise be idle. As the relatively low instruction density of ARM is one of its defects, that might have been compelling. They particularly wanted a fast Basic interpreter, I think, and perhaps also a fast 6502 emulation, so that might have been a consideration.

The nice thing about putting a shifter into your machine, and your ISA, is that a less aggressive implementation can take more time, while a more aggressive implementation can use more resources for greater throughput. It gives you a degree of freedom.

Squashing two operations into one opcode maybe helps memory bandwidth, not just by lowering cache pressure but by freeing up cycles for write backs?


Fri Aug 31, 2018 6:29 am
Profile

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Quote:
The instruction would have to take two cycles to get through the ALU rather than cascading both operations in a single cycle. So there’s little if any performance benefit to using a combined instruction.
I think that you are not crediting the reduction in memory bandwidth as a measure of improved performance. Somewhere recently I've read that "performance is measured in bandwidth". (Not a direct quote, but I think that it expresses a correct view regarding the source of performance. Reflecting further on the source, I think this perspective was expressed by Seymour Cray.) Reduction of the bandwidth requirements for cache / memory allows bandwidth to be preserved / transferred to other critical areas such as your register interfaces. It may also be possible to conditionally execute the other(s) instruction(s) contained in a packed instruction representation in any free pipelines that your architecture may implement. In other words, instead of serial instruction dispatch of the packed instructions through a single pipeline, perhaps a simpler, parallel pipeline could be used to perform simple operations like register-register moves. A pipeline of this sort would support many common operations that are performed in common algorithms. The address generation algorithm you describe could be one of those algorithms supported by such a configuration. In addition, many signal/image processing algorithms have a similar structure.

_________________
Michael A.


Fri Aug 31, 2018 11:40 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Backtracked and removed the pointer logic in the arithmetic operations. The only arithmetic operations allowed on pointers by the compiler are +,-,++, and --. And there is no need for extra logic with those operations. Subtracting two pointers should naturally result in a non-pointer. The only thing to watch out for is overflow during an add causing the pointer status to change to non-pointer. I added an ‘addp’ instruction to add to a pointer and detect overflow.

I noticed the conditional move instruction had a lot of extra unused bits in it. It has to be a 48 bit instruction. So I dedicated a couple of bits to specifying other conditional operations. I added an immediate mode to the instruction and a conditional add operation. Conditional add might be useful for a multiply function. There are a couple of more opcode slot available for other operations.

Added a bunch of floating-point set instructions.
I’m thinking of adding a new operator to the compiler ‘??’ (double question). This operator detects the orderliness relationship of two floating point numbers. Otherwise there isn’t really a way to test for orderliness at a high level.
Code:
If (a ?? b)
    printf(“Ordered”);


Added run-time determined (rtd) operations. The operation to be performed is determined at run-time. I’ve previously called these mystery operations.
Code:
op = OP_PLUS;
c = __rtop(op, a, b);
op = OP_GT;
if (__rtop(op,a,b))
    printf(“hi”);


I've swung back to the pointer tags using additional words of memory to represent the tag bits.
I'm just working on the data cache. Virtual addresses multiplied by 33/32 to give an extra hidden word. This is 3% overhead.

_________________
Robert Finch http://www.finitron.ca


Tue Sep 04, 2018 3:39 am
Profile WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2095
Location: Canada
Quote:
In other words, instead of serial instruction dispatch of the packed instructions through a single pipeline, perhaps a simpler, parallel pipeline could be used to perform simple operations like register-register moves. A pipeline of this sort would support many common operations that are performed in common algorithms.

I was thinking along those lines, but there's very little (none at the moment) room in the FPGA I'm using. It would be better as a parallel pipeline because then the throughput isn't being affected, just the latency.

Spent some time fixing up minor bugs and stubbing out some minor functionality in order to try synthesizing the rtl code. The code’s synthesizing right now, it could be a while.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 05, 2018 4:54 am
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 775 posts ]  Go to page Previous  1 ... 13, 14, 15, 16, 17, 18, 19 ... 52  Next

Who is online

Users browsing this forum: No registered users and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software