AnyCPU - View topic - Thor Core / FT64

Page 30 of 52

[ 775 posts ]

Go to page Previous 1 ... 27, 28, 29, 30, 31, 32, 33 ... 52 Next

Thor Core / FT64

Author	Message
joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: Thor Core / FT64 Hi Rob, I looks that the availability of the new operators is a useful addition. Eventually, I performed some tests on the LLMV compiler with your proposed expressions and others and, despite my earlier suggestion about always following the short circuit rule, it I found that maybe it's not that hard to optimise this for the general case. I found by looking at code generation that the compiler seems to follow a relatively simple set of rules to decide when it is worth (or possible and safe) to break the short-circuit rule, so maybe you are still able to consider that. The compiler seems to break the short-circuit rule WHEN:. - Only simple scalar type variables or constants are used on the right hand side expression (for example integer or floating point values), and these operations are performed on registers. Pointer arithmetic is explicitly excluded even if it's just a single load. - Only basic arithmetic and logical operations are performed on the right hand side. Division is always explicitly excluded, even if the divisor is a non-zero constant. Function calls are explicitly excluded too. Significantly long expressions are also excluded. In all other cases, branching code is created. So under the light of this, I think that basically, the compiler seems to only optimise branching if there's no memory access (in all senses) and if there are no divisions in the right hand side. Maybe there's a more complicated algorithm under the hood and I surely have not tested all possible scenarios, but it seems that maybe by just checking the above in a conservative way you may be able to reduce branching significantly. After all most conditional and logical expressions tend to be very simple, so the chances for them to be optimised are great. Anyway, I leave it here as some information of what I found on the code generated by LLVM.
Sun Apr 21, 2019 9:04 pm

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 I re-wrote the peephole optimizer, converting it to C++ code. Got the compiler almost back to a working state. The source code for one function though causes a dereferencing error. The error arises because the compiler infers a function definition for a function that it can’t find in it’s tables. The issue is the function has a proto-type and a definition and for some reason, for just this one function, the compiler is treating them as separate function definitions, as if they were two different functions. By having two entries in the function table, when the function is instanced the compiler gets confused, says the instance doesn’t match, and creates yet a third copy of the function in the function table (assuming it’s just returning an int). Unfortunately, the function signature does match how it’s instanced, so the compiler then just spits out a dereferencing error as if it can’t find the function. It’s very complicated to try and explain. What’s driving me nuts is that things are working in many other cases. It works fine in a previous version of the compiler. IIRC I recompiled the entire C standard library and got this error in just one spot. How? Is what wonder. _________________ Robert Finch http://www.finitron.ca
Mon Apr 22, 2019 8:54 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 Well, the previous issue was hacked to work. The compiler’s declaration processing needs to be cleaned up some more. Specifically processing parameter lists. The hack looks for identical function entries in the function table, and if that’s all it finds when trying to dereference the function then it accepts an entry from the table instead of creating a new one. The compiler was also spitting out an error for an expression taking the address of a struct variable. Like the following: Code: ; const char point = (&_Locale)->decimal_point[0]; lea $t0,#__Locale lw $t0,72[$t0] lc $t1,[$t0] sc $t1,-2[$fp] Another hack was added to add and ‘address of’ expression node type just for taking the address of struct types. This was supposedly already handled in the expression processing code, but I guess not. There was one line of code affected by this. _________________ Robert Finch http://www.finitron.ca
Tue Apr 23, 2019 3:23 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 The BNEI instruction has crept into the instruction set. There is already a BEQI instruction, so BNEI adds some symmetry. BEQI was added to handle switch statements with small constants. BNEI was initially thought to be not used that often. But it’s used in ‘if’ statement like: if (*s == ‘c’) continue; The branch is actually opposite to the ‘==’ because we want to branch to the ‘else’ on the opposite of the condition. BNEI replaced most of the XOR-then-branch instructions with a single operation. _________________ Robert Finch http://www.finitron.ca
Wed Apr 24, 2019 2:49 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 Made 40-bit jumps and calls a config option (JMP40). It’s extra hardware to support them and they are currently unused. Added a set of logical branches to the ISA. These branches perform a logical operation (and, or, nand, nor, xor, xnor) then branch based on the result. xnor and xor branches are similar to BEQ, BNE except that they reduce the values to logical 1 or 0 before testing. _________________ Robert Finch http://www.finitron.ca
Thu Apr 25, 2019 2:35 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 The opcodes for indexed loads were documented incorrectly in the text (they were correct in the opcode map). The volatile load half-word instruction wasn’t documented. Added some compound operations to the alu. They were planned in the ISA but not implemented. The compiler was spitting out sized shift operations and they are available only with compound shifts. (A compound operation would be two operations for the same instruction). Compounded operations need an extra cycle in the ALU and are 48-bit instructions. So, they are both larger and slower than most other ops. Most of the time it’s better to use two separate instructions. Some operations were not possible to encode in 32-bits however (for example sized shift operations). To optimize the use of a 48-bit instruction a second opcode was allowed for. Compound operations are represented in source code by separating the opcodes with a ‘:’. Like the following: Code: shl:add r1,r2,r3,r4 Which shifts left r2 by r3 then adds r4 to the result. In the case of the compiler output a sized right shift was required which could be written as: Code: shr:nop.c r1,r2,#5,r0 ; shift the low order 16-bits of r2 right five times or Code: shr.c r1,r2,#5 which the assembler also recognizes. Forgot to include code to convert a double indexed address mode to a single indexed address mode when substituting constant values for registers in operands. This caused the screen-scroll routine to fault on a bad address. To generate branch-and and branch-or operations the compiler looks at the number of instructions generated, if the logical branch requires too many instructions then the compiler dumps the generated code and redoes the branch using regular short-circuit branches. Well I put a limit in of the number of instructions but forgot to qualify it with a logical branch being generated. Result: all if statements with more than 10 instructions had the generated output removed. I had a heck of a time tracking this bug down because most of the time 10 instructions is adequate to generate the if expression. At the moment, there are several places in the code generator where it takes the number of instructions into consideration. However, it doesn’t look at the type of instructions. At some point this will be changed to take the instruction’s cost relative to other instructions into consideration. The Instruction class already has a cost field associated with it, it’s just a matter of a little more code. Updating the peeplist to C++ caused inline assembly code generation to fail because each function now has it’s own peeplist. A copy of the peeplist of the inlined function needed to be inserted into the function being generated wherever the inline function is called. _________________ Robert Finch http://www.finitron.ca
Fri Apr 26, 2019 3:18 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 Put some more work into optimizing index scaling. The index scaling optimization looks for shift operations prior to double indexed operation and removes the shift operation by specifying an index scale if possible. Before the optimization the code looks like: Code: ; for (nn = 0; nn < count; nn++) mov $r11,$r0 console_208: bge $r11,$r12,console_209 ; scrn[nn] = scrn[nn+(int)j->VideoCols]; shl $t0,$r11,#3 lc $t3,4338[$r13] add $t2,$r11,$t3 shl $t1,$t2,#3 lw $t2,[$r14+$t1] sw $t2,[$r14+$t0] add $r11,$r11,#1 bra console_208 console_209: After optimization the code looks like: Code: ; for (nn = 0; nn < count; nn++) mov $r11,$r0 console_208: bge $r11,$r12,console_209 ; scrn[nn] = scrn[nn+(int)j->VideoCols]; lc $t3,4338[$r13] add $t2,$r11,$t3 lw $t2,[$r14+$t28] sw $t2,[$r14+$r118] add $r11,$r11,#1 bra console_208 Two left shift operations have been removed from the loop and replaced with scaling values in the indexed operation. The code with no optimization looks like: Code: ; for (nn = 0; nn < count; nn++) sw $r0,-16[$fp] console_208: lw $t0,-16[$fp] lw $t1,-24[$fp] bge $t0,$t1,console_209 ; scrn[nn] = scrn[nn+(int)j->VideoCols]; lw $t1,-16[$fp] mulu $t0,$t1,#8 lw $t1,-8[$fp] lw $t4,-16[$fp] lw $t5,-32[$fp] lc $t5,4338[$t5] add $t3,$t4,$t5 mulu $t2,$t3,#8 lw $t3,-8[$fp] lw $t4,[$t3+$t2] sw $t4,[$t1+$t0] lw $t0,-16[$fp] add $t0,$t0,#1 sw $t0,-16[$fp] bra console_208 console_209: The next optimization to work on is loop invariants. The “lc $t3,4338[$r13]” could be moved outside of the loop. After loop invariant optimization: Code: ; for (nn = 0; nn < count; nn++) mov $r11,$r0 lc $t3,4338[$r13] console_208: bge $r11,$r12,console_209 ; scrn[nn] = scrn[nn+(int)j->VideoCols]; add $t2,$r11,$t3 lw $t2,[$r14+$t28] sw $t2,[$r14+$r118] add $r11,$r11,#1 bra console_208 console_209: The stack footprint was reduced. Functions previously allocated stack storage for temporary registers and register variables using an approach that assumed the worst possible stack usage. This has been changed to track the stack usage of temporaries to what is actually used, based on register spill depth. I continue to work on hardware issues. The core is hanging during an instruction cache load. Exactly why hasn't been determined yet. _________________ Robert Finch http://www.finitron.ca
Sat Apr 27, 2019 2:16 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 It seems like I’ve been doing a lot of work on the compiler lately. It’s just that I’ve had lots of time between builds for the hardware:) I’ve been doing a lot of experimentation to try and figure out which signals are holding up the bus access. Signals are active for bus access (cyc, stb) and the rom chip select is active, but there’s no acknowledge signal coming back from the rom. However, it’s not a static problem as the core boots (from rom) then hangs part-way into the monitor program. The compiler was outputting case tables twice for table-based switch statements. This consumed extra storage space and led to undefined/unused symbols. The label for the default case wasn’t output by the compiler. This caused unresolved references. Added the ability to directly push an immediate constant on the stack. I got tired of seeing code like: Code: ldi $a0,#123 push $a0 Pushing a constant on the stack is a fairly common operation. There are worse ways to waste hardware, and we have the transistors. The assembler was encoding the 32-bit form of the push instruction incorrectly, however that form was never used. Added the loop inversion optimization, been meaning to do that for a while. Loop inversion moves the conditional test to the end of the loop which eliminates a branch instruction. Well hardware wise I found out the rom chip select wasn’t active. After placing the address bus under scrutiny, it was revealed the address going to the rom is wrong. It’s $FF…FF0. The address coming out of the cpu is correct, the address coming out of the mpu isn’t. So, I’m guessing that the mmu is active for some reason when it shouldn’t be. _________________ Robert Finch http://www.finitron.ca
Sun Apr 28, 2019 2:43 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 Labels for case statement were only being generated if there was code in the case statement. It took some searching to find out where the label generation was being suppressed. This caused an unresolved label with table-based switch statements. The compiler was altered to always output the label for a case statement. Changed the way sprite address generation is done. The issue was that the sprites had a rolling display. The address generators used to just accumulate the address incrementing when triggered by a horizontal pulse. Now the address generators calculate the address based on the scan position relative to the sprite’s position. So, if a trigger point is missed the calculated address should still be correct for the next access. It turns out to be not anymore hardware. I also added an config option to omit lower resolution sprite modes in order to reduce the hardware footprint. I went through all that work to modify the sprite controller and it didn’t fix things. It was just a wild guess at timing issues which turned out to be wrong. It turns out to be in the multi-port memory controller (MPMC). The address test for cache access wasn’t taking into consideration the different addresses depending on the sprite. It just did a global compare, what was needed was a separate compare for each sprite. This added a bunch more registers and compare logic to the MPMC. Data for the sprites image is being loaded from dram. This shows that the dram controller is working. _________________ Robert Finch http://www.finitron.ca
Mon Apr 29, 2019 3:11 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 ToDo: fix aggregate assignments. They don’t quite work, close but no cigar. I have a small test program for aggregate assignments. I got stuck for the longest time wondering why the address calculation was returning the same address for array elements x[2][0][0] and x[1][5][0]. Then I finally realized that there were only five elements meaning the highest allowed value for the index should be a ‘4’. So, the code was correct after all. x[2][0][0] and x[1][5][0] do have the same address when x is dimensioned as x[3][5][6]. Code: void TestArrayAssign4(int aa) { int x[3][5][6]; int y[7][2][10]; int j = (int){15,20,25}; int k; x[2][0][0] = 21; x[1][4] = (int[6]){1,2,3,4,5,6}; k = &x[2]; x[2] = (int[5][6]){{10,2,1,0},{9,6,2},{8},{7},{6}}; x = y; } You may have noticed the line ‘(int[5][6]){10,9,8,7,6}’. It’s a list typecast. Elements of the list are assigned types according to those identified in the cast. For constant values this is important because unless the compiler is told what types of values it’s dealing with, it’ll assume byte integers (the shortest representation) for instance. The compiler just emits a call to a copy routine to handle the assignment. One thing I don’t know how to do yet is to remove the redundant constant data generated. TestAssign_3 to _7 are redundant data. They get generated because of the order the constant list is processed in with nested constants. The only thing I can think of to do at the moment is write a data peephole optimizer that looks for unreferenced labels and removes the associated data. Code generated looks like the following: Code:* code align 2 ;==================================================== ; Basic Block 0 ;==================================================== public code _TestArrayAssign4: sub $sp,$sp,#32 sw $fp,[$sp] sw $r0,8[$sp] mov $fp,$sp sub $sp,$sp,#1880 sw $r11,0[$sp] lea $v0,-720[$fp] mov $r11,$v0 ; int x[3][5][6]; lea $v0,TestArrayAssign_1 lw $v0,[$v0] sw $v0,-1848[$fp] ; x[2][0][0] = 21; ldi $v0,#21 sw $v0,480[$r11] ; x[1][4] = (int[6]){1,2,3,4,5,6}; add $v0,$r11,#432 lea $v1,TestArrayAssign_2 mov $a0,$v0 mov $a1,$v1 ldi $a2,#48 call __aacpy ; k = &x[2]; add $v0,$r11,#480 sw $v0,-1856[$fp] ; x[2] = (int[5][6]){{10,2,1,0},{9,6,2},{8},{7},{6}}; add $v0,$r11,#480 lea $v1,TestArrayAssign_8 mov $a0,$v0 mov $a1,$v1 ldi $a2,#240 call __aacpy ; x = y; lea $v0,-1840[$fp] mov $a0,$r11 mov $a1,$v0 ldi $a2,#1120 call __aacpy lw $r11,0[$sp] mov $sp,$fp lw $fp,[$sp] ret #32 endpublic rodata align 16 align 8 TestArrayAssign_1: db 15 db 20 db 25 fill.b 5,0x00 TestArrayAssign_2: dw 1 dw 2 dw 3 dw 4 dw 5 dw 6 TestArrayAssign_3: dw 10 dw 2 dw 1 dw 0 fill.b 8,0x00 TestArrayAssign_4: dw 9 dw 6 dw 2 fill.b 16,0x00 TestArrayAssign_5: dw 8 fill.b 32,0x00 TestArrayAssign_6: dw 7 fill.b 32,0x00 TestArrayAssign_7: dw 6 fill.b 32,0x00 TestArrayAssign_8: dw 10 dw 2 dw 1 dw 0 fill.b 8,0x00 dw 9 dw 6 dw 2 fill.b 16,0x00 dw 8 fill.b 32,0x00 dw 7 fill.b 32,0x00 dw 6 fill.b 32,0x00 fill.b 160,0x00 ; global _TestArrayAssign4 [code] _________________ Robert Finch http://www.finitron.ca
Tue Apr 30, 2019 6:05 am

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: Thor Core / FT64 This is an interesting feature. As far as I know, aggregate assignation for arrays is not a standard "C" language feature, so I assume you have implemented it as a language extension in a similar way that you did with the '&&&' operator and their friends. In 'pure' C, you can only assign array elements, or get pointers to arrays or parts of them, but not assign parts of them. For example the code below is valid, but it will only make pointer assignments. In regular C, in order to assign array elements, you must use 'memcpy' or implement iterative code. Code: int x[3][5][6]; int (n)[5][6]; n = x; // pointer assignment int (m)[6]; m = x[1]; // pointer assignment Higher level languages tend to implement arrays as dynamically growing, ordered collections of objects. Most of the time, they are still referred by their pointer, but in many languages that is relatively transparent to the programmer. Almost invariably, arrays are created on the heap and never on the stack. This facilitates functions returning their references without issues. I have programmed in Objective-C and on that language you can declare an array as mutable or non-mutable. Non mutable arrays have fixed length which is determined upon creation, and they are stored in a memory optimal way. Non-mutable arrays can be referenced, subscripted, passed along functions, and copied, but not modified after they have been created. On the contrary a 'mutable' array has a dynamic length and new elements can be appended, inserted or deleted from it. In both cases there's a explicit language difference between assigning the array as its object reference, and copying its contents to another array or container object.
Tue Apr 30, 2019 3:13 pm

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 Quote: This is an interesting feature. As far as I know, I wasn't sure if it was going to be supported or not. But it's a pita to initialize some arrays an element at a time. I was also waiting for builds. Now to test struct assigns which are coded already. Branch ‘and’ and ‘or’ opcode were missing from the compiler. This resulted in them being omitted from output causing havoc with loops. Changed the compiler’s default calling convention to ‘pascal’. Pascal calling convention is both shorter and faster. The usual ‘C’ calling convention is still available with the ‘__cdecl’ keyword. It can be globally enabled using a ‘using’ statement like: ‘using __cdecl’ Made some makefiles to trim a few seconds off the build times. _________________ Robert Finch http://www.finitron.ca
Wed May 01, 2019 2:42 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 Put some work into improving the assembler’s performance. It was taking about five minutes to process. I managed to trim that down to about three minutes using the performance optimization tool in MSVC. The following function used to find symbols was taking much of the time: Code: int ncmp (char m1, const void m2) { SYM n2; n2 = (SYM )m2; if (m1==NULL) return 1; if (n2->name==NULL) return -1; return (strcmp(m1, GetName(n2->name))); } The first thing to go was the null pointer checks. In properly running code they aren’t needed. The next thing to go was the function call to GetName(). This was a simple index to pointer conversion function. It was simply in-lined by converting the data to a global variable (it used to be located as a class member). Doing these removals improved performance quite a bit. Code: int ncmp (char m1, const void m2) { SYM n2; n2 = (SYM )m2; return (strcmp(m1, &textname[n2->name])); } Still not satisfied with the performance. I decided to scrap the ncmp() function all together. It was implemented using a function pointer in the hash find function so that the hash find function could be a generic function. There was a loop like this: Code: for (count = 0; count < hi->size; count++) { rr = (hi->IsEqualName)(name, &htbl[TableIndex hi->width]); if (rr == 0) break; TableIndex = (TableIndex + hash.delta) % hi->size; } Which called the ncmp() function via a function pointer. Too slow! I put the function directly in loop rather than going through a pointer, like this: Code: for (count = 0; count < hi->size; count++) { rr = strcmp(name, &nametext[((SYM)&htbl[TableIndex hi->width])->name]); if (rr == 0) break; TableIndex = (TableIndex + hash.delta) % hi->size; } Performance improved substantially again. The cost was the use of a global variable, and loss of generality. I also modified the interpreter portion to use a jump table for the most common opcodes rather than a giant switch statement. This also improved performance noticeably. For several SIMD operations the order of the fields was reversed, for instance, results that should have gone into the least significant byte were going into the most significant byte, I managed to extract the instruction cache controller from the mainline code and place it in its own module. The toolset was complaining about the clocking of registers and the i-cache controller was one spot on the hot list. Making it its own module allows the use of a separate clock buffer which will hopefully help. _________________ Robert Finch http://www.finitron.ca
Thu May 02, 2019 3:00 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 Well the extraction of the instruction cache controller didn’t help. The toolset still has the same issue raised about registers that aren’t clocked by a root clock. I assume this is due to the size of the core and the toolset’s ability to replicate registers. I checked on the web and there seems to be a fair amount of confusion about how to handle this error. According to responses on the web it can be handled by adding ‘proper’ constraints. I haven’t figured out what constraints should be added yet. The core is hung up repeatedly trying to load the data cache. For some reason after a data cache load it isn’t registering as a hit, which causes the bus controller to try and load the data again. So, I added an additional wait state into the process. I also added a ‘safety’ counter which counts the number of times the data cache load is attempted and aborts after three tries. The data cache controller can form a loop accessing the data cache, so some form of guaranteed loop exit is probably a good idea. Why there is a loop: the cache controller has to go back and retry the hit-test after data is loaded. It could be done without a loop, I’m thinking of this. I finally decided to just assume that a data cache load is successful and got rid of the whole loop thing. Shaving more cycles off of loads which are aborted. Checks were put in place in every load state for aborted loads. If the load is aborted the bus immediately transitions to the IDLE state to free up the load channel for another load. Random data is also forced into the data stream. Setting an exception code was also nullified for aborted loads. I changed the data cache organization. It used to load a word at time directly from memory. Now it loads into an intermediate line buffer. When the line buffer is full it is written in its entirety to the data cache. This allows a data-cache load to be aborted midstream without having to worry about a partially updated data cache state. Currently the core hangs try to load data from the invalid address $f8c30100. Where this address is coming from is a mystery. Sprites have disappeared. Tested the z-order capability for graphics. RGB output from the cores is associated with an eight-bit z-order indicator. The z-order indicator is used to determine which core’s output is on top. _________________ Robert Finch http://www.finitron.ca
Fri May 03, 2019 2:30 am

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2392 Location: Canada	Re: Thor Core / FT64 I wrote a ROM checksum routine and added check-summing to the assembler. Lo and behold when run it revealed ROM checksum errors. I think I figured out where the mysterious address was coming from. Loading the instruction cache may sometimes have been loading data from a previous rom access. This could cause parts of two consecutive instructions to be mangled. The rom was using pipelined access that featured single cycle access. Possibly due to bus skew, it wasn’t working reliably. I changed it so it’s no longer pipelined, but the clock frequency to the rom has been increased to compensate a little bit. I was able to track this down because it happened in simulation without pipelined access active. The cpu was seeing an ack signal to soon causing invalid data to be loaded. The peephole optimizer was removing too much code in some small non-leaf functions. The call instruction got missed as modifying the stack pointer. There was an extra load, store and branch at the end of virtually every function in order to support the try/catch mechanism. I found a way to eliminate them using an additional register and double-indexed addressing. _________________ Robert Finch http://www.finitron.ca
Sat May 04, 2019 4:54 am

Page 30 of 52

[ 775 posts ]

Go to page Previous 1 ... 27, 28, 29, 30, 31, 32, 33 ... 52 Next

Thor Core / FT64

Who is online