AnyCPU - View topic - 74xx based CPU (yet another)

Page 7 of 21

[ 305 posts ]

Go to page Previous 1 ... 4, 5, 6, 7, 8, 9, 10 ... 21 Next

74xx based CPU (yet another)

Author	Message
joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) Ok, so even more improvements for source code compile of my previous post. The compiler now generates this: CPU74 Code: callConvertA: ; @callConvertA ; %bb.0: ; %entry sub SP, 4, SP mov &.LcallConvertA.a, r0 ld.sb [r0, 2], r1 st.b r1, [SP, 2] ld.w [r0, 0], r0 st.w r0, [SP, 0] mov SP, r0 call &convertA add SP, 4, SP ret It's better because the compiler now takes into account that the first two bytes can be load/stored with a word instruction, so the whole operation can be done with just two load/store pairs instead of three. The case is that I finally realised that what I regarded as a "bug" in LLVM was in reality an "omission" in Clang (the front end). It turned out that if I specify a specific alignment for the involved struct, then the back end takes that into account to generate better code. Unfortunately, there's no way (without modifying Clang source code) to tell the compiler a default alignment for structs. The 'DataLayout' string Code: "e-m:e-p:16:16-i1:8-i8:8-i16:16-i32:16-i64:16-f32:16-f64:16-a:8:16-n16-S16" does not work properly for 'aggregated' types. When trying to increase the ABI 'aggregate' alignment to 16 (ie, setting it to a:16:16), the compiler just keeps crashing with assertions unless the i8 type is also aligned to 16 bits. However, I want 'char' to be 8 bit aligned. I want structs and arrays to get 16 bit alignment so that bulk assignments and memcpy can be optimised, but I still want 'chars' (i8 type) to be 8 bit aligned. For all major backends, (say x86, ARM, and so on) it is considered that the processor is able to read/write words in misaligned memory accesses. Their respective backends just emit large load/stores in all cases regardless of alignment. In the vast majority of cases this is still optimal because global data gets usually aligned as per the 'Preferred' data layout setting. In all these cases, the 'aggregate' ABI alignment is set to 8, and the Preferred alignment is set to the native processor word. The MSP430 does not support misaligned accesses, so it suffers from the default implementation of the compiler. Single byte accesses coupled with shift/or code is generated to avoid any (potential) misaligned access. I regard that as less than optimal, and I want to avoid that for my architecture. So my processor will not support unaligned word accesses. Since I can't just assume that data will be aligned I must explicitly assert that. After many attempts to try to get the backend generate better code, and many hours trying several things, I finally created a 'module pass' in my LLVM backend implementation, that I run before any instruction selection begins. The pass visits all global objects and explicitly assigns them the desired alignment. This seem to have made the trick and leads to the better assembly code that opened this post. Joan Last edited by joanlluch on Mon May 20, 2019 5:21 pm, edited 1 time in total.
Mon May 20, 2019 2:10 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) Going back to the instruction set, I have now defined the policy for immediate field extensions into full word values. This was already reflected in the instruction set summary that I recently posted, but I haven't made a mention of it. So this is what I decided and why: 1) All immediate offsets in load/store instructions are zero-extended positive values. This affects the load/store with register plus immediate 6 bit offset, and the SP relative load/store (8 bit offset). The affected instructions are the ones in the T6 and T8 patterns. I found that negative offsets are hardly used at all. Positive offsets are used to access structure data members (from the base pointer or address), or constant index arrays. In the Stack, positive offsets are used both to access local variables and arguments. By allowing only positive offsets on the load/store instructions I have increased range on the ascending memory addresses. Negative offsets can still be computed by loading the offset in a register and using a register-register indexed load/store. 2) All immediate add/sub instructions use positive zero-extended values. This affects the add/sub register with immediate offset and add/Sub offset to SP (8 bit long in both cases). Since I have both 'add' and 'sub' its not a problem that they are both positive. This is a very common practice in most processors. 3) Move immediate instructions use sign-extended values. This enables loading small positive or negative values (from -128 to +127) into registers, which is a common practice and also a feature of many processors. 4) Compare immediate instructions use sign-extended values. As with moves, this enables comparison with small positive or negative values. 5) Immediate offset fields for branches or calls are all sign-extended, so branches/calls can jump to previous or following addresses. Instruction decoder The above requires changes in the decoder circuit that I presented before. Until now, all immediate values were meant to be sign-extended so the offset decoder just needed to know the bits range of the instruction immediate field. There are only 4 immediate bits range possibilities: 5-0, 7-0, 9-0 and 11-0; so only 2 wires were required from the main instruction decoder to the offset decoder. Now, the 7-0 range may be zero or sign extended, so one more wire is required. I have not yet updated the circuit but I will do when more things get defined. Joan
Mon May 20, 2019 5:10 pm

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2095 Location: Canada	Re: 74xx based CPU (yet another) Quote: I found that negative offsets are hardly used at all. That's true but there are cases where negative offsets are used. I've found -1 is used for indexed just before the start of an array. My crappy compiler generates negative offsets from the frame pointer to reference data. One way to skew the values so that there are lots more positive values than negative ones is to use multiple bits for the sign. For instance with triple sign bit extension using the top three bits as '111' to indicate a negative number with a 6-bit offset that gives 56 positive values and eight negative ones. _________________ Robert Finch http://www.finitron.ca
Tue May 21, 2019 3:35 am

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) Hi Rob, What you observe about a non-symmetrical positive and negative encoding is interesting. I haven't thought on that, and certainly something that I can consider. The hardware decoding should not be much more difficult because I'm doing most of it inside a PAL device. About the frequency of negative offsets, in my case I found that I only have them when a frame pointer is used, but not when accessing struct members or zero base arrays (as it's always the case in C). I have not disclosed my use of a frame pointer before but I am also using it in the cases where the SP instructions would not be reach the whole stack frame. I worked some time on the compiler backend to make this as optimal as possible. In the general case, with relatively small offsets that would reach both the arguments and the local vars, I use only the SP. This is an example: Code: void doSomething( int arr ); int torna( int a, int b, int c, int d ) { const int Size = 100; int arr[Size]; arr[0] = 11; arr[Size-1] = 22; doSomething(arr); return d; } The above C code gets compiled like this: CPU74 Code:* torna: # @torna # %bb.0: # %entry sub SP, 200, SP mov 22, r0 st.w r0, [SP, 198] mov 11, r0 st.w r0, [SP, 0] mov SP, r0 call &doSomething ld.w [SP, 202], r0 add SP, 200, SP ret As you can see, only the SP is used in this case. All offsets are positive within the 8 bit range. The offsets for the local array go from 0 to 198. The offset to argument 'd' is 202 because the return address (at offset 200) is skipped. Only argument 'd' is passed on the stack because arguments 'a', 'b', 'c' are passed in registers (although not used in this example) If I increase the size of the array to a size that would not fit in SP offsets the function gets compiled like this Code: void doSomething( int arr ); int torna( int a, int b, int c, int d ) { const int Size = 200; int arr[Size]; arr[0] = 11; arr[Size-1] = 22; doSomething(arr); return d; } CPU74 Code:* torna: # @torna # %bb.0: # %entry push r7 mov SP, r7 mov -400, r0 add SP, r0, SP mov 22, r0 mov -2, r1 st.w r0, [r7, r1] mov 11, r0 st.w r0, [SP, 0] mov SP, r0 call &doSomething ld.w [r7, 4], r0 mov 400, r1 add SP, r1, SP pop r7 ret In this case register 'r7' is used as the frame pointer. The SP adjustments at the beginning and end of the function require the use of a register and a SP+register additions. The compiler is clever enough to use r0 for this despite it's parameter 'a', but that's legal because 'a' is not used in the function. Argument 'd' is accessed through the 'r7' with positive offset 4. It's 4 because both the return address and the stored r7 must be skipped. The 0 index to the array is still accessed through the SP (because it fits). The 398 index to the array (last element) could be accessed though the SP or the r7. In this case r7 is used because the offset would not fit in SP instructions. The offset from r7 results in a negative value (-2). The choice among r7 or SP in this case is debatable because in both cases an intermediate register is required, so both in terms of performance and size any of the two would produce the same. This is a clear case where your proposal of asymmetric offsets would be beneficial. Also in this particular case, it would be debatable whether it's best to enter into the trouble of using a frame pointer (r7) or we could still remain with the SP alone, even by using intermediate register instructions to get the offsets. I chose to use a frame pointer in all cases where offsets would be too big for the SP alone, because the frame pointer gets very near the end of the frame while the SP is always at the beginning of it, so there are more chances to get variables and arguments from small offsets. I hope this makes sense, and furthermore your idea of having an asymmetric offset for the frame pointer related instructions would add extra benefits. Joan
Tue May 21, 2019 6:00 am

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) I got variable number of arguments working. The following example shows a function with variable number of arguments using the venerable C va_arg macro and their friends Code: #include "stdarg.h" int sumAll(int count, ...) { va_list ap; va_start(ap, count); int total = 0; for(int i=0; i<count; i++) total += va_arg(ap, int); va_end(ap); return total; } int callSumAll() { return sumAll( 3, 10, 20, 30); } CPU74 Code: sumAll: # @sumAll # %bb.0: # %entry push r4 push r3 sub SP, 2, SP mov SP, r0 add r0, 10, r0 st.w r0, [SP, 0] mov 0, r1 ld.w [SP, 8], r2 mov 0, r0 jmp &LBB0_2 LBB0_1: # %for.body # in Loop: Header=BB0_2 Depth=1 ld.w [SP, 0], r3 mov r3, r4 add r4, 2, r4 st.w r4, [SP, 0] ld.w [r3, 0], r3 add r3, r0, r0 add r1, 1, r1 LBB0_2: # %for.cond # =>This Inner Loop Header: Depth=1 cmp r1, r2 brlt &LBB0_1 # %bb.3: # %for.cond.cleanup add SP, 2, SP pop r3 pop r4 ret CPU74 Code: callSumAll: # @callSumAll # %bb.0: # %entry sub SP, 8, SP mov 30, r0 st.w r0, [SP, 6] mov 20, r0 st.w r0, [SP, 4] mov 10, r0 st.w r0, [SP, 2] mov 3, r0 st.w r0, [SP, 0] call &sumAll add SP, 8, SP ret For functions with variable number of arguments, I use the classic pass-everything-on-the-stack calling convention. A possible optimisation would be to still pass some of the fixed arguments in registers as usual, but in this case I just implemented the old good approach that determines the starting address in the stack of the extra arguments starting from the address of the last fixed one. The ARM Thumb backend implementation does a better job at that because it still allows both fixed and extra arguments to be passed in registers as per the normal calling convention. But this requires a lot of case by case code that I thought it was not worth to implement. So the "callSumAll" function passes all the arguments in the stack, including the first one. The "sumAll" function starts by creating a placeholder for the 'ap' va_list macro variable, storing the address of the first extra argument (the next one to 'count'). Inside the loop, this address is used to load the argument and is incremented by the va_arg macro, which in the assembly code happens just after the LBB0_1 label. The va_end macro does not really do anything in my architecture, It can be omitted from the C source and the compiler frontend doesn't even complain. Of course if a second iteration through the arguments is required, the va_start macro can be invoked again as per the usual way. Joan
Wed May 22, 2019 6:21 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) Also, I have variable-sized stack objects working. The following example shows what I mean. Code: extern void test1 ( int a, int x ); void callTest1( int size ) { int array[size]; int a=22; test1( &a, array ); } This gets compiled like this: CPU74 Code: callcallTest1: # @callTest1 # %bb.0: # %entry push r7 mov SP, r7 # Use r7 as the Frame Pointer sub SP, 2, SP lsl r0 # 'size' is in r0. It gets multiplied by 2 mov SP, r1 sub r1, r0, r1 # 'r1' now contains the starting address of 'array' mov r1, SP # The SP gets dynamically updated, mov 22, r0 st.w r0, [r7, -2] # Local variable 'a' uses the first stack slot, accessed through the Frame Pointer mov r7, r0 sub r0, 2, r0 # Pass '&a' in 'r0', 'r1' already contains 'arr' call &test1 # Call test1 mov r7, SP # Recover SP from 'r7' rather than incrementing it pop r7 ret
Wed May 22, 2019 7:05 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) A more complex example of variable-sized arrays. Code: void test ( int a, int x, int y ); void callTest( int siz0, int siz1) { int arr0[siz0]; int a=22; int arr1[siz1]; test( &a, arr0, arr1 ); } CPU74 Code:* callTest: # @callTest # %bb.0: # %entry push r7 mov SP, r7 # Use r7 as the Frame Pointer push r3 sub SP, 2, SP lsl r0 # 'siz0' is in r0. It gets multiplied by 2 mov SP, r2 sub r2, r0, r3 # 'r3' now contains the starting address of 'arr0' mov r3, SP # SP is dynamically updated to a larger frame mov 22, r0 st.w r0, [r7, -4] # 'a' is in the upper stack slot, accessed through the Frame Pointer lsl r1 # 'siz1' is in r0. It gets multiplied by 2 mov SP, r0 sub r0, r1, r2 # 'r2' now contains the starting address of 'arr1' mov r2, SP # SP is dynamically updated to an even larger frame mov r7, r0 sub r0, 4, r0 # 'r0' is now '&a' mov r3, r1 # 'r1' is now 'arr0', 'r2' already contains 'arr1' call &test # Call test mov r7, SP # Recover SP from 'r7' rather than incrementing it sub SP, 2, SP # Account for the callee saved registers. Sub rather than Add pop r3 pop r7 ret The strategy is as follows - Always use a Frame Pointer (register r7) when there are variable-sized objects. By placing all fixed-sized objects in the upper stack slots, they can always be accessed thought the Frame Pointer regardless of any dynamic updates of the SP. - Keep a registry of the starting addresses of all the variable-sized objects. These addresses are stored in registers as long as there are enough available, or spilled out to the upper stack slots in case the compiler gets out of registers. (The latter is not shown on the examples above). - Update the SP to new frame bottoms as the variable-sized object addresses are computed. - Access the variable-sized objects from their addresses, available in registers or in the upper stack frame, using double indirection rather than simple immediate offsets. - At the end of the function, recover the SP from the Frame Pointer, rather than incrementing it by a constant offset. As this happens just before the recovering of any callee saved registers, the SP must the subracted by the stack space used by these registers. A careful observer may notice that instruction sequences like this Code: mov SP, r0 sub r0, r1, r2 could be replaced by Code: sub SP, r1, r2 . The reason for this not happening is that the CPU74 processor does not have a subtract Register from SP instruction. There are 'add SP, Rd, SP' and 'add Rd, SP, Rd' belonging to the T10 group, but not the equivalent sub instructions. This is a small price to pay for having SP specific instructions separated from the general purpose register instructions. Beyond these examples, there are a couple of more complicated scenarios that may require stack frame saves and restores depending on several circumstances, which I have not shown. But I think that the general approach is already described with enough detail. Joan Last edited by joanlluch on Wed May 22, 2019 10:00 pm, edited 2 times in total.
Wed May 22, 2019 8:27 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) Btw, I forgot to add that in all the examples of my recent posts I have already implemented Rob's suggestion of asymmetrical immediate offsets. Particularly, all the offsets that are shown from 'r7' in the examples are embedded in the instruction, even if they are negative. The effective range that I have chosen has 16 units displacement so, the 6 bit long immediate field goes effectively from -16 to 47 Joan
Wed May 22, 2019 8:46 pm

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1782	Re: 74xx based CPU (yet another) An interesting innovation! I don't think I've come across it before, or considered it.
Wed May 22, 2019 8:56 pm

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) BigEd wrote: An interesting innovation! I don't think I've come across it before, or considered it. I recall having seen this somewhere before, but I do not remember where. I thought it was the ARM-Thumb, but I re-checked and it's not exactly the case. The ARM has ranges in the positive area that start above zero (for example 8 to 255), I can see them implemented in the LLVM ARM-Thumb backend for the ADDi instruction, but I think they are really a compiler trick to help select different instructions depending on range, rather than a limitation/feature of the ARM instruction set per se. So at this time, I can only say that when Rob mentioned it, it didn't came as totally alien to me, but I can't recall where I've seen that before.
Wed May 22, 2019 9:56 pm

robfinch Joined: Sat Feb 02, 2013 9:40 am Posts: 2095 Location: Canada	Re: 74xx based CPU (yet another) I've used the skewed value trick first I think on the Butterfly core for a 4-bit immediate field. It's really probably worth it only for small fields. Once fields are more than eight bits it's better just to sign extend. I suspect there is some performance lost because an extra gate is required. I don't think it matter much on an FPGA though, the first gate is likely combined into other logic for sign extension. For really small fields a table mapping of common values could be setup. The 68000 maps the value zero to eight for shift instructions, which uses a three-bit count field. _________________ Robert Finch http://www.finitron.ca
Fri May 24, 2019 6:39 am

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) robfinch wrote: I've used the skewed value trick first I think on the Butterfly core for a 4-bit immediate field. It's really probably worth it only for small fields. Once fields are more than eight bits it's better just to sign extend. I suspect there is some performance lost because an extra gate is required. I don't think it matter much on an FPGA though, the first gate is likely combined into other logic for sign extension. For really small fields a table mapping of common values could be setup. The 68000 maps the value zero to eight for shift instructions, which uses a three-bit count field. Also, for immediate fields up to (at least) 8 bits long, most architectures use zero extending semantics, i.e all values in the positive range, if there are complementary instructions that can account for the missing negative values. The immediate ADD/SUB instructions and the Shift instructions are cases of this. In my case, I think that decoding the displaced offsets should not cause any performance loss, because the intention is that immediate field decoding is done in a PLD such as the ATF22V10C. So, if I understand it correctly, that essentially becomes a table of inputs that produce custom outputs, with identical propagation times regardless of programming.
Fri May 24, 2019 8:27 am

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) I am now looking at 32 bit arithmetic for my 16 bit processor. The processor does not explicitly support that, except for the presence of ADD/SUB Carry instructions, and the Shift through Carry instructions. So everything must be done by the compiler. These are some examples (with comments) on what I have achieved so far: 32 bit vars add Code: long arith32( long a, long b ) { return a+b; } CPU74 Code: arith32: # @arith0 # %bb.0: # %entry add r2, r0, r0 ld.w [SP, 2], r2 addc r2, r1, r1 ret The function has 2 parameters, but arguments are 32 bit long, so 4 stack slots are used. Technically it's like a 4 parameter function. As usual, only the first three 16 bit values are passed in registers, so the fourth one is found from the stack. The addition is performed with the typical add/addc sequence, but the processor has really no notion of 32 bit arithmetic so any register is valid for that. Interestingly, the r2 register becomes reused for the second half of the addition. The next example is a constant addition: 32 bit var + constant add Code: long arith32( long a ) { return a+2; } CPU74 Code: arith32: # @arith32 # %bb.0: # %entry mov 2, r2 add r0, r2, r0 mov 0, r2 addc r1, r2, r1 ret The constant is sign extended and split into its two 16 bit components. Note that since the 'mov' instruction does not affect flags, it can be inserted between the 'add' and the 'addc', so a register can be saved. The sign-extension of constants is more explicitly shown on negative values: 32 bit var with negative constant Code: long arith32( long a ) { return a-2; } CPU74 Code: arith32: # @arith32 # %bb.0: # %entry mov -2, r2 add r0, r2, r0 mov -1, r2 addc r1, r2, r1 ret In this case -1 (all ones) is used for the 'addc' because that's the higher word of sign-extended -2. 32 bit and 16 bit arithmetic can be of course mixed, as per the 'C' specification. In the following example an unsigned int var (16 bits) is added to a signed 32 bit var: 32 bit var with 16 bit unsigned var Code: long arith32( long a, unsigned int b ) { return a+b; } CPU74 Code: arith32: # @arith32 # %bb.0: # %entry add r2, r0, r0 # 'a' is in r1,r0, 'b' is in 'r2', ' mov 0, r2 addc r1, r2, r1 ret This is actually the easier case because the extension of unsigned 'b' is always all zeros. Now, signed addition becomes trickier. The 16 bit value must be sign extended to 32 bits: 32 bit var with 16 bit signed var Code: long arith32( long a, int b ) { return a+b; } CPU74 Code: arith32: # @arith32 # %bb.0: # %entry add r2, r0, r0 cmp r2, 0 setlt r2 neg r2 addc r2, r1, r1 ret The sign extension is implemented by testing the sign bit. If it's a positive value then use all zeros for the upper word, if it's negative then use all ones. The compiler does that by default by performing a 15 arithmetic shift right on the variable, which would give the desired result with a single instruction. However, CPU74 does not support multiple shift instructions, so I replaced that by a 'set' instruction followed by a 'neg'. The resulting assembly code is still quite optimal, specially if we compare with the outputs for alike architectures such as the MSP450 and the AVR. I'm not pasting the MSP430 or AVR assembly code because it's excessively long for this particular case. Last edited by joanlluch on Fri May 24, 2019 11:12 am, edited 1 time in total.
Fri May 24, 2019 10:44 am

joanlluch Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia	Re: 74xx based CPU (yet another) An aspect that is interesting to note is the behaviour of the CPU74 Status Registers. As I stated on some earlier post, and also on the instruction set document, I have two status registers. SR and ASR. The SR is used for comparisons in combination with the SETcc, SELECTcc and BRcc instructions. The ASR is used for arithmetic instruction results. So their semantics do not mix together, and this ultimately helps to produce shorter code. I am now also considering to carefully select the instructions that should modify the ASR or not. This will further create optimisation opportunities. For example, the 'neg' instruction in the last example above will not modify any status register, not even the ASR. The 'cmp' instruction will affect the SR but not the ASR. So this enabled the compiler to safely insert an entire block of code between the 'add' and the 'addc' instructions, which saved one precious register, one stack register save/restore pair, and increased code density. It would be nice that assuming a reasonably well designed hardware I could beat microprocessors of the 80’s (or even the MSP430 or the AVRs at the same clock rate)
Fri May 24, 2019 10:56 am

BigEd Joined: Wed Jan 09, 2013 6:54 pm Posts: 1782	Re: 74xx based CPU (yet another) Interesting - and very positive - that you can get a compiler to make good use of the long life of a flag. To some extent the humble 6502 (and therefore the less humble ARM) sets N and Z more readily than C and V: there are two kinds of flags, or three if you could distinguish the use of C for shifting and the use of C for arithmetic. It might be that having the different sets of flags in different registers is crucial for the compiler.
Fri May 24, 2019 3:03 pm
Display posts from previous: Sort by

Page 7 of 21

[ 305 posts ]

Go to page Previous 1 ... 4, 5, 6, 7, 8, 9, 10 ... 21 Next

74xx based CPU (yet another)

Who is online