View unanswered posts | View active topics It is currently Mon Sep 23, 2019 9:11 pm



Reply to topic  [ 203 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6 ... 14  Next
 74xx based CPU (yet another) 
Author Message
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 168
Location: Girona-Catalonia
I completed my third iteration to defining an ISA that I expect to be the final one. I made the following changes with respect to the previous one.

(1) It's still a pure load-store architecture. Memory access will never happens on the same instruction than ALU operations. This should help to keep critical paths at bay.

(2) The processor is now pure 16 bit. This means that 8 bit ALU operations are not supported or implemented. All ALU operations are exclusively 16 bit. 8 bit memory loads are enabled through zero-extend and sign-extend loads. 8 bit stores are implemented through 8 bit truncated stores. This is fully consistent with the C language specification that requires all intermediate operations to be promoted to the 'int' type unless it's guaranteed that the result will be the same. The 'int' type for this architecture is 16 bit. The total number of instructions can be reduced.

(3) I reduced the number of registers from a total of 16 including SP and PC, to 10 including SP and PC. (8 General purpose, SP, PC). I found that more than 8 general purpose registers are rarely used by the compiler. Also, memory access (SRAM) will be at the processor clock frequency, so the advantages of a large number of registers are dismissed in this case.

(4) I do not longer have the SP and PC on the same register bank than the general purpose registers. In reality, these two registers are special ones, and I see little justification to have them available for general ALU operations. The register separation required the implementation of PC and SP dedicated instructions, such as "push", "pop" and "add", "sub" on the SP which were not explicitly necessary before, or could be performed with the existing addressing modes. The register separation adds some complexity to the instruction set, but it frees some encoding space on the overall instruction set because I can now refer up to 8 general purpose registers with fields of 3 bits.

(5) I have gotten now a lot more inspiration from the ARM Thumb instruction set, and less from the MSP430. I found after watching at compiler output that the THUMB instruction set appears to be a lot more balanced. In particular, the MSP430 has many addressing modes and instructions that are hardly used by the compiler.

(6) I favoured the incorporation of as many instructions as possible with embedded small constant fields or memory access offsets in them. This improves code density and execution efficiency, as such kind of instructions are used all the time. This is not different than the Thumb instruction set approach.

So this is the Opcode Summary:

Attachment:
CPU74InstrSetV3..png
CPU74InstrSetV3..png [ 130.1 KiB | Viewed 3278 times ]


Also attached the complete list,


Attachments:
CPU74InstrSetV3.pdf [52.25 KiB]
Downloaded 134 times
Wed Apr 10, 2019 8:57 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 168
Location: Girona-Catalonia
There's a total number of 63 unique instructions. It seems a lot, but that's in part because SP and PC instructions count as separated ones, and instructions with different addressing modes count as separated. I also counted all the conditional branch instructions as separated. There's a dedicated "ret" (return from subroutine) instruction. I chose to have it for assembler cleanliness, but that could be implemented as a "pop" followed by a "jump". That's what the ARM does, but as said, I chose to have a proper "ret" instruction.

There are 3 available addressing modes for load/stores from/to memory:

(1) Load/store with immediate offset. This adds a constant offset to an address in a register and loads the contents of the resulting address. This is mostly used for C pointer access to data, or for C++ object data member access. For example:

Code:
ld  [#4, r3], r0


(2) Load/store with register offset. This adds an offset in a register to an address in another register and loads the contents of the resulting address. This is useful for accessing array elements. For example.

Code:
ld  [r2, r3], r0


(3) Load/store with absolute address. This gets the contents of an address specified as a direct value. This is useful for accessing C global variables. For example:

Code:
ld [&globVar], r0


- There is no pre/post increments/decrements modes for general purpose registers, but the SP implements "push rd" and "pop rd" instructions that do just that with the SP as pointer register accordingly. The SP also implements load/store with immediate offset to retrieve data from function stack frames.

- There are no more addressing modes because the above already cover the most common scenarios and virtually everything that could be required can (or I think so) be performed by combining one of the above with an ALU operation.

- Loads and Stores can be performed as byte (8 bits) or as word (16 bits). In case of byte stores, the higher byte is ignored, effectively truncating 16 bit register contents into its lower 8 bits. This is consistent with C specification.

- 8 Bit loads are performed by loading the 8 bit memory contents into the lower byte of a register by either setting to zero the higher byte or by sign extending the 8 bit value into a 16 bit value. The zero or sign extent is performed by the load instruction, so there's no need o emit an additional instruction.

- Since function arguments can be passed to functions through registers (registers R0 to R3 are used like that by default before using the stack frame), and arguments are always aligned to word and promoted to 16 bit values before passing to functions, zero-extend and sign-extend instructions are required for that. So they are used when byte sized arguments are specified on C code.

In some circumstances, access to non aligned bytes in register function arguments may be required. For example when passing a struct containing byte fields by value to a function. In these cases the compiler generates shift bit instructions to "shift" the required value to the lower register byte. The ARM architecture features instructions to perform constant shifts of any length in a single instruction. I chose not to implement that because that would require the creation of a "barrel shifter" in hardware which may take several IC to perform the action. Instead I added a Swap instruction that will move swap the lower and higher bytes of a register. By combining a swap with a zero-extent or a sign-extent I can achieve the required result. It's two instructions instead of one, but simplifies the hardware. It works like that:

(1) A 8 bit logical shift left is equivalent to "zero-extent" followed by "swap".
(2) A 8 bit logical shift right is equivalent to "swap" followed by "zero-extend".
(3) A 8 bit arithmetic shift right is equivalent to "swap" followed by "sign-extend"

So these things take two instructions on my architecture instead of a single one on the ARM, but it's still much better and elegant than the awful output that the compiler generates for the AVR and the MSP430 consisting on a series of 8 shift instructions on a row.

That's the only target independent optimisation that I implemented so far. In fact, I do not anticipate that I would need any more target optimisations because the compiler is THAT good at producing optimised assembly from C code that as long as you adhere to the type of instructions that the compiler likes to have, I think there's no need to complicate things further.

On my next posts, I will try to show simple C code examples, and the output the compiler generates for my ISA. "if' statements and branches still do not work, but I shall be able to show some examples of passing function arguments and performing several linear operations.

Joan


Wed Apr 10, 2019 10:13 pm
Profile

Joined: Fri Aug 01, 2014 3:00 pm
Posts: 23
Just found this project and it looks really neat!

I've been working on an assembly project with the MSP430 and had a few thoughts:
Quote:
The MSP430 appears to rely on a significant number of registers, 16 registers, but indexed stack access is comparatively more expensive because it's a two word instruction, the second word contains the indexed offset field. The instruction set also enables extensive memory access for all ALU instructions, albeit expensive, thanks to totally orthogonal addressing modes, which in my opinion defeats the need for such many registers.
Quote:
I reduced the number of registers from a total of 16 including SP and PC, to 10 including SP and PC. (8 General purpose, SP, PC). I found that more than 8 general purpose registers are rarely used by the compiler.
All register to register operations are 1 cycle on the MSP430, whereas accessing memory can take up to 5 cycles depending on the addressing mode. It's a shame if the compiler isn't using every single last register! You can really speed things up, so I am always running out of them. On the other hand, there are occasionally times when doing an ALU operation directly on an address turns out a cycle or two faster than loading the data into a register, modifying, and storing.

One slightly inconvenient thing is that the @Rn and @Rn+ addressing modes only work on source operands (not sure if you consider this less orthogonal). MOV @R3,R4 (which moves the data pointed to by R3 to R4) works but you can't do MOV R4,@R3. MOV R4,0(R3) accomplishes the same thing but then you have the extra word of data to encode the 0 offset. Likewise, you can do MOV @R3+,R4 but not MOV R4,@R3+ or MOV R4,0(R3+).


Sat Apr 13, 2019 4:12 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 168
Location: Girona-Catalonia
Hi Druzyek,

Thank you very much for your input.

Druzyek wrote:
All register to register operations are 1 cycle on the MSP430, whereas accessing memory can take up to 5 cycles depending on the addressing mode. It's a shame if the compiler isn't using every single last register! You can really speed things up, so I am always running out of them. On the other hand, there are occasionally times when doing an ALU operation directly on an address turns out a cycle or two faster than loading the data into a register, modifying, and storing


I agree with these points, but it's not that the compiler isn't able to use every single register. The compiler of course attempts to use as many registers as possible, to generate as few memory accesses as possible, and to keep as much temporary data as possible stored on registers. However, my point is that the circumstances of real compiled programs do not always enable the compiler to use all registers, all the time. C programs and particularly C++ programs tend to access structures or class data members that are created in global memory or on the stack. Data is commonly passed by reference to functions, and therefore primary access remains to be memory, even if the reference is in a register. Many functions tend to be short, thus not requiring many registers. Furthermore, functions must preserve used registers in the stack to avoid conflict with calling functions register usage. The saving of registers in the stack is itself a memory operation, (and the MSP430 is not particularly efficient on that compared with other processors). The result is that, in practice, there are not that great opportunities to use a big number of registers in an optimal way even if they are available to the compiler. It's not a fault of a compiler, but just the way things are.

Said that, I'm not claiming that having more registers is not advantageous, it definitely is. In fact most modern RISC processors have plenty of them. But that's in my opinion because having them is relatively cheap from the point of view of the number of additional gates or transistors required. The proportional benefit of having say the double of registers is not that great in my opinion. I mean a processor is by no means much faster by having 16 registers instead of 8, or 32 instead of 16. The performance increase is possibly only marginal.

For my architecture, I got to the conclusion that a total number of 8 general purpose registers is adequate considering that (1) I will have to implement them on 74xx chips, that (2) I am able to encode them with 3 bit fields on the instruction set, and that (3) this allows me to incorporate many 3-register instructions (using only 9 bits on the instruction encoding), including all ALU operations with two sources and one destination, which in turn assist the compiler to have a lower demand for additional registers and generate more compact code through less register to register moves.


Sat Apr 13, 2019 7:59 pm
Profile

Joined: Fri Aug 01, 2014 3:00 pm
Posts: 23
Quote:
Data is commonly passed by reference to functions, and therefore primary access remains to be memory, even if the reference is in a register. Many functions tend to be short, thus not requiring many registers. Furthermore, functions must preserve used registers in the stack to avoid conflict with calling functions register usage. The saving of registers in the stack is itself a memory operation, (and the MSP430 is not particularly efficient on that compared with other processors). The result is that, in practice, there are not that great opportunities to use a big number of registers in an optimal way even if they are available to the compiler. It's not a fault of a compiler, but just the way things are.
I see what you mean about passing by reference. Does the compiler always push registers when it jumps to a new function? A lot of times it is possible to juggle the register assignments so that the calling function and the functions it calls use different registers, so there is minimal stack usage. All the registers stay full when you are several levels deep into the call stack. I think you might benefit a lot from more registers if the compiler can figure out it doesn't need to push anything when it enters some functions.

Quote:
For my architecture, I got to the conclusion that a total number of 8 general purpose registers is adequate
Sounds good :)


Sun Apr 14, 2019 1:45 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 168
Location: Girona-Catalonia
Druzyek wrote:

I see what you mean about passing by reference. Does the compiler always push registers when it jumps to a new function? A lot of times it is possible to juggle the register assignments so that the calling function and the functions it calls use different registers, so there is minimal stack usage. All the registers stay full when you are several levels deep into the call stack. I think you might benefit a lot from more registers if the compiler can figure out it doesn't need to push anything when it enters some functions.


I see what you say, but a compiler is a compiler. It does a lot of amazing optimisations, some of them humans would require a lot of work to realise, specially complex pattern matching. But the compiler code optimiser is just a computer algorithm, not a human. A complier can not predict when a particular function will be called by other functions, or whether it will be called at all. Particularly, the compiler can not predict runtime execution paths that may involve conditional branches, jump tables, or runtime related events. Therefore, it's not generally possible for a compiler to set a particular register usage for a function, which will not interfere with their possible calling functions, simply because runtime execution paths can not be determined at compile time. At best, the compiler may decide that the callee is short enough, or that it can eventually be called only from a single location, so it can be inlined and optimised by avoiding the function call altogether.

What you can do for the compiler is to set the rules to deal with function arguments. This is generally named the "calling convention". For example, you can define a subset of registers that will be used by default for passing function arguments and for return values (so the stack will only be used if there's not enough register space for arguments). You can also define another subset of registers that can freely be used without preserving their values. So that the remaining subset will have to be preserved upon function return because they are potentially used by some caller.

There's some flexibility on defining the calling convention and you can even define conditional rules for it. For example you may decide that some data types are only passed in registers, and others only onto the stack (to say), if that would improve performance on your architecture. But the important thing to consider is that the "calling convention" can only be a set of general rules because the compiler can't know the program execution flow at compile time.

Said that, I am aware that JIT compilers are able to extend optimisations beyond the possibilities of static compilers. JIT compilers are able to gather probabilistic runtime information as the program executes, and perform things such as jump table reordering to take advantage of the fastest path, or redefine calling conventions to take advantage of the more probable execution scenarios. But that's beyond my level of expertise.

Joan


Sun Apr 14, 2019 8:46 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1258
(There's another technology which can be used these days: link time optimisation. At link time, much more is known about the whole program, and if the linker gets to see enough detail of what the code generator knew, it can make big improvements. However, this is a big subject, probably too big for an independent hobby project! Looks like both LLVM and GCC have something.)


Mon Apr 15, 2019 8:03 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 168
Location: Girona-Catalonia
Hi BigEd,

Indeed a clever linker can do its own optimisations too, but as you say that's too much for my project as I will not even target the compiler to produce 'linkable' object files. I will just produce assembly code with the compiler which I will then translate into actual machine code with my custom made assembler. That seems to me the more reasonable approach in my case.

Still, the compiler is able to create cross function optimised code by itself, in a linker-like fashion, if you declare functions as static. A static function in a c or c++ file means that this function is not meant to be called (or even available) to code outside the same file. Thus the compiler is clever enough to consider this situation and is able to apply more aggressive optimisations to such functions, as the compiler has full control over the scope of these functions. So in fact, to some extent you can force some linker-like optimisations on the compiler, by explicitly declaring some functions as static.


Mon Apr 15, 2019 8:43 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 168
Location: Girona-Catalonia
Ok, I have some examples now of early compiler testing code generation for my architecture. I'll make an example selection that I think it's interesting to share. I will be comparing them with the output for MSP430 and ARM-Thumb architectures. Just as a matter of reference, I have named my architecture as "CPU74"

The first example is a simple computation with a struct variable in global memory or passed by reference to a function. This is the C code:

Code:
struct A
{
  unsigned  char l[2];
  unsigned  char m;
  unsigned  char n;
} globa;

int test0( int aaa )
{
  int ss = 34;
  globa.m = ss + globa.l[0] + globa.l[1] + aaa;
  return globa.m;
}

int test00( struct A *a, int aaa )
{
  int ss = 34;
  a->m = ss + a->l[0] + a->l[1] + aaa;
  return a->m;
}


This is the resulting assemby for several architectures (I am omiting here some assembly file parts that are less interesting)

CPU74
Code:
test0:                                  ; @test0
; %bb.0:                                ; %entry
   ld.zb   [&globa], r1
   add   r0, r1, r0
   ld.zb   [&globa+1], r1
   add   r0, r1, r0
   add   #34, r0
   st.b   r0, [&globa+2]
   zext   r0, r0
   ret

test00:                                 ; @test00
; %bb.0:                                ; %entry
   ld.zb   [r0, #0], r2
   add   r1, r2, r1
   ld.zb   [r0, #1], r2
   add   r1, r2, r1
   add   #34, r1
   st.b   r1, [r0, #2]
   zext   r1, r0
   ret



MSP430
Code:
test0:                                  ; @test0
; %bb.0:                                ; %entry
   mov.b   &globa, r13
   add.w   r12, r13
   mov.b   &globa+1, r12
   add.w   r13, r12
   add.w   #34, r12
   mov.b   r12, &globa+2
   mov.b   r12, r12
   ret

test00:                                 ; @test00
; %bb.0:                                ; %entry
   mov.b   0(r12), r14
   add.w   r13, r14
   mov.b   1(r12), r13
   add.w   r14, r13
   add.w   #34, r13
   mov.b   r13, 2(r12)
   mov.b   r13, r13
   mov.w   r13, r12
   ret


ARM-Thumb

Code:
_test0:
@ %bb.0:                                @ %entry
   ldr   r1, LCPI0_0
LPC0_0:
   add   r1, pc
   ldr   r1, [r1]
   ldrb   r2, [r1]
   adds   r0, r0, r2
   ldrb   r2, [r1, #1]
   adds   r2, r0, r2
   adds   r2, #34
   strb   r2, [r1, #2]
   movs   r0, #255
   ands   r0, r2
   bx   lr

_test00:
@ %bb.0:                                @ %entry
   ldrb   r2, [r0]
   adds   r1, r1, r2
   ldrb   r2, [r0, #1]
   adds   r1, r1, r2
   adds   r1, #34
   strb   r1, [r0, #2]
   movs   r0, #255
   ands   r0, r1
   bx   lr


For all architectures the function arguments are passed in registers (MSP430 is a bit odd because it uses reverse numbering for registers, the upper registers are the ones being used to pass arguments). Some considerations are worth to make:
(0) note that both CPU74 and MSP430 assemblies, convene that the destination operand is the LAST one specified on the assembly instruction, whereas for Thumb it is the first one. I like the "last is destination" convention because that remains me of the VAX-11 architecture. It's just a matter of preference, and there's no other reason for that.
(1) the C source code specifies several additions with mixed types: unsigned char, int, with an unsigned char store. The function return value must be promoted to int. This is achieved in several ways depending on the architecture. Both CPU74 and Thumb use implicitly zero-extending load instructions to move unsigned chars to word length registers, then perfom normal adds, and finally zero extent the result before returning.
(2) Thumb uses an and instruction for the last zero extent, because there's no explicit instruction for that on registers. In other cases, Thumb may use combinations of left and right shifts to achieve the desired results. CPU74 has explicit instructions for zero extending and sign extending register contents, and shares with Thumb the ability to perform zero-extended and sign-extended loads.
(3) MSP430 uses byte instructions that implicitly zero extend to word when registers are involved. If signed data types were used on the source code, this would have resulted in explicit sign extend instructions inserted after the loads for the MSP430
(4) Thumb does not have instructions to directly load word sized immediates to a registers, so it uses a funny trick with the program counter to load an address reference to the global variable into a register.
(5) Both Thumb and CPU74 have three operand instructions. This helps reducing some code through effective choice of registers. On the other hand, MSP430 requires an additional instruction to move from register to register before returning from the function.
(6) It's also interesting to note that the Thumb does not have an explicit ret instruction. It uses a special register to store the return address and it avoids completely using the stack on leaf functions. That's an interesting approach that surely adds some additional performance on heavily used small functions. However, for the CPU74 I chose the classical approach of pushing/popping the return address onto the stack as part of the call/return sequence, so I avoid implementing a special register for that.


Last edited by joanlluch on Mon Apr 15, 2019 12:00 pm, edited 2 times in total.



Mon Apr 15, 2019 9:36 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 168
Location: Girona-Catalonia
The next example is an array access from global memory or passed as a reference to a function. This is the C code, and the resulting assembler in the chosen architectures

Code:
char ss[30] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,30};

 int test3( int i, int j )
{
  return ss[i]+ss[j];
}

 int test30( char *ss, int i, int j )
{
  return ss[i]+ss[j];
}


CPU74
Code:
test3:                                  ; @test3
; %bb.0:                                ; %entry
   mov   &ss, r2
   ld.sb   [r0, r2], r0
   ld.sb   [r1, r2], r1
   add   r1, r0, r0
   ret

test30:                                 ; @test30
; %bb.0:                                ; %entry
   ld.sb   [r0, r1], r1
   ld.sb   [r0, r2], r0
   add   r0, r1, r0
   ret


MSP430
Code:
test3:                                  ; @test3
; %bb.0:                                ; %entry
   mov.b   ss(r12), r14
   sxt   r14
   mov.b   ss(r13), r12
   sxt   r12
   add.w   r14, r12
   ret

test30:                                 ; @test30
; %bb.0:                                ; %entry
   add.w   r12, r14
   add.w   r12, r13
   mov.b   0(r13), r13
   sxt   r13
   mov.b   0(r14), r12
   sxt   r12
   add.w   r13, r12
   ret


ARM-Thumb
Code:
_test3:
@ %bb.0:                                @ %entry
   lsls   r0, r0, #16
   asrs   r0, r0, #16
   ldr   r2, LCPI0_0
LPC0_0:
   add   r2, pc
   ldrsb   r0, [r2, r0]
   lsls   r1, r1, #16
   asrs   r1, r1, #16
   ldrsb   r1, [r2, r1]
   adds   r0, r1, r0
   bx   lr

_test30:
@ %bb.0:                                @ %entry
   lsls   r1, r1, #16
   asrs   r1, r1, #16
   ldrsb   r1, [r0, r1]
   lsls   r2, r2, #16
   asrs   r2, r2, #16
   ldrsb   r0, [r0, r2]
   adds   r0, r0, r1
   bx   lr


Comments:
(0) The shorter code is for the CPU74 thanks to the implicit sign extent load instructions (same as Thumb), the register, register indirect addressing mode (like Thumb) and the three operand ALU instructions (like Thumb)
(1) Thumb code appears to be larger but that's caused in this case because registers are 32 bit long and therefore the 16 bit array indexes must be extended to the length of registers. The sign extension is performed by means of a pair of shift instructions, logical shift lelft by 16 bits followed by arithmeting shift right by 16 bits to the right effectively results in a sign-extension of a 16 bit integer into a 32 bit integer. The bottom cause for this extra code is that I have set the compiler to 16 bit long integer types. The same exact C code on 32 bit ints would not have produced such extra instructions on the Thumb, but would have created more code on both the CPU74 and MSP430.
(2) In this case the worse scenario happens for the MSP430 because it must explicitly insert sign-extent instructions and use two word long instructions for indirect data access.


Mon Apr 15, 2019 9:57 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 168
Location: Girona-Catalonia
The third example is a non-aligned byte access. To create this situation I pass a struct by value to a function. That's not a particularly common scenario, and certainly not the right use of structs in C, but I found the resulting assembly code interesting to share. This is the code:
Code:
struct A
{
  unsigned  char l[2];
  unsigned  char m;
  unsigned  char n;
};

int test( struct A a )
{
  return a.n;
}


In this case the compiler must generate an access to byte 3 of struct A (of course starting to count from 0). If this was memory access It would be no problem because any alignment byte data can be accessed from memory (as single bytes) in all architectures. However, the calling convention dictates that function parameters are passed through registers up to 8 bytes long data. The struct is 4 bytes long, so it's entirely passed through registers. All architectures are little endian so what happens is that struct field n gets passed in the most significant byte of a register. These are the results:

CPU74
Code:
test:                                   ; @test
; %bb.0:                                ; %entry
   swapb   r1, r0
   zext   r0, r0
   ret


MSP430
Code:
test:                                   ; @test
; %bb.0:                                ; %entry
   clrc
   rrc.w   r13
   rra.w   r13
   rra.w   r13
   rra.w   r13
   rra.w   r13
   rra.w   r13
   rra.w   r13
   rra.w   r13
   mov.w   r13, r12
   ret]


ARM-Thumb
Code:
_test:
@ %bb.0:                                @ %entry
   lsrs   r0, r0, #24
   bx   lr


Notes:
(0) The trick in all cases is to shift right the affected register in order to obtain the required value in their lower bits.
(1) only the Thumb has explicit instructions to perform constant shift amounts in a single instruction. In this case the entire struct is passed on a single 32 bit register and the required byte is found by shifting 24 bits to the right. The function just returns the same register (this time as an int rather than a struct) after the shift.
(2) the CPU74 does not incorporate constant shift instructions of an arbitrary amount. But I added a Swap byte instruction to help in cases with 8-multiple constant shifts and instructed the compiler to use it in combination with extend instructions to create the desired effect. So the result is a two instruction assembly code function. The first instruction moves the swaped byte contents of R1 into R0 to get the higher byte into the lower bits, then R0 is zero extended to remove the most significant bits.
(3) the MSP430 target compiler implementation looks kind of incomplete in this case, because the compiler could do the same as the CPU74 implementation does in this circumstance, given that the processor also features a swap byte instruction. Instead, it generates a series of single bit shift instructions that look certainly unnecessary to my eyes. To its defence, the same kind of rather unoptimised code is produced also for the AVR compiler, and it's not the first time that I found such series of chained single shift instructions on other 8 bit architectures that I studied in the past.


Mon Apr 15, 2019 10:34 am
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1258
Thanks for these worked examples and analyses. Interesting that byte-permute isn't (presently) picked up by the compiler.


Mon Apr 15, 2019 7:05 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 168
Location: Girona-Catalonia
Hi BigEd,

BigEd wrote:
Thanks for these worked examples and analyses. Interesting that byte-permute isn't (presently) picked up by the compiler.


I think this is because the compiler takes a rather agnostic approach to the intermediate code it generates. It actually hardly resembles any real target, as it tends to be a higher level implementation of an assembly language. There's not a particularly big number of instructions available, but the existing ones are quite powerful, and there's an infinite number of registers too.

It's the responsibility of the target backend to lower the LLVM form to the available machine instructions, and that too often requires custom code because the LLVM intermediate do not fit with real architectures. The following document describes (towards the end, in the "Instruction reference" section) all the instructions that the compiler may generate as intermediate code:

http://llvm.org/docs/LangRef.html

These are the instructions that must be converted to real assembly code by the backend implementation. It's a lot harder than I initially assumed because many aspects of the LLVM intermediate code differ significantly from real architectures.


Mon Apr 15, 2019 9:02 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1258
Is the answer some kind of peephole pass which can consolidate inefficient instruction sequences into compact ones?


Mon Apr 15, 2019 9:06 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 168
Location: Girona-Catalonia
BigEd wrote:
Is the answer some kind of peephole pass which can consolidate inefficient instruction sequences into compact ones?

Yes, it sort of works like that. There are a number of passes that are performed to convert the intermediate representation to the target form. Some of them are more or less automated, by creating target description files in a kind of meta language that is processed into C++ code by a utility called tablegen, such as the register and instruction operand mapping, and most of the calling conventions, but others must be explicitly implemented in code, such as the SelectionDAG legalize phase, branch folding, instruction selection, and machine code lowering.

The basics are described in the link below, but to be honest this document makes it look easier than it actually is, because it only covers the surface.

http://llvm.org/docs/WritingAnLLVMBackend.html

I am by no means proficient on it, as I learn as I need to jump to the next step, and most of the time I have no other choice than trying to understand what the existing code for the available architectures does, and just start from that to implement what is required for my target


Mon Apr 15, 2019 9:30 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 203 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6 ... 14  Next

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software