View unanswered posts | View active topics It is currently Tue Jan 21, 2020 5:42 am



Reply to topic  [ 24 posts ]  Go to page Previous  1, 2
 Some C compilers 
Author Message

Joined: Mon Oct 07, 2019 1:26 pm
Posts: 24
Hi Joan, Robinson, thank you for your replies.

Joan, yes my code generator does not do any optimizing yet. Everything simply gets allocated in the workspace area.
A pity that all code is optimized away, this leaves almost nothing to compare.
Perhaps I must use another demo code that can not be optimized so much.


Robinsonb5, your assembly code takes some practice to understand. I read through your postings and I understand that it is this way because you can assemble this to machine code without having a dedicated assembler. Personally I think PDP-11 or 68000 code is easier to read, that's why I choose pdp-11 lookalike instructions for the Kobold assembly. In your case I would rename temp to ACC and have explicit instructions like MOV (R2),ACC

It is surprising to see that your CPU uses up to 2 times more instructions as Kobold to do the same thing. And the Kobold code has not even been optimized. But since you're using an FPGA there is plenty of speed so I guess it's not a problem for you.
I like your endevour to get performance out of 8-bit instructions. But the single 'temp' accumulator seems to be a bottleneck. Did you think about using a small stack of accumulators (8,16 or 32 of them) ?
Good that your C compiler doesn't optimize (as Joan's does), because that would also optimize everything away and leave nothing to compare.


Sun Dec 29, 2019 9:44 am
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 240
Location: Girona-Catalonia
roelh wrote:
Hi Joan, Robinson, thank you for your replies.
Joan, yes my code generator does not do any optimizing yet. Everything simply gets allocated in the workspace area.
A pity that all code is optimized away, this leaves almost nothing to compare.
Perhaps I must use another demo code that can not be optimized so much.

Hi Roelh, LLVM optimises away load/stores to local variables that have not an effect outside the function, or that can be evaluated as constant expressions. The trick to prevent this is to use function arguments that are actually used to compute return values, or to update global vars, or to update arguments passed by reference. Also, small functions within the visibility scope of the compiler (namely on the same C source file, or with known bodies in included header files) tend to be inlined so they are not called at all, but this can be explicitly prevented if required by applying the 'noinline' attribute to these functions. I made some changes to your code to show this, and this is the result:

Modified C code
Code:
__attribute__((noinline))
int sum (int num1, int num2)
{
    return num1+num2;
}
int (*f2p) (int, int);
int n;

int f(int op1, int *op2)
{
    *op2 = sum(op1,n);
    f2p = sum;
    //Calling a function using function pointer
    op1 = (*f2p)(10, 13);
    return op1;
}

// demo structure member, array element, and if statement

char ar[8];

struct tt {
  int r1    ;
  int r2;
};

int i;

void test(struct tt* p){
if (i==5)
  { i = p-> r2; }
else ar[6] = i;
}


CPU74

Code:
# ---------------------------------------------
# sum
# ---------------------------------------------
   .globl   sum
sum:
   add   r1, r0, r0
   ret

# ---------------------------------------------
# f
# ---------------------------------------------
   .globl   f
f:                        // function enters with arguments in r0 and r1
   add   SP, -2, SP          // make room on the stack for callee register saving (this is equivalent to 'push r4', but I removed push/pop instructions from the set)
   st.w   r4, [SP, 0]       // save r4 to the stack (required as per the calling conventions)
   mov   r1, r4              // second argument (it's a pointer, but this does not matter at this point) goes to r4
   ld.w   [&n], r1          // load r1 with the contents of global 'n'
   call   @sum              // call sum with r0 and r1 as arguments
   st.w   r0, [r4, 0]       // store the sum retun value to the address pointed to by r4
   mov   &sum, r0            // load the address of sum in r0
   st.w   r0, [&f2p]        // store the address of sum into global var f2p
   mov   10, r0              // set r0 to 10
   mov   13, r1              // set r1 to 13
   call   @sum              // call sum with arguments in r0 and r1, result goes to r0
   ld.w   [SP, 0], r4       // restore r4 ( this, along with the next instruction is equivalent to 'pop r4')
   add   SP, 2, SP           // restore the previous frame
   ret                     // return with the result in r0

# ---------------------------------------------
# test
# ---------------------------------------------
   .globl   test
test:
   ld.w   [&i], r1
   cmp.ne   r1, 5
   brcc   .LBB2_2
   ld.w   [r0, 2], r0
   st.w   r0, [&i]
   ret
.LBB2_2:
   st.b   r1, [&ar+6]
   ret

# ---------------------------------------------
# Global Data
# ---------------------------------------------
   .comm   n,2,2
   .comm   f2p,2,2
   .comm   i,2,2
   .comm   ar,8,2





If I do not apply the noinline attribute to the sum function then I get what it would be more natural for LLVM, which is this:

Code:
# ---------------------------------------------
# sum
# ---------------------------------------------
   .globl   sum
sum:
   add   r1, r0, r0
   ret

# ---------------------------------------------
# f
# ---------------------------------------------
   .globl   f
f:                        // function enters with arguments in r0 and r1 (r1 is a pointer)
   ld.w   [&n], r2          // load variable n in r2
   add   r2, r0, r0          // add to first argument (this is the inlined sum function)
   st.w   r0, [r1, 0]       // store the sum retun value to the address pointed to by r1
   mov   &sum, r0            // load the address of sum in r0
   st.w   r0, [&f2p]        // store the address of sum into global var f2p
   mov   23, r0              // this again is the inlined call to sum, this time resulting in a constant value
   ret                     // return r0

# ---------------------------------------------
# test
# ---------------------------------------------
   .globl   test
test:
   ld.w   [&i], r1
   cmp.ne   r1, 5
   brcc   .LBB2_2
   ld.w   [r0, 2], r0
   st.w   r0, [&i]
   ret
.LBB2_2:
   st.b   r1, [&ar+6]
   ret

# ---------------------------------------------
# Global Data
# ---------------------------------------------
   .comm   n,2,2
   .comm   f2p,2,2
   .comm   i,2,2
   .comm   ar,8,2



Now, the sum function is inlined and the result is much faster and shorter code.

In any of the examples, the function pointer is still not seen to actually being used to call a function. This is because the compiler knows which function the pointer points to at compile time, so it just generates code to call the function directly. For the compiler to actually use a function pointer, it must be dereferenced in a scope where the target function is not known. For example if the pointer is set on a function, and used on a different one, or if the pointer is passed as an argument.

I hope this is of interest.

Joan


Sun Dec 29, 2019 12:18 pm
Profile

Joined: Mon Oct 07, 2019 1:26 pm
Posts: 24
Hi Joan, I took your modified code as the new demo code. Also modified the array access to show some real indexing (with I).

For function f, the difference is not very great, with 18 for Kobold and 14 for CPU74. Other functions will also improve when I put some optimizations in the generated code.

It is tempting to work on the code generator, but I have an empty Kobold PCB waiting to be populated...


Sun Dec 29, 2019 1:06 pm
Profile
User avatar

Joined: Fri Mar 22, 2019 8:03 am
Posts: 240
Location: Girona-Catalonia
roelh wrote:
Hi Joan, I took your modified code as the new demo code. Also modified the array access to show some real indexing (with I).

For function f, the difference is not very great, with 18 for Kobold and 14 for CPU74. Other functions will also improve when I put some optimizations in the generated code.

It is tempting to work on the code generator, but I have an empty Kobold PCB waiting to be populated...


Indeed, the difference is not that great. One aspect that slightly hurts the CPU74 ISA is the lack of push/pop instructions. The ISA defines that up to 4 function arguments are passed on registers (r0 to r3) and the remaining ones (or any varargs) are passed on the stack. Functions are not meant to preserve registers r0 to r3, but must preserve r4 to r7 (The SP and PC are not part of the general purpose set). Any function can potentially modify r0...r3, so these registers can not be used across calls, instead r4 is used in your code example. But r4 is one of the 'must preserve' registers, so that's why it must be saved/restored from the stack.

The lack of push/pop means 4 instructions in total just for saving/restoring one register, and that's why inlining greatly benefits this particular case. On the more general case, the penalty is not that big because a stack frame must be generally allocated anyway for local variables or register spills, or a greater number of registers must be saved. In some cases, implementing the 'tail call' optimisation would also help, but I have not gone that far...

I think you will have to consider the same kind of balance when you define what to do with registers across function calls, should them all be preserved?, only some of them?, or none of them? In all cases this adds more code generated in order to adhere with the chosen convention. Your current approach seems to be to store all arguments to the local function frame, which is fine, so you can access arguments across function calls, and probably you assume that registers are not preserved by called functions. Depending on the available memory addressing modes, this may or may not reduce code as compared with making a more extensive use of registers. My gut feeling is that pure load/store architectures benefit more from saved/restored registers performed once per function call, than a potentially bigger number of memory accesses. But I guess this is highly debatable and dependent on the actual architecture.


Sun Dec 29, 2019 2:18 pm
Profile

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1351
(Is this something one could explore and tune? Recompile the same benchmarks with the compiler set up to honour different calling conventions and see how many registers it's best to use for scratch, how many for parameters, how many to preserve?)


Sun Dec 29, 2019 3:01 pm
Profile

Joined: Wed Nov 20, 2019 12:56 pm
Posts: 21
roelh wrote:
Robinsonb5, your assembly code takes some practice to understand. I read through your postings and I understand that it is this way because you can assemble this to machine code without having a dedicated assembler. Personally I think PDP-11 or 68000 code is easier to read, that's why I choose pdp-11 lookalike instructions for the Kobold assembly.


68000 was the first assembly language I learned, so it's naturally the one I consider easiest to read!

Quote:
In your case I would rename temp to ACC and have explicit instructions like MOV (R2),ACC


I originally intended that register to be an accumulator, i.e. the target of ALU instructions, but actually I ended up with ALU instructions writing their results to the nominated register instead, with tmp simply providing the implicit second operand, so I'm not sure ACC is strictly appropriate. It would certainly be clearer to be specify the second operand explicitly, though.

Quote:
It is surprising to see that your CPU uses up to 2 times more instructions as Kobold to do the same thing. And the Kobold code has not even been optimized.


Yes, though the code currently emitted by the compiler is functional but far from optimal - I've found that by hand-coding I can reduce code size by anything between a quarter and two thirds - for exactly the reasons that prompted you to start this project. So I shall be watching its progress with interest!

Quote:
But since you're using an FPGA there is plenty of speed so I guess it's not a problem for you.


True - an Altera DE2 board can run this comfortably at 133MHz.

Quote:
I like your endevour to get performance out of 8-bit instructions. But the single 'temp' accumulator seems to be a bottleneck.


Yes, it is - especially since I haven't yet implemented any kind of result forwarding.

Quote:
Did you think about using a small stack of accumulators (8,16 or 32 of them)?


Yes, I considered that, and it would be interesting to explore. I haven't yet because this project doesn't exist purely for its own sake - I have an application in mind for it: a project where there's next to no block RAM left, so I need a CPU with small logic footprint, no block RAM used for a register file and good code density so the ROM doesn't take up too much block RAM either. I'm hoping that I can improve the code generator to the point where the last requirement is met!


Sun Dec 29, 2019 8:52 pm
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1026
Location: Canada
I figured I'd share the output from my compiler. The compiler isn't smart enough to use registers automatically, if one wants to use registers the register keyword must be used. Register arguments can generally be used only for leaf routines.
I modified the test program slightly:
Code:
int sum (register int num1, register int num2)
{
    return num1+num2;
}
int (*f2p) (register int, register int);
int n;

int f(int op1, int *op2)
{
    *op2 = sum(op1,n);
    f2p = sum;
    //Calling a function using function pointer
    op1 = (*f2p)(10, 13);
    return op1;
}

// demo structure member, array element, and if statement

char ar[8];

struct tt {
  int r1    ;
  int r2;
};

int i;

void test(register struct tt* p){
if (i==5)
  { i = p-> r2; }
else ar[6] = i;
}

Compiler output isn't too bad, but it's no match for LLVM.
Code:
   code
   align   8
;====================================================
; Basic Block 0
;====================================================
public code _sum:
            add         $v0,$a0,$a1
TestC74_8:
            ret         $lk1
endpublic



   bss
   align   2
public bss _n:
   fill.b   4,0x00

endpublic
   code
   align   8
;====================================================
; Basic Block 0
;====================================================
public code _f:
            sub         $sp,$sp,#16
            st          $fp,[$sp]
            st          $r0,4[$sp]
            mov         $r23,$lk1
            st          $r23,12[$sp]
            mov         $fp,$sp
            sub         $sp,$sp,#16
            st          $r11,0[$sp]
            ld          $r11,16[$fp]
;     *op2 = sum(op1,n);
            ld          $t0,20[$fp]
            st          $t0,-8[$fp]
            ld          $a1,_n
            mov         $a0,$r11
            jal         $lk1,_sum
            ld          $t0,-8[$fp]
            mov         $t1,$v0
            st          $t1,[$t0]
;     f2p = sum;
            ldi         $t0,#_sum
            st          $t0,_f2p
;     op1 = (*f2p)(10, 13);
            ld          $t0,_f2p
            st          $t0,-8[$fp]
            ldi         $a1,#13
            ldi         $a0,#10
            jal         $lk1,[$t0]
            mov         $t0,$v0
            mov         $r11,$t0
;     return op1;
            mov         $v0,$r11
TestC74_13:
TestC74_16:
            ld          $r11,0[$sp]
            mov         $sp,$fp
            ld          $fp,[$sp]
            ld          $r23,12[$sp]
            mov         $lk1,$r23
            add         $sp,$sp,#24
            ret         $lk1
endpublic



   bss
   align   8
   align   8
   dw   $FFF0200000000001
public bss _ar:
   fill.b   8,0x00

endpublic
   align   2
public bss _i:
   fill.b   4,0x00

endpublic
   code
   align   8
;====================================================
; Basic Block 0
;====================================================
public code _test:
; if (i==5)
            ld          $v0,_i
            cmp         $cr2,$v0,#5
            bne         $cr2,TestC74_27
;   { i = p-> r2; }
            ld          $v0,4[$a0]
            st          $v0,_i
            bra         TestC74_28
TestC74_27:
; else ar[6] = i;
            ldb         $v0,_i
            stb         $v0,_ar+6
TestC74_28:
TestC74_23:
TestC74_26:
            ret         $lk1
endpublic



   rodata
   align   16
;   global   _test
;   global   _ar
;   global   _f2p
;   global   _sum
;   global   _f
;   global   _i
;   global   _n

_________________
Robert Finch http://www.finitron.ca


Mon Dec 30, 2019 6:57 am
Profile WWW

Joined: Mon Oct 07, 2019 1:26 pm
Posts: 24
Hi Rob, thanks for sharing you compiler output.

Is this the C64 compiler that I found on your website ? I tried to look at C64.zip but I got a 404 error. (on this page: http://www.finitron.ca/Projects/C64.html ).

For which processor is this compiled ?

$lk1, $t0, $t1, $v0 did puzzle me. Are these alias names for certain registers ?

I saw that when a stack frame is needed, the entry and exit code is quite large...


Mon Dec 30, 2019 9:34 am
Profile

Joined: Sat Feb 02, 2013 9:40 am
Posts: 1026
Location: Canada
Quote:
Is this the C64 compiler that I found on your website ? I tried to look at C64.zip but I got a 404 error. (on this page: http://www.finitron.ca/Projects/C64.html ).

No, it's a much more recent version with a lot of fixes. I should really update the website.

Quote:
For which processor is this compiled ?

The compile is for a custom processor I'm working on. It has an oddball number of bits per byte, to make things interesting.

Quote:
$lk1, $t0, $t1, $v0 did puzzle me. Are these alias names for certain registers ?
Yes, those are aliases for certain registers. $tx is a temporary, $vx is a return value, $lkx is a link register. $crx is a compare result register, for which there is a separate register file (8 entries). There are four link registers available as a separate register file.
Quote:
I saw that when a stack frame is needed, the entry and exit code is quite large...
Yes, 16 bytes. It's a fixed size to keep the compiler simpler. There's the return address, exception return address and frame pointer to store, plus one extra. I wasn't too worried about the frame size because there's lots of ram and stacks generally aren't very deep.

_________________
Robert Finch http://www.finitron.ca


Mon Dec 30, 2019 1:52 pm
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 24 posts ]  Go to page Previous  1, 2

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software