Last visit was: Sun Nov 10, 2024 7:16 pm
It is currently Sun Nov 10, 2024 7:16 pm



 [ 11 posts ] 
 Perils of Optimization On Modern High-Performance Processors 
Author Message

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Recently, I've been studying the book "From Mathematics to Generic Programming," by Alexander A. Stepanov and Daniel E. Rose. I took up studying this book in order to gain an understanding of the process by which Stepanov designed the C++ Standard Template Library. With regard to the book, I find that it is well written, and intellectually stimulating. I particularly enjoy the authors' linkage of the algorithms described in the book to the mathematical foundation for those algorithms. (Note: In the spirit of full disclosure of any personal biases regarding this subject, I am not a particular fan of canned algorithms. In particular, I am not a particular fan of canned algorithms such as those in the Standard Template Library, Matlab, Simulink, and similar toolsets which, in my opinion, many software developers, engineers, and scientists are using blindly.)

There is a mathematical basis for many algorithms commonly available in standard libraries. However, there appears to be a blind trust that many of these algorithms behave correctly under a wide range of conditions. This assumption is generally unsupported by the documentation that accompanies many of these collections of algorithms. To be fair to the developers/suppliers of these libraries and toolsets, many software developers, engineers, and scientists do not test in a critical manner the libraries for suitability to their particular application domain. Perhaps the lack of curiosity is driven in part by a lack of confidence with regard to mathematics. There are many algorithms whose mathematical basis is certainly difficult for many, but there are also many algorithms whose mathematical basis should be easily understood by anyone capable of learning and using a programming language.

The second chapter of the book relates to the Egyptian/Russian Peasant multiplication algorithm applicable to unsigned non-zero integers. This is an algorithm that should be easily understood by anybody who has learned to program computers. I decided to understand the development of the authors' implementations of the algorithm in C. Therefore, I implemented each example that the authors' presented in C, and compiled and tested each "improved" version of the algorithm. After completing that exercise, I believed that I understood each of the authors' algorithms.

After developing and testing each of the 8 algorithms, which progress from simple recursion to non-recursive looping, I decided to test the efficiency of the algorithms using a test program that exhaustively calculated all 65536 products for two arguments, each ranging from 1 to 256. Interestingly, I found that a recursive algorithm yielded the best time instead of the non-recursive looping implementation. Throughout the elaboration of the various algorithms, the authors clearly emphasized the cost of testing variables and calling subroutines. In other words, they wanted the reader to pick up on the processing costs associated with testing variables and recursively calling the core multiplication subroutine.

With respect to recursion (Algorithms 1-5), the authors built up the algorithmic progression until Algorithm 5 was implemented in a strict tail recursive form. I've used recursion in the past, but it is not a natural way for me to develop algorithms. Until reading the authors' definition for strictly tail recursive they provide in the book, I had not actually followed through on determining the conditions needed to refer to an implementation as tail recursive. For the non-recursive looping algorithms (Algorithms 6-8), the authors go through the development of these algorithms with the objective of reducing the number of tests and the number of loops required to implement the desired product. (Note: In the multiplication algorithm being considered, the multiplier should be less than multiplicand. Examination of the multiplier is used to develop the partial products which are subsequently added together to form the product. To minimize the number of recursions/loops, the multiplier should be the lesser of the two numbers. The algorithms implemented do not ensure that the multiplier is less than the multiplicand. My testing is performed in such a manner that for roughly half of the products computed the multiplier is less than the multiplicand, and greater than or equal to the multiplicand for the remaining products.)

The following table shows the results I obtained for the 8 algorithms implemented for this multiplication technique. I used Eclipse and GCC on a 64-bit Linux Mint platform powered by a 1 GHz, 4 GB, AMD E1 processor. My test program consists of three nested loops: n - iteration loop to ensure a sufficiently long test time; k - multiplier loop; l - multiplicand loop. The multiplier and multiplicand loops iterate from 1 to 256 to generate 65536 products. The iteration loop repeats the inner two loops. The resulting products generated by the 8 implementations are tested against the product of k and l. If there is an error, an error message is printed and a break is taken. I use the argument vector to input the number of iterations and the algorithm number, which I use as an index into an array of function pointers. Each algorithm is timed using the time utility of Linux: $ time test_pgm n_iterations algorithm_index.
Code:
Routine_Nm,Dbg/-O3,Rls/-Os,Rls/-O3,Rls/-O2,Rls/-O1,Rls/-O0
Multiply_1,  7.234,  9.585,  7.389,  9.904, 10.778, 17.481
Multiply_2,  4.975,  7.063,  4.988,  8.005, 15.532, 17.464
Multiply_3,  4.725,  6.218,  4.721,  4.711, 11.791, 17.492
Multiply_4,  6.633,  6.236,  6.625,  6.687, 12.668, 17.995
Multiply_5,  4.286,  6.065,  4.283,  4.287, 14.193, 17.850
Multiply_6,  5.939,  6.961,  5.930,  5.935,  7.199,  9.832
Multiply_7,  4.481,  6.667,  4.481,  4.481,  7.178,  9.755
Multiply_8,  5.016,  6.192,  5.013,  5.015,  6.709,  9.001

Across the top are the optimizations employed. Dbg signifies the Debug output, and Rls signifies the Release output from the GCC compiler. I captured results for all 5 optimization levels that can be selected using the -Ox optimization option. When using optimization levels -O2 or -O3, the strictly tail recursive Algorithm 5 provides the best times. Without optimization (-O0) or minimal optimization (-O1), the non-recursive looping algorithms, Algorithm 6-8, handily beat out the recursive algorithms, Algorithms 1-5.

There was some variation in the measurements using the time utility, so I captured the following data using the perf utility, which provides more consistent timing data by using the performance counters inside the processor. I captured data using the Rls/-O3 optimization for all 8 algorithms. Interestingly, there is a wide variation in the effective number of instructions per cycle, the total number of branches, and the number missed branches. Algorithm 5 is still the fastest, but its number of instructions per cycle is not the highest, nor does it have the least number of branches or missed branches.
Code:
morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 1

 Performance counter stats for 'Release/Stepanov 2000 1':

       7280.156042      task-clock (msec)         #    0.995 CPUs utilized         
               677      context-switches          #    0.093 K/sec                 
                28      cpu-migrations            #    0.004 K/sec                 
                52      page-faults               #    0.007 K/sec                 
     7,238,360,838      cycles                    #    0.994 GHz                     [50.01%]
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.05%]
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.21%]
    13,485,495,935      instructions              #    1.86  insns per cycle         [50.04%]
     1,680,883,125      branches                  #  230.886 M/sec                   [49.97%]
           861,832      branch-misses             #    0.05% of all branches         [49.87%]

       7.317273564 seconds time elapsed

morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 2

 Performance counter stats for 'Release/Stepanov 2000 2':

       5015.594286      task-clock (msec)         #    0.994 CPUs utilized         
               478      context-switches          #    0.095 K/sec                 
                13      cpu-migrations            #    0.003 K/sec                 
                53      page-faults               #    0.011 K/sec                 
     4,979,463,334      cycles                    #    0.993 GHz                     [49.96%]
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.03%]
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.18%]
     9,100,716,091      instructions              #    1.83  insns per cycle         [50.14%]
     1,450,183,464      branches                  #  289.135 M/sec                   [50.03%]
           848,119      branch-misses             #    0.06% of all branches         [49.84%]

       5.045531490 seconds time elapsed

morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 3

 Performance counter stats for 'Release/Stepanov 2000 3':

       4666.820274      task-clock (msec)         #    0.994 CPUs utilized         
               594      context-switches          #    0.127 K/sec                 
                 5      cpu-migrations            #    0.001 K/sec                 
                53      page-faults               #    0.011 K/sec                 
     4,644,182,243      cycles                    #    0.995 GHz                     [50.08%]
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.00%]
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle    [49.97%]
     7,500,004,657      instructions              #    1.61  insns per cycle         [49.99%]
     1,449,465,718      branches                  #  310.590 M/sec                   [50.03%]
           807,622      branch-misses             #    0.06% of all branches         [50.15%]

       4.696254481 seconds time elapsed

morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 4

 Performance counter stats for 'Release/Stepanov 2000 4':

       6437.925075      task-clock (msec)         #    0.989 CPUs utilized         
               673      context-switches          #    0.105 K/sec                 
                25      cpu-migrations            #    0.004 K/sec                 
                52      page-faults               #    0.008 K/sec                 
     6,405,362,667      cycles                    #    0.995 GHz                     [49.99%]
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.03%]
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.07%]
     6,836,760,873      instructions              #    1.07  insns per cycle         [50.11%]
     2,232,507,206      branches                  #  346.774 M/sec                   [50.03%]
         8,277,203      branch-misses             #    0.37% of all branches         [49.96%]

       6.508887512 seconds time elapsed

morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 5

 Performance counter stats for 'Release/Stepanov 2000 5':

       4329.726333      task-clock (msec)         #    0.996 CPUs utilized         
               386      context-switches          #    0.089 K/sec                 
                15      cpu-migrations            #    0.003 K/sec                 
                51      page-faults               #    0.012 K/sec                 
     4,306,586,028      cycles                    #    0.995 GHz                     [49.95%]
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [49.96%]
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.15%]
     6,839,918,869      instructions              #    1.59  insns per cycle         [50.10%]
     2,235,540,097      branches                  #  516.324 M/sec                   [50.04%]
         9,797,084      branch-misses             #    0.44% of all branches         [49.95%]

       4.345760144 seconds time elapsed

morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 6

 Performance counter stats for 'Release/Stepanov 2000 6':

       5819.323941      task-clock (msec)         #    0.994 CPUs utilized         
               531      context-switches          #    0.091 K/sec                 
                25      cpu-migrations            #    0.004 K/sec                 
                55      page-faults               #    0.009 K/sec                 
     5,790,127,256      cycles                    #    0.995 GHz                     [49.91%]
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [49.96%]
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.08%]
     7,097,873,094      instructions              #    1.23  insns per cycle         [50.11%]
     2,234,531,073      branches                  #  383.985 M/sec                   [50.14%]
         7,263,670      branch-misses             #    0.33% of all branches         [49.98%]

       5.851833687 seconds time elapsed

morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 7

 Performance counter stats for 'Release/Stepanov 2000 7':

       4518.088795      task-clock (msec)         #    0.969 CPUs utilized         
               521      context-switches          #    0.115 K/sec                 
                 9      cpu-migrations            #    0.002 K/sec                 
                54      page-faults               #    0.012 K/sec                 
     4,493,347,007      cycles                    #    0.995 GHz                     [50.03%]
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.03%]
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.00%]
     7,025,496,576      instructions              #    1.56  insns per cycle         [49.97%]
     2,226,158,240      branches                  #  492.721 M/sec                   [50.02%]
        12,654,222      branch-misses             #    0.57% of all branches         [50.03%]

       4.664287850 seconds time elapsed

morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 8

 Performance counter stats for 'Release/Stepanov 2000 8':

       5037.158903      task-clock (msec)         #    0.996 CPUs utilized         
               466      context-switches          #    0.093 K/sec                 
                18      cpu-migrations            #    0.004 K/sec                 
                53      page-faults               #    0.011 K/sec                 
     5,012,721,116      cycles                    #    0.995 GHz                     [50.09%]
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.09%]
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.10%]
     6,643,054,184      instructions              #    1.33  insns per cycle         [49.91%]
     2,097,877,713      branches                  #  416.480 M/sec                   [50.00%]
         5,236,597      branch-misses             #    0.25% of all branches         [49.95%]

       5.058676239 seconds time elapsed

morrisma@morrisma5 ~/Development/CProjects/Stepanov $

Not exactly sure what these results mean for a different application. My objective was simply to time the various algorithms. The implication from the book was that each succeeding algorithm (higher number) was a better implementation than the previous algorithm. From the results presented above, that is clearly not the case. It appears to me that the more the program construction is able to successfully coerce instruction level parallelism (ILP) from the optimizer the faster the program will be. I don't know exactly how to interpret the performance effects of the branch prediction logic.

The code used in this test is included below. With these results in mind, I will attempt to build and test a version of Algorithm 5 using the MiniCPU instruction set. The next chapter in book deals with the prime number sieve algorithm ascribed to Eratosthenes. I am looking forward to developing and testing the algorithms developed by Stepanov and Rose for this ancient algorithm, which is often used as a processor/language benchmark.
Code:
/*
 ============================================================================
 Name        : Stepanov.c
 Author      : Michael A. Morris
 Version     :
 Copyright   : Copyright 2017 - GPLv3
 Description : Various Algorithms for Egyptian/Russian Peasant Multiplication in C, Ansi-style
 ============================================================================
 */

#include <stdio.h>
#include <stdlib.h>

#define half(x) ((x) >> 1)
#define odd(x)   ((x) & 1)

int multiply1(int, int);

int multiply2(int, int);
int mult_acc0(int, int, int);

int multiply3(int, int);
int mult_acc1(int, int, int);

int multiply4(int, int);
int mult_acc2(int, int, int);

int multiply5(int, int);
int mult_acc3(int, int, int);

int multiply6(int, int);
int mult_acc4(int, int, int);

int multiply7(int, int);
int multiply8(int, int);

typedef int (*MFP_t)(int, int);

int main(int argc, char *argv[])
{
   int i, j, k, l;
   int n, prod;
   MFP_t mfp[8] = {
                      &multiply1,
                  &multiply2,
                  &multiply3,
                  &multiply4,
                  &multiply5,
                  &multiply6,
                  &multiply7,
                  &multiply8
                  };

   i = atoi(argv[1]);
   j = atoi(argv[2]) - 1;

   for(n = 0; n < i; n++){
      //printf("\n********************* n=%5d **************************\n\n", n);
      for(k = 1; k <= 256; k++) {
         //printf("k=%d\n", k);
         for(l = 1; l <= 256; l++) {
            prod = mfp[j](k, l);
            if(prod != (k * l)) {
               printf("error: %d; %d\n", prod, k*l);
               break;
            }
         }
      }
   }

   return EXIT_SUCCESS;
}


/*
 * int multiply1(int n, int a)   -   Multiply two unsigned integers using Egyptian
 *                                  multiplication algorithm.
 */

int multiply1(int n, int a)
{
   if(1 == n) return(a);
   int result = multiply1(half(n), a << 1);
   if(odd(n)) return(result + a);
   return(result);
}


int multiply2(int n, int a)
{
   if(1 == n) return(a);
   return(mult_acc0(0, n, a));
}

int mult_acc0(int r, int n, int a)
{
   if(1 == n) return(r + a);
   if(odd(n)) {
      return(mult_acc0(r + a, half(n), a << 1));
   } else {
      return(mult_acc0(r    , half(n), a << 1));
   }
}


int multiply3(int n, int a)
{
   if(1 == n) return(a);
   return(mult_acc1(0, n, a));
}

int mult_acc1(int r, int n, int a)
{
   if(1 == n) return(r + a);
   if(odd(n)) r += a;
   return(mult_acc1(r, half(n), a << 1));
}


int multiply4(int n, int a)
{
   if(1 == n) return(a);
   return(mult_acc2(0, n, a));
}

int mult_acc2(int r, int n, int a)
{
   if(odd(n)) {
      r += a;
      if(1 == n) return(r);
   }
   return(mult_acc2(r, half(n), a << 1));
}


int multiply5(int n, int a)
{
   if(1 == n) return(a);
   return(mult_acc3(0, n, a));
}

int mult_acc3(int r, int n, int a)
{
   if(odd(n)) {
      r += a;
      if(1 == n) return(r);
   }
   n = half(n);
   a = a << 1;
   return(mult_acc3(r, n, a));
}


int multiply6(int n, int a)
{
   if(1 == n) return(a);
   return(mult_acc4(a, n - 1, a));
}

int mult_acc4(int r, int n, int a)
{
   while(1) {
      if(odd(n)) {
         r += a;
         if(1 == n) return(r);
      }
      n = half(n);
      a = a << 1;
   }
}


int multiply7(int n, int a)
{
   while(!odd(n)) {
      a = a << 1;
      n = half(n);
   }
   if(1 == n) return(a);
   return(mult_acc4(a, n - 1, a));
}


int multiply8(int n, int a)
{
   while(!odd(n)) {
      a = a << 1;
      n = half(n);
   }
   if(1 == n) return(a);
   return(mult_acc4(a, half(n - 1), a << 1));
}

_________________
Michael A.


Sun Sep 03, 2017 11:46 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1806
Interesting! A very thorough job of doing the exercises, measuring and then having a think. I'm guilty of almost always skipping exercises, which is very self-limiting.

I was just going to see if I could add anything by colouring - first colouring according to the best/worst per row, which is looking at the compiler, and then colouring according to best/worst in each column, which is looking at the execution. But it seems I can't add colour to code blocks.

Still, I think even writing that paragraph shows that we've got three things going on
- the algorithm, in the abstract
- the compiler's competence in building a code sequence
- the CPU's competence in executing that code sequence

As your thread title indicates, the modern CPU is something of a beast in forging ahead: speculatively executing through conditional branches, and making light work of jumps, calls, and returns.

Let's have a look without the benefit of colour:
MichaelM wrote:
Code:
Routine_Nm,Dbg/-O3,Rls/-Os,Rls/-O3,Rls/-O2,Rls/-O1,Rls/-O0
Multiply_1,  7.234,  9.585,  7.389,  9.904, 10.778, 17.481
Multiply_2,  4.975,  7.063,  4.988,  8.005, 15.532, 17.464
Multiply_3,  4.725,  6.218,  4.721,  4.711, 11.791, 17.492
Multiply_4,  6.633,  6.236,  6.625,  6.687, 12.668, 17.995
Multiply_5,  4.286,  6.065,  4.283,  4.287, 14.193, 17.850
Multiply_6,  5.939,  6.961,  5.930,  5.935,  7.199,  9.832
Multiply_7,  4.481,  6.667,  4.481,  4.481,  7.178,  9.755
Multiply_8,  5.016,  6.192,  5.013,  5.015,  6.709,  9.001


We can see from the final column, where the compiler isn't being clever, that the 6, 7 and 8 versions are a lot "better" - but we can't tell if it's the algorithm or the CPU.

In an effort to try to gain some insight, I've cropped down the 'perf' results to see only what I think might be the most informative stats, noting that these are for the O3 compilation and therefore the compiler has already tried to squeeze out as much as it can from the source code:
MichaelM wrote:
Code:
 Performance counter stats for 'Release/Stepanov 2000 1':
     7,238,360,838      cycles                    #    0.994 GHz                     [50.01%]
    13,485,495,935      instructions              #    1.86  insns per cycle         [50.04%]
     1,680,883,125      branches                  #  230.886 M/sec                   [49.97%]
           861,832      branch-misses             #    0.05% of all branches         [49.87%]

 Performance counter stats for 'Release/Stepanov 2000 2':
     4,979,463,334      cycles                    #    0.993 GHz                     [49.96%]
     9,100,716,091      instructions              #    1.83  insns per cycle         [50.14%]
     1,450,183,464      branches                  #  289.135 M/sec                   [50.03%]
           848,119      branch-misses             #    0.06% of all branches         [49.84%]

 Performance counter stats for 'Release/Stepanov 2000 3':
     4,644,182,243      cycles                    #    0.995 GHz                     [50.08%]
     7,500,004,657      instructions              #    1.61  insns per cycle         [49.99%]
     1,449,465,718      branches                  #  310.590 M/sec                   [50.03%]
           807,622      branch-misses             #    0.06% of all branches         [50.15%]

 Performance counter stats for 'Release/Stepanov 2000 4':
     6,405,362,667      cycles                    #    0.995 GHz                     [49.99%]
     6,836,760,873      instructions              #    1.07  insns per cycle         [50.11%]
     2,232,507,206      branches                  #  346.774 M/sec                   [50.03%]
         8,277,203      branch-misses             #    0.37% of all branches         [49.96%]

 Performance counter stats for 'Release/Stepanov 2000 5':
     4,306,586,028      cycles                    #    0.995 GHz                     [49.95%]
     6,839,918,869      instructions              #    1.59  insns per cycle         [50.10%]
     2,235,540,097      branches                  #  516.324 M/sec                   [50.04%]
         9,797,084      branch-misses             #    0.44% of all branches         [49.95%]

 Performance counter stats for 'Release/Stepanov 2000 6':
     5,790,127,256      cycles                    #    0.995 GHz                     [49.91%]
     7,097,873,094      instructions              #    1.23  insns per cycle         [50.11%]
     2,234,531,073      branches                  #  383.985 M/sec                   [50.14%]
         7,263,670      branch-misses             #    0.33% of all branches         [49.98%]

 Performance counter stats for 'Release/Stepanov 2000 7':
     4,493,347,007      cycles                    #    0.995 GHz                     [50.03%]
     7,025,496,576      instructions              #    1.56  insns per cycle         [49.97%]
     2,226,158,240      branches                  #  492.721 M/sec                   [50.02%]
        12,654,222      branch-misses             #    0.57% of all branches         [50.03%]

 Performance counter stats for 'Release/Stepanov 2000 8':
     5,012,721,116      cycles                    #    0.995 GHz                     [50.09%]
     6,643,054,184      instructions              #    1.33  insns per cycle         [49.91%]
     2,097,877,713      branches                  #  416.480 M/sec                   [50.00%]
         5,236,597      branch-misses             #    0.25% of all branches         [49.95%]


We do at least see these points, I think:
- the number of executed instructions goes down at first, from 1, 2, 3 and then the rest of the pack.
- the final code version, 8, executes the fewest branches of the 4-8 group.
- And 8 misses the fewest in that group, too. Which means they are predictable branches.

I'd like to be able to say something useful about the instructions per cycle figure, but I'm not sure if I can.


Mon Sep 04, 2017 3:53 am

Joined: Tue Aug 08, 2017 1:33 pm
Posts: 11
I just tested this on a Linux box (dual Xeon E5630, Ubuntu Mate 16.04, clang), and found version 3 to be the fastest with -O3.

At first, I didn't even notice version 3 (since I was focussing on version 5), and so wrote this:

Code:
int multiply9(int n, int a)
{
  return mult_acc5(0, n, a);
}

int mult_acc5(int r, int n, int a)
{
  r+= (odd(n)?a:0);
  return ((n>1) ? mult_acc5(r, n>>1, a<<1) : r);
}


which gives identical runtime to version 3 on my setup.

The tail-recursive implementations are turned into iterative code at sufficiently high optimization levels (basically, turning the recursive calls into gotos to a point near the start of the function), so there is effectively no function-call overhead in this situation.

Disassembling the code should be interesting, too.


Mon Sep 04, 2017 8:25 am

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Very nice, compact realization.

I tried several construction changes like those rwiker includes in multiply9. With GCC, I did not get performance improvements using the trigraph (?:) construction, but I did see a minor performance improvement of a few percent using a+a instead of a<<1.

I don't have ready access to different x86 processor variants, but it may be interesting to see a comparison of results from several variants using multiply3, multiply5, and multiply9, while varying the compiler and optimization level. I suspect that, even for a simple benchmark like these multiply routines, we should see performance variations that can be attributed to the architectural differences between the processors. AMD and Intel processors, although they appear to be compatible ISAs, have significant differences in micro-architecture: different caching policies, different instruction decoding logic, different pipeline depths and ILP strategies, etc. I suspect, that on average, the compiler code generators and optimization routines produce code that is a good match to the x86 processors at some level.

I copied rwiker's code and ran it in my environment. The results are posted below:
Code:
morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 9

 Performance counter stats for 'Release/Stepanov 2000 9':

       4677.083417      task-clock (msec)         #    0.998 CPUs utilized         
               411      context-switches          #    0.088 K/sec                 
                 8      cpu-migrations            #    0.002 K/sec                 
                51      page-faults               #    0.011 K/sec                 
     4,653,583,491      cycles                    #    0.995 GHz                     [50.04%]
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.08%]
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.07%]
     8,707,304,550      instructions              #    1.87  insns per cycle         [49.98%]
     1,578,000,951      branches                  #  337.390 M/sec                   [50.03%]
           732,542      branch-misses             #    0.05% of all branches         [50.00%]

       4.687538150 seconds time elapsed

morrisma@morrisma5 ~/Development/CProjects/Stepanov $

In my environment, the code improves on the performance given by my tests for multiply3 using the -O3 optimization level, but it doesn't perform better than multiply5 in my environment. Perhaps the difference in the results can be attributed to the micro-achitectural differences between the Intel Xeon E5630 and the AMD E1 processors.

_________________
Michael A.


Mon Sep 04, 2017 1:14 pm

Joined: Tue Aug 08, 2017 1:33 pm
Posts: 11
It's quite likely that both architectural differences and compiler details affect the performance of the different implementations. I compiled the code using clang (-O3) on my MacBook, and disassembled the code in the debugger:

Code:

.../stepanov {703} $ lldb stepanov
(lldb) target create "stepanov"
Current executable set to 'stepanov' (x86_64).
(lldb) disass -n multiply9
stepanov`multiply9:
stepanov[0x100000e00] <+0>:  pushq  %rbp
stepanov[0x100000e01] <+1>:  movq   %rsp, %rbp
stepanov[0x100000e04] <+4>:  movl   %edi, %eax
stepanov[0x100000e06] <+6>:  andl   $0x1, %eax
stepanov[0x100000e09] <+9>:  cmovnel %esi, %eax
stepanov[0x100000e0c] <+12>: cmpl   $0x2, %edi
stepanov[0x100000e0f] <+15>: jl     0x100000e33               ; <+51>
stepanov[0x100000e11] <+17>: nopw   %cs:(%rax,%rax)
stepanov[0x100000e20] <+32>: sarl   %edi
stepanov[0x100000e22] <+34>: addl   %esi, %esi
stepanov[0x100000e24] <+36>: movl   %edi, %ecx
stepanov[0x100000e26] <+38>: andl   $0x1, %ecx
stepanov[0x100000e29] <+41>: cmovnel %esi, %ecx
stepanov[0x100000e2c] <+44>: addl   %ecx, %eax
stepanov[0x100000e2e] <+46>: cmpl   $0x1, %edi
stepanov[0x100000e31] <+49>: jg     0x100000e20               ; <+32>
stepanov[0x100000e33] <+51>: popq   %rbp
stepanov[0x100000e34] <+52>: retq   
stepanov[0x100000e35] <+53>: nopw   %cs:(%rax,%rax)

(lldb) disass -n mult_acc5
stepanov`mult_acc5:
stepanov[0x100000f30] <+0>:  pushq  %rbp
stepanov[0x100000f31] <+1>:  movq   %rsp, %rbp
stepanov[0x100000f34] <+4>:  movl   %esi, %eax
stepanov[0x100000f36] <+6>:  andl   $0x1, %eax
stepanov[0x100000f39] <+9>:  cmovnel %edx, %eax
stepanov[0x100000f3c] <+12>: addl   %edi, %eax
stepanov[0x100000f3e] <+14>: cmpl   $0x2, %esi
stepanov[0x100000f41] <+17>: jl     0x100000f63               ; <+51>
stepanov[0x100000f43] <+19>: nopw   %cs:(%rax,%rax)
stepanov[0x100000f50] <+32>: sarl   %esi
stepanov[0x100000f52] <+34>: addl   %edx, %edx
stepanov[0x100000f54] <+36>: movl   %esi, %ecx
stepanov[0x100000f56] <+38>: andl   $0x1, %ecx
stepanov[0x100000f59] <+41>: cmovnel %edx, %ecx
stepanov[0x100000f5c] <+44>: addl   %ecx, %eax
stepanov[0x100000f5e] <+46>: cmpl   $0x1, %esi
stepanov[0x100000f61] <+49>: jg     0x100000f50               ; <+32>
stepanov[0x100000f63] <+51>: popq   %rbp
stepanov[0x100000f64] <+52>: retq   



It seems to me that the compiler decides to inline mult_acc5 into multiply9, as well as making mult_acc5 available to callers outside the current file. I keep getting surprised by how good modern compilers can be (then again, I've been surprised by compilers since at least 1992)...

It would be interesting to see how well GCC does on this.


Mon Sep 04, 2017 5:21 pm

Joined: Tue Aug 08, 2017 1:33 pm
Posts: 11
If I move mult_acc5 into a separate source file, the disassembly of multipl9 looks like this:

Code:
(lldb) disass -n multiply9
stepanov`multiply9:
stepanov[0x100000e20] <+0>:  pushq  %rbp
stepanov[0x100000e21] <+1>:  movq   %rsp, %rbp
stepanov[0x100000e24] <+4>:  movl   %esi, %eax
stepanov[0x100000e26] <+6>:  movl   %edi, %ecx
stepanov[0x100000e28] <+8>:  xorl   %edi, %edi
stepanov[0x100000e2a] <+10>: movl   %ecx, %esi
stepanov[0x100000e2c] <+12>: movl   %eax, %edx
stepanov[0x100000e2e] <+14>: popq   %rbp
stepanov[0x100000e2f] <+15>: jmp    0x100000f30               ; mult_acc5
stepanov[0x100000e34] <+20>: nopw   %cs:(%rax,%rax)

(lldb)


So, clang definitely inline-expanded mult_acc5 into multiply9 when both functions were in the same source file.


Mon Sep 04, 2017 5:27 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
The following code represents the disassembly of my Stepanov object module. I am much more familiar with 8086/80186 assembly language than that of the larger members of the ISA, but it looks to me that the mult_accX functions have been inlined in the multiplyY functions. The transformations/optimizations that these compilers appear to be implementing are rather amazing. Need to find and read an architecture description for these processors; my knowledge of the ISA is rather dated. (Note: there is a very interesting discussion on the repz ret instruction that follows the je offset instruction that continues the loop/recursion in the multiply5 function. Can't say that the reason given for that specific construction applies to Intel platforms, but apparently it is a holdover in gcc to support the recommendations of AMD for the K8 processor. Another side note is that one recommendation by AMD for the early 64-bit devices was to limit the number of branches to 3 or less in a block of 16 bytes, which I interpret to include returns instructions as well. Apparently, that processor linked branch prediction to the instruction cache and provided three branch predictions per 16 byte block. )
Code:
morrisma@morrisma5 ~/Development/CProjects/Stepanov $ objdump -d Release/src/Stepanov.o

Release/src/Stepanov.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <multiply2>:
   0:   31 c0                   xor    %eax,%eax
   2:   83 ff 01                cmp    $0x1,%edi
   5:   74 29                   je     30 <multiply2+0x30>
   7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
   e:   00 00
  10:   89 f9                   mov    %edi,%ecx
  12:   8d 14 30                lea    (%rax,%rsi,1),%edx
  15:   d1 ff                   sar    %edi
  17:   83 e1 01                and    $0x1,%ecx
  1a:   85 c9                   test   %ecx,%ecx
  1c:   0f 45 c2                cmovne %edx,%eax
  1f:   01 f6                   add    %esi,%esi
  21:   83 ff 01                cmp    $0x1,%edi
  24:   75 ea                   jne    10 <multiply2+0x10>
  26:   01 f0                   add    %esi,%eax
  28:   c3                      retq   
  29:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  30:   89 f0                   mov    %esi,%eax
  32:   c3                      retq   
  33:   66 66 66 66 2e 0f 1f    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
  3a:   84 00 00 00 00 00

0000000000000040 <multiply3>:
  40:   31 c0                   xor    %eax,%eax
  42:   83 ff 01                cmp    $0x1,%edi
  45:   74 29                   je     70 <multiply3+0x30>
  47:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  4e:   00 00
  50:   8d 14 30                lea    (%rax,%rsi,1),%edx
  53:   40 f6 c7 01             test   $0x1,%dil
  57:   0f 45 c2                cmovne %edx,%eax
  5a:   d1 ff                   sar    %edi
  5c:   01 f6                   add    %esi,%esi
  5e:   83 ff 01                cmp    $0x1,%edi
  61:   75 ed                   jne    50 <multiply3+0x10>
  63:   01 f0                   add    %esi,%eax
  65:   c3                      retq   
  66:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  6d:   00 00 00
  70:   89 f0                   mov    %esi,%eax
  72:   c3                      retq   
  73:   66 66 66 66 2e 0f 1f    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
  7a:   84 00 00 00 00 00

0000000000000080 <multiply4>:
  80:   83 ff 01                cmp    $0x1,%edi
  83:   74 23                   je     a8 <multiply4+0x28>
  85:   31 c0                   xor    %eax,%eax
  87:   eb 0b                   jmp    94 <multiply4+0x14>
  89:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  90:   01 f6                   add    %esi,%esi
  92:   d1 ff                   sar    %edi
  94:   40 f6 c7 01             test   $0x1,%dil
  98:   74 f6                   je     90 <multiply4+0x10>
  9a:   01 f0                   add    %esi,%eax
  9c:   83 ff 01                cmp    $0x1,%edi
  9f:   75 ef                   jne    90 <multiply4+0x10>
  a1:   f3 c3                   repz retq
  a3:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  a8:   89 f0                   mov    %esi,%eax
  aa:   c3                      retq   
  ab:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

00000000000000b0 <multiply5>:
  b0:   83 ff 01                cmp    $0x1,%edi
  b3:   74 23                   je     d8 <multiply5+0x28>
  b5:   31 c0                   xor    %eax,%eax
  b7:   eb 0b                   jmp    c4 <multiply5+0x14>
  b9:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  c0:   d1 ff                   sar    %edi
  c2:   01 f6                   add    %esi,%esi
  c4:   40 f6 c7 01             test   $0x1,%dil
  c8:   74 f6                   je     c0 <multiply5+0x10>
  ca:   01 f0                   add    %esi,%eax
  cc:   83 ff 01                cmp    $0x1,%edi
  cf:   75 ef                   jne    c0 <multiply5+0x10>
  d1:   f3 c3                   repz retq
  d3:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  d8:   89 f0                   mov    %esi,%eax
  da:   c3                      retq   
  db:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

00000000000000e0 <multiply6>:
  e0:   83 ff 01                cmp    $0x1,%edi
  e3:   89 f0                   mov    %esi,%eax
  e5:   74 1a                   je     101 <multiply6+0x21>
  e7:   83 ef 01                sub    $0x1,%edi
  ea:   89 f2                   mov    %esi,%edx
  ec:   eb 06                   jmp    f4 <multiply6+0x14>
  ee:   66 90                   xchg   %ax,%ax
  f0:   d1 ff                   sar    %edi
  f2:   01 d2                   add    %edx,%edx
  f4:   40 f6 c7 01             test   $0x1,%dil
  f8:   74 f6                   je     f0 <multiply6+0x10>
  fa:   01 d0                   add    %edx,%eax
  fc:   83 ff 01                cmp    $0x1,%edi
  ff:   75 ef                   jne    f0 <multiply6+0x10>
 101:   f3 c3                   repz retq
 103:   66 66 66 66 2e 0f 1f    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 10a:   84 00 00 00 00 00

0000000000000110 <multiply7>:
 110:   40 f6 c7 01             test   $0x1,%dil
 114:   89 f0                   mov    %esi,%eax
 116:   75 12                   jne    12a <multiply7+0x1a>
 118:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
 11f:   00
 120:   d1 ff                   sar    %edi
 122:   01 c0                   add    %eax,%eax
 124:   40 f6 c7 01             test   $0x1,%dil
 128:   74 f6                   je     120 <multiply7+0x10>
 12a:   83 ff 01                cmp    $0x1,%edi
 12d:   74 22                   je     151 <multiply7+0x41>
 12f:   83 ef 01                sub    $0x1,%edi
 132:   89 c2                   mov    %eax,%edx
 134:   eb 0e                   jmp    144 <multiply7+0x34>
 136:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
 13d:   00 00 00
 140:   d1 ff                   sar    %edi
 142:   01 d2                   add    %edx,%edx
 144:   40 f6 c7 01             test   $0x1,%dil
 148:   74 f6                   je     140 <multiply7+0x30>
 14a:   01 d0                   add    %edx,%eax
 14c:   83 ff 01                cmp    $0x1,%edi
 14f:   75 ef                   jne    140 <multiply7+0x30>
 151:   f3 c3                   repz retq
 153:   66 66 66 66 2e 0f 1f    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 15a:   84 00 00 00 00 00

0000000000000160 <multiply8>:
 160:   40 f6 c7 01             test   $0x1,%dil
 164:   89 f0                   mov    %esi,%eax
 166:   75 12                   jne    17a <multiply8+0x1a>
 168:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
 16f:   00
 170:   d1 ff                   sar    %edi
 172:   01 c0                   add    %eax,%eax
 174:   40 f6 c7 01             test   $0x1,%dil
 178:   74 f6                   je     170 <multiply8+0x10>
 17a:   83 ff 01                cmp    $0x1,%edi
 17d:   74 22                   je     1a1 <multiply8+0x41>
 17f:   83 ef 01                sub    $0x1,%edi
 182:   8d 14 00                lea    (%rax,%rax,1),%edx
 185:   d1 ff                   sar    %edi
 187:   eb 0b                   jmp    194 <multiply8+0x34>
 189:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 190:   d1 ff                   sar    %edi
 192:   01 d2                   add    %edx,%edx
 194:   40 f6 c7 01             test   $0x1,%dil
 198:   74 f6                   je     190 <multiply8+0x30>
 19a:   01 d0                   add    %edx,%eax
 19c:   83 ff 01                cmp    $0x1,%edi
 19f:   75 ef                   jne    190 <multiply8+0x30>
 1a1:   f3 c3                   repz retq
 1a3:   66 66 66 66 2e 0f 1f    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 1aa:   84 00 00 00 00 00

00000000000001b0 <multiply9>:
 1b0:   83 ff 01                cmp    $0x1,%edi
 1b3:   7e 23                   jle    1d8 <multiply9+0x28>
 1b5:   31 c0                   xor    %eax,%eax
 1b7:   eb 0b                   jmp    1c4 <multiply9+0x14>
 1b9:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 1c0:   01 f6                   add    %esi,%esi
 1c2:   d1 ff                   sar    %edi
 1c4:   89 fa                   mov    %edi,%edx
 1c6:   83 e2 01                and    $0x1,%edx
 1c9:   0f 45 d6                cmovne %esi,%edx
 1cc:   01 d0                   add    %edx,%eax
 1ce:   83 ff 01                cmp    $0x1,%edi
 1d1:   75 ed                   jne    1c0 <multiply9+0x10>
 1d3:   f3 c3                   repz retq
 1d5:   0f 1f 00                nopl   (%rax)
 1d8:   89 f0                   mov    %esi,%eax
 1da:   c3                      retq   
 1db:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

00000000000001e0 <multiply1>:
 1e0:   41 57                   push   %r15
 1e2:   41 56                   push   %r14
 1e4:   41 55                   push   %r13
 1e6:   41 54                   push   %r12
 1e8:   41 89 f4                mov    %esi,%r12d
 1eb:   55                      push   %rbp
 1ec:   53                      push   %rbx
 1ed:   48 83 ec 18             sub    $0x18,%rsp
 1f1:   83 ff 01                cmp    $0x1,%edi
 1f4:   0f 84 b1 00 00 00       je     2ab <multiply1+0xcb>
 1fa:   89 fa                   mov    %edi,%edx
 1fc:   89 f3                   mov    %esi,%ebx
 1fe:   89 fd                   mov    %edi,%ebp
 200:   d1 fa                   sar    %edx
 202:   44 8d 24 36             lea    (%rsi,%rsi,1),%r12d
 206:   83 fa 01                cmp    $0x1,%edx
 209:   0f 84 92 00 00 00       je     2a1 <multiply1+0xc1>
 20f:   89 f9                   mov    %edi,%ecx
 211:   44 8d 2c b5 00 00 00    lea    0x0(,%rsi,4),%r13d
 218:   00
 219:   c1 f9 02                sar    $0x2,%ecx
 21c:   83 f9 01                cmp    $0x1,%ecx
 21f:   74 76                   je     297 <multiply1+0xb7>
 221:   41 89 f8                mov    %edi,%r8d
 224:   44 8d 34 f5 00 00 00    lea    0x0(,%rsi,8),%r14d
 22b:   00
 22c:   41 c1 f8 03             sar    $0x3,%r8d
 230:   41 83 f8 01             cmp    $0x1,%r8d
 234:   74 57                   je     28d <multiply1+0xad>
 236:   41 89 f9                mov    %edi,%r9d
 239:   41 89 f7                mov    %esi,%r15d
 23c:   41 c1 f9 04             sar    $0x4,%r9d
 240:   41 c1 e7 04             shl    $0x4,%r15d
 244:   41 83 f9 01             cmp    $0x1,%r9d
 248:   44 89 0c 24             mov    %r9d,(%rsp)
 24c:   74 34                   je     282 <multiply1+0xa2>
 24e:   c1 e6 05                shl    $0x5,%esi
 251:   c1 ff 05                sar    $0x5,%edi
 254:   44 89 44 24 0c          mov    %r8d,0xc(%rsp)
 259:   89 4c 24 08             mov    %ecx,0x8(%rsp)
 25d:   89 54 24 04             mov    %edx,0x4(%rsp)
 261:   e8 00 00 00 00          callq  266 <multiply1+0x86>
 266:   44 8b 0c 24             mov    (%rsp),%r9d
 26a:   44 8b 44 24 0c          mov    0xc(%rsp),%r8d
 26f:   41 01 c7                add    %eax,%r15d
 272:   8b 4c 24 08             mov    0x8(%rsp),%ecx
 276:   8b 54 24 04             mov    0x4(%rsp),%edx
 27a:   41 83 e1 01             and    $0x1,%r9d
 27e:   44 0f 44 f8             cmove  %eax,%r15d
 282:   45 01 fe                add    %r15d,%r14d
 285:   41 83 e0 01             and    $0x1,%r8d
 289:   45 0f 44 f7             cmove  %r15d,%r14d
 28d:   45 01 f5                add    %r14d,%r13d
 290:   83 e1 01                and    $0x1,%ecx
 293:   45 0f 44 ee             cmove  %r14d,%r13d
 297:   45 01 ec                add    %r13d,%r12d
 29a:   83 e2 01                and    $0x1,%edx
 29d:   45 0f 44 e5             cmove  %r13d,%r12d
 2a1:   44 01 e3                add    %r12d,%ebx
 2a4:   83 e5 01                and    $0x1,%ebp
 2a7:   44 0f 45 e3             cmovne %ebx,%r12d
 2ab:   48 83 c4 18             add    $0x18,%rsp
 2af:   44 89 e0                mov    %r12d,%eax
 2b2:   5b                      pop    %rbx
 2b3:   5d                      pop    %rbp
 2b4:   41 5c                   pop    %r12
 2b6:   41 5d                   pop    %r13
 2b8:   41 5e                   pop    %r14
 2ba:   41 5f                   pop    %r15
 2bc:   c3                      retq   
 2bd:   0f 1f 00                nopl   (%rax)

00000000000002c0 <mult_acc0>:
 2c0:   83 fe 01                cmp    $0x1,%esi
 2c3:   89 d0                   mov    %edx,%eax
 2c5:   74 21                   je     2e8 <mult_acc0+0x28>
 2c7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
 2ce:   00 00
 2d0:   89 f1                   mov    %esi,%ecx
 2d2:   8d 04 12                lea    (%rdx,%rdx,1),%eax
 2d5:   d1 fe                   sar    %esi
 2d7:   83 e1 01                and    $0x1,%ecx
 2da:   01 fa                   add    %edi,%edx
 2dc:   85 c9                   test   %ecx,%ecx
 2de:   0f 45 fa                cmovne %edx,%edi
 2e1:   83 fe 01                cmp    $0x1,%esi
 2e4:   89 c2                   mov    %eax,%edx
 2e6:   75 e8                   jne    2d0 <mult_acc0+0x10>
 2e8:   01 f8                   add    %edi,%eax
 2ea:   c3                      retq   
 2eb:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

00000000000002f0 <mult_acc1>:
 2f0:   eb 11                   jmp    303 <mult_acc1+0x13>
 2f2:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
 2f8:   40 f6 c6 01             test   $0x1,%sil
 2fc:   0f 45 f8                cmovne %eax,%edi
 2ff:   01 d2                   add    %edx,%edx
 301:   d1 fe                   sar    %esi
 303:   83 fe 01                cmp    $0x1,%esi
 306:   8d 04 17                lea    (%rdi,%rdx,1),%eax
 309:   75 ed                   jne    2f8 <mult_acc1+0x8>
 30b:   f3 c3                   repz retq
 30d:   0f 1f 00                nopl   (%rax)

0000000000000310 <mult_acc2>:
 310:   89 f8                   mov    %edi,%eax
 312:   eb 08                   jmp    31c <mult_acc2+0xc>
 314:   0f 1f 40 00             nopl   0x0(%rax)
 318:   01 d2                   add    %edx,%edx
 31a:   d1 fe                   sar    %esi
 31c:   40 f6 c6 01             test   $0x1,%sil
 320:   74 f6                   je     318 <mult_acc2+0x8>
 322:   01 d0                   add    %edx,%eax
 324:   83 fe 01                cmp    $0x1,%esi
 327:   75 ef                   jne    318 <mult_acc2+0x8>
 329:   f3 c3                   repz retq
 32b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000000330 <mult_acc3>:
 330:   89 f8                   mov    %edi,%eax
 332:   eb 08                   jmp    33c <mult_acc3+0xc>
 334:   0f 1f 40 00             nopl   0x0(%rax)
 338:   d1 fe                   sar    %esi
 33a:   01 d2                   add    %edx,%edx
 33c:   40 f6 c6 01             test   $0x1,%sil
 340:   74 f6                   je     338 <mult_acc3+0x8>
 342:   01 d0                   add    %edx,%eax
 344:   83 fe 01                cmp    $0x1,%esi
 347:   75 ef                   jne    338 <mult_acc3+0x8>
 349:   f3 c3                   repz retq
 34b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000000350 <mult_acc4>:
 350:   89 f8                   mov    %edi,%eax
 352:   eb 08                   jmp    35c <mult_acc4+0xc>
 354:   0f 1f 40 00             nopl   0x0(%rax)
 358:   d1 fe                   sar    %esi
 35a:   01 d2                   add    %edx,%edx
 35c:   40 f6 c6 01             test   $0x1,%sil
 360:   74 f6                   je     358 <mult_acc4+0x8>
 362:   01 d0                   add    %edx,%eax
 364:   83 fe 01                cmp    $0x1,%esi
 367:   75 ef                   jne    358 <mult_acc4+0x8>
 369:   f3 c3                   repz retq
 36b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000000370 <mult_acc5>:
 370:   89 f8                   mov    %edi,%eax
 372:   eb 08                   jmp    37c <mult_acc5+0xc>
 374:   0f 1f 40 00             nopl   0x0(%rax)
 378:   01 d2                   add    %edx,%edx
 37a:   d1 fe                   sar    %esi
 37c:   89 f1                   mov    %esi,%ecx
 37e:   83 e1 01                and    $0x1,%ecx
 381:   0f 45 ca                cmovne %edx,%ecx
 384:   01 c8                   add    %ecx,%eax
 386:   83 fe 01                cmp    $0x1,%esi
 389:   7f ed                   jg     378 <mult_acc5+0x8>
 38b:   f3 c3                   repz retq

Disassembly of section .text.startup:

0000000000000000 <main>:
   0:   41 57                   push   %r15
   2:   41 56                   push   %r14
   4:   ba 0a 00 00 00          mov    $0xa,%edx
   9:   41 55                   push   %r13
   b:   41 54                   push   %r12
   d:   55                      push   %rbp
   e:   53                      push   %rbx
   f:   48 89 f3                mov    %rsi,%rbx
  12:   48 83 ec 58             sub    $0x58,%rsp
  16:   48 8b 7e 08             mov    0x8(%rsi),%rdi
  1a:   31 f6                   xor    %esi,%esi
  1c:   48 c7 04 24 00 00 00    movq   $0x0,(%rsp)
  23:   00
  24:   48 c7 44 24 08 00 00    movq   $0x0,0x8(%rsp)
  2b:   00 00
  2d:   48 c7 44 24 10 00 00    movq   $0x0,0x10(%rsp)
  34:   00 00
  36:   48 c7 44 24 18 00 00    movq   $0x0,0x18(%rsp)
  3d:   00 00
  3f:   48 c7 44 24 20 00 00    movq   $0x0,0x20(%rsp)
  46:   00 00
  48:   48 c7 44 24 28 00 00    movq   $0x0,0x28(%rsp)
  4f:   00 00
  51:   48 c7 44 24 30 00 00    movq   $0x0,0x30(%rsp)
  58:   00 00
  5a:   48 c7 44 24 38 00 00    movq   $0x0,0x38(%rsp)
  61:   00 00
  63:   48 c7 44 24 40 00 00    movq   $0x0,0x40(%rsp)
  6a:   00 00
  6c:   e8 00 00 00 00          callq  71 <main+0x71>
  71:   48 8b 7b 10             mov    0x10(%rbx),%rdi
  75:   49 89 c6                mov    %rax,%r14
  78:   ba 0a 00 00 00          mov    $0xa,%edx
  7d:   31 f6                   xor    %esi,%esi
  7f:   e8 00 00 00 00          callq  84 <main+0x84>
  84:   45 85 f6                test   %r14d,%r14d
  87:   8d 50 ff                lea    -0x1(%rax),%edx
  8a:   7e 64                   jle    f0 <main+0xf0>
  8c:   48 63 d2                movslq %edx,%rdx
  8f:   45 31 ff                xor    %r15d,%r15d
  92:   4c 8b 2c d4             mov    (%rsp,%rdx,8),%r13
  96:   41 bc 01 00 00 00       mov    $0x1,%r12d
  9c:   0f 1f 40 00             nopl   0x0(%rax)
  a0:   44 89 e5                mov    %r12d,%ebp
  a3:   bb 01 00 00 00          mov    $0x1,%ebx
  a8:   eb 14                   jmp    be <main+0xbe>
  aa:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  b0:   83 c3 01                add    $0x1,%ebx
  b3:   44 01 e5                add    %r12d,%ebp
  b6:   81 fb 01 01 00 00       cmp    $0x101,%ebx
  bc:   74 1c                   je     da <main+0xda>
  be:   89 de                   mov    %ebx,%esi
  c0:   44 89 e7                mov    %r12d,%edi
  c3:   41 ff d5                callq  *%r13
  c6:   39 c5                   cmp    %eax,%ebp
  c8:   74 e6                   je     b0 <main+0xb0>
  ca:   89 c6                   mov    %eax,%esi
  cc:   89 ea                   mov    %ebp,%edx
  ce:   bf 00 00 00 00          mov    $0x0,%edi
  d3:   31 c0                   xor    %eax,%eax
  d5:   e8 00 00 00 00          callq  da <main+0xda>
  da:   41 83 c4 01             add    $0x1,%r12d
  de:   41 81 fc 01 01 00 00    cmp    $0x101,%r12d
  e5:   75 b9                   jne    a0 <main+0xa0>
  e7:   41 83 c7 01             add    $0x1,%r15d
  eb:   45 39 f7                cmp    %r14d,%r15d
  ee:   75 a6                   jne    96 <main+0x96>
  f0:   48 83 c4 58             add    $0x58,%rsp
  f4:   31 c0                   xor    %eax,%eax
  f6:   5b                      pop    %rbx
  f7:   5d                      pop    %rbp
  f8:   41 5c                   pop    %r12
  fa:   41 5d                   pop    %r13
  fc:   41 5e                   pop    %r14
  fe:   41 5f                   pop    %r15
 100:   c3                      retq   
morrisma@morrisma5 ~/Development/CProjects/Stepanov $

_________________
Michael A.


Mon Sep 04, 2017 8:31 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1806
In the land of GCC there are also
-march
-mtune
See perhaps https://stackoverflow.com/questions/106 ... chitecture


Tue Sep 05, 2017 1:14 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1806
Also well worth a mention is Matt Godbolt's GCC Explorer, which allows you to explore
- different versions of GCC
- GCC for x86 or for ARM
- any different command line options
and to do that for any short sequence of C code you care to paste in, from the comfort of your browser.
https://godbolt.org/

Oh, in a variety of source languages, and also to share links to interesting findings.


Tue Sep 05, 2017 6:12 pm

Joined: Tue Aug 08, 2017 1:33 pm
Posts: 11
Just to muddle the waters a bit: I implemented this algorithm in Common Lisp, using (optional) declarations to induce the compiler (SBCL, in this case) to use fixnum arithmetic. This implementation took about 5.5 seconds (5000 iterations), which compares fairly well with the 3.16 seconds (I think) that I got from multiply3 with clang -O3.


Tue Sep 05, 2017 8:09 pm

Joined: Wed Apr 24, 2013 9:40 pm
Posts: 213
Location: Huntsville, AL
Big Ed:

Very cool link. Bookmarked it.

_________________
Michael A.


Wed Sep 06, 2017 2:01 am
 [ 11 posts ] 

Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software