Last visit was: Sun Nov 10, 2024 7:16 pm
|
It is currently Sun Nov 10, 2024 7:16 pm
|
Perils of Optimization On Modern High-Performance Processors
Author |
Message |
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Recently, I've been studying the book "From Mathematics to Generic Programming," by Alexander A. Stepanov and Daniel E. Rose. I took up studying this book in order to gain an understanding of the process by which Stepanov designed the C++ Standard Template Library. With regard to the book, I find that it is well written, and intellectually stimulating. I particularly enjoy the authors' linkage of the algorithms described in the book to the mathematical foundation for those algorithms. (Note: In the spirit of full disclosure of any personal biases regarding this subject, I am not a particular fan of canned algorithms. In particular, I am not a particular fan of canned algorithms such as those in the Standard Template Library, Matlab, Simulink, and similar toolsets which, in my opinion, many software developers, engineers, and scientists are using blindly.) There is a mathematical basis for many algorithms commonly available in standard libraries. However, there appears to be a blind trust that many of these algorithms behave correctly under a wide range of conditions. This assumption is generally unsupported by the documentation that accompanies many of these collections of algorithms. To be fair to the developers/suppliers of these libraries and toolsets, many software developers, engineers, and scientists do not test in a critical manner the libraries for suitability to their particular application domain. Perhaps the lack of curiosity is driven in part by a lack of confidence with regard to mathematics. There are many algorithms whose mathematical basis is certainly difficult for many, but there are also many algorithms whose mathematical basis should be easily understood by anyone capable of learning and using a programming language. The second chapter of the book relates to the Egyptian/Russian Peasant multiplication algorithm applicable to unsigned non-zero integers. This is an algorithm that should be easily understood by anybody who has learned to program computers. I decided to understand the development of the authors' implementations of the algorithm in C. Therefore, I implemented each example that the authors' presented in C, and compiled and tested each "improved" version of the algorithm. After completing that exercise, I believed that I understood each of the authors' algorithms. After developing and testing each of the 8 algorithms, which progress from simple recursion to non-recursive looping, I decided to test the efficiency of the algorithms using a test program that exhaustively calculated all 65536 products for two arguments, each ranging from 1 to 256. Interestingly, I found that a recursive algorithm yielded the best time instead of the non-recursive looping implementation. Throughout the elaboration of the various algorithms, the authors clearly emphasized the cost of testing variables and calling subroutines. In other words, they wanted the reader to pick up on the processing costs associated with testing variables and recursively calling the core multiplication subroutine. With respect to recursion (Algorithms 1-5), the authors built up the algorithmic progression until Algorithm 5 was implemented in a strict tail recursive form. I've used recursion in the past, but it is not a natural way for me to develop algorithms. Until reading the authors' definition for strictly tail recursive they provide in the book, I had not actually followed through on determining the conditions needed to refer to an implementation as tail recursive. For the non-recursive looping algorithms (Algorithms 6-8), the authors go through the development of these algorithms with the objective of reducing the number of tests and the number of loops required to implement the desired product. ( Note: In the multiplication algorithm being considered, the multiplier should be less than multiplicand. Examination of the multiplier is used to develop the partial products which are subsequently added together to form the product. To minimize the number of recursions/loops, the multiplier should be the lesser of the two numbers. The algorithms implemented do not ensure that the multiplier is less than the multiplicand. My testing is performed in such a manner that for roughly half of the products computed the multiplier is less than the multiplicand, and greater than or equal to the multiplicand for the remaining products.) The following table shows the results I obtained for the 8 algorithms implemented for this multiplication technique. I used Eclipse and GCC on a 64-bit Linux Mint platform powered by a 1 GHz, 4 GB, AMD E1 processor. My test program consists of three nested loops: n - iteration loop to ensure a sufficiently long test time; k - multiplier loop; l - multiplicand loop. The multiplier and multiplicand loops iterate from 1 to 256 to generate 65536 products. The iteration loop repeats the inner two loops. The resulting products generated by the 8 implementations are tested against the product of k and l. If there is an error, an error message is printed and a break is taken. I use the argument vector to input the number of iterations and the algorithm number, which I use as an index into an array of function pointers. Each algorithm is timed using the time utility of Linux: $ time test_pgm n_iterations algorithm_index. Code: Routine_Nm,Dbg/-O3,Rls/-Os,Rls/-O3,Rls/-O2,Rls/-O1,Rls/-O0 Multiply_1, 7.234, 9.585, 7.389, 9.904, 10.778, 17.481 Multiply_2, 4.975, 7.063, 4.988, 8.005, 15.532, 17.464 Multiply_3, 4.725, 6.218, 4.721, 4.711, 11.791, 17.492 Multiply_4, 6.633, 6.236, 6.625, 6.687, 12.668, 17.995 Multiply_5, 4.286, 6.065, 4.283, 4.287, 14.193, 17.850 Multiply_6, 5.939, 6.961, 5.930, 5.935, 7.199, 9.832 Multiply_7, 4.481, 6.667, 4.481, 4.481, 7.178, 9.755 Multiply_8, 5.016, 6.192, 5.013, 5.015, 6.709, 9.001
Across the top are the optimizations employed. Dbg signifies the Debug output, and Rls signifies the Release output from the GCC compiler. I captured results for all 5 optimization levels that can be selected using the -Ox optimization option. When using optimization levels -O2 or -O3, the strictly tail recursive Algorithm 5 provides the best times. Without optimization (-O0) or minimal optimization (-O1), the non-recursive looping algorithms, Algorithm 6-8, handily beat out the recursive algorithms, Algorithms 1-5. There was some variation in the measurements using the time utility, so I captured the following data using the perf utility, which provides more consistent timing data by using the performance counters inside the processor. I captured data using the Rls/-O3 optimization for all 8 algorithms. Interestingly, there is a wide variation in the effective number of instructions per cycle, the total number of branches, and the number missed branches. Algorithm 5 is still the fastest, but its number of instructions per cycle is not the highest, nor does it have the least number of branches or missed branches. Code: morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 1
Performance counter stats for 'Release/Stepanov 2000 1':
7280.156042 task-clock (msec) # 0.995 CPUs utilized 677 context-switches # 0.093 K/sec 28 cpu-migrations # 0.004 K/sec 52 page-faults # 0.007 K/sec 7,238,360,838 cycles # 0.994 GHz [50.01%] 0 stalled-cycles-frontend # 0.00% frontend cycles idle [50.05%] 0 stalled-cycles-backend # 0.00% backend cycles idle [50.21%] 13,485,495,935 instructions # 1.86 insns per cycle [50.04%] 1,680,883,125 branches # 230.886 M/sec [49.97%] 861,832 branch-misses # 0.05% of all branches [49.87%]
7.317273564 seconds time elapsed
morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 2
Performance counter stats for 'Release/Stepanov 2000 2':
5015.594286 task-clock (msec) # 0.994 CPUs utilized 478 context-switches # 0.095 K/sec 13 cpu-migrations # 0.003 K/sec 53 page-faults # 0.011 K/sec 4,979,463,334 cycles # 0.993 GHz [49.96%] 0 stalled-cycles-frontend # 0.00% frontend cycles idle [50.03%] 0 stalled-cycles-backend # 0.00% backend cycles idle [50.18%] 9,100,716,091 instructions # 1.83 insns per cycle [50.14%] 1,450,183,464 branches # 289.135 M/sec [50.03%] 848,119 branch-misses # 0.06% of all branches [49.84%]
5.045531490 seconds time elapsed
morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 3
Performance counter stats for 'Release/Stepanov 2000 3':
4666.820274 task-clock (msec) # 0.994 CPUs utilized 594 context-switches # 0.127 K/sec 5 cpu-migrations # 0.001 K/sec 53 page-faults # 0.011 K/sec 4,644,182,243 cycles # 0.995 GHz [50.08%] 0 stalled-cycles-frontend # 0.00% frontend cycles idle [50.00%] 0 stalled-cycles-backend # 0.00% backend cycles idle [49.97%] 7,500,004,657 instructions # 1.61 insns per cycle [49.99%] 1,449,465,718 branches # 310.590 M/sec [50.03%] 807,622 branch-misses # 0.06% of all branches [50.15%]
4.696254481 seconds time elapsed
morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 4
Performance counter stats for 'Release/Stepanov 2000 4':
6437.925075 task-clock (msec) # 0.989 CPUs utilized 673 context-switches # 0.105 K/sec 25 cpu-migrations # 0.004 K/sec 52 page-faults # 0.008 K/sec 6,405,362,667 cycles # 0.995 GHz [49.99%] 0 stalled-cycles-frontend # 0.00% frontend cycles idle [50.03%] 0 stalled-cycles-backend # 0.00% backend cycles idle [50.07%] 6,836,760,873 instructions # 1.07 insns per cycle [50.11%] 2,232,507,206 branches # 346.774 M/sec [50.03%] 8,277,203 branch-misses # 0.37% of all branches [49.96%]
6.508887512 seconds time elapsed
morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 5
Performance counter stats for 'Release/Stepanov 2000 5':
4329.726333 task-clock (msec) # 0.996 CPUs utilized 386 context-switches # 0.089 K/sec 15 cpu-migrations # 0.003 K/sec 51 page-faults # 0.012 K/sec 4,306,586,028 cycles # 0.995 GHz [49.95%] 0 stalled-cycles-frontend # 0.00% frontend cycles idle [49.96%] 0 stalled-cycles-backend # 0.00% backend cycles idle [50.15%] 6,839,918,869 instructions # 1.59 insns per cycle [50.10%] 2,235,540,097 branches # 516.324 M/sec [50.04%] 9,797,084 branch-misses # 0.44% of all branches [49.95%]
4.345760144 seconds time elapsed
morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 6
Performance counter stats for 'Release/Stepanov 2000 6':
5819.323941 task-clock (msec) # 0.994 CPUs utilized 531 context-switches # 0.091 K/sec 25 cpu-migrations # 0.004 K/sec 55 page-faults # 0.009 K/sec 5,790,127,256 cycles # 0.995 GHz [49.91%] 0 stalled-cycles-frontend # 0.00% frontend cycles idle [49.96%] 0 stalled-cycles-backend # 0.00% backend cycles idle [50.08%] 7,097,873,094 instructions # 1.23 insns per cycle [50.11%] 2,234,531,073 branches # 383.985 M/sec [50.14%] 7,263,670 branch-misses # 0.33% of all branches [49.98%]
5.851833687 seconds time elapsed
morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 7
Performance counter stats for 'Release/Stepanov 2000 7':
4518.088795 task-clock (msec) # 0.969 CPUs utilized 521 context-switches # 0.115 K/sec 9 cpu-migrations # 0.002 K/sec 54 page-faults # 0.012 K/sec 4,493,347,007 cycles # 0.995 GHz [50.03%] 0 stalled-cycles-frontend # 0.00% frontend cycles idle [50.03%] 0 stalled-cycles-backend # 0.00% backend cycles idle [50.00%] 7,025,496,576 instructions # 1.56 insns per cycle [49.97%] 2,226,158,240 branches # 492.721 M/sec [50.02%] 12,654,222 branch-misses # 0.57% of all branches [50.03%]
4.664287850 seconds time elapsed
morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 8
Performance counter stats for 'Release/Stepanov 2000 8':
5037.158903 task-clock (msec) # 0.996 CPUs utilized 466 context-switches # 0.093 K/sec 18 cpu-migrations # 0.004 K/sec 53 page-faults # 0.011 K/sec 5,012,721,116 cycles # 0.995 GHz [50.09%] 0 stalled-cycles-frontend # 0.00% frontend cycles idle [50.09%] 0 stalled-cycles-backend # 0.00% backend cycles idle [50.10%] 6,643,054,184 instructions # 1.33 insns per cycle [49.91%] 2,097,877,713 branches # 416.480 M/sec [50.00%] 5,236,597 branch-misses # 0.25% of all branches [49.95%]
5.058676239 seconds time elapsed
morrisma@morrisma5 ~/Development/CProjects/Stepanov $
Not exactly sure what these results mean for a different application. My objective was simply to time the various algorithms. The implication from the book was that each succeeding algorithm (higher number) was a better implementation than the previous algorithm. From the results presented above, that is clearly not the case. It appears to me that the more the program construction is able to successfully coerce instruction level parallelism (ILP) from the optimizer the faster the program will be. I don't know exactly how to interpret the performance effects of the branch prediction logic. The code used in this test is included below. With these results in mind, I will attempt to build and test a version of Algorithm 5 using the MiniCPU instruction set. The next chapter in book deals with the prime number sieve algorithm ascribed to Eratosthenes. I am looking forward to developing and testing the algorithms developed by Stepanov and Rose for this ancient algorithm, which is often used as a processor/language benchmark. Code: /* ============================================================================ Name : Stepanov.c Author : Michael A. Morris Version : Copyright : Copyright 2017 - GPLv3 Description : Various Algorithms for Egyptian/Russian Peasant Multiplication in C, Ansi-style ============================================================================ */
#include <stdio.h> #include <stdlib.h>
#define half(x) ((x) >> 1) #define odd(x) ((x) & 1)
int multiply1(int, int);
int multiply2(int, int); int mult_acc0(int, int, int);
int multiply3(int, int); int mult_acc1(int, int, int);
int multiply4(int, int); int mult_acc2(int, int, int);
int multiply5(int, int); int mult_acc3(int, int, int);
int multiply6(int, int); int mult_acc4(int, int, int);
int multiply7(int, int); int multiply8(int, int);
typedef int (*MFP_t)(int, int);
int main(int argc, char *argv[]) { int i, j, k, l; int n, prod; MFP_t mfp[8] = { &multiply1, &multiply2, &multiply3, &multiply4, &multiply5, &multiply6, &multiply7, &multiply8 };
i = atoi(argv[1]); j = atoi(argv[2]) - 1;
for(n = 0; n < i; n++){ //printf("\n********************* n=%5d **************************\n\n", n); for(k = 1; k <= 256; k++) { //printf("k=%d\n", k); for(l = 1; l <= 256; l++) { prod = mfp[j](k, l); if(prod != (k * l)) { printf("error: %d; %d\n", prod, k*l); break; } } } }
return EXIT_SUCCESS; }
/* * int multiply1(int n, int a) - Multiply two unsigned integers using Egyptian * multiplication algorithm. */
int multiply1(int n, int a) { if(1 == n) return(a); int result = multiply1(half(n), a << 1); if(odd(n)) return(result + a); return(result); }
int multiply2(int n, int a) { if(1 == n) return(a); return(mult_acc0(0, n, a)); }
int mult_acc0(int r, int n, int a) { if(1 == n) return(r + a); if(odd(n)) { return(mult_acc0(r + a, half(n), a << 1)); } else { return(mult_acc0(r , half(n), a << 1)); } }
int multiply3(int n, int a) { if(1 == n) return(a); return(mult_acc1(0, n, a)); }
int mult_acc1(int r, int n, int a) { if(1 == n) return(r + a); if(odd(n)) r += a; return(mult_acc1(r, half(n), a << 1)); }
int multiply4(int n, int a) { if(1 == n) return(a); return(mult_acc2(0, n, a)); }
int mult_acc2(int r, int n, int a) { if(odd(n)) { r += a; if(1 == n) return(r); } return(mult_acc2(r, half(n), a << 1)); }
int multiply5(int n, int a) { if(1 == n) return(a); return(mult_acc3(0, n, a)); }
int mult_acc3(int r, int n, int a) { if(odd(n)) { r += a; if(1 == n) return(r); } n = half(n); a = a << 1; return(mult_acc3(r, n, a)); }
int multiply6(int n, int a) { if(1 == n) return(a); return(mult_acc4(a, n - 1, a)); }
int mult_acc4(int r, int n, int a) { while(1) { if(odd(n)) { r += a; if(1 == n) return(r); } n = half(n); a = a << 1; } }
int multiply7(int n, int a) { while(!odd(n)) { a = a << 1; n = half(n); } if(1 == n) return(a); return(mult_acc4(a, n - 1, a)); }
int multiply8(int n, int a) { while(!odd(n)) { a = a << 1; n = half(n); } if(1 == n) return(a); return(mult_acc4(a, half(n - 1), a << 1)); }
_________________ Michael A.
|
Sun Sep 03, 2017 11:46 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
Interesting! A very thorough job of doing the exercises, measuring and then having a think. I'm guilty of almost always skipping exercises, which is very self-limiting. I was just going to see if I could add anything by colouring - first colouring according to the best/worst per row, which is looking at the compiler, and then colouring according to best/worst in each column, which is looking at the execution. But it seems I can't add colour to code blocks. Still, I think even writing that paragraph shows that we've got three things going on - the algorithm, in the abstract - the compiler's competence in building a code sequence - the CPU's competence in executing that code sequence As your thread title indicates, the modern CPU is something of a beast in forging ahead: speculatively executing through conditional branches, and making light work of jumps, calls, and returns. Let's have a look without the benefit of colour: MichaelM wrote: Code: Routine_Nm,Dbg/-O3,Rls/-Os,Rls/-O3,Rls/-O2,Rls/-O1,Rls/-O0 Multiply_1, 7.234, 9.585, 7.389, 9.904, 10.778, 17.481 Multiply_2, 4.975, 7.063, 4.988, 8.005, 15.532, 17.464 Multiply_3, 4.725, 6.218, 4.721, 4.711, 11.791, 17.492 Multiply_4, 6.633, 6.236, 6.625, 6.687, 12.668, 17.995 Multiply_5, 4.286, 6.065, 4.283, 4.287, 14.193, 17.850 Multiply_6, 5.939, 6.961, 5.930, 5.935, 7.199, 9.832 Multiply_7, 4.481, 6.667, 4.481, 4.481, 7.178, 9.755 Multiply_8, 5.016, 6.192, 5.013, 5.015, 6.709, 9.001
We can see from the final column, where the compiler isn't being clever, that the 6, 7 and 8 versions are a lot "better" - but we can't tell if it's the algorithm or the CPU. In an effort to try to gain some insight, I've cropped down the 'perf' results to see only what I think might be the most informative stats, noting that these are for the O3 compilation and therefore the compiler has already tried to squeeze out as much as it can from the source code: MichaelM wrote: Code: Performance counter stats for 'Release/Stepanov 2000 1': 7,238,360,838 cycles # 0.994 GHz [50.01%] 13,485,495,935 instructions # 1.86 insns per cycle [50.04%] 1,680,883,125 branches # 230.886 M/sec [49.97%] 861,832 branch-misses # 0.05% of all branches [49.87%]
Performance counter stats for 'Release/Stepanov 2000 2': 4,979,463,334 cycles # 0.993 GHz [49.96%] 9,100,716,091 instructions # 1.83 insns per cycle [50.14%] 1,450,183,464 branches # 289.135 M/sec [50.03%] 848,119 branch-misses # 0.06% of all branches [49.84%]
Performance counter stats for 'Release/Stepanov 2000 3': 4,644,182,243 cycles # 0.995 GHz [50.08%] 7,500,004,657 instructions # 1.61 insns per cycle [49.99%] 1,449,465,718 branches # 310.590 M/sec [50.03%] 807,622 branch-misses # 0.06% of all branches [50.15%]
Performance counter stats for 'Release/Stepanov 2000 4': 6,405,362,667 cycles # 0.995 GHz [49.99%] 6,836,760,873 instructions # 1.07 insns per cycle [50.11%] 2,232,507,206 branches # 346.774 M/sec [50.03%] 8,277,203 branch-misses # 0.37% of all branches [49.96%]
Performance counter stats for 'Release/Stepanov 2000 5': 4,306,586,028 cycles # 0.995 GHz [49.95%] 6,839,918,869 instructions # 1.59 insns per cycle [50.10%] 2,235,540,097 branches # 516.324 M/sec [50.04%] 9,797,084 branch-misses # 0.44% of all branches [49.95%]
Performance counter stats for 'Release/Stepanov 2000 6': 5,790,127,256 cycles # 0.995 GHz [49.91%] 7,097,873,094 instructions # 1.23 insns per cycle [50.11%] 2,234,531,073 branches # 383.985 M/sec [50.14%] 7,263,670 branch-misses # 0.33% of all branches [49.98%]
Performance counter stats for 'Release/Stepanov 2000 7': 4,493,347,007 cycles # 0.995 GHz [50.03%] 7,025,496,576 instructions # 1.56 insns per cycle [49.97%] 2,226,158,240 branches # 492.721 M/sec [50.02%] 12,654,222 branch-misses # 0.57% of all branches [50.03%]
Performance counter stats for 'Release/Stepanov 2000 8': 5,012,721,116 cycles # 0.995 GHz [50.09%] 6,643,054,184 instructions # 1.33 insns per cycle [49.91%] 2,097,877,713 branches # 416.480 M/sec [50.00%] 5,236,597 branch-misses # 0.25% of all branches [49.95%]
We do at least see these points, I think: - the number of executed instructions goes down at first, from 1, 2, 3 and then the rest of the pack. - the final code version, 8, executes the fewest branches of the 4-8 group. - And 8 misses the fewest in that group, too. Which means they are predictable branches. I'd like to be able to say something useful about the instructions per cycle figure, but I'm not sure if I can.
|
Mon Sep 04, 2017 3:53 am |
|
|
rwiker
Joined: Tue Aug 08, 2017 1:33 pm Posts: 11
|
I just tested this on a Linux box (dual Xeon E5630, Ubuntu Mate 16.04, clang), and found version 3 to be the fastest with -O3. At first, I didn't even notice version 3 (since I was focussing on version 5), and so wrote this: Code: int multiply9(int n, int a) { return mult_acc5(0, n, a); }
int mult_acc5(int r, int n, int a) { r+= (odd(n)?a:0); return ((n>1) ? mult_acc5(r, n>>1, a<<1) : r); }
which gives identical runtime to version 3 on my setup. The tail-recursive implementations are turned into iterative code at sufficiently high optimization levels (basically, turning the recursive calls into gotos to a point near the start of the function), so there is effectively no function-call overhead in this situation. Disassembling the code should be interesting, too.
|
Mon Sep 04, 2017 8:25 am |
|
|
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Very nice, compact realization. I tried several construction changes like those rwiker includes in multiply9. With GCC, I did not get performance improvements using the trigraph (?:) construction, but I did see a minor performance improvement of a few percent using a+a instead of a<<1. I don't have ready access to different x86 processor variants, but it may be interesting to see a comparison of results from several variants using multiply3, multiply5, and multiply9, while varying the compiler and optimization level. I suspect that, even for a simple benchmark like these multiply routines, we should see performance variations that can be attributed to the architectural differences between the processors. AMD and Intel processors, although they appear to be compatible ISAs, have significant differences in micro-architecture: different caching policies, different instruction decoding logic, different pipeline depths and ILP strategies, etc. I suspect, that on average, the compiler code generators and optimization routines produce code that is a good match to the x86 processors at some level. I copied rwiker's code and ran it in my environment. The results are posted below: Code: morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 9
Performance counter stats for 'Release/Stepanov 2000 9':
4677.083417 task-clock (msec) # 0.998 CPUs utilized 411 context-switches # 0.088 K/sec 8 cpu-migrations # 0.002 K/sec 51 page-faults # 0.011 K/sec 4,653,583,491 cycles # 0.995 GHz [50.04%] 0 stalled-cycles-frontend # 0.00% frontend cycles idle [50.08%] 0 stalled-cycles-backend # 0.00% backend cycles idle [50.07%] 8,707,304,550 instructions # 1.87 insns per cycle [49.98%] 1,578,000,951 branches # 337.390 M/sec [50.03%] 732,542 branch-misses # 0.05% of all branches [50.00%]
4.687538150 seconds time elapsed
morrisma@morrisma5 ~/Development/CProjects/Stepanov $
In my environment, the code improves on the performance given by my tests for multiply3 using the -O3 optimization level, but it doesn't perform better than multiply5 in my environment. Perhaps the difference in the results can be attributed to the micro-achitectural differences between the Intel Xeon E5630 and the AMD E1 processors.
_________________ Michael A.
|
Mon Sep 04, 2017 1:14 pm |
|
|
rwiker
Joined: Tue Aug 08, 2017 1:33 pm Posts: 11
|
It's quite likely that both architectural differences and compiler details affect the performance of the different implementations. I compiled the code using clang (-O3) on my MacBook, and disassembled the code in the debugger: Code: .../stepanov {703} $ lldb stepanov (lldb) target create "stepanov" Current executable set to 'stepanov' (x86_64). (lldb) disass -n multiply9 stepanov`multiply9: stepanov[0x100000e00] <+0>: pushq %rbp stepanov[0x100000e01] <+1>: movq %rsp, %rbp stepanov[0x100000e04] <+4>: movl %edi, %eax stepanov[0x100000e06] <+6>: andl $0x1, %eax stepanov[0x100000e09] <+9>: cmovnel %esi, %eax stepanov[0x100000e0c] <+12>: cmpl $0x2, %edi stepanov[0x100000e0f] <+15>: jl 0x100000e33 ; <+51> stepanov[0x100000e11] <+17>: nopw %cs:(%rax,%rax) stepanov[0x100000e20] <+32>: sarl %edi stepanov[0x100000e22] <+34>: addl %esi, %esi stepanov[0x100000e24] <+36>: movl %edi, %ecx stepanov[0x100000e26] <+38>: andl $0x1, %ecx stepanov[0x100000e29] <+41>: cmovnel %esi, %ecx stepanov[0x100000e2c] <+44>: addl %ecx, %eax stepanov[0x100000e2e] <+46>: cmpl $0x1, %edi stepanov[0x100000e31] <+49>: jg 0x100000e20 ; <+32> stepanov[0x100000e33] <+51>: popq %rbp stepanov[0x100000e34] <+52>: retq stepanov[0x100000e35] <+53>: nopw %cs:(%rax,%rax)
(lldb) disass -n mult_acc5 stepanov`mult_acc5: stepanov[0x100000f30] <+0>: pushq %rbp stepanov[0x100000f31] <+1>: movq %rsp, %rbp stepanov[0x100000f34] <+4>: movl %esi, %eax stepanov[0x100000f36] <+6>: andl $0x1, %eax stepanov[0x100000f39] <+9>: cmovnel %edx, %eax stepanov[0x100000f3c] <+12>: addl %edi, %eax stepanov[0x100000f3e] <+14>: cmpl $0x2, %esi stepanov[0x100000f41] <+17>: jl 0x100000f63 ; <+51> stepanov[0x100000f43] <+19>: nopw %cs:(%rax,%rax) stepanov[0x100000f50] <+32>: sarl %esi stepanov[0x100000f52] <+34>: addl %edx, %edx stepanov[0x100000f54] <+36>: movl %esi, %ecx stepanov[0x100000f56] <+38>: andl $0x1, %ecx stepanov[0x100000f59] <+41>: cmovnel %edx, %ecx stepanov[0x100000f5c] <+44>: addl %ecx, %eax stepanov[0x100000f5e] <+46>: cmpl $0x1, %esi stepanov[0x100000f61] <+49>: jg 0x100000f50 ; <+32> stepanov[0x100000f63] <+51>: popq %rbp stepanov[0x100000f64] <+52>: retq
It seems to me that the compiler decides to inline mult_acc5 into multiply9, as well as making mult_acc5 available to callers outside the current file. I keep getting surprised by how good modern compilers can be (then again, I've been surprised by compilers since at least 1992)... It would be interesting to see how well GCC does on this.
|
Mon Sep 04, 2017 5:21 pm |
|
|
rwiker
Joined: Tue Aug 08, 2017 1:33 pm Posts: 11
|
If I move mult_acc5 into a separate source file, the disassembly of multipl9 looks like this: Code: (lldb) disass -n multiply9 stepanov`multiply9: stepanov[0x100000e20] <+0>: pushq %rbp stepanov[0x100000e21] <+1>: movq %rsp, %rbp stepanov[0x100000e24] <+4>: movl %esi, %eax stepanov[0x100000e26] <+6>: movl %edi, %ecx stepanov[0x100000e28] <+8>: xorl %edi, %edi stepanov[0x100000e2a] <+10>: movl %ecx, %esi stepanov[0x100000e2c] <+12>: movl %eax, %edx stepanov[0x100000e2e] <+14>: popq %rbp stepanov[0x100000e2f] <+15>: jmp 0x100000f30 ; mult_acc5 stepanov[0x100000e34] <+20>: nopw %cs:(%rax,%rax)
(lldb)
So, clang definitely inline-expanded mult_acc5 into multiply9 when both functions were in the same source file.
|
Mon Sep 04, 2017 5:27 pm |
|
|
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
The following code represents the disassembly of my Stepanov object module. I am much more familiar with 8086/80186 assembly language than that of the larger members of the ISA, but it looks to me that the mult_accX functions have been inlined in the multiplyY functions. The transformations/optimizations that these compilers appear to be implementing are rather amazing. Need to find and read an architecture description for these processors; my knowledge of the ISA is rather dated. ( Note: there is a very interesting discussion on the repz ret instruction that follows the je offset instruction that continues the loop/recursion in the multiply5 function. Can't say that the reason given for that specific construction applies to Intel platforms, but apparently it is a holdover in gcc to support the recommendations of AMD for the K8 processor. Another side note is that one recommendation by AMD for the early 64-bit devices was to limit the number of branches to 3 or less in a block of 16 bytes, which I interpret to include returns instructions as well. Apparently, that processor linked branch prediction to the instruction cache and provided three branch predictions per 16 byte block. ) Code: morrisma@morrisma5 ~/Development/CProjects/Stepanov $ objdump -d Release/src/Stepanov.o
Release/src/Stepanov.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <multiply2>: 0: 31 c0 xor %eax,%eax 2: 83 ff 01 cmp $0x1,%edi 5: 74 29 je 30 <multiply2+0x30> 7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) e: 00 00 10: 89 f9 mov %edi,%ecx 12: 8d 14 30 lea (%rax,%rsi,1),%edx 15: d1 ff sar %edi 17: 83 e1 01 and $0x1,%ecx 1a: 85 c9 test %ecx,%ecx 1c: 0f 45 c2 cmovne %edx,%eax 1f: 01 f6 add %esi,%esi 21: 83 ff 01 cmp $0x1,%edi 24: 75 ea jne 10 <multiply2+0x10> 26: 01 f0 add %esi,%eax 28: c3 retq 29: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 30: 89 f0 mov %esi,%eax 32: c3 retq 33: 66 66 66 66 2e 0f 1f data16 data16 data16 nopw %cs:0x0(%rax,%rax,1) 3a: 84 00 00 00 00 00
0000000000000040 <multiply3>: 40: 31 c0 xor %eax,%eax 42: 83 ff 01 cmp $0x1,%edi 45: 74 29 je 70 <multiply3+0x30> 47: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 4e: 00 00 50: 8d 14 30 lea (%rax,%rsi,1),%edx 53: 40 f6 c7 01 test $0x1,%dil 57: 0f 45 c2 cmovne %edx,%eax 5a: d1 ff sar %edi 5c: 01 f6 add %esi,%esi 5e: 83 ff 01 cmp $0x1,%edi 61: 75 ed jne 50 <multiply3+0x10> 63: 01 f0 add %esi,%eax 65: c3 retq 66: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 6d: 00 00 00 70: 89 f0 mov %esi,%eax 72: c3 retq 73: 66 66 66 66 2e 0f 1f data16 data16 data16 nopw %cs:0x0(%rax,%rax,1) 7a: 84 00 00 00 00 00
0000000000000080 <multiply4>: 80: 83 ff 01 cmp $0x1,%edi 83: 74 23 je a8 <multiply4+0x28> 85: 31 c0 xor %eax,%eax 87: eb 0b jmp 94 <multiply4+0x14> 89: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 90: 01 f6 add %esi,%esi 92: d1 ff sar %edi 94: 40 f6 c7 01 test $0x1,%dil 98: 74 f6 je 90 <multiply4+0x10> 9a: 01 f0 add %esi,%eax 9c: 83 ff 01 cmp $0x1,%edi 9f: 75 ef jne 90 <multiply4+0x10> a1: f3 c3 repz retq a3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) a8: 89 f0 mov %esi,%eax aa: c3 retq ab: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
00000000000000b0 <multiply5>: b0: 83 ff 01 cmp $0x1,%edi b3: 74 23 je d8 <multiply5+0x28> b5: 31 c0 xor %eax,%eax b7: eb 0b jmp c4 <multiply5+0x14> b9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) c0: d1 ff sar %edi c2: 01 f6 add %esi,%esi c4: 40 f6 c7 01 test $0x1,%dil c8: 74 f6 je c0 <multiply5+0x10> ca: 01 f0 add %esi,%eax cc: 83 ff 01 cmp $0x1,%edi cf: 75 ef jne c0 <multiply5+0x10> d1: f3 c3 repz retq d3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) d8: 89 f0 mov %esi,%eax da: c3 retq db: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
00000000000000e0 <multiply6>: e0: 83 ff 01 cmp $0x1,%edi e3: 89 f0 mov %esi,%eax e5: 74 1a je 101 <multiply6+0x21> e7: 83 ef 01 sub $0x1,%edi ea: 89 f2 mov %esi,%edx ec: eb 06 jmp f4 <multiply6+0x14> ee: 66 90 xchg %ax,%ax f0: d1 ff sar %edi f2: 01 d2 add %edx,%edx f4: 40 f6 c7 01 test $0x1,%dil f8: 74 f6 je f0 <multiply6+0x10> fa: 01 d0 add %edx,%eax fc: 83 ff 01 cmp $0x1,%edi ff: 75 ef jne f0 <multiply6+0x10> 101: f3 c3 repz retq 103: 66 66 66 66 2e 0f 1f data16 data16 data16 nopw %cs:0x0(%rax,%rax,1) 10a: 84 00 00 00 00 00
0000000000000110 <multiply7>: 110: 40 f6 c7 01 test $0x1,%dil 114: 89 f0 mov %esi,%eax 116: 75 12 jne 12a <multiply7+0x1a> 118: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 11f: 00 120: d1 ff sar %edi 122: 01 c0 add %eax,%eax 124: 40 f6 c7 01 test $0x1,%dil 128: 74 f6 je 120 <multiply7+0x10> 12a: 83 ff 01 cmp $0x1,%edi 12d: 74 22 je 151 <multiply7+0x41> 12f: 83 ef 01 sub $0x1,%edi 132: 89 c2 mov %eax,%edx 134: eb 0e jmp 144 <multiply7+0x34> 136: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 13d: 00 00 00 140: d1 ff sar %edi 142: 01 d2 add %edx,%edx 144: 40 f6 c7 01 test $0x1,%dil 148: 74 f6 je 140 <multiply7+0x30> 14a: 01 d0 add %edx,%eax 14c: 83 ff 01 cmp $0x1,%edi 14f: 75 ef jne 140 <multiply7+0x30> 151: f3 c3 repz retq 153: 66 66 66 66 2e 0f 1f data16 data16 data16 nopw %cs:0x0(%rax,%rax,1) 15a: 84 00 00 00 00 00
0000000000000160 <multiply8>: 160: 40 f6 c7 01 test $0x1,%dil 164: 89 f0 mov %esi,%eax 166: 75 12 jne 17a <multiply8+0x1a> 168: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 16f: 00 170: d1 ff sar %edi 172: 01 c0 add %eax,%eax 174: 40 f6 c7 01 test $0x1,%dil 178: 74 f6 je 170 <multiply8+0x10> 17a: 83 ff 01 cmp $0x1,%edi 17d: 74 22 je 1a1 <multiply8+0x41> 17f: 83 ef 01 sub $0x1,%edi 182: 8d 14 00 lea (%rax,%rax,1),%edx 185: d1 ff sar %edi 187: eb 0b jmp 194 <multiply8+0x34> 189: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 190: d1 ff sar %edi 192: 01 d2 add %edx,%edx 194: 40 f6 c7 01 test $0x1,%dil 198: 74 f6 je 190 <multiply8+0x30> 19a: 01 d0 add %edx,%eax 19c: 83 ff 01 cmp $0x1,%edi 19f: 75 ef jne 190 <multiply8+0x30> 1a1: f3 c3 repz retq 1a3: 66 66 66 66 2e 0f 1f data16 data16 data16 nopw %cs:0x0(%rax,%rax,1) 1aa: 84 00 00 00 00 00
00000000000001b0 <multiply9>: 1b0: 83 ff 01 cmp $0x1,%edi 1b3: 7e 23 jle 1d8 <multiply9+0x28> 1b5: 31 c0 xor %eax,%eax 1b7: eb 0b jmp 1c4 <multiply9+0x14> 1b9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 1c0: 01 f6 add %esi,%esi 1c2: d1 ff sar %edi 1c4: 89 fa mov %edi,%edx 1c6: 83 e2 01 and $0x1,%edx 1c9: 0f 45 d6 cmovne %esi,%edx 1cc: 01 d0 add %edx,%eax 1ce: 83 ff 01 cmp $0x1,%edi 1d1: 75 ed jne 1c0 <multiply9+0x10> 1d3: f3 c3 repz retq 1d5: 0f 1f 00 nopl (%rax) 1d8: 89 f0 mov %esi,%eax 1da: c3 retq 1db: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
00000000000001e0 <multiply1>: 1e0: 41 57 push %r15 1e2: 41 56 push %r14 1e4: 41 55 push %r13 1e6: 41 54 push %r12 1e8: 41 89 f4 mov %esi,%r12d 1eb: 55 push %rbp 1ec: 53 push %rbx 1ed: 48 83 ec 18 sub $0x18,%rsp 1f1: 83 ff 01 cmp $0x1,%edi 1f4: 0f 84 b1 00 00 00 je 2ab <multiply1+0xcb> 1fa: 89 fa mov %edi,%edx 1fc: 89 f3 mov %esi,%ebx 1fe: 89 fd mov %edi,%ebp 200: d1 fa sar %edx 202: 44 8d 24 36 lea (%rsi,%rsi,1),%r12d 206: 83 fa 01 cmp $0x1,%edx 209: 0f 84 92 00 00 00 je 2a1 <multiply1+0xc1> 20f: 89 f9 mov %edi,%ecx 211: 44 8d 2c b5 00 00 00 lea 0x0(,%rsi,4),%r13d 218: 00 219: c1 f9 02 sar $0x2,%ecx 21c: 83 f9 01 cmp $0x1,%ecx 21f: 74 76 je 297 <multiply1+0xb7> 221: 41 89 f8 mov %edi,%r8d 224: 44 8d 34 f5 00 00 00 lea 0x0(,%rsi,8),%r14d 22b: 00 22c: 41 c1 f8 03 sar $0x3,%r8d 230: 41 83 f8 01 cmp $0x1,%r8d 234: 74 57 je 28d <multiply1+0xad> 236: 41 89 f9 mov %edi,%r9d 239: 41 89 f7 mov %esi,%r15d 23c: 41 c1 f9 04 sar $0x4,%r9d 240: 41 c1 e7 04 shl $0x4,%r15d 244: 41 83 f9 01 cmp $0x1,%r9d 248: 44 89 0c 24 mov %r9d,(%rsp) 24c: 74 34 je 282 <multiply1+0xa2> 24e: c1 e6 05 shl $0x5,%esi 251: c1 ff 05 sar $0x5,%edi 254: 44 89 44 24 0c mov %r8d,0xc(%rsp) 259: 89 4c 24 08 mov %ecx,0x8(%rsp) 25d: 89 54 24 04 mov %edx,0x4(%rsp) 261: e8 00 00 00 00 callq 266 <multiply1+0x86> 266: 44 8b 0c 24 mov (%rsp),%r9d 26a: 44 8b 44 24 0c mov 0xc(%rsp),%r8d 26f: 41 01 c7 add %eax,%r15d 272: 8b 4c 24 08 mov 0x8(%rsp),%ecx 276: 8b 54 24 04 mov 0x4(%rsp),%edx 27a: 41 83 e1 01 and $0x1,%r9d 27e: 44 0f 44 f8 cmove %eax,%r15d 282: 45 01 fe add %r15d,%r14d 285: 41 83 e0 01 and $0x1,%r8d 289: 45 0f 44 f7 cmove %r15d,%r14d 28d: 45 01 f5 add %r14d,%r13d 290: 83 e1 01 and $0x1,%ecx 293: 45 0f 44 ee cmove %r14d,%r13d 297: 45 01 ec add %r13d,%r12d 29a: 83 e2 01 and $0x1,%edx 29d: 45 0f 44 e5 cmove %r13d,%r12d 2a1: 44 01 e3 add %r12d,%ebx 2a4: 83 e5 01 and $0x1,%ebp 2a7: 44 0f 45 e3 cmovne %ebx,%r12d 2ab: 48 83 c4 18 add $0x18,%rsp 2af: 44 89 e0 mov %r12d,%eax 2b2: 5b pop %rbx 2b3: 5d pop %rbp 2b4: 41 5c pop %r12 2b6: 41 5d pop %r13 2b8: 41 5e pop %r14 2ba: 41 5f pop %r15 2bc: c3 retq 2bd: 0f 1f 00 nopl (%rax)
00000000000002c0 <mult_acc0>: 2c0: 83 fe 01 cmp $0x1,%esi 2c3: 89 d0 mov %edx,%eax 2c5: 74 21 je 2e8 <mult_acc0+0x28> 2c7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 2ce: 00 00 2d0: 89 f1 mov %esi,%ecx 2d2: 8d 04 12 lea (%rdx,%rdx,1),%eax 2d5: d1 fe sar %esi 2d7: 83 e1 01 and $0x1,%ecx 2da: 01 fa add %edi,%edx 2dc: 85 c9 test %ecx,%ecx 2de: 0f 45 fa cmovne %edx,%edi 2e1: 83 fe 01 cmp $0x1,%esi 2e4: 89 c2 mov %eax,%edx 2e6: 75 e8 jne 2d0 <mult_acc0+0x10> 2e8: 01 f8 add %edi,%eax 2ea: c3 retq 2eb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
00000000000002f0 <mult_acc1>: 2f0: eb 11 jmp 303 <mult_acc1+0x13> 2f2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 2f8: 40 f6 c6 01 test $0x1,%sil 2fc: 0f 45 f8 cmovne %eax,%edi 2ff: 01 d2 add %edx,%edx 301: d1 fe sar %esi 303: 83 fe 01 cmp $0x1,%esi 306: 8d 04 17 lea (%rdi,%rdx,1),%eax 309: 75 ed jne 2f8 <mult_acc1+0x8> 30b: f3 c3 repz retq 30d: 0f 1f 00 nopl (%rax)
0000000000000310 <mult_acc2>: 310: 89 f8 mov %edi,%eax 312: eb 08 jmp 31c <mult_acc2+0xc> 314: 0f 1f 40 00 nopl 0x0(%rax) 318: 01 d2 add %edx,%edx 31a: d1 fe sar %esi 31c: 40 f6 c6 01 test $0x1,%sil 320: 74 f6 je 318 <mult_acc2+0x8> 322: 01 d0 add %edx,%eax 324: 83 fe 01 cmp $0x1,%esi 327: 75 ef jne 318 <mult_acc2+0x8> 329: f3 c3 repz retq 32b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0000000000000330 <mult_acc3>: 330: 89 f8 mov %edi,%eax 332: eb 08 jmp 33c <mult_acc3+0xc> 334: 0f 1f 40 00 nopl 0x0(%rax) 338: d1 fe sar %esi 33a: 01 d2 add %edx,%edx 33c: 40 f6 c6 01 test $0x1,%sil 340: 74 f6 je 338 <mult_acc3+0x8> 342: 01 d0 add %edx,%eax 344: 83 fe 01 cmp $0x1,%esi 347: 75 ef jne 338 <mult_acc3+0x8> 349: f3 c3 repz retq 34b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0000000000000350 <mult_acc4>: 350: 89 f8 mov %edi,%eax 352: eb 08 jmp 35c <mult_acc4+0xc> 354: 0f 1f 40 00 nopl 0x0(%rax) 358: d1 fe sar %esi 35a: 01 d2 add %edx,%edx 35c: 40 f6 c6 01 test $0x1,%sil 360: 74 f6 je 358 <mult_acc4+0x8> 362: 01 d0 add %edx,%eax 364: 83 fe 01 cmp $0x1,%esi 367: 75 ef jne 358 <mult_acc4+0x8> 369: f3 c3 repz retq 36b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0000000000000370 <mult_acc5>: 370: 89 f8 mov %edi,%eax 372: eb 08 jmp 37c <mult_acc5+0xc> 374: 0f 1f 40 00 nopl 0x0(%rax) 378: 01 d2 add %edx,%edx 37a: d1 fe sar %esi 37c: 89 f1 mov %esi,%ecx 37e: 83 e1 01 and $0x1,%ecx 381: 0f 45 ca cmovne %edx,%ecx 384: 01 c8 add %ecx,%eax 386: 83 fe 01 cmp $0x1,%esi 389: 7f ed jg 378 <mult_acc5+0x8> 38b: f3 c3 repz retq
Disassembly of section .text.startup:
0000000000000000 <main>: 0: 41 57 push %r15 2: 41 56 push %r14 4: ba 0a 00 00 00 mov $0xa,%edx 9: 41 55 push %r13 b: 41 54 push %r12 d: 55 push %rbp e: 53 push %rbx f: 48 89 f3 mov %rsi,%rbx 12: 48 83 ec 58 sub $0x58,%rsp 16: 48 8b 7e 08 mov 0x8(%rsi),%rdi 1a: 31 f6 xor %esi,%esi 1c: 48 c7 04 24 00 00 00 movq $0x0,(%rsp) 23: 00 24: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp) 2b: 00 00 2d: 48 c7 44 24 10 00 00 movq $0x0,0x10(%rsp) 34: 00 00 36: 48 c7 44 24 18 00 00 movq $0x0,0x18(%rsp) 3d: 00 00 3f: 48 c7 44 24 20 00 00 movq $0x0,0x20(%rsp) 46: 00 00 48: 48 c7 44 24 28 00 00 movq $0x0,0x28(%rsp) 4f: 00 00 51: 48 c7 44 24 30 00 00 movq $0x0,0x30(%rsp) 58: 00 00 5a: 48 c7 44 24 38 00 00 movq $0x0,0x38(%rsp) 61: 00 00 63: 48 c7 44 24 40 00 00 movq $0x0,0x40(%rsp) 6a: 00 00 6c: e8 00 00 00 00 callq 71 <main+0x71> 71: 48 8b 7b 10 mov 0x10(%rbx),%rdi 75: 49 89 c6 mov %rax,%r14 78: ba 0a 00 00 00 mov $0xa,%edx 7d: 31 f6 xor %esi,%esi 7f: e8 00 00 00 00 callq 84 <main+0x84> 84: 45 85 f6 test %r14d,%r14d 87: 8d 50 ff lea -0x1(%rax),%edx 8a: 7e 64 jle f0 <main+0xf0> 8c: 48 63 d2 movslq %edx,%rdx 8f: 45 31 ff xor %r15d,%r15d 92: 4c 8b 2c d4 mov (%rsp,%rdx,8),%r13 96: 41 bc 01 00 00 00 mov $0x1,%r12d 9c: 0f 1f 40 00 nopl 0x0(%rax) a0: 44 89 e5 mov %r12d,%ebp a3: bb 01 00 00 00 mov $0x1,%ebx a8: eb 14 jmp be <main+0xbe> aa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) b0: 83 c3 01 add $0x1,%ebx b3: 44 01 e5 add %r12d,%ebp b6: 81 fb 01 01 00 00 cmp $0x101,%ebx bc: 74 1c je da <main+0xda> be: 89 de mov %ebx,%esi c0: 44 89 e7 mov %r12d,%edi c3: 41 ff d5 callq *%r13 c6: 39 c5 cmp %eax,%ebp c8: 74 e6 je b0 <main+0xb0> ca: 89 c6 mov %eax,%esi cc: 89 ea mov %ebp,%edx ce: bf 00 00 00 00 mov $0x0,%edi d3: 31 c0 xor %eax,%eax d5: e8 00 00 00 00 callq da <main+0xda> da: 41 83 c4 01 add $0x1,%r12d de: 41 81 fc 01 01 00 00 cmp $0x101,%r12d e5: 75 b9 jne a0 <main+0xa0> e7: 41 83 c7 01 add $0x1,%r15d eb: 45 39 f7 cmp %r14d,%r15d ee: 75 a6 jne 96 <main+0x96> f0: 48 83 c4 58 add $0x58,%rsp f4: 31 c0 xor %eax,%eax f6: 5b pop %rbx f7: 5d pop %rbp f8: 41 5c pop %r12 fa: 41 5d pop %r13 fc: 41 5e pop %r14 fe: 41 5f pop %r15 100: c3 retq morrisma@morrisma5 ~/Development/CProjects/Stepanov $
_________________ Michael A.
|
Mon Sep 04, 2017 8:31 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
In the land of GCC there are also -march -mtune See perhaps https://stackoverflow.com/questions/106 ... chitecture
|
Tue Sep 05, 2017 1:14 pm |
|
|
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1806
|
Also well worth a mention is Matt Godbolt's GCC Explorer, which allows you to explore - different versions of GCC - GCC for x86 or for ARM - any different command line options and to do that for any short sequence of C code you care to paste in, from the comfort of your browser. https://godbolt.org/Oh, in a variety of source languages, and also to share links to interesting findings.
|
Tue Sep 05, 2017 6:12 pm |
|
|
rwiker
Joined: Tue Aug 08, 2017 1:33 pm Posts: 11
|
Just to muddle the waters a bit: I implemented this algorithm in Common Lisp, using (optional) declarations to induce the compiler (SBCL, in this case) to use fixnum arithmetic. This implementation took about 5.5 seconds (5000 iterations), which compares fairly well with the 3.16 seconds (I think) that I got from multiply3 with clang -O3.
|
Tue Sep 05, 2017 8:09 pm |
|
|
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Big Ed:
Very cool link. Bookmarked it.
_________________ Michael A.
|
Wed Sep 06, 2017 2:01 am |
|
Who is online |
Users browsing this forum: No registered users and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|