Optimizing code via compiler flags
Posted by Steve on Mon 12 Sep 2011 at 19:12
When you're developing performance-critical code you will most likely receive performance increases in one of two ways; via the selection of the most appropriate algorithmic solution, or via the expense of additional hardware. Here we'll look at an interesting alternative - optimising via compiler flags.
Genetic algorithms have fascinated me for years, and one interesting project I've always been vaguely curious about is the research-based acovea tool. Given a small benchmark program which is a current bottleneck in your code this tool will recompile it over and over and over again, with different compiler flags.
After compiling the code with a given set of compiler flags, and timing the result, the best results will be kept and "bred" together. In short if you have a suitable program you can expect to return (many hours later) with a faster program with no human involvement.
To demonstrate this I've put together a trivial prime-number factoring program. This simple script does nothing fancy, and coule be optimised algorithmically with ease - however it is a suitable demonstration. (I suspect it'd be almost a thousand times faster without the I/O caused by printf. Ignore that!)
The sample code is as follows:
#include<stdio.h>
int
main (int argc, char *argv[])
{
int i = 1, num;
int j, k;
for (num = 2; num < 32768; num++)
{
while (i <= num)
{
k = 0;
if (num % i == 0)
{
j = 1;
while (j <= i)
{
if (i % j == 0)
k++;
j++;
}
if (k == 2)
printf ("%d is prime\n", i);
}
i++;
}
}
return 0;
}
Once you've saved this code to ~/bench.c you can compile it and time its execution with:
skx@birthday:~$ gcc -o bench bench.c skx@birthday:~$ time ./bench ... ... real 0m9.487s user 0m9.273s sys 0m0.008s
The execution time on your system may be different, but for me a simple execution took 9.5 seconds. Now lets make it faster. We'll start by installing the appropriate package:
root@birthday:# apt-get install acovea Reading package lists... Done Building dependency tree Reading state information... Done ... The following extra packages will be installed: libacovea-5.1-5 libcoyotl-3.1-4 libevocosm-3.1-1 Recommended packages: acovea-results The following NEW packages will be installed: acovea libacovea-5.1-5 libcoyotl-3.1-4 libevocosm-3.1-1 .. Setting up libevocosm-3.1-1 (3.1.0-3.1) ... Setting up acovea (5.1.1-2) ...
Once installed we can then start working. The first thing we need to do is load a "profile". A profile is a simple XML file which describes the available compiler flags which may be tested - so you could update the tool to work with Intels icc compiler, for example.
When installed several default profile are deployed beneath /usr/share/libacovea/config, so if you do wish to tweak things that is where you should start from.As I'm running on an Intel host I started the process by executing:
skx@birthday:~$ runacovea -config gcc34_intel.acovea -input bench.c
Note: With no tweaks to the number of population sizes, generations, or similar this compilation and benchmarking effort took most of a day. and the end result was:
generation 20 complete, average fitness: 461.928
Acovea completed its analysis at 2011 Sep 06 19:06:37
Optimistic options:
-ftree-copyrename (1.509)
-fmerge-constants (2.071)
-fcrossjumping (1.951)
-fcse-follow-jumps (1.911)
-fno-inline (1.951)
-ftracer (1.79)
Pessimistic options:
-momit-leaf-frame-pointer (-1.545)
-floop-optimize2 (-1.786)
-fforce-mem (-1.786)
-fsched-spec (-1.505)
-funroll-all-loops (-1.626)
-fbranch-target-load-optimize2 (-1.706)
-mfpmath=387 (-1.545)
-mfpmath=sse (-1.746)
Acovea's Best-of-the-Best:
gcc -lrt -lm -std=gnu99 -O1 -march=opteron -fno-thread-jumps \
-fno-if-conversion2 -fno-delayed-branch -fno-loop-optimize \
-ftree-dce -ftree-sra -ftree-copyrename -ftree-fre -ftree-ch \
-fmerge-constants -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks \
-fexpensive-optimizations -fstrength-reduce -frerun-cse-after-loop \
-frerun-loop-opt -fforce-addr -fpeephole2 -fschedule-insns2 -fregmove \
-freorder-blocks -fsched-interblock -funit-at-a-time \
-falign-functions -falign-jumps -falign-loops -falign-labels \
-finline-functions -fno-inline -ftracer -fmodulo-sched -fgcse-sm \
-freschedule-modulo-scheduled-loops -ftree-loop-im -ftree-loop-ivcanon \
-maccumulate-outgoing-args -mno-align-stringops -D__NO_MATH_INLINES \
-funsafe-math-optimizations -fno-signaling-nans \
-o /tmp/ACOVEAF92D8CC5 bench.c
As you can see it has managed to generate a simply horrifically large compile command - with many options I've never seen before. However the important question is how well did it do?
Recompiling the code manually using the suggested flags yielded an execution time of 7.5 seconds. No code changes, yet the program became faster by 2 seconds. That is a pretty astonishing result, albeit one we could have probably replicated via a better algorithm in the first place.
For this tool to be useful I expect you'll need to be writing code which is CPU-bound in some fashion, but which is still simple enough to condense into a useful benchmark. Given that you must compile and execute the benchmark many times you can't expect speed which is a potential downside. (i.e. If you have to "simplify" your real code to make it worth trying acovea on it you might find the suggested results just don't apply to the real thing.)
[ Send Message | View Steve's Scratchpad | View Weblogs ]
That's a good point. Numbers below.
$ for i in -O0 -O1 -O2 -O3 ; do gcc $i -o ./x-$i bench.c ; done $ time ./x--O0 real 0m8.269s user 0m9.677s sys 0m0.260s $ time ./x--O1 real 0m8.170s user 0m8.268s sys 0m0.000s $ time ./x--O2 real 0m8.172s user 0m8.276s sys 0m0.000s $ time ./x--O3 real 0m7.775s user 0m7.776s sys 0m0.000s
Hrm. The "-O3" version is almost as good as the acovea result. That kinda ruins my happy happy joy joy moment!
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]