This site is now 100% read-only, and retired.

Optimizing code via compiler flags

Posted by Steve on Mon 12 Sep 2011 at 19:12

When you're developing performance-critical code you will most likely receive performance increases in one of two ways; via the selection of the most appropriate algorithmic solution, or via the expense of additional hardware. Here we'll look at an interesting alternative - optimising via compiler flags.

Genetic algorithms have fascinated me for years, and one interesting project I've always been vaguely curious about is the research-based acovea tool. Given a small benchmark program which is a current bottleneck in your code this tool will recompile it over and over and over again, with different compiler flags.

After compiling the code with a given set of compiler flags, and timing the result, the best results will be kept and "bred" together. In short if you have a suitable program you can expect to return (many hours later) with a faster program with no human involvement.

To demonstrate this I've put together a trivial prime-number factoring program. This simple script does nothing fancy, and coule be optimised algorithmically with ease - however it is a suitable demonstration. (I suspect it'd be almost a thousand times faster without the I/O caused by printf. Ignore that!)

The sample code is as follows:


main (int argc, char *argv[])
  int i = 1, num;
  int j, k;

  for (num = 2; num < 32768; num++)
      while (i <= num)
          k = 0;
          if (num % i == 0)
              j = 1;
              while (j <= i)
                  if (i % j == 0)
              if (k == 2)
                printf ("%d is prime\n", i);
  return 0;

Once you've saved this code to ~/bench.c you can compile it and time its execution with:

skx@birthday:~$ gcc -o bench bench.c 
skx@birthday:~$ time ./bench 
real    0m9.487s
user    0m9.273s
sys     0m0.008s

The execution time on your system may be different, but for me a simple execution took 9.5 seconds. Now lets make it faster. We'll start by installing the appropriate package:

root@birthday:# apt-get install acovea
Reading package lists... Done
Building dependency tree 
Reading state information... Done
The following extra packages will be installed:
  libacovea-5.1-5 libcoyotl-3.1-4 libevocosm-3.1-1
Recommended packages:
The following NEW packages will be installed:
  acovea libacovea-5.1-5 libcoyotl-3.1-4 libevocosm-3.1-1
Setting up libevocosm-3.1-1 (3.1.0-3.1) ...
Setting up acovea (5.1.1-2) ...

Once installed we can then start working. The first thing we need to do is load a "profile". A profile is a simple XML file which describes the available compiler flags which may be tested - so you could update the tool to work with Intels icc compiler, for example.

When installed several default profile are deployed beneath /usr/share/libacovea/config, so if you do wish to tweak things that is where you should start from.As I'm running on an Intel host I started the process by executing:

skx@birthday:~$ runacovea -config gcc34_intel.acovea -input bench.c

Note: With no tweaks to the number of population sizes, generations, or similar this compilation and benchmarking effort took most of a day. and the end result was:

generation 20 complete, average fitness: 461.928

Acovea completed its analysis at 2011 Sep 06 19:06:37

Optimistic options:

                       -ftree-copyrename  (1.509)
                       -fmerge-constants  (2.071)
                          -fcrossjumping  (1.951)
                      -fcse-follow-jumps  (1.911)
                             -fno-inline  (1.951)
                                -ftracer  (1.79)

Pessimistic options:

               -momit-leaf-frame-pointer  (-1.545)
                        -floop-optimize2  (-1.786)
                             -fforce-mem  (-1.786)
                            -fsched-spec  (-1.505)
                      -funroll-all-loops  (-1.626)
          -fbranch-target-load-optimize2  (-1.706)
                            -mfpmath=387  (-1.545)
                            -mfpmath=sse  (-1.746)

Acovea's Best-of-the-Best:
gcc -lrt -lm -std=gnu99 -O1 -march=opteron -fno-thread-jumps \
    -fno-if-conversion2 -fno-delayed-branch -fno-loop-optimize \
    -ftree-dce -ftree-sra -ftree-copyrename -ftree-fre -ftree-ch  \
    -fmerge-constants -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks \
    -fexpensive-optimizations -fstrength-reduce -frerun-cse-after-loop  \
    -frerun-loop-opt -fforce-addr -fpeephole2 -fschedule-insns2 -fregmove \
    -freorder-blocks -fsched-interblock -funit-at-a-time \
    -falign-functions -falign-jumps -falign-loops -falign-labels \
    -finline-functions -fno-inline -ftracer -fmodulo-sched -fgcse-sm \
    -freschedule-modulo-scheduled-loops -ftree-loop-im -ftree-loop-ivcanon \
    -maccumulate-outgoing-args -mno-align-stringops -D__NO_MATH_INLINES \
    -funsafe-math-optimizations -fno-signaling-nans \
    -o /tmp/ACOVEAF92D8CC5 bench.c

As you can see it has managed to generate a simply horrifically large compile command - with many options I've never seen before. However the important question is how well did it do?

Recompiling the code manually using the suggested flags yielded an execution time of 7.5 seconds. No code changes, yet the program became faster by 2 seconds. That is a pretty astonishing result, albeit one we could have probably replicated via a better algorithm in the first place.

For this tool to be useful I expect you'll need to be writing code which is CPU-bound in some fashion, but which is still simple enough to condense into a useful benchmark. Given that you must compile and execute the benchmark many times you can't expect speed which is a potential downside. (i.e. If you have to "simplify" your real code to make it worth trying acovea on it you might find the suggested results just don't apply to the real thing.)



Re: Optimizing code via compiler flags
Posted by Anonymous (82.8.xx.xx) on Tue 13 Sep 2011 at 21:17
You should have compared the acovea optimimum with the "usual" optimization flags (-O2) and not with the unoptimized compilation. Do you have the numbers for that?

[ Parent ]

Re: Optimizing code via compiler flags
Posted by Steve (90.193.xx.xx) on Tue 13 Sep 2011 at 21:25
[ View Weblogs ]

That's a good point. Numbers below.

$ for i in -O0 -O1 -O2 -O3 ; do gcc $i -o ./x-$i bench.c ; done

$ time ./x--O0
real    0m8.269s
user    0m9.677s
sys     0m0.260s

$ time ./x--O1
real    0m8.170s
user    0m8.268s
sys     0m0.000s

$ time ./x--O2
real    0m8.172s
user    0m8.276s
sys     0m0.000s

$ time ./x--O3
real    0m7.775s
user    0m7.776s
sys     0m0.000s

Hrm. The "-O3" version is almost as good as the acovea result. That kinda ruins my happy happy joy joy moment!


[ Parent ]

Re: Optimizing code via compiler flags
Posted by jiteo (98.124.xx.xx) on Sat 1 Oct 2011 at 18:54
It's still a nifty idea - I enjoyed reading this. Although I think the article is mis-titled. It's not so much about optimizing code with compiler flags (using -O3 is optimizing code with compiler flags), it's about using genetic algorithms to figure out what compiler flags to use.

[ Parent ]