Moar Speed Please!

Number one complaint from ASP users, make it faster. Number two complaint, is what is the LE90 of their DEM. I’m only going to take a stab at answering the first one in this post. However we’ll look at it from the lazy perspective of just changing the compiler and not implementing any new crazy algorithms, because algorithms are hard.

When we build ASP’s binaries, we use an Apple variant of GCC-4.2 on OSX. When we build our Linux binaries, we use GCC 4.4.6 from RHEL6. Those compilers are relatively old, the newest GCC 4.4.6, was compiled back in 2010. Since then, new versions of GCC have been released. Clang++ has also been maturing. There have even been new processor instruction sets that have been released, like the 256 bit wide AVX.

The first test I performed was simply recording the run time for our unit tests on the Bayes EM subpixel refinement algorithm.  I tested on both an OSX 10.7.5 system with a Core i7-2675QM and then an Ubuntu 12.04 system with an AMD FX 8120. Both systems support the AVX instruction set. I was able to get newer compilers on the OSX system by using MacPorts. For the Ubuntu box, I installed almost everything through Aptitude. However, I got the Clang-3.1 binaries directly from the LLVM website.

Bayes EM timings for Ubuntu 12.04
Compiler -O3 -mno-avx -O3 -mavx -O3 -mavx -funsafe -0fast -mavx -funsafe
Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev
g++-4.4 0.871589 2.714 0.146 0.871589 2.629 0.142 0.887567 2.629 0.172
g++-4.5 0.871589 2.621 0.05 0.871589 2.587 0.034 0.887566 2.669 0.183
g++-4.6 0.871589 2.493 0.009 0.871589 2.743 0.1 0.88774 2.542 0.173 0.88774 2.285 0.125
g++-4.7 0.871589 2.439 0.017 0.871589 2.62 0.127 0.887566 2.581 0.111 0.887566 2.36 0.202
clang++-2.9 segfaulted
clang++-3.0 0.871589 2.29 0.195 14.2007 2.475 0.159 14.2007 2.44 0.102
clang++-3.1 0.871589 2.434 0.215 0.871589 2.492 0.238 0.87157 2.309 0.225
Bayes EM timings for OSX 10.7.5
Compiler -O3 -O3 -funsafe -Ofast
Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev
g++-4.2 0.871582 2.59 0.103 0.887563 2.52 0.111
g++-4.4.7 0.871582 2.48 0.212 0.887563 2.27 0.027
g++-4.5.4 0.871582 2.265 0.03 0.887564 2.187 0.032
g++-4.7.1 0.871582 2.122 0.036 0.887777 2.005 0.02 0.887777 1.905 0.011
clang++-2.1 0.871582 2.193 0.021 0.871582 2.485 0.313
clang++-2.9 0.871582 2.273 0.014 0.871582 2.247 0.039
clang++-3.1 0.871582 1.996 0.013 0.871586 1.91 0.014
llvm-g++-4.2 0.871582 2.149 0.008 0.871582 2.19 0.027

I tested Clang-2.9 on Ubuntu. Unfortunately every compile operation resulted in an internal seg-fault. Clang-3.0 worked most of the time, until I manually turned on ‘-mavx’. This caused no improvement in BayesEM performance, however it did cause the test code to return bad results. Overall, GCC 4.7 and Clang 3.1 showed about 20% improvement in speed over GCC 4.4.

I also tested the performance of our integer correlator under different compilers.

Integer Correlator timings for Ubuntu 12.04
Compiler -O3 -mno-avx -O3 -mavx -O3 -mavx -funsafe -Ofast -mavx -funsafe
Avg Time Std Dev Avg Time Std Dev Avg Time Std Dev Avg Time Std Dev
g++-4.4 8.288 0.037 8.136 0.03 8.127 0.032
g++-4.5 8.396 0.014 8.267 0.024 8.326 0.022
g++-4.6 5.168 0.078 5.094 0.022 5.102 0.019 5.11 0.022
g++-4.7 4.525 0.019 4.624 0.014 4.669 0.012 4.638 0.017
clang++-2.9
clang++-3.0 5.147 0.053 5.079 0.094 5.06 0.012
clang++-3.1 5.119 0.012 5.059 0.32 4.949 0.016
Integer Correlator timings for OSX 10.7.5
Compiler -O3 -O3 -funsafe -Ofast
Avg Time Std Dev Avg Time Std Dev Avg Time Std Dev
g++-4.2 8.973 0.096 8.654 0.047
g++-4.4.7 8.61 0.034 8.654 0.181
g++-4.5.4 8.131 0.083 7.67 0.033
g++-4.7.1 4.044 0.024 4.084 0.03 3.9 0.023
clang++-2.1 5.077 0.023 5.072 0.029
clang++-2.9 5.211 0.032 5.192 0.013
clang++-3.1 4.966 0.018 4.973 0.027
llvm-g++-4.2 5.097 0.023 5.113 0.021

Here, the newer compilers showed significant performance gains. GCC 4.7 and Clang 3.1 both showed a 100% speed improvement over GCC 4.4. Clang also managed to compile the code correctly every time unlike in the Bayes EM tests. However I would still recommend sticking with the safe and stable GCC. Their 4.7 release was able to get just as much or better performance than the Clang compilers. GCC just provides the comfort of mind knowing that it has always been able to compile VW correctly. Clang still has me on edge since it burned me so many times because it produced segfaulting assembly instructions since it is so aggressive with SIMD.

MacPorts Portfiles available for VW and ASP

As of just a minute ago, I’ve committed a portfile for VW and for ASP in their respective code repositories. The ASP one doesn’t support ISIS or point2mesh. It’s only good for performing stereo on pinhole sessions (MER/Personal Robots) or DG sessions (Digital Globe). I hope that eventually Macports will accept them into their distribution as vw-devel and asp-devel. Until that day, you can use these port files manually using these instructions.