Moar Processes

Oleg slipped a new feature in ASP 2.1 that I didn’t call too much attention to. That is the “–left-image-crop-win” option which causes the code to process only a subsection of the stereo pair as defined in the coordinates of the left image. Pair this ability with GDAL’s buildvrt to composite tiles and you have yourself an interesting multiprocess variant of ASP. In the next release we’ll be providing a script called “stereo_mpi” that does all of this for you and across multiple computers if you have an MPI environment setup (such as on a cluster or supercomputer).

Our code was already multithreaded and could bring a single computer to its knees with ease. Being a multiprocess application allows us to take over multiple computers. It also allows us to speed up sections of the code that are not thread-safe through parallelization. That is because processes don’t share memory across each other like threads do. Each process gets their own copy of the non-thread-safe ISIS and CSpice libraries and can thus run them simultaneously. However this also means that our image cache system is not shared among the processes. I haven’t noticed this to be too much of a hit in performance.

I still have an account on NASA’s Pleiades, so I decided to create a 3D model of Aeolis Planum using CTX imagery and 3 methods now available in the ASP code repo. Those options are the traditional stereo command using one node, stereo_mpi using one node, and finally stereo_mpi using two nodes. Here are the results:

Aeolis Planum processing runs on Westmere Nodes
Command Walltime CPU Time Mem Used
stereo 00:44:40 08:46:57 1.71 GB
stereo_mpi –mpi 1 00:32:11 11:28:44 5.77 GB
stereo_mpi –mpi 2 00:21:46 10:10:22 5.52 GB

The stereo_mpi command is faster in walltime compared to traditional stereo command entirely because it can parallel process the triangulation step. Unfortunately not every step of ASP can be performed with multiple processes due to interdependencies of the tiles. Here’s a quick handy chart for which steps can be multiprocessed or multithreaded. (Well … we could make the processes actually communicate with each other via MPI but … that would be hard).

ASP Steps: Multiprocess or Multithreaded
Multithread x x x x DG/RPC sessions only
Multiprocess x x x

Just for reference, here’s my VWRC file I used for all 3 runs and the PBS job script for the 2 node example. All runs were performed with Bayes EM subpixel and homography pre-alignment.

default_num_threads = 24
write_pool_size = 15
system_cache_size = 200000000
#PBS -S /bin/bash
#PBS -W group_list=#####
#PBS -q normal
#PBS -l select=2:model=wes
#PBS -l walltime=1:30:00
#PBS -m e

# Load MPI so we have the MPIEXEC command for the Stereo_MPI script
module load mpi-mvapich2/1.4.1/intel

cd /u/zmoratto/nobackup/Mars/CTX/Aeolis_Planum_SE
stereo_mpi mpi2/mpi2 --mpiexec 2 --processe
s 16 --threads-multi 4 --threads-single 24

Come to think of it, I was probably cheating the traditional stereo by leaving the write pool size set to 15.

Update 2/4/13

I also tried this same experiment with the HiRISE stereo pair of Hrad Vallis that we ship in our binaries. Unfortunately the single node runs didn’t finish in 8 hours and were shut off. Again, this is homography alignment plus Bayes EM subpixel. Everything would have finished much sooner if I used parabola.

HiRISE Hrad Vallis processing runs on Westmere Nodes
Command Walltime CPU Time Mem Used Completed
stereo 08:00:24 106:31:38 2.59 GB Nope
stereo_mpi –mpi 1 08:00:28 181:55:00 10.0 GB Nope
stereo_mpi –mpi 6 02:18:19 221:41:52 44.9 GB Yep

OpenJPEG might be usable

You know about Jpeg2000 right? Wavelet compression, top notch work of the 90’s, an image compression format that promises better results than JPEG and can be lossless for some pixel types. Well it totally exists and commercial software uses it quite a bit. It has taken quite a hold of the satellite imaging sector as it allows image compression to 1/6th the size of a traditional TIFF. Unfortunately there doesn’t seem to be any good open source libraries available for everyone else. There’s Jasper, OpenJPEG, and CQJ2K but they were always a magnitude or more slower than the commercial product Kakadu.

OpenJPEG had an official 2.0.0 release on December 1st of last year and it is actually worth a glance. Unfortunately the current release of GDAL, version 1.9.2, doesn’t support this new release. It was designed for a prototype of OpenJPEG found at revision 2230 of OpenJPEG’s SVN repo. If you are willing though, the new OpenJPEG v2 release contains the executables opj_decompress and opj_compress for conversion of JP2 files to and from TIFF, PNG, and JPEG formats. Another alternative is also downloading the current development version of GDAL 1.10 which has support for the new OpenJPEG v2 library and can leverage it to read things like NTF. I performed some rough / unscientific tests of these configurations this weekend and my notes are below. My conclusion is that OpenJPEG 2.0.0 is decently nice and I can’t wait for the next release of GDAL so that I can roll it into Ames Stereo Pipeline.

Conversion times for 400 MB JP2 to 2 GB Tiled TIFF
Command Time Peak Memory
OJP r2230's j2k_to_image greater than 2 days ~1 MB
GDAL 1.9.2's gdal_translate w/ OJP r2230 greater than 2 days ~10 MB
OJP v2's opj_decompress 4 min ~2 GB
GDAL 1.10's gdal_translate w/ OJP v2 w/ GDAL_CACHEMAX=512 5 min ~600 MB

Moar Speed Please!

Number one complaint from ASP users, make it faster. Number two complaint, is what is the LE90 of their DEM. I’m only going to take a stab at answering the first one in this post. However we’ll look at it from the lazy perspective of just changing the compiler and not implementing any new crazy algorithms, because algorithms are hard.

When we build ASP’s binaries, we use an Apple variant of GCC-4.2 on OSX. When we build our Linux binaries, we use GCC 4.4.6 from RHEL6. Those compilers are relatively old, the newest GCC 4.4.6, was compiled back in 2010. Since then, new versions of GCC have been released. Clang++ has also been maturing. There have even been new processor instruction sets that have been released, like the 256 bit wide AVX.

The first test I performed was simply recording the run time for our unit tests on the Bayes EM subpixel refinement algorithm.  I tested on both an OSX 10.7.5 system with a Core i7-2675QM and then an Ubuntu 12.04 system with an AMD FX 8120. Both systems support the AVX instruction set. I was able to get newer compilers on the OSX system by using MacPorts. For the Ubuntu box, I installed almost everything through Aptitude. However, I got the Clang-3.1 binaries directly from the LLVM website.

Bayes EM timings for Ubuntu 12.04
Compiler -O3 -mno-avx -O3 -mavx -O3 -mavx -funsafe -0fast -mavx -funsafe
Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev
g++-4.4 0.871589 2.714 0.146 0.871589 2.629 0.142 0.887567 2.629 0.172
g++-4.5 0.871589 2.621 0.05 0.871589 2.587 0.034 0.887566 2.669 0.183
g++-4.6 0.871589 2.493 0.009 0.871589 2.743 0.1 0.88774 2.542 0.173 0.88774 2.285 0.125
g++-4.7 0.871589 2.439 0.017 0.871589 2.62 0.127 0.887566 2.581 0.111 0.887566 2.36 0.202
clang++-2.9 segfaulted
clang++-3.0 0.871589 2.29 0.195 14.2007 2.475 0.159 14.2007 2.44 0.102
clang++-3.1 0.871589 2.434 0.215 0.871589 2.492 0.238 0.87157 2.309 0.225
Bayes EM timings for OSX 10.7.5
Compiler -O3 -O3 -funsafe -Ofast
Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev
g++-4.2 0.871582 2.59 0.103 0.887563 2.52 0.111
g++-4.4.7 0.871582 2.48 0.212 0.887563 2.27 0.027
g++-4.5.4 0.871582 2.265 0.03 0.887564 2.187 0.032
g++-4.7.1 0.871582 2.122 0.036 0.887777 2.005 0.02 0.887777 1.905 0.011
clang++-2.1 0.871582 2.193 0.021 0.871582 2.485 0.313
clang++-2.9 0.871582 2.273 0.014 0.871582 2.247 0.039
clang++-3.1 0.871582 1.996 0.013 0.871586 1.91 0.014
llvm-g++-4.2 0.871582 2.149 0.008 0.871582 2.19 0.027

I tested Clang-2.9 on Ubuntu. Unfortunately every compile operation resulted in an internal seg-fault. Clang-3.0 worked most of the time, until I manually turned on ‘-mavx’. This caused no improvement in BayesEM performance, however it did cause the test code to return bad results. Overall, GCC 4.7 and Clang 3.1 showed about 20% improvement in speed over GCC 4.4.

I also tested the performance of our integer correlator under different compilers.

Integer Correlator timings for Ubuntu 12.04
Compiler -O3 -mno-avx -O3 -mavx -O3 -mavx -funsafe -Ofast -mavx -funsafe
Avg Time Std Dev Avg Time Std Dev Avg Time Std Dev Avg Time Std Dev
g++-4.4 8.288 0.037 8.136 0.03 8.127 0.032
g++-4.5 8.396 0.014 8.267 0.024 8.326 0.022
g++-4.6 5.168 0.078 5.094 0.022 5.102 0.019 5.11 0.022
g++-4.7 4.525 0.019 4.624 0.014 4.669 0.012 4.638 0.017
clang++-3.0 5.147 0.053 5.079 0.094 5.06 0.012
clang++-3.1 5.119 0.012 5.059 0.32 4.949 0.016
Integer Correlator timings for OSX 10.7.5
Compiler -O3 -O3 -funsafe -Ofast
Avg Time Std Dev Avg Time Std Dev Avg Time Std Dev
g++-4.2 8.973 0.096 8.654 0.047
g++-4.4.7 8.61 0.034 8.654 0.181
g++-4.5.4 8.131 0.083 7.67 0.033
g++-4.7.1 4.044 0.024 4.084 0.03 3.9 0.023
clang++-2.1 5.077 0.023 5.072 0.029
clang++-2.9 5.211 0.032 5.192 0.013
clang++-3.1 4.966 0.018 4.973 0.027
llvm-g++-4.2 5.097 0.023 5.113 0.021

Here, the newer compilers showed significant performance gains. GCC 4.7 and Clang 3.1 both showed a 100% speed improvement over GCC 4.4. Clang also managed to compile the code correctly every time unlike in the Bayes EM tests. However I would still recommend sticking with the safe and stable GCC. Their 4.7 release was able to get just as much or better performance than the Clang compilers. GCC just provides the comfort of mind knowing that it has always been able to compile VW correctly. Clang still has me on edge since it burned me so many times because it produced segfaulting assembly instructions since it is so aggressive with SIMD.