Noproj’n LRO Imagery

Earlier this year I found out that I got my first proposal funded. I’ve had directed funding before thanks to NASA’s cyrosphere program. I’ve also been a Co-I on numerous other funded proposals with co-workers and friends at USGS. However my LASER proposal to perform Mass DEM production from LROC-NA imagery was something I wrote as PI and was competed. Now that it is accepted, I have two years to follow through and I’d like to share the whole process here on the blog.

The first problem I’m going to write about has bugged me for a while. That problem is that each LROC-NA observation is actually 2 files and makes using ASP awkward. In the past I’ve been telling people to run perform 4 runs with stereo. Do the combinations of LE-LE, LE-RE, RE-LE, and RE-RE. However UofA had the same problem with HiRISE, which comes down in 20 different files. They had worked out an ISIS process chain that would noproj those files together among other things to make a single observation. I don’t know why such step doesn’t exist for LROC-NA but today I’ll show you what I’ve come up with.

If you try right now to noproj the RE cub file to the LE cube file you’ll find that the program errors out because an ideal camera model for LROC has not been defined. Definitions for ideal noproj cameras for all ISIS cameras are defined in a file at $ISIS3DATA/base/applications/ noprojInstruments003.pvl. Looking at that file we see that there are 5 elements that can be defined for the ideal camera model, TransY, ItransS, TransX, ItransL, and DetectorSamples. DetectorSamples is easy; it’s what the output image width should be in the final image.  The Trans* variables are measured in millimeters and define a focal plane offset from the original camera model we are using. I plan to noproj with the match file being the LE file. Thus Trans* references a shift from the original position of the LE sensors. The Itrans* variables are pixel conversion of the millimeter measurements. It’s used by ISIS to run the math the other direction. If Trans* and Itrans* variable don’t match, funny errors will occur where the CCDs noproj correctly but the triangulation in ASP is completely bonkers. Relating the two is easy, just use the pixel pitch. For LROC-NA that is 7 micrometers per pixel.

So how do we decide what to set the TransY and TransX values to be? If those values are left to zero, the LE image will be centered on the new image and the RE will be butted up beside but falling off the edge of the new image. A good initial guess would be to set TransY to be a shift half the CCD width. A better idea I thought to use was to look at the FK kernel and identify the angle differences between the two cameras and then use the instantaneous field of view measurement to convert to pixel offset between the two CCD origins. Use pixel pitch to convert to millimeters and then divide by two will be the shift that we want. Below are the final noproj settings I used for LROC.

 Group = "LUNAR RECONNAISSANCE ORBITER/NACL"
    TransY = 16.8833
    ItransS = -2411.9
    TransX = 0.6475
    ItransL = -92.5
    DetectorSamples = 10000
  End_Group

At this point we can noproj the imagery and then handmos them together. A naïve version would look something like the following.

> noproj from=originalRE.cub to=RE.cub match=originalLE.cub
> noproj from=originalLE.cub to=LE.cub match=origianlLE.cub
> cp LE.cub LERE_mosaic.cub
> handmos from=RE.cub mosaic=LERE_mosaic.cub

Alas, the LE and RE imagery does not perfectly align. In the HiRISE processing world we would use hijitreg to determine a mean pixel offset. There is no generic form of hijitreg that we can use for LROC-NA. There is the coreg application, but in all my tests this program failed to find any correct match points between the images. I’ve tried two other successful methods. I’ve manually measured the offset using Qtie. Warning: Sending this images to Qtie requires twiddling with how serial numbers are made for ideal cameras. This is something I should document at some point as it would allow bundle adjusting ideal cameras like fully formed HiRISE and LROC images. My other solution was the example correlate program in Vision Workbench.  I did the following to make processing quick (5 minutes run time).

> crop from=LE.cub to=LE.centerline.cub sample=4900 nsamples=200
> crop from=RE.cub to=RE.centerline.cub sample=4900 nsamples=200
> parallel isis2std from={} to={.}.tif format=tiff ::: *centerline.cub
> correlate --h-corr-min -60 --h-corr-max 0 --v-corr-min 0 --v-corr-max 60 LE.centerline.tif RE.centerline.tif

That creates a disparity TIF file. The average of the valid pixels (third channel is 1) can then be used to solve for the mean translation. That translation can then be used during handmos by using the outsample and outline options. Ideally this would all be one program like hijitreg but that is for another time. Below is the final result.

Hijitreg actually does more than just solve for the translation between CCDs on HiRISE. It correlates a line of pixels between the CCDs in hope of determining the jitter of the spacecraft. I can do the same!

From the above plots, it doesn’t look like there is much jitter or the state of the data is not in form that I could determine. The only interesting bit here is that the offset in vertical direction changes with the line number. I think this might be disparity due to terrain. The imagery I used for my testing was M123514622 and M123521405, which happened to be focused on the wall of Slipher crater. The NA cameras are angled 0.106 degrees off from each other in the vertical direction. Ideal stereo geometry would 30 degrees or 15 degrees, but 0.106 degrees should allow some disparity given the massive elevation change into the crater. I wouldn’t recommend using this for 3D reconstruction but it would explain the vertical offset signal. The horizontal signal has less amplitude but does seem like it might be seeing spacecraft jitter. However it looks aliased, impossible to determine what the frequency is.

Anyways, making a noproj version of the LROC-NA observation is a huge boon for processing with Ames Stereo Pipeline. Using the options of affine epipolar alignment, no map projection, simple parabola subpixel, and it is possible to make a beautiful DEM/DTM in 2.5 hours. 70% of that time was just during triangulation because ISIS is single threaded. That would be faster with the application parallel_stereo (renamed from stereo_mpi in ASP 2.2.2).

Mass CTX Processing

A few weeks back, Ross Beyer presented my blog posts on autonomous HiRISE DEM processing at the HiRISE team meeting in Tucson, AZ. This brought about a question of could this be performed for CTX. Of course! ASP can be applied to do bulk processing for any of the missions it supports. Earth or any place ISIS and CSpice have defined coordinate system for. Just put in some safeties about run time into the processing scripts because ASP still occasionally goes mad and eats a whole bunch of processing time for no output. (Were working on it!)

Processing CTX stereo pairs however is in fact a little more difficult compared to the HiRISE processing I was doing before. HiRISE lists all their stereo pairs from their website or Dr. Shane Byrne’s website. There’s no equivalent for CTX. Luckily for me, some folks at UofA wrote PairendipityCTX (Chris Schaller?). They provided Ross and I a detailed report of overlapping files and other statistics. I cut out everything but the filenames for my own use and you can get a copy of the list with its 1,542 stereo pairs here.

Another difference was how these two missions stored their data on PDS. I can look at a HiRISE filename and work out its download path in PDS with no trouble. CTX on the other hand seems to have arbitrary volume breaks, thus the download URL is not predictable. My solution is a bad solution, but a quick solution. I wrote a python script that scraped PDS’s servers and identified all the CTX images it has and what their URLs are. I then just ‘grep’ for the URL in the processing scripts. Here’s the resulting text file that lists all of the 50,708 CTX images that existed at the time of my scraping. This is a mean trick because my script can make HTTP requests much faster than a human can. In a sense, I’m DOS’ing the PDS servers. So please copy my results rather than my methods.

Processing scripts

Previously with my autonomous HiRISE processing efforts, I just wrote a Bash script and then spawned it multiple times on my server using GNU parallel. That’s great, but looking around my office I saw a lot of unused computers that I’d like to do my bidding. What I wanted was a job management system like PBS, which is what they use on the super computer. But PDS is a little too heavy and sometimes cost money. So I instead found two alternative projects that seemed to fit the bill, Gearman and Celery. Gearman is the minimalist. Celery required a database backing and multiple ports open with each slave worker. I decided to use Gearman since it seemed simpler to learn.

My code for this little project is available on Github in my Mars3DGearman project. I hope to eventually add HiRISE support. Here’s how it plays out. All machines make a working directory that contains the CTX stereo list, the CTX url lookup list, and the folders DEM and DRG. One machine is designated the server, in my case, my workstation at home. It starts the only instance of ctx_processor.py with the help of a backing ‘gearmand’ executable (gearman daemon). All of the slaves then SSH back to my server and forward 4730, the port used by Gearman for communication. Then all the slave machines can start one or more instances of ctx_worker.py.

The part I haven’t worked out is how to send home the completed results to my main server. The ctx_worker script produces a DEM and orthophoto and then just dumps it locally into the DEM and DRG folder. Gearman does allow sending binary strings back the main server, but I’m betting a 100 MB long string would break something. I chose a different route. Since I already have all the slaves SSH’ing back to my main server, I decided to simply rsync all the slaves’ DEM and DRG folder back home. I don’t have to re-enter my password as I’ve enabled SSH ControlMaster which re-uses previous connections. For now, I just put that rsync in a watch command that fires every 2 hours. Alternatively it could be inside the ctx_worker script. A better bet would be to use SSH keys.

Another worthwhile improvement compared to my HiRISE processing scripts is the inclusion of a timeout method for each step. When it comes to CTX, if the correlation doesn’t finish in 2 hours, I’m not interested. This timeout is achieved through the run_cmd and process_timeout functions in the ctx_worker script. The Internet helped me a lot in figuring out how to make that a reality. Thanks Internet!

Results

These last few days I’ve roped 4 machines into doing my bidding. So far they’ve produced ~260 DEMs. 5 new DEMs completed just while I was writing this article. There are still some hiccups in the process. But when the stars align, I seem to produce over 50 new DEMs every day. I’m not going to show you all of them as that felt like a lot of work to post on to the blog. Instead I’m just going to show off a couple screenshots of some interesting places in Google Mars. The color ramp is a little funky because someday I need to learn to reference everything against the Mars Aeroid and not just the sphere.

Not everything looks as great as those screenshots. Here are some places that failed to correlate. I’m not sure what went wrong and I unfortunately don’t have the time to investigate.

In conclusion, this is just another proof of concept of what is possible. I hope that someday someone else will attempt this and do a better job than me. ASP is not perfect, but it can achieve a lot of processing on its own that could be beneficial to the scientific community.

Related

Shean, D. E., et al. “MRO CTX Stereo Image Processing and Preliminary DEM Quality Assessment.“ Lunar and Planetary Institute Science Conference Abstracts. Vol. 42. 2011.

David Shean, MSSS, and Larry Edwards actually already attempted this once before! In the above abstract you can see that they went above and beyond my little demo and processed 1180 stereo pairs. They also developed some initial steps for registering the output DEMs to MOLA and plotted the relationship convergence angle has on the outcome of a stereo pair.

Moar Processes

Oleg slipped a new feature in ASP 2.1 that I didn’t call too much attention to. That is the “–left-image-crop-win” option which causes the code to process only a subsection of the stereo pair as defined in the coordinates of the left image. Pair this ability with GDAL’s buildvrt to composite tiles and you have yourself an interesting multiprocess variant of ASP. In the next release we’ll be providing a script called “stereo_mpi” that does all of this for you and across multiple computers if you have an MPI environment setup (such as on a cluster or supercomputer).

Our code was already multithreaded and could bring a single computer to its knees with ease. Being a multiprocess application allows us to take over multiple computers. It also allows us to speed up sections of the code that are not thread-safe through parallelization. That is because processes don’t share memory across each other like threads do. Each process gets their own copy of the non-thread-safe ISIS and CSpice libraries and can thus run them simultaneously. However this also means that our image cache system is not shared among the processes. I haven’t noticed this to be too much of a hit in performance.

I still have an account on NASA’s Pleiades, so I decided to create a 3D model of Aeolis Planum using CTX imagery and 3 methods now available in the ASP code repo. Those options are the traditional stereo command using one node, stereo_mpi using one node, and finally stereo_mpi using two nodes. Here are the results:

Aeolis Planum processing runs on Westmere Nodes
Command Walltime CPU Time Mem Used
stereo 00:44:40 08:46:57 1.71 GB
stereo_mpi –mpi 1 00:32:11 11:28:44 5.77 GB
stereo_mpi –mpi 2 00:21:46 10:10:22 5.52 GB

The stereo_mpi command is faster in walltime compared to traditional stereo command entirely because it can parallel process the triangulation step. Unfortunately not every step of ASP can be performed with multiple processes due to interdependencies of the tiles. Here’s a quick handy chart for which steps can be multiprocessed or multithreaded. (Well … we could make the processes actually communicate with each other via MPI but … that would be hard).

ASP Steps: Multiprocess or Multithreaded
PPRC CORR RFNE FLTR TRI
Multithread x x x x DG/RPC sessions only
Multiprocess x x x

Just for reference, here’s my VWRC file I used for all 3 runs and the PBS job script for the 2 node example. All runs were performed with Bayes EM subpixel and homography pre-alignment.

[general]
default_num_threads = 24
write_pool_size = 15
system_cache_size = 200000000
#PBS -S /bin/bash
#PBS -W group_list=#####
#PBS -q normal
#PBS -l select=2:model=wes
#PBS -l walltime=1:30:00
#PBS -m e

# Load MPI so we have the MPIEXEC command for the Stereo_MPI script
module load mpi-mvapich2/1.4.1/intel

cd /u/zmoratto/nobackup/Mars/CTX/Aeolis_Planum_SE
stereo_mpi P02_002002_1738_XI_06S208W.cal.cub P03_002279_1737_XI_06S208W.cal.cub mpi2/mpi2 --mpiexec 2 --processe
s 16 --threads-multi 4 --threads-single 24

Come to think of it, I was probably cheating the traditional stereo by leaving the write pool size set to 15.

Update 2/4/13

I also tried this same experiment with the HiRISE stereo pair of Hrad Vallis that we ship in our binaries. Unfortunately the single node runs didn’t finish in 8 hours and were shut off. Again, this is homography alignment plus Bayes EM subpixel. Everything would have finished much sooner if I used parabola.

HiRISE Hrad Vallis processing runs on Westmere Nodes
Command Walltime CPU Time Mem Used Completed
stereo 08:00:24 106:31:38 2.59 GB Nope
stereo_mpi –mpi 1 08:00:28 181:55:00 10.0 GB Nope
stereo_mpi –mpi 6 02:18:19 221:41:52 44.9 GB Yep

Ames Stereo Pipeline 2.1 Released!

Hooray! New software for the masses! This release includes a bunch of bug fixes plus a few new features. Most importantly we’ve added support for a generic satellite camera model called the RPC model. RPCs are just big polynomials that map geodetic coordinates to image coordinates but most importantly just about every commercial satellite company ships an RPC with their imagery. This allows Ames Stereo Pipeline to process imagery from new sources that we haven’t previously been able to work with like GeoEye.

The picture on right is an example shaded colorized elevation model of the city Hobart in Australia. That image was created from example stereo imagery provided from GeoEye’s website and represents a difficult stereo pair for us to process. On the northeast corner of the image is a bunch of salt and pepper noise, which represents the water of the bay that we couldn’t correlate into 3D. In the southwest are mountains that have a dense forest with a texture that changes rapidly with viewing angle. Despite these problems you can see that our software was able to extract roads and buildings to some degree. This is interesting primarily because we wrote the software to work primarily on bare rock found on the Moon or Mars. Slowly we are improving so that we can support all kinds of terrain. For now, we recommend that our users apply ASP to imagery of bare rock, grasslands, snow, and ice for best results.

Also interesting is this colorized elevation model of Lucknow India created with Digital Globe sample imagery. The red patches are actually trees! Both of these stereo pairs are discussed as examples in our documentation. This means you can reproduce this work on your computer too!

Bundle Adjusting HiRISE

Last week, I showed off a method for processing LRO-NAC. Now I’m going to show an even more difficult process with HiRISE. Each observation of LRO-NAC was 2 CCDs and it made for a lot of click work. In the case of HiRISE, it has a whopping 10 CCDs. This will make bundle adjustment very tricky if we treated all 20 files as individual cameras. To get around this problem we’ll deploy a new feature in Jigsaw (the Observation option) and we’ll have to modify the HiEDR2Mosaic script that is released with Ames Stereo Pipeline.

HiRISE Preparation

I recently found out that there is a nifty UofA site from Shane Byrne that details all the stereo pairs captured by HiRISE. It is available at this link. I then stalked the user ‘mcewen’ and process a stereo pair he selected. If I was a nice guy I would process stuff selected by user ‘rbeyer’, but he images boring places. Loser! Anyways, for this demo I’ll be processing ESP_013660_1475 and ESP_013950_1475. UofA says it’s a “Gullied 35 Kilometer Diameter Impact Crater in Promethei Terra”. Whatever, I just want 3D.

At this point I would normally tell you to run HiEDR2Mosaic blindly. Unfortunately that won’t work because that script will attempt to project the outer CCDs into the RED5 or RED4’s frame of reference. Any projection is bad because it requires using the spacecraft’s ephemeris, which we don’t trust and we haven’t corrected yet. It also won’t work because noproj will drop some observation serial or whatnots. To fix this, we want to stop HiEDR2Mosaic right before it does noproj. Then we’ll perform our jigsaw and then afterwards we want to resume the process of HiEDR2Mosaic. Doing that required the modification that I checked into Github here. Just download the ‘py.in’ file and rename it to ‘py’. The ‘in’ suffix just means that our build system is going to burn the ASP version into the script at compile time.

Now the run looks like the following:

download all ESP_013660_1475 IMGs
download all ESP_013950_1475 IMGs
hiedr2mosaic.py --stop-at-no-proj ESP_013660*IMG
hiedr2mosaic.py --stop-at-no-proj ESP_013950*IMG

Bundle Adjusting

No magic here, it’s same process as usual for creating a control network. I just picked a special XSpacing of 200 meters and YSpacing of 3 km for Autoseed. This was to make sure that there would be control points on all 10 CCDs. I was guessing that a single HiRISE CCD had a swath of ~500 meters. I also made the MinimumThickness a very small number (something like .0000001) since the HiRISE CCDs are very thin strips. After Autoseed, you then proceed to manually clean up the control network and it will take a very long while. Then you should perform a couple of jigsaw runs to identify control points that have mistakes and correct them in qnet. However, for HiRISE all your jigsaw runs should use the option, observations=yes. This says that images with the same image serial should be treated as the same camera or observation. So CCDs 0 through 9 will be treated as a single camera. Without this flag, jigsaw will never converge.

You might be asking why I didn’t use this for LRO-NAC, that camera also had multiple CCDs. That’s because in the case of LRO-NAC, the CCDs are in two separate optical housing whose position and angle changes noticeably with the thermal cycling of the spacecraft. On HiRISE, all the CCDs are on the same optical plane and they all use the same optics. It is not noticeable at all. If you use the observation option for LRO-NAC, you’ll find that it is impossible to get a sigma0 value under 2 pixels.

In my last post, I tried an idea where I didn’t use any ground control points and tried to make jigsaw auto register to the default ISIS DEM, which is usually the best altimeter data available. I wanted to try that again, unfortunately HiRISE doesn’t have a big enough footprint to cover much detail in MOLA so I added 2 CTX images to the jigsaw problem. Those images were B10_013660_1473_XN_32S256W and B11_013950_1473_XN_32S256W. I made their control network separately and then used cnetmerge to add them to the HiRISE network. I then proceeded to match a bunch of the control points between the two imagers. This would have been easier if I had just processed the images from the beginning together.

Another problem I noticed from the last post was that with every jigsaw run, all control points would start from the ‘apriori’ position. I was updating their radius with cnetnewradii, but their latitude and longitude kept being locked back to their original position. I wanted this to be like ICP, so I modified my script so that after every jigsaw run, ‘apriori’ latitude and longitude would be replaced by their ‘adjusted’ solution. Below is that code.

Adjusted2Apriori.py:

#!/usr/bin/env python
# Expect ./adjusted2apriori.py <input pvl> <output pvl>
import sys

inf = open(sys.argv[1], 'r')
outf = open(sys.argv[2], 'w')

delayed_write = False
delayed_lines = []

for line in inf:
    # Search for Apriori X
    if 'AprioriX' in line:
        delayed_write = True

    if 'AdjustedX' in line:
        delayed_lines.insert(5,line.replace('Adjusted','Apriori'))
    if 'AdjustedY' in line:
        delayed_lines.insert(6,line.replace('Adjusted','Apriori'))
    if 'AdjustedZ' in line:
        delayed_lines.insert(7,line.replace('Adjusted','Apriori'))
        delayed_write = False
        for dline in delayed_lines:
            outf.write( dline )
        delayed_lines = []

    if not delayed_write:
        outf.write( line )
    else:
        if 'AprioriX' in line or 'AprioriY' in line or 'AprioriZ' in line:
            pass
        else:
            delayed_lines.append(line)

bundleadjust.sh:

#!/bin/bash

input_control=control_comb_pointreg_const.net
radius_source=/home/zmoratto/raid/isis/isis3.4.1_ubuntu1204/data/base/dems/molaMarsPlanetaryRadius0005.cub

cp $input_control control_loop.net

for i in `seq 1 50`; do
    echo Iteration $i

    # Convert point's apriori position to be adjusted position
    cnetbin2pvl from= control_loop.net to= control_loop.pvl
    ./adjusted2apriori.py control_loop.pvl control_loop2.pvl
    cnetpvl2bin from= control_loop2.pvl to= control_loop.net

    cnetnewradii cnet= control_loop.net onet= output.net model= $radius_source getlatlon= apriori
    mv output.net control_loop.net

    jigsaw fromlist=cube.lis radius=yes twist=yes cnet= control_loop.net  onet= output.net update=yes spsolve= position camsolve= velocities observations= yes maxits= 100
    mv output.net control_loop.net

    #Gathering statistics for user monitoring
    grep Sigma0: bundleout.txt
    list_length=`wc -l < bundleout.txt`
    interesting_part=`grep -n "POINTS DETAIL" bundleout.txt | awk -F ":" '{print $1}'`
    tail -n $(expr ${list_length} - ${interesting_part}) bundleout.txt | grep RADIUS --color=no | awk -F " " '{print $4}' | awk '{sum+=$1; sumsq+=$1*$1;} END {print "stdev = " sqrt(sumsq/NR - (sum/NR)**2) " meters";}'
    tail -n $(expr ${list_length} - ${interesting_part}) bundleout.txt | grep RADIUS --color=no | awk -F " " '{print $4}' | awk '{mean+=$1} END {print "mean = " mean/NR " meters";}'
done

Running jigsaw in this case was just running my script.

./bundleadjust

Unfortunately I didn’t see the standard deviation of the radius correction reducing ever. So possibly my whole idea of a jigsaw without control points is flawed, or I need a better height source. I really want to go back and redo LRO-NAC now.

After we finish our bundle adjustment, we now finish our HiEDR2Mosaic to create a single image of a HiRISE observation. This looks like the following with the modified script. Notice specifically that I’m giving the script the histitch cube files this time instead of the IMG files.

./hiedr2mosiac.py --resume-at-no-proj ESP_013660*histitch.cub
./hiedr2mosiac.py --resume-at-no-proj ESP_013950*histitch.cub

Processing in ASP

I ran the stereo command like this and I used the default stereo.options file. That just means everything is in full auto and I’m only using parabola subpixel. Thus, my output DEM looks little ugly up close.

stereo ESP_013660_1475_RED.mos_hijitreged.norm.cub ESP_013950_1475_RED.mos_hijitreged.norm.cub HiRISE/HiRISE

Then I used the following point2dem option. Notice I’m using a custom latitude of scale so that the crater of interest will be circular in the projection.

point2dem --t_srs "+proj=eqc +lat_ts=-32 +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +a=3396000 +b=3396000 +units=m +no_defs" --orthoimage HiRISE-L.tif HiRISE-PC.tif --tr 2

Results

Here are the difference maps between MOLA and a non Bundle Adjusted ASP HiRISE DEM (Raw), a Bundle Adjusted DEM with no GCPs, a Bundle Adjusted DEM with no GCPs but with the addition of the CTX imagery. We can see that the bundle adjustment helps definitely; it removes a 200-meter error.

Adding the CTX imagery definitely helped reduce the error against MOLA. However if we look at a DEM that can be created from the additional 2 CTX images we processed, we’ll see that there are still large pockets of error around the rim of the crater. When I flip back and forth between the MOLA hillshade and the CTX hillshade, I think I can definitely see a shift where the CTX DEM is too low and slightly shrunk.

So possibly my whole ‘no-gcps’ idea might be bunk. For the case of Mars, it is really easy to go get ground control points against the THEMIS mosaic or a processed HRSC observation. You’ll just want to chain THEMIS Mosaic to CTX and then to HiRISE since there is such a large resolution change. I’ll leave it as a homework assignment for someone to work out the exact commands you need to run. At least now you should understand how to bundle adjust HiRISE.

Moar Speed Please!

Number one complaint from ASP users, make it faster. Number two complaint, is what is the LE90 of their DEM. I’m only going to take a stab at answering the first one in this post. However we’ll look at it from the lazy perspective of just changing the compiler and not implementing any new crazy algorithms, because algorithms are hard.

When we build ASP’s binaries, we use an Apple variant of GCC-4.2 on OSX. When we build our Linux binaries, we use GCC 4.4.6 from RHEL6. Those compilers are relatively old, the newest GCC 4.4.6, was compiled back in 2010. Since then, new versions of GCC have been released. Clang++ has also been maturing. There have even been new processor instruction sets that have been released, like the 256 bit wide AVX.

The first test I performed was simply recording the run time for our unit tests on the Bayes EM subpixel refinement algorithm.  I tested on both an OSX 10.7.5 system with a Core i7-2675QM and then an Ubuntu 12.04 system with an AMD FX 8120. Both systems support the AVX instruction set. I was able to get newer compilers on the OSX system by using MacPorts. For the Ubuntu box, I installed almost everything through Aptitude. However, I got the Clang-3.1 binaries directly from the LLVM website.

Bayes EM timings for Ubuntu 12.04
Compiler -O3 -mno-avx -O3 -mavx -O3 -mavx -funsafe -0fast -mavx -funsafe
Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev
g++-4.4 0.871589 2.714 0.146 0.871589 2.629 0.142 0.887567 2.629 0.172
g++-4.5 0.871589 2.621 0.05 0.871589 2.587 0.034 0.887566 2.669 0.183
g++-4.6 0.871589 2.493 0.009 0.871589 2.743 0.1 0.88774 2.542 0.173 0.88774 2.285 0.125
g++-4.7 0.871589 2.439 0.017 0.871589 2.62 0.127 0.887566 2.581 0.111 0.887566 2.36 0.202
clang++-2.9 segfaulted
clang++-3.0 0.871589 2.29 0.195 14.2007 2.475 0.159 14.2007 2.44 0.102
clang++-3.1 0.871589 2.434 0.215 0.871589 2.492 0.238 0.87157 2.309 0.225
Bayes EM timings for OSX 10.7.5
Compiler -O3 -O3 -funsafe -Ofast
Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev Sum Px Error Avg Time Std Dev
g++-4.2 0.871582 2.59 0.103 0.887563 2.52 0.111
g++-4.4.7 0.871582 2.48 0.212 0.887563 2.27 0.027
g++-4.5.4 0.871582 2.265 0.03 0.887564 2.187 0.032
g++-4.7.1 0.871582 2.122 0.036 0.887777 2.005 0.02 0.887777 1.905 0.011
clang++-2.1 0.871582 2.193 0.021 0.871582 2.485 0.313
clang++-2.9 0.871582 2.273 0.014 0.871582 2.247 0.039
clang++-3.1 0.871582 1.996 0.013 0.871586 1.91 0.014
llvm-g++-4.2 0.871582 2.149 0.008 0.871582 2.19 0.027
 

I tested Clang-2.9 on Ubuntu. Unfortunately every compile operation resulted in an internal seg-fault. Clang-3.0 worked most of the time, until I manually turned on ‘-mavx’. This caused no improvement in BayesEM performance, however it did cause the test code to return bad results. Overall, GCC 4.7 and Clang 3.1 showed about 20% improvement in speed over GCC 4.4.

I also tested the performance of our integer correlator under different compilers.

Integer Correlator timings for Ubuntu 12.04
Compiler -O3 -mno-avx -O3 -mavx -O3 -mavx -funsafe -Ofast -mavx -funsafe
Avg Time Std Dev Avg Time Std Dev Avg Time Std Dev Avg Time Std Dev
g++-4.4 8.288 0.037 8.136 0.03 8.127 0.032
g++-4.5 8.396 0.014 8.267 0.024 8.326 0.022
g++-4.6 5.168 0.078 5.094 0.022 5.102 0.019 5.11 0.022
g++-4.7 4.525 0.019 4.624 0.014 4.669 0.012 4.638 0.017
clang++-2.9
clang++-3.0 5.147 0.053 5.079 0.094 5.06 0.012
clang++-3.1 5.119 0.012 5.059 0.32 4.949 0.016

Integer Correlator timings for OSX 10.7.5
Compiler -O3 -O3 -funsafe -Ofast
Avg Time Std Dev Avg Time Std Dev Avg Time Std Dev
g++-4.2 8.973 0.096 8.654 0.047
g++-4.4.7 8.61 0.034 8.654 0.181
g++-4.5.4 8.131 0.083 7.67 0.033
g++-4.7.1 4.044 0.024 4.084 0.03 3.9 0.023
clang++-2.1 5.077 0.023 5.072 0.029
clang++-2.9 5.211 0.032 5.192 0.013
clang++-3.1 4.966 0.018 4.973 0.027
llvm-g++-4.2 5.097 0.023 5.113 0.021
 

Here, the newer compilers showed significant performance gains. GCC 4.7 and Clang 3.1 both showed a 100% speed improvement over GCC 4.4. Clang also managed to compile the code correctly every time unlike in the Bayes EM tests. However I would still recommend sticking with the safe and stable GCC. Their 4.7 release was able to get just as much or better performance than the Clang compilers. GCC just provides the comfort of mind knowing that it has always been able to compile VW correctly. Clang still has me on edge since it burned me so many times because it produced segfaulting assembly instructions since it is so aggressive with SIMD.