# Patch Match with Smoothness Constraints

I really wanted to implement the Heise’s 2013 paper about Patch Match with Huber regularization [1]. However that is a little too much for me to bite off, I kept getting lost in the derivation of the Huber-ROF model. So instead I chose to implement something in between Heise’s paper and Steinbrucker’s 2009 paper about large displacement optical flow [2]. I shall call what I implemented the ghetto Patch Match with squared regularization. I’m fancy.

# Idea

The idea behind this implementation is that we want to minimize the following equation over the image omega.

If this equation looks familiar, it’s because you’ve seen in Semi Global Matching and it is the basis for most all “global” matching ideas. An interesting technique that I think originates for Steinbrucker’s paper is the idea that solving this directly is impossible. However if we separate this equation into two problems with convex relaxation, we can solve both halves with independent algorithms. In the equation below, both U and V are the disparity image. When theta is large, U and V should be the same disparity image.

The first half of the equation is the data term or template correlation term, which Steinbrucker recommend solving with a brute force search and where Heise instead solved with Patch Match stereo. The right half of the equation is the same as the image denoising algorithm called Total Variation Minimization, first written up in 1992 by Ruden et al (ROF) [3]. However, don’t use the 1992 implementation for solving that equation. Chambolle described a better solution using a primal dual algorithm [4]. I’d like to say I understood that paper immediately and implemented it, however I instead lifted an implementation from Julia written by Tim Holy [5] and then rewrote in C++ for speed. (Julia isn’t as fast as they claim, especially when it has to JIT its whole standard library on boot.)

Solving this new form of the equation is done with theta close to zero. Each iteration U is solved with a fixed V. Then V is solved with a fixed U. Finally in preparation for the next iteration, theta is incremented. After many iterations, it is hoped that U and V then converge on the same disparity image.

The algorithm I’ve implemented for this blog post is considerably different from Heise’s paper. I’ve parameterized my disparity with an x and y offset. This is different from Heise’s x offset and normal vector x and y measurement. I also used a sum of absolute differences while Heise used adaptive support weights. Heise’s method also used a Huber cost metric on the smoothing term while I use nothing but squared error. This all means that my implementation is much closer to Steinbrucker’s optical flow algorithm.

# Results

Here is a result I got with cones using a 3×3 kernel size.

That 3×3 kernel size should impress you. I can greatly drop the kernel size when there is a smoothness constraint.

However images of cones and perspective shots in general are not what I care about, I care about the application of this algorithm to terrain. Here it is applied to a subsampled image of a glacier imaged by World View 2.

Crap! Instead of converging to some actual disparity value, it just blurred the initial result.

# Disaster Analysis

Possibly this didn’t work because I didn’t tune parameters correctly. With all global matching algorithms there a bunch of lambdas that try to bend the two cost metrics to equal footing. But is this really the answer?

I think it has something to do with the fact that Patch Match produces the same result as brute force search result, a.k.a. the global minimum across all disparity values. If you look at the global minimum result for both of these input images shown below, you’ll see that the World View imagery is exceptionally ill conditioned.

It seems that the smoothness constraint isn’t strong enough to over come the initial global minimum from correlation. Despite the noise, in the ‘cones’ example it is clear there is just high frequency noise around the final disparity image that we want. However with the World View example, there is no such clear view. That disparity image should be a simple gradient along the horizontal. ASP’s take on this region is shown right. However since the texture is self-similar across the image, this creates patches of global minimum that are the incorrect and resemble palette knife marks.

Seeding the input correlation result with ASP’s rough solution helps things to some degree. But ultimately it seems that this convex relaxation trick doesn’t insure that we converge on the correct (and what I hope is the global) minimum from the original equation. Another detail that I learned is that if Patch Match is going to converge, it will happen in the first 10 iterations.

# What’s Next?

My conclusion is that having a mostly globally minimum cost metric is required. Unfortunately increasing kernel size doesn’t help much. What I think is worth investigating next is: (1) implement the affine matching and second derivative smoothing detailed in Heise’s paper; (2) Cop out and implement hierarchical Patch Match; (3) recant my sins and investigate fSGM. Eventually I’ll find something good for ASP.

# Reference

[1] Heise, Philipp, et al. “PM-Huber: PatchMatch with Huber Regularization for Stereo Matching.”
[2] Steinbrucker, Frank, Thomas Pock, and Daniel Cremers. “Large displacement optical flow computation withoutwarping.” Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.
[3] Rudin, Leonid I., Stanley Osher, and Emad Fatemi. “Nonlinear total variation based noise removal algorithms.” Physica D: Nonlinear Phenomena 60.1 (1992): 259-268.
[4] Chambolle, Antonin. “An algorithm for total variation minimization and applications.” Journal of Mathematical imaging and vision 20.1-2 (2004): 89-97.
[5] https://github.com/timholy/Images.jl

# Code to Semi Global Matching

I’ve received a few emails asking me for the code to my implementation of Semi Global Matching. So here it is in the state that I last touched it. This is my icky dev code and I have no plans for maintaining it. In another month I’ll probably forget how it works. Before you look at the code, take to heart that this is not a full implementation of what Hirschmuller described in his papers. I didn’t implement 16-path integration or the mutual information cost metric.

TestSemiGlobalMatching is my main function. I developed inside a GTest framework while I was writing this code. [source]

The core of the algorithm is inside the following header and source file links, titled SemiGlobalMatching. As you can see it is all in VW-ese. Most of the math looks Eigen-like and there’s a lot of STL in action that I hope you can still read along. [header] [source]

Also, I haven’t forgotten that I promised to write another article on what I thought was a cool correlator idea. I’m still working on that code base so that I can show off cool urban 3D reconstruction using the satellite imagery I have of NASA Ames Research Center and Mountain View. (I think I have Google’s HQ in the shot.)

## Update 3/3/2014

Someone emailed me to figure out what is the MOC imagery that the code keeps referring to? In this case MOC stands for the Mars Orbital Camera on board the 1996 Mars Global Surveyor mission. The stereo pair is epipolar rectified images of Hrad Vallis that I use regularly and represent a classical problem for me when performing satellite stereo correlation. A copy of the input imagery is now available here [epi-L.tif][epi-R.tif].

# Semi-Global Matching

My co-worker Oleg Alexandrov has been working on Ames Stereo Pipeline for a while now. He’s just about touched all parts of the code. This is crystal clear when you look at our logs on Github. One of things he ribs me most about in ASP is that he doesn’t like that we advertise ASP as using up-to-date stereo correlation algorithms. “Come ‘on Man! That’s just not true!” he tells me. Depending on whom you talk to, we’re using primo 90’s research[7] or something re-hashed from the 70’s[1]. Either way, it is clear, we haven’t been tracking along with the current research in our field when it comes to integer correlation. This blog post covers the first part of my own personal research to find a new correlator algorithm that would improve ASP in terms of both runtime and quality. In this post, I’ll be reviewing an algorithm called Semi-Global Matching.

Semi-Global Matching or SGM is a method developed by Heiko Hirschmueller from the DLR. He first wrote about this in his 2005 paper[2]. He then elaborated and proposed further improvements in [3][4]. The best paper to learn the method is probably his second paper[3]. In my opinion his first paper gets side tracked in using a Mutual Information (MI) cost metric when the interesting bit is just SGM. The most exciting bit about this work is that it comes from DLR and is an algorithm they have applied to aerial and satellite mapping. I believe this is the method that was used to create the wonderful HRSC DTMs that some how managed to overcome the weird JPEG artifacts in their raw imagery.

## The Algorithm

Heiko might have sped over his SGM algorithm in his first paper because he didn’t view it as being as challenging to implement when compared to the MI cost metric. SGM shares a lot in common with scanline optimization stereo, which has had a lot of prior research but now-a-days is considered a dead end. Let’s review how that worked. Also, the images used for this testing are from the Middlebury Stereo dataset. More information about this data and stereo algorithms applied to them can be found in [8][9][10][11].

Scanline optimization stereo is essentially Viterbi decoding in my mind. We evaluate along an epipolar line. In the case of a rectified image, this is along the horizontal scanline. For each pixel along that scanline we evaluate each possible disparity result. The costs of each pixel along the scanline can then be stacked into a matrix. A scanline was highlighted in the above picture. The cost of each pixel (x-direction) versus each possible disparity value (y-direction) is shown in the picture below. The solution for the disparity along this scanline is then the path through this matrix/image that has minimum costs (dark areas). We also have to include some smoothness constraint otherwise our disparity result could follow the jagged jumps in this map that don’t represent reality.

Finding the minimum path is then an application of Linear Programming. We iterate through the matrix left to right and take a rolling sum. The cost of an element in the rolling sum vector for the current pixel and disparity combination is equal to the cost for the current location plus the lowest summed cost from the set of all possible disparities for the prior pixel location. Heiko applies some additional constraints in that he penalizes the cost when ever the disparity changes. He penalizes higher for multiple disparity value transitions than he does for 1. Penality for an increment of 1 in disparity is P1 and anything greater is P2. This entire paragraph can more elegantly be described in the following equation.

Applying this forward and backward for each scanline we can solve for a disparity map. Here’s an example.

Notice there’s a lot of tearing between the scanlines. The image looks as if we had tracking error on a VCR. We could fix this by using a larger kernel size. For the above, the kernel size was 1 x 1 px. Something more unique would insure matches that are constrained between lines. Another approach would be to insure some smoothness constraint across lines as opposed to just disparity transitions. Heiko’s solution to this issue is what makes SGM what it is. He opted to instead perform scanline optimization at multiple angles and then take the summed cost vector to determine the final disparity result. Note, that even though we evaluate the scanline along an angle, the disparity is still defined as going along the epipolar line (perfectly horizontal in this case). Each line direction produces results like the following:

The sum of their cost vectors and then taking the minimum produces a beautiful result like the following:

## My Implementation and Results

All of the pictures above were created with my implementation of SGM. In my version, I only evaluate 8 line directions. So my results are noisier than what’s seen in Heiko’s original paper. Despite this, the end results are pretty impressive. Here’s line up of ASP result, my SGM result, Heiko’s result, and the ground truth result. ASP performs so badly because it has a large kernel size that can’t handle the sudden jumps in depth. ASP then blurs the disparity discontinuities.

Unfortunately I must mention the bad side of this method. There are several cons the first and weakest of arguments is the required CPU time. My implementation of this method takes about 23 seconds to evaluate this with 8 paths. 16 paths like the paper would have doubled the processing time. ASP chops through this image in seconds. Heiko says he got the processing time down to 1.3 seconds in 2005. So I’m doing something horribly wrong and could improve my implementation. However speed is always an issue, some ideas to address this issue are iSGM[5] and wSGM[6]. These are hierarchical methods of SGM and fancy maps to reduce the length required to integrate paths for cost evaluation.

A bigger issue is that SGM requires an absurd amount of memory. All costs for all pixels and all possible disparity values are evaluated up front in a big tensor that has a size of W * H * D * 1 byte. We also need a copy of this tensor for evaluating paths and another to store summing for all paths. Those are two allocations of memory that are W * H * D * 2 bytes. They need to be a higher data type to avoid integer rollover artifacts. This demo set is 450 x 375 px and I evaluated it across 64 possible disparities. Thus SGM required 51 MB. That doesn’t include the memory cost of just loading the images up and allocating space for the disparity result. Imagine tackling a satellite image where we we have a disparity range of 2000 pixels.

Another pesky complaint against SGM is how to figure out what the values should be for the two penalty arguments. Heiko never mentioned what he used; likely he tuned the values for each stereo pair to get best results. However these penalty values ultimately determine how this algorithm responds to occlusion and angled surfaces. What works for World View 1 in glacier regions (low frequencies) might not necessarily apply to World View 1 in the city (square wave patterns). In practice, we would want to have tuned parameters for each instrument we work on and for each type of terrain.

The final and most harsh criticism of SGM is that it can only be applied to 1D disparity searches and the range must be defined beforehand. 1D searches work for calibrated stereo rigs such as the imagery used in this post. However it is my opinion that real data always has imperfections and finding the true disparity requires searching in the Y direction still. Examples of this are linescan cameras that have jitter but the spacecraft ephemeris isn’t sampled high enough to capture such as MOC, HiRISE, and LROC. There’s also the case were the camera isn’t perfect such as the World View cameras where there is a subpixel misregistration of all 50 CCDs. They can’t be easily corrected for because we can’t have raw imagery. ASP also doesn’t have a perfect method for epipolar rectification of linescan cameras. We have a linear approximation with our affine method but the problem is nonlinear.

SGM is still an amazing algorithm that is incredibly useful. There are ton of papers out there that find it to be perfect for their applications. Beside the incredible detail it resolves, my other favorite bit about the algorithm is that its runtime is deterministic. It depends squarely on search range and there is no worst-case path versus best-case path that we have to deal with in ASP’s binary search approach. Despite this, SGM seems to be a non-ideal match for ASP. ASP hopes to address the generic correlation problem where we don’t always trust our camera information or our data. I want something that can still handle 2D searching. In my next post I’ll show off another promising algorithm that seems to address that concern along with runtime and memory requirements. Until then, have some music.

Update: Code is now available here.

## Works Cited

[1]  Barnea, D. (1972). A Class of Algorithms for Fast Digital Image Registration. IEEE Transactions on Computers.
[2]  Hirschmuller, H. (2005). Accurate and Efficient Stereo Processing by Semi Global Matching and Mutual Information. CVPR .
[3]  Hirschmuller, H. (2008). Stereo Processing by Semiglobal Matching and Mutual Information. Pattern Analysis and Machine Intelligence .
[4]  Hirschmulller, H., Buder, M., & Ernst, I. (2012). Memory Efficient Semi-Global Matching. Remote Sensing and Spatial Information Sciences .
[5]  Klette, S. H. (2012). Iterative Semi-Global Matching for Robust Driver Assistance Systems. ACCV .
[6]  Spangenberg, R., Langner, T., & Rojas, R. (2013). Weighted Semi-Global Matching and Center-Symmetric Census Transform for Robust Driver Assistance. CAIP .
[7]  Sun, C. (1997). A Fast Stereo Matching Method. Digital Image Computing: Techniques and Application.
[8] Scharstein, D., Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision .
[9] Scharstein, D., Szeliski, R. (2003). High-accuracy stereo depth maps using structured light. CVPR .
[10] Scharstein, D., Pal, C. (2007). Learning conditional random fields for stereo. CVPR .
[11] Hirschmuller, H., Scharstein, D. (2007). Evaluation of cost functions for stereo matching. CVPR .