Patch Match with Smoothness Constraints

I really wanted to implement the Heise’s 2013 paper about Patch Match with Huber regularization [1]. However that is a little too much for me to bite off, I kept getting lost in the derivation of the Huber-ROF model. So instead I chose to implement something in between Heise’s paper and Steinbrucker’s 2009 paper about large displacement optical flow [2]. I shall call what I implemented the ghetto Patch Match with squared regularization. I’m fancy.

Idea

The idea behind this implementation is that we want to minimize the following equation over the image omega.

If this equation looks familiar, it’s because you’ve seen in Semi Global Matching and it is the basis for most all “global” matching ideas. An interesting technique that I think originates for Steinbrucker’s paper is the idea that solving this directly is impossible. However if we separate this equation into two problems with convex relaxation, we can solve both halves with independent algorithms. In the equation below, both U and V are the disparity image. When theta is large, U and V should be the same disparity image.

The first half of the equation is the data term or template correlation term, which Steinbrucker recommend solving with a brute force search and where Heise instead solved with Patch Match stereo. The right half of the equation is the same as the image denoising algorithm called Total Variation Minimization, first written up in 1992 by Ruden et al (ROF) [3]. However, don’t use the 1992 implementation for solving that equation. Chambolle described a better solution using a primal dual algorithm [4]. I’d like to say I understood that paper immediately and implemented it, however I instead lifted an implementation from Julia written by Tim Holy [5] and then rewrote in C++ for speed. (Julia isn’t as fast as they claim, especially when it has to JIT its whole standard library on boot.)

Solving this new form of the equation is done with theta close to zero. Each iteration U is solved with a fixed V. Then V is solved with a fixed U. Finally in preparation for the next iteration, theta is incremented. After many iterations, it is hoped that U and V then converge on the same disparity image.

The algorithm I’ve implemented for this blog post is considerably different from Heise’s paper. I’ve parameterized my disparity with an x and y offset. This is different from Heise’s x offset and normal vector x and y measurement. I also used a sum of absolute differences while Heise used adaptive support weights. Heise’s method also used a Huber cost metric on the smoothing term while I use nothing but squared error. This all means that my implementation is much closer to Steinbrucker’s optical flow algorithm.

Results

Here is a result I got with cones using a 3×3 kernel size.

That 3×3 kernel size should impress you. I can greatly drop the kernel size when there is a smoothness constraint.

However images of cones and perspective shots in general are not what I care about, I care about the application of this algorithm to terrain. Here it is applied to a subsampled image of a glacier imaged by World View 2.

Crap! Instead of converging to some actual disparity value, it just blurred the initial result.

Disaster Analysis

Possibly this didn’t work because I didn’t tune parameters correctly. With all global matching algorithms there a bunch of lambdas that try to bend the two cost metrics to equal footing. But is this really the answer?

I think it has something to do with the fact that Patch Match produces the same result as brute force search result, a.k.a. the global minimum across all disparity values. If you look at the global minimum result for both of these input images shown below, you’ll see that the World View imagery is exceptionally ill conditioned.

It seems that the smoothness constraint isn’t strong enough to over come the initial global minimum from correlation. Despite the noise, in the ‘cones’ example it is clear there is just high frequency noise around the final disparity image that we want. However with the World View example, there is no such clear view. That disparity image should be a simple gradient along the horizontal. ASP’s take on this region is shown right. However since the texture is self-similar across the image, this creates patches of global minimum that are the incorrect and resemble palette knife marks.

Seeding the input correlation result with ASP’s rough solution helps things to some degree. But ultimately it seems that this convex relaxation trick doesn’t insure that we converge on the correct (and what I hope is the global) minimum from the original equation. Another detail that I learned is that if Patch Match is going to converge, it will happen in the first 10 iterations.

What’s Next?

My conclusion is that having a mostly globally minimum cost metric is required. Unfortunately increasing kernel size doesn’t help much. What I think is worth investigating next is: (1) implement the affine matching and second derivative smoothing detailed in Heise’s paper; (2) Cop out and implement hierarchical Patch Match; (3) recant my sins and investigate fSGM. Eventually I’ll find something good for ASP.

Reference

[1] Heise, Philipp, et al. “PM-Huber: PatchMatch with Huber Regularization for Stereo Matching.”
[2] Steinbrucker, Frank, Thomas Pock, and Daniel Cremers. “Large displacement optical flow computation withoutwarping.” Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.
[3] Rudin, Leonid I., Stanley Osher, and Emad Fatemi. “Nonlinear total variation based noise removal algorithms.” Physica D: Nonlinear Phenomena 60.1 (1992): 259-268.
[4] Chambolle, Antonin. “An algorithm for total variation minimization and applications.” Journal of Mathematical imaging and vision 20.1-2 (2004): 89-97.
[5] https://github.com/timholy/Images.jl

Google’s Lens Blur Feature

Earlier today Google announced that they added a new feature to their android camera application called “Lens Blur”. From their blog post we can see that they take a series of images and solve for a multiview solution for the depth image of the first frame in the sequence. That depth image is then used to refocus the image. It is probably better described that they selectively blur an already fully focused image. This is similar to how they create bokeh in video games, however their blog states that they actually invoke the thin lens equations to get a realistic blur.

I thought this was really cool that they can solve for a depth image on a cellphone in only a few seconds. I also wanted access to that depth image so I can play with it and make 3D renderings! However that will come at another time.

Accessing the Data

Everything the GoogleCamera app needs to refocus an image is contained right inside the JPEG file recorded to /sdcard/DCIM/Camera after you perform a Lens Blur capture. Go ahead and download that IMG of the picture you took. How is this possible? It seems they use Adobe’s XMP format to store a depth image as PNG inside the header of the original JPEG. Unfortunately, I couldn’t figure out how to make usually tools, GDAL, read that.

So instead let’s do a manual method. If you open the JPEG file up in Emacs, right away you’ll see the XMP header which is human readable. Scroll down till you find the GDepth:Data section and copy everything between the quotes to a new file. Now things get weird. For whatever reason, the XMP format embeds binary containing strings of extension definition and a hash periodically through this binary PNG data you just copied. This doesn’t belong in the PNG and libpng will be very upset with you! Using Emacs again you can search this binary data for the http extension definition string and then delete everything between the \377 or OxFF upfront to the 8 bytes after the hash string. Depending on the file length you’ll have to perform this multiple times.

At this point you now have the raw PNG string! Unfortunately it is still in base64, so we need to decode it.

> base64 –D < header > header.png

Viola! You now have a depth image that you can open with any viewer.

Going back to the original XMP in the header of the source JPEG, you can find some interesting details on what these pixels mean like the following:

GFocus:BlurAtInfinity="0.036530018"
GFocus:FocalDistance="20.225327"
GFocus:FocalPointX="0.59722227"
GFocus:FocalPointY="0.5756945"
GImage:Mime="image/jpeg"
GDepth:Format="RangeInverse"
GDepth:Near="12.516363143920898"
GDepth:Far="47.468021392822266"
GDepth:Mime="image/png"

How did they do this?

There is no way for me to determine this. However looking at GoogleCamera we can see that it refers a librefocus.so which seems to contain the first half of the Lens Blur feature’s native code. Doing a symbol dump gives hints about what stereo algorithm they use.

librefocus.so:00221b68 D vtable for vision::optimization::belief_propagation::BinaryCost
librefocus.so:00221b88 D vtable for vision::optimization::belief_propagation::GridProblem
librefocus.so:00221bc0 D vtable for vision::optimization::belief_propagation::LinearTruncatedCost
librefocus.so:00221ca0 V vtable for vision::sfm::RansacSolver<vision::sfm::EssentialMatrixProblem>
librefocus.so:00221cb8 V vtable for vision::sfm::RansacSolver<vision::sfm::FundamentalMatrixProblem>
librefocus.so:00221c78 D vtable for vision::sfm::StdlibRandom
librefocus.so:00221b58 D vtable for vision::image::FixedPointPyramid
librefocus.so:00221c00 D vtable for vision::shared::EGLOffscreenContext
librefocus.so:00221800 V vtable for vision::shared::Progress
librefocus.so:00221be0 V vtable for vision::shared::GLContext
librefocus.so:00221cd0 V vtable for vision::stereo::PlaneSweep
librefocus.so:00221ce8 D vtable for vision::stereo::GPUPlaneSweep
librefocus.so:00221d20 D vtable for vision::stereo::PhotoConsistencySAD
librefocus.so:00221d00 V vtable for vision::stereo::PhotoConsistencyBase
librefocus.so:00221d60 D vtable for vision::stereo::MultithreadPlaneSweep
librefocus.so:00221ae8 V vtable for vision::features::FeatureDetectorInterface
librefocus.so:00221b00 D vtable for vision::features::fast::FastDetector
librefocus.so:00221d78 V vtable for vision::tracking::KLTPyramid
librefocus.so:00221a90 V vtable for vision::tracking::KLTPyramidFactory
librefocus.so:00221dd0 D vtable for vision::tracking::KLTFeatureDetector
librefocus.so:00221da0 D vtable for vision::tracking::GaussianPyramidFactory
librefocus.so:00221db8 D vtable for vision::tracking::LaplacianPyramidFactory

So it appears they use a KLT system that is fed to an SfM model to solve for camera extrinsics and intrinsics. Possibly they solve for the fundamental between the first sequential images, decompose out the intrinsics and then solve for the essential matrix on the remainder of the sequence. After that point, they use a GPU accelerated plane sweep stereo algorithm that apparently has some belief propagation smoothing. That’s interesting stereo choice given how old plane sweeping is and the lack of popularity in the Middlebury tests. However you can’t doubt the speed. Very cool!

I’m not Dead

However I’ve been really busy working with Google’s project Tango. I encourage you to watch the video if you haven’t already.

What is NASA doing with project Tango? Well currently there is a very vague article available here. However the plan is to apply Tango to the SPHERES project to perform visual navigation. Lately, I’ve been overwhelmed with trying to meet the schedule of 0-g testing and all the hoops there are with getting hardware and software onboard the ISS. This has left very little time to write, let alone sleep. In a few weeks NASA export control will have gone over our collected data and I’ll be able to share here.

In the short term, project Tango represent an amazing opportunity to perform local mapping. The current hardware has little application to the large-scale satellite mapping that I usually discuss. However I think the ideas present in project Tango will have application in low-cost UAV mapping. Something David Shean of U of W has been pursuing. In the more immediate term I think the Tango hardware would have application to scientists wanting to perform local surveys of a glacial wall, cave, or anything you can walk all over. It’s ability to export its observations as a 3D model makes it perfect for sharing with others and perform long-term temporal studies. Yes the 3D sensor won’t work outside, however stereo observations and post processing with things like Photoscan are still possible with the daylight imagery. Tango will then be reduced to providing an amazing 6-DOF measurement of where each picture was taken. If this sounds interesting to you, I encourage you to apply for a prototype device! I’d be interested in helping you tackle a scientific objective with project Tango.

This picture is of Mark and I dealing with our preflight jitters of being onboard the “Vomit Comet” while 0-g testing the space-rated version of Project Tango. This shares my current state of mind. Also, there aren’t enough pictures of my ugly mug on this blog. I’m the guy on the right.

Patch Match Stereo

3 months ago during my Semi Global Matching post, I mentioned I would have a follow along post about another algorithm I was interested in. That algorithm is PatchMatch, and is in my opinion a better fit for satellite processing. It was first written for hole filling [1] and then later it was applied to stereo [2]. Unlike Ames Stereo Pipeline’s currently implemented hierarchical discrete correlator, PatchMatch easily applies to high dimensionality searching and it has a run time that corresponds to number of iterations and image size. Not image content, which I hope interprets into predictable run times. Unfortunately my current implementation is really slow and has several faults. However my hope is that eventually we’ll make it fast and that it’s predictability will enable our users to budget for the CPU cost of creating a DEM while still improving quality.

How it Works

The algorithm for PatchMatch is actually very simple. The disparity image is seeded with uniform noise. That noise is random guesses for what the disparity should be. However since every pixel is another guess, and the image is quite a bit larger than the search range we are looking over, a handful of these guesses are bound to be close to correct. We then propagate disparities with low cost to the neighboring disparities.  The cycle repeats by then adding more noise to the current propagated disparity but at half the previous amplitude. This cycle repeats until everything converges. Surprisingly, a recognizable disparity appears after the first iteration.

If you are having a hard time visualizing this, here is a video from the original authors of Patch Match Stereo.

PatchMatch can be quick in the hole filling application and probably for stereo from video. (Propagating between sequential frames can greatly help convergence.) Unfortunately it is quite slow in use to pairwise stereo. When performing stereo correlation in ASP 2.3, we group kernel evaluations by being at the same disparity value (sometimes called disparity plane in literature). This means that overlapping kernel evaluations will reuse pixel comparisons. The reuse of pixel comparisons is a huge speed boost. My implementation of PatchMatch has none of that. My implementation is also solving for a floating-point precision of disparity. While this gives me very detailed disparity maps, the downside is that my implementation spends 75% of its time performing interpolation. I think for this algorithm to become useful for researchers, I’ll need to discretize the disparities and prerender the input as super sampled to avoid repeated interpolation calculations.

I have one final statement to make on PatchMatch algorithm. In practice, PatchMatch produces results that are very similar to performing a brute force search over the entire search range where winner (lowest cost) takes all. That is different from ASP’s hierarchal search, which at each pyramid level falls into local minimums. It is only the hierarchal part of it that has any use in finding the global. What this means for PatchMatch is that we need to use a cost metric that is globally minimal. For narrow baseline stereo in a controlled lighting environment, the fast Sum of Absolute Differences (SAD) fits the bill. But in the cruel realities of satellite imagery, the only solution is Normalized Cross Correlation (NCC). Unfortunately, NCC is quite a bit slower to evaluate than SAD. Also, in cases where SAD worked, NCC performs worse (probably due to being sensitive to the calculation of mean?).

Where PatchMatch does poorly

I’ve already hit my primary concern, which is the speed of Patch Match. However I think if we view PatchMatch as replacement for both correlation and BayesEM, it already appears cost effective. A different concern is what happens when the search range’s area is larger than the area of the image being correlated. A real world example would be rasterizing a 256^2 tile for a WV2 image where the search range is 1400×100. The area of the tile is less than the search’s. The number of close valid guesses seeded by the initial uniform noise drops dramatically. It might be a good idea to then take the interest points that ASP finds and dump them in the initial random disparity guess that PatchMatch evaluates. This would insure there is a disparity to propagate from.

Previously I mentioned that PatchMatch in my experiments seems to behave as a brute force winner takes all algorithms. This means that increase the search size also means a decrease in quality because our odds of finding an outlier with less cost than the actually correct match have gone up. So maybe completely abandoning hierarchal search entirely is a bad idea.  Other ideas might be enforcing a smoothness constraint that would heavily penalize the random jumping that characterizes outliers. The enforcement of smoothness constraint was the driving force behind the power of Semi Global Matching.

Expanding the Algorithm

Since kernel evaluations in PatchMatch are individual problems and there is no shared arithmetic like there is in ASP’s stereo correlation, it makes it easy to add on some advance algorithms to PatchMatch. The original PatchMatch Stereo paper [#] mentioned adding on parameters to the disparity maps so that an affine window could be used for matching. This would improve results on slopes and I believe would be a better alternative to prior map projecting the input imagery.

Another idea mentioned in the paper was adding Adaptive Support Weights (ASW) [3] to each kernel evaluation. It adds weighting to pixels that match the color of the center of the kernel in addition to weighting central pixels more importantly than pixels at the edge. The idea is that pixels of the same color are likely to be at the same disparity. While not always correct, it definitely applies to scenarios where the top of a building is a different color than its sides. Implementations I show in my figures operate only on grayscale imagery and not color like the original paper proposed.

In practice, this additional weighting does a good job at preserving edges at depth discontinuity. This is important for cliffs and for buildings. A proposed improvement is geodesic adaptive support weights, which weighs same color pixel heavily that are connected to the central pixels. This fixes problems like a picture of blades of grass, where the blades have the same color but are actually at different disparities.

Wrapping back around to the idea that Patch Match needs to have a cost metric that is globally minimal. It might be worth while exploring different cost metric such as Mutual Information or if I remember correctly, the fast evaluating Matching by Tone Mapping (MTM) [4]. Nice side effect of this would be that correlation is completely lighting invariant and could even correlate between infrared and visible.

Additional Comparisons

Here’s a disparity image of NASA Ames Research Center captured from a Pleiades satellite. Whatever they do to pre-process their imagery, I couldn’t epipolar rectify that imagery. Search range given to both ASP and PatchMatch was [-40,-20,70,40]. Despite being noisy, I like how the PatchMatch w/ ASW preserved straight edges. The specularity of some of the roof lights on a hanger however really threw ASW for a loop.

I also processed a crop from a World View 2 image of somewhere in the McMurdo Dry Valleys. This really shot a hole in my argument that Patch Match would be completely invariant to search range. The search range was [-700,-50,820,50]. I did however have to hand tune the search range to get this excellent result from ASP. The automatic detection initially missed the top of the mountains.

Source Code

I’m being messy because this is my research code and not something production. You’ll need to understand VW and Makefiles if you want to compile and modify.

https://github.com/zmoratto/PatchMatch

Further Research

I’m still diving through papers during my free evenings. I don’t do this at work because I have another project that is taking my soul. However there appears to be a good paper called Patch Match Filter that tries to improve speed through super pixel calculation [5]. There is also an implementation of PatchMatch that adds a smoothness constraint that performs better than the original PatchMatch paper [6]. When it came out, it was the best performing algorithm on the Middlebury dataset. However, recently another graph cut algorithm dethroned it. I myself will also just look at applying patch match as a refinement to a noisy disparity result from ASP 2.3 using a kernel size of 3. Having a small kernel still evaluates extremely quickly even if the search range is huge.

I’m sorry this post doesn’t end in a definitive conclusion. However I hope you found it interesting.

References

[1] Barnes, Connelly, et al. “PatchMatch: a randomized correspondence algorithm for structural image editing.” ACM Transactions on Graphics-TOG 28.3 (2009): 24.
[2] Bleyer, Michael, Christoph Rhemann, and Carsten Rother. “PatchMatch Stereo-Stereo Matching with Slanted Support Windows.” BMVC. Vol. 11. 2011.
[3] Yoon, Kuk-Jin, and In So Kweon. “Adaptive support-weight approach for correspondence search.” Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.4 (2006): 650-656.
[4] Hel-Or, Yacov, Hagit Hel-Or, and Eyal David. “Fast template matching in non-linear tone-mapped images.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
[5] Lu, Jiangbo, et al. “Patch Match Filter: Efficient Edge-Aware Filtering Meets Randomized Search for Fast Correspondence Field Estimation.” Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference. 2013.
[6] Heise, Philipp, et al. “PM-Huber: PatchMatch with Huber Regularization for Stereo Matching.”