[Paper] Multi-View Stereo: A Tutorial

Posted by Tong on April 4, 2020

Abstract 1

This tutorial presents a hands-on view of the field of multi-view stereo with a focus on practical algorithms. Multi-view stereo algorithms are able to construct highly detailed 3D models from images alone. They take a possibly very large set of images and construct a 3D plausible geometry that explains the images under some reasonable assumptions, the most important being scene rigidity. The tutorial frames the multi-view stereo problem as an image/geometry consistency optimization problem. It describes in detail its main two ingredients: robust implementations of photometric consistency measures, and efficient optimization algorithms. It then presents how these main ingredients are used by some of the most successful algorithms, applied into real applications, and deployed as products in the industry. Finally it describes more advanced approaches exploiting domain-specific knowledge such as structural priors, and gives an overview of the remaining challenges and future research directions.

1 Introduction

  1. All the MVS algorithms described in the following chapters assume the same input:
    1. a set of images.
    2. their corresponding camera parameters (pose and intrinsic parameters).
  2. An MVS algorithm is only as good as the quality of the input images and camera parameters.
  3. MVS pipeline
    1. Collect images.
    2. Compute camera parameters for each image.
    3. Reconstruct the 3D geometry of the scene from the set of images and corresponding camera parameters.
    4. Optinally reconstruct the materials of the scene.
  4. SfM (Structure from Motion) pipeline:
    1. Detect 2D features in every input image.
    2. Match 2D features between images.
    3. Construct 2D tracks from the matches.
    4. Solve for the SfM model from the 2D tracks.
    5. Refine the SfM model using bundle adjustment.
  5. The robustness of MVS to camera reprojection error depends mainly on how tolerant the matching criterion (photo-consistency) is to to misalignments.
    1. The larger the domain of the photo-consistency measure, the more robust the measure is.
    2. Large domains also tend to produce over-smoothed geometry.
  6. Because reprojection error is measured in pixels, one can downsample the input images and rescale the camera parameters until the reprojection error drops below a certain threshold.

2 Multi-view Photo-consistency

  • A crucial requirement for photo-consistency measures is to compute photo-consistency on a set of images that see the same 3D geometry.

2.1 Photo-consistency measures

  • Summary of different similarity measures
Measure require support domain? invariance
Sum of Squared Differences (SSD) no none
Sum of Absolute Differences (SAD) no none
Normalized Cross Correlation (NCC) yes bias / gain
Census 2 yes bias / gain
Rank 2 yes bias / gain / rotation
Mutual Information (MI) yes any bijection
  • The size of a support domain needs to be proportional to the image resolution and the viewpoint separation, and inversely proportional to the distance to the scene.
  • A related concept to the domain that is very common in stereo algorithms is photo-consistency aggregation 3, which consists in spatially aggregating the photo-consistency measure to increase its robustness.
  • Given color images, different strategies exist to deal with different channels:
    • Convert the color image into gray scale before computing the photo-consistency
    • Compute the photo-consistency per color channel independently, and return the average.
    • Concatenate the vectors from all the color channels into a single larger vector.
  • The handling of color images in NCC is to compute and substract the average intensity per color channel independently, but concatenate all the color channels together as a single vector when computing its variance.
  • A normalized variance of SSD is equivalent to NCC.
  • The main advantage of Census vs NCC is around depth boundaries, where Census is more robust than NCC, because the assumption (the appearance in the domain is intrinsic to the object and, to some extent, invariant to illumination and viewpoint changes.) breaks at depth discontinuities because the domain contains foreground and background objects.
  • The main characteristic of MI (mutual information) is that it is highly invariant, in particular to any transformation that does not change the amount of information of a signal, e.g., bijective mappings. However, it is not a common measure since its invariance comes at the cost of degraded accuracy.
  • Normalization is important for parameter tuning as well as combining photo-consistency meausres together.
    • Typical transforms: exponential, linear truncated, smooth step.
    • The shape of the normalization function is expected to be sigmoid-like, with two plateaus separated by a relatively steep ramp.
    • The rationale behind it is that the steep slope acts as the optimal threshold separating the inlier/outlier model.
      • For SAD, inlier errors < 5 levels (out of 255), while outlier values are > 10 levels.
      • NCC values below \(\frac{1}{\sqrt{2}}\) are typically considered not accurate enough and discarded.
  • Photo-consistency aggregation
    • box filter
    • anisotropic fitlers
      • bilateral filter 4
      • guided filter 5
      • Weighted median filter 6

2.2 Visibility estimation in state-of-the-art algorithms

  • Space-carving for visibility estimation 7 8
    • Given a 3D volume partitioned into a 3D grid of voxels, the volume is iteratively carved out by removing voxels that are not photo-consistent.
    • The main contribution of the work was the proposal of a geometric constraint on the camera centers such that there exists an ordinal visibility constraint on all the 3D voxels in the scene. That is, starting from an initial bounding volume, one can visit all the voxels in an order that guarantees that any potential occluder voxel is always visited before its potential occluded voxel.
  • Visibility estimate for large image collections typically happen in two phases
    • In the first phase, visibility is estimated coarsely by clustering the initial set of images and reducing the large-scale MVS problem into a sequence of small sub-problems. 9 10 11 12
    • In the second phase, more fine-scale visibility estimation is conducted per 3D point basis.
      • One popular approach is to use the current reconstructed geometry to compute occlusions (e.g., a z-buffer testing), select which views see which parts of the geometry, and iterate visibility estimation and reconstruction.
      • Another popular solutions is to rely on robust photo-consistency statistics without explicitly estimating occlusion.

3 Algorithms: From Photo-Consistency to 3D Reconstruction

  View-dependent texture-mapping Free-viewpoint rendering Ease of manipulation (e.g., merging/splitting)
Depthmaps Good Fair (point-based rendering) Fair
Point-cloud Poor Fair (point-based rendering) Good
Voxels No No Good
Mesh Poor Good Bad

Robust Vision Challenge

3.1 Depthmap Reconstruction

  • Winner-Takes-All Depthmaps
    • Evaluate photo-consistency values throughout the depth range, and pick the depth value with the highest photo-consistency score for each pixel independently.
    • In addition to the depth value with the highest photo-consistency, the algorithm often evaluates a confidence measure so that low-confidence depth values can be ignored or down-weighted in the later model mergin step 13.
  • Robust Photo-Consistency Depthmaps
    • Robust photo-consistency function 14: a sum of kernel function (e.g., Gaussian) at major local maxima.
    • Another effective approach is to ignore photo-consistency scores that are below a certain threshold. 15
  • Markov Random Field (MRF) Depthmaps
    • In the presence of severe occlusions, there may not exist a corresponding match in most other images. A standard solution is to enforce spatial consistency, under the assumption that neighboring pixels have similar depth values.
    • The MRF depth map formulations 16 can be seen as a combinatorial optimization problem, where an input depth range is discretized into a finite set of depth values.
    • The cost function is comprised of two parts
      • Unary potential: photo-consistency information
      • Pairwise interaction potentials: spatial regularization and proportional to the amount of depth discrepancy at neighboring pixels.
  • Mutiple Hypothesis MRF Depthmaps
    • Instead of using blindly discretized depth values as possible label set for an entire image, their algorithm extracts local maxima from photo-consistency curves for each pixel, then the MRF formulation is used to assigned the depth of one of such local maxima to each pixel. 17
    • They also allow the “unknown” label to indicate that no correct depth can be estimated in certain cases.
    • The process consists of two phases
      1. Extraction of depth labels;
      2. MRF optimization to assign extracted depth labels.
  • More Depthmap Reconstruction Algorithms
    • Real-time plane sweeping depthmap reconstruction 18
      • It sweeps a family of parallel planes in a scene, projects images onto a plane via planar homographies, then evaluates photo-consistency values on each plane.
    • Second order smoothness 19
      • MRF formulation, but a second order smoothness prior over triple cliques (i.e., three pixels) that enforces piece-wise planar surfaces

3.2 Point-cloud Reconstruction

  • A 3D point with a surface normal estimation or a local region support is referred to as an oriented point or a patch.

  • A common characteristic of point-cloud reconstruction algorithms is that they make use of an spatial consistency assumption and grows or expand the point-cloud on the surface of the scene during the reconstruction process.

  • This article focuses on the work 9.
    • The algorithm also follows a greedy expansion approach,
    • but one key difference is that it iterates between the expansion and the filtering steps after reconstructing an initial seed of patches via feature matching.
    • The filtering step analyzes consistency of patches across all the views and removes falsely reconstructed ones.
  • Key Elements
    • Patch Model: A patch \(p\) is essentially a local tangent plane approximation of a surface, whose geometry is determined by its center \(\mathbf{c}(p)\) and unit normal vector \(\mathbf{n}(p)\).
      • The robust photo-consistency function is simply evaluated by using the patch as a proxy geometry to sample pixel colors.
      • In practice, the perpendicular direction is fixed before and throughout the optimization, where one parameter for position and two parameters for normal are optimized via a standard non-linear least squares technique.
    • Image-based Data Structure
      • Image projections of reconstructed patches in the visible images are used to help searching or accessing neighboring patches, then enforcing regularization.
  • Algorithm
    • Initial Feature Matching
      • The purpose is to generate a spares set of patches
      • Difference-of-Gaussian (DoG) + Harris operators
      • Search matches within two pixels from the corresponding epipolar line
      • Triangulation to initialize a patch
      • Patch optimization
    • Expansion
      • The goal is to reconstruct at least one patch in every image cell, where they repeat taking existing patches and generating new ones in nearby empty spaces.
      • The process repeats until the expansion process is performed from every patch that has been reconstructed.
    • Filtering
      • Two filters are used to remove erroneous patches.
        • The first filter relies on visibility consistency.
        • The second filter enforces a weak form of regularization.

3.3 Volumetric data fusion

  • Volumetric surface extraction is flexible and the input 3D information can come from many different sources such as photo-consistency volumes, depthmaps, MVS point clouds, laser scanned 3D points, or any combination of those. It is a challenging task to fuse such diverse set of 3D measurements into a single clean mesh model with the right topology.
  • A standard yet powerful solution is to accumulate evidence in a 3D voxel grid and extract a surface model via Marching Cube algorithm.
    • One formulation is to compute a signed distance function field over the voxel grid, then pose a surface reconstruction as zero iso-surface extraction. 20 21
    • The other formulation is to pose as a 3D binary segmentation problem, where the boundary of the two segments can be extracted as a surface model. 21 22 23 24 25
  • Volumetric Graph-Cuts on a Voxel Grid
    • Given a bounding box that contains the solution surface, the space is discretized with a voxel grid.
    • The objective is to find the optimal label assignment (“interior” or “exterior”) to all the voxels that minimizes the following cost function: \(E\left(\left\{k_{v}\right\}\right)=\sum_{v} \Phi\left(k_{v}\right)+\sum_{(v, w) \in \mathcal{N}} \Psi\left(k_{v}, k_{w}\right)\)
    • The first term (unary term) is the summation of per voxel cost over the entire domain, where \(\Phi\left(k_{v}\right)\) encodes the cost of assigning a label to voxel \(v\).
    • The second term (pairwise interaction term) is the summation over all the pairs of adjacent voxels denoted as \(\mathcal{N}\).
    • Possibility: a constant ballooning term as the unary term
    • The optimization can be solved exactly and efficiently with a graph-cuts algorithm, as long as each pairwise term is submodular. 26
    • One limitation of the use of a voxel grid is that the memory allocation quickly becomes very expensive.
      • One effective solution is an octree data structure.
    • Due to the shrinkage bias, the reconstruction with the constant balloning term failes in reconstructing deep concavities in various parts of the model.
  • A better alternative than uniform voxel grid is to first reconstruct a sparse 3D point cloud of scene, then use the 3D points as nodes of the space discretization grid.
    • 3D Delaunay triangulation 27
    • Explicit regularization term to penalize label change
      • It is usually necessary at every pair of adjacent cells for a uniform voxel grid structure.
      • In the case of a 3D Delaunay triangulation, explicit regularization in the pairwise term is not crucial.
    • If the 3D evidence is weak, this technique will miss thin structures due to the MRF regularization.
      • Solution: add a step to reinforce the evidence of structure by analyzing the gradient of exterior evidence. 28

3.4 MVS Mesh Refinement

  • Mesh refinement algorithms move the location of each vertex \(v_i\) one by one or simultaneously, while minimizing an objective function defined over the mesh model.
  • The object function \(E(\{v_i\})\) typically consists of
    • a data term \(E_p\), which is based on a photometric consistency measure
    • a regularization term \(E_r\), which measures the smoothness of the mesh.
    • Optionally, when image silhouettes are given as input, silhouette consistency measure \(E_s\) to enforce that the mesh is consistent with the image silhouettes
  • Gradient descent is usually the method of optimization.
    • In practice, the gradients of \(E_p\) and \(E_s\) are usually projected along the surface normal.
    • The regularization term \(E_r\) is used to prefer uniform vertex distribution, and the direct derivative is calculated with respect to \(x_i\), \(y_i\), and \(z_i\).
  • Photometric consistency term
    • Three examples: 9 11 29
    • Different from many methods that use tangent planes or front-parallel surface approximations, the use of triangulated mesh model for photometric consistency evaluation has several advantages:
      • First, the core texture reprojection (i.e., morphing) operations can be effectively and easily performed over an entire mesh by GPUs.
      • Second, it can handle sharp edge structure such as staircases property.
  • Regularization term
    • Mesh Laplacian and Mesh Bi-Laplacian are often used to define the regularization force. 30
    • Laplacian becomes zero when the surface is planar.
    • Bi-Laplician becomes zero when the surface has constant or uniform curvature.
  • Sihouette consistency term
    • Silhouettes often delineate a foreground object from background.
    • Example: 11

4 Multi-view Stereo and Structure Priors

  • Our world is redundant with textureless and highly non-Lambertian objects that make MVS algorithms fail.
  • Strong structural regularities
    • planarity
    • orthogonality
  • Change the definition of labels to enforce high-level structure priors such planarity, while still keeping the submodularity to enable the simple application of graph-cuts technique.

4.1 Departure from Depthmap to Planemap

  • Manhattan-world stereo 31
    • The MRF formulation is used to estimate a plane ID per pixel, which is called a planemap as opposed to a depthmap.
    • The data term measures visibility inconsistency between a plane hypothesis at a pixel and all of the input 3D points.
    • The smoothness term enforces spatial consistency. The penalty is down-weighted in the presence of image edges, where depth discontinuity is reasonable.
    • Given the MRF formulation, the \(\alpha\)-expansion algorithm is used to minimize the energy and estimate a planemap. 32 33
  • Extended work allowing planes of arbitrary orientations 34

4.2 Departure from Planes to Geometric Primitives

  • They 35 estimate an elevation model or a height field representation of cities, which is formulated as a 2D MRF problem in a top down view.
  • The formulation is very similar to the piecewise planar stereo algorithms in the previous section, where an MRF is used to assign a primitive ID, but with two key differences:
    • The first difference is that the domain is discretized by a grid of much larger rectangular cells based on the 3D line segments reconstructed by a multi-view line matching algorithm 36. A primitive ID is assigned to each cell, which has additional data-aware regularization effects.
    • The second difference is the handling of surfaces of revolution as geometric primitives, which may not be effective for arbitrary scenes.

4.3 Image Classification for Structure Priors

  • Image classification techniques can be employed to label an image into planar and non-planar regions, where structure priors (i.e., piecewise planarity) are enforced only for regions with planar classification. 37
    • They train a classifier based on the color and texture features to distinguish planar and non-planar surfaces, where both a simple k-nearest neighbor classifier and SVM classifier were tested and produced similar results.

5 Software, Best Practices, and Successful Applications

6 Limitations and Future Directions

  • Limitations
    • Lack of texture
    • Thin structures
    • Non-Lambertian surfaces
  • Open Problems
    • Aerial and ground fusion
    • Inverse CAD
    • Semantic modeling
    • Dynamic scene modeling

Literature

  1. Furukawa, Yasutaka, and Carlos Hernández. “Multi-view stereo: A tutorial.” Foundations and Trends® in Computer Graphics and Vision 9.1-2 (2015): 1-148. 

  2. Ramin Zabih and John Woodfill. Non-parametric local transforms for computing visual correspondence. In European Conference on Computer Vision, pages 151–158, Stockhold, Sweden, 1994.  2

  3. D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1/2/3) 7–42, 2002. 

  4. C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In IEEE International Conference on Computer Vision, pages 839–846, 1998. 

  5. Kaiming He, Jian Sun, and Xiaoou Tang. Guided image filtering. In European Conference on Computer Vision, ECCV’10, pages 1–14, Berlin, Heidelberg, 2010. 

  6. Ziyang Ma, Kaiming He, Yichen Wei, Jian Sun, and Enhua Wu. Constant time weighted median filtering for stereo matching and beyond. In IEEE International Conference on Computer Vision, 2013. 

  7. Steven M. Seitz and Charles R. Dyer. Photorealistic scene reconstruction by voxel coloring. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1067–, Washington, DC, USA, 1997. IEEE Computer Society. 

  8. K.N. Kutulakos and S. M. Seitz. A theory of shape by space carving. International Journal of Computer Vision, 38(3):199–218, 2000. 

  9. Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi-view stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, August 2010.  2 3

  10. Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski. Towards Internet-scale multiview stereo. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. 

  11. Carlos Hernández and Francis Schmitt. Silhouette and stereo fusion for 3d object modeling. Computer Vision and Image Understanding, 96(3):367–392, 2004.  2 3

  12. M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S.M. Seitz. Multi-view stereo for community photo collections. In IEEE International Conference on Computer Vision, pages 1–8, 2007. 

  13. Xiaoyan Hu and P. Mordohai. A quantitative evaluation of confidence measures for stereo vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2121–2133, Nov 2012. 

  14. G. Vogiatzis, C. Hernández, P. H S Torr, and R. Cipolla. Multi-view stereo via volumetric graph-cuts and occlusion robust photo-consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12):2241–2246, 2007. 

  15. Michael Goesele, Brian Curless, and Steven M. Seitz. Multi-view stereo revisited. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2402–2409, 2006. 

  16. V. Kolmogorov and R. Zabih. Computing visual correspondence with occlusions using graph cuts. In IEEE International Conference on Computer Vision, volume 2, pages 508–515 vol.2, 2001. 

  17. Neill D.F. Campbell, George Vogiatzis, Carlos Hernández, and Roberto Cipolla. Using multiple hypotheses to improve depth-maps for multi-view stereo. In 10th European Conference on Computer Vision, volume 5302 of LNCS, pages 766–779, 2008. 

  18. David Gallup, Jan-Michael Frahm, Philippos Mordohai, Qingxiong Yang, and Marc Pollefeys. Real-time plane-sweeping stereo with multiple sweeping directions. In IEEE Conference on Computer Vision and Pattern Recognition, 2007. 

  19. Woodford, O. J., Torr, P. H. S., Reid, I. D., & Fitzgibbon, A. W. (2008). Global stereo reconstruction under second order smoothness priors. 2008 IEEE Conference on Computer Vision and Pattern Recognition. 

  20. Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In ACM SIGGRAPH, 1996. 

  21. C. Zach, T. Pock, and H. Bischof. A globally optimal algorithm for robust tv-l 1 range image integration. In IEEE International Conference on Computer Vision, 2007.  2

  22. Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski. Reconstructing building interiors from images. In IEEE International Conference on Computer Vision, 2009. 

  23. Carlos Hernández, George Vogiatzis, and Roberto Cipolla. Probabilistic visibility for multi-view stereo. In IEEE Conference on Computer Vision and Pattern Recognition, 2007. 

  24. S.N. Sinha, P. Mordohai, and M. Pollefeys. Multi-view stereo via graph cuts on the dual of an adaptive tetrahedral mesh. In IEEE International Conference on Computer Vision, pages 1–8, 2007. 

  25. George Vogiatzis, P.H.S. Torr, and Roberto Cipolla. Multi-view stereo via volumetric graph-cuts. In IEEE Conference on Computer Vision and Pattern Recognition, 2005. 

  26. V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):147–159, 2004. 

  27. Patrick Labatut, Jean-Philippe Pons, and Renaud Keriven. Efficient multi-view reconstruction of large-scale scenes using interest points, delaunay triangulation and graph cuts. In IEEE International Conference on Computer Vision, 2007. 

  28. M. Jancosek and T. Pajdla. Multi-view reconstruction preserving weakly-supported surfaces. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. 

  29. H-H. Vu, P. Labatut, R. Keriven, and J.-P Pons. High accuracy and visibility-consistent dense multi-view stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):889–901, May 2012. 

  30. Mario Botsch and Olga Sorkine. On linear variational surface deformation methods. IEEE Transactions on Visualization and Computer Graphics, 14(1):213–230, 2008. 

  31. Furukawa, Yasutaka, et al. “Manhattan-world stereo.” 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009. 

  32. Yuri Boykov and Vladimir Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1124–1137, September 2004. 

  33. Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, November 2001. 

  34. Sudipta Sinha, Drew Steedly, and Richard Szeliski. Piecewise planar stereo for image-based rendering. In IEEE International Conference on Computer Vision, 2009. 

  35. L. Zebedin, J. Bauer, K. Karner, and H. Bischof. Fusion of feature and area-based information for urban buildings modeling from aerial imagery. In IEEE International Conference on Computer Vision, pages IV: 873–886, 2008. 

  36. C. Baillard, C. Schmid, A. Zisserman, and A. W. Fitzgibbon. Automatic line matching and 3D reconstruction of buildings from multiple views. In ISPRS Conference on Automatic Extraction of GIS Objects from Digital Imagery, pages 69–80, 1999. 

  37. David Gallup, Jan-Michael Frahm, and Marc Pollefeys. Piecewise planar and non-planar stereo for urban scene reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition, 2010.