{"title": "Learning a Multi-View Stereo Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 365, "page_last": 376, "abstract": "We present a learnt system for multi-view stereopsis. In contrast to recent learning based methods for 3D reconstruction, we leverage the underlying 3D geometry of the problem through feature projection and unprojection along viewing rays. By formulating these operations in a differentiable manner, we are able to learn the system end-to-end for the task of metric 3D reconstruction. End-to-end learning allows us to jointly reason about shape priors while conforming to geometric constraints, enabling reconstruction from much fewer images (even a single image) than required by classical approaches as well as completion of unseen surfaces. We thoroughly evaluate our approach on the ShapeNet dataset and demonstrate the benefits over classical approaches and recent learning based methods.", "full_text": "Learning a Multi-View Stereo Machine\n\nAbhishek Kar\nUC Berkeley\n\nakar@berkeley.edu\n\nChristian H\u00e4ne\n\nUC Berkeley\n\nchaene@berkeley.edu\n\nJitendra Malik\nUC Berkeley\n\nmalik@berkeley.edu\n\nAbstract\n\nWe present a learnt system for multi-view stereopsis. In contrast to recent learning\nbased methods for 3D reconstruction, we leverage the underlying 3D geometry of\nthe problem through feature projection and unprojection along viewing rays. By\nformulating these operations in a differentiable manner, we are able to learn the\nsystem end-to-end for the task of metric 3D reconstruction. End-to-end learning\nallows us to jointly reason about shape priors while conforming to geometric\nconstraints, enabling reconstruction from much fewer images (even a single image)\nthan required by classical approaches as well as completion of unseen surfaces. We\nthoroughly evaluate our approach on the ShapeNet dataset and demonstrate the\nbene\ufb01ts over classical approaches and recent learning based methods.\n\n1\n\nIntroduction\n\nMulti-view stereopsis (MVS) is classically posed as the following problem - given a set of images\nwith known camera poses, it produces a geometric representation of the underlying 3D world. This\nrepresentation can be a set of disparity maps, a 3D volume in the form of voxel occupancies, signed\ndistance \ufb01elds etc. An early example of such a system is the stereo machine from Kanade et al. [26]\nthat computes disparity maps from images streams from six video cameras. Modern approaches\nfocus on acquiring the full 3D geometry in the form of volumetric representations or polygonal\nmeshes [48]. The underlying principle behind MVS is simple - a 3D point looks locally similar when\nprojected to different viewpoints [29]. Thus, classical methods use the basic principle of \ufb01nding\ndense correspondences in images and triangulate to obtain a 3D reconstruction.\nThe question we try to address in this work is can we learn a multi-view stereo system? For the\nbinocular case, Becker and Hinton [1] demonstrated that a neural network can learn to predict a depth\nmap from random dot stereograms. A recent work [28] shows convincing results for binocular stereo\nby using an end-to-end learning approach with binocular geometry constraints.\nIn this work, we present Learnt Stereo Machines (LSM) - a system which is able to reconstruct object\ngeometry as voxel occupancy grids or per-view depth maps from a small number of views, including\njust a single image. We design our system inspired by classical approaches while learning each\ncomponent from data embedded in an end to end system. LSMs have built in projective geometry,\nenabling reasoning in metric 3D space and effectively exploiting the geometric structure of the MVS\nproblem. Compared to classical approaches, which are designed to exploit a speci\ufb01c cue such as\nsilhouettes or photo-consistency, our system learns to exploit the cues that are relevant to the particular\ninstance while also using priors about shape to predict geometry for unseen regions.\nRecent work from Choy et al. [5] (3D-R2N2) trains convolutional neural networks (CNNs) to\npredict object geometry given only images. While this work relied primarily on semantic cues for\nreconstruction, our formulation enables us to exploit strong geometric cues. In our experiments, we\ndemonstrate that a straightforward way of incorporating camera poses for volumetric occupancy\nprediction does not lead to expected gains, while our geometrically grounded method is able to\neffectively utilize the additional information.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Overview of a Learnt Stereo Machine (LSM). It takes as input one or more views and camera poses.\nThe images are processed through a feature encoder which are then unprojected into the 3D world frame using a\ndifferentiable unprojection operation. These grids {Gf\ni=1 are then matched in a recurrent manner to produce a\nfused grid Gp which is then transformed by a 3D CNN into Go. LSMs can produce two kinds of outputs - voxel\noccupancy grids (Voxel LSM) decoded from Go or per-view depth maps (Depth LSM) decoded after a projection\noperation.\n\ni }n\n\nClassical multi-view stereopsis is traditionally able to handle both objects and scenes - we only\nshowcase our system for the case of objects with scenes left for future work. We thoroughly evaluate\nour system on the synthetic ShapeNet [3] dataset. We compare to classical plane sweeping stereo,\nvisual hulls and several challenging learning-based baselines. Our experiments show that we are able\nto reconstruct objects with fewer images than classical approaches. Compared to recent learning\nbased reconstruction approaches, our system is able to better use camera pose information leading to\nsigni\ufb01cantly large improvements while adding more views. Finally, we show successful generalization\nto unseen object categories demonstrating that our network goes beyond semantic cues and strongly\nuses geometric information for uni\ufb01ed single and multi-view 3D reconstruction.\n\n2 Related Work\n\nExtracting 3D information from images is one of the classical problems in computer vision. Early\nworks focused on the problem of extracting a disparity map from a binocular image pair [36]. We\nrefer the reader to [47] for an overview of classical binocular stereo matching algorithms. In the multi-\nview setting, early work focused on using silhouette information via visual hulls [32], incorporating\nphoto-consistency to deal with concavities (photo hull) [29], and shape re\ufb01nement using optimization\n[55, 50, 7, 15]. [39, 35, 54] directly reason about viewing rays in a voxel grid, while [34] recovers\na quasi dense point cloud. In our work, we aim to learn a multi-view stereo machine grounded in\ngeometry, that learns to use these classical constraints while also being able to reason about semantic\nshape cues from the data. Another approach to MVS involves representing the reconstruction as a\ncollection of depth maps [6, 57, 41, 13, 40]. This allows recovery of \ufb01ne details for which a consistent\nglobal estimate may be hard to obtain. These depth maps can then be fused using a variety of different\ntechniques [38, 8, 33, 59, 30]. Our learnt system is able to produce a set of per-view depth maps\nalong with a globally consistent volumetric representation which allows us to preserve \ufb01ne details\nwhile conforming to global structure.\nLearning has been used for multi-view reconstruction in the form of shape priors for objects [2, 9, 58,\n20, 27, 52], or semantic class speci\ufb01c surface priors for scenes [22, 17, 45]. These works use learnt\nshape models and either directly \ufb01t them to input images or utilize them in a joint representation\nthat fuses semantic and geometric information. Most recently, CNN based learning methods have\nbeen proposed for 3D reconstruction by learning image patch similarity functions [60, 18, 23] and\nend-to-end disparity regression from stereo pairs [37, 28]. Approaches which predict shape from\na single image have been proposed in form of direct depth map regression [46, 31, 10], generating\nmultiple depth maps from novel viewpoints [51], producing voxel occupancies [5, 16], geometry\nimages [49] and point clouds [11]. [12] study a related problem of view interpolation, where a rough\ndepth estimate is obtained within the system.\nA line of recent works, complementary to ours, has proposed to incorporate ideas from multi-view\ngeometry in a learning framework to train single view prediction systems [14, 56, 53, 42, 61] using\nmultiple views as supervisory signal. These works use the classical cues of photo-consistency and\n\n2\n\nImage Encoder2D Feature MapsUnprojection3D Feature GridsRecurrent FusionFused Feature Grid3D Grid ReasoningProjectionFinal GridVoxel Occupancy GridDepth Maps\fFigure 2: Illustrations of projection and unprojection operations between 1D maps and 2D grids. (a) The\nprojection operation samples values along the ray at equally spaced z-values into a 1D canvas/image. The\nsampled features (shown by colors here) at the z planes are stacked into channels to form the projected feature\nmap. (b) The unprojection operation takes features from a feature map (here in 1-D) and places them along rays\nat grid blocks where the respective rays intersect. Best viewed in color.\n\nsilhouette consistency only during training - their goal during inference is to only perform single\nimage shape prediction. In contrast, we also use geometric constraints during inference to produce\nhigh quality outputs.\nClosest to our work is the work of Kendall et al. [28] which demonstrates incorporating binocular\nstereo geometry into deep networks by formulating a cost volume in terms of disparities and regressing\ndepth values using a differentiable arg-min operation. We generalize to multiple views by tracing\nrays through a discretized grid and handle variable number of views via incremental matching using\nrecurrent units. We also propose a differentiable projection operation which aggregates features along\nviewing rays and learns a nonlinear combination function instead of using the differentiable arg-min\nwhich is susceptible to multiple modes. Moreover, we can also infer 3D from a single image during\ninference.\n\n3 Learnt Stereo Machines\n\nOur goal in this paper is to design an end-to-end learnable system that produces a 3D reconstruction\ngiven one or more input images and their corresponding camera poses. To this end, we draw inspira-\ntion from classical geometric approaches where the underlying guiding principle is the following - the\nreconstructed 3D surface has to be photo-consistent with all the input images that depict this particular\nsurface. Such approaches typically operate by \ufb01rst computing dense features for correspondence\nmatching in image space. These features are then assembled into a large cost volume of geometrically\nfeasible matches based on the camera pose. Finally, the optimum of this matching volume (along\nwith certain priors) results in an estimate of the 3D volume/surface/disparity maps of the underlying\nshape from which the images were produced.\nOur proposed system, shown in Figure 1, largely follows the principles mentioned above. It uses\na discrete grid as internal representation of the 3D world and operates in metric 3D space. The\ninput images {Ii}n\ni=1 are \ufb01rst processed through a shared image encoder which produces dense\nfeature maps {Fi}n\ni=1, one for each image. The features are then unprojected into 3D feature grids\n{Gf\ni=1. This unprojection\noperation aligns the features along epipolar lines, enabling ef\ufb01cient local matching. This matching\nis modelled using a recurrent neural network which processes the unprojected grids sequentially to\nproduce a grid of local matching costs Gp. This cost volume is typically noisy and is smoothed in\nan energy optimization framework with a data term and smoothness term. We model this step by a\nfeed forward 3D convolution-deconvolution CNN that transforms Gp into a 3D grid Go of smoothed\ncosts taking context into account. Based on the desired output, we propose to either let the \ufb01nal\ngrid be a volumetric occupancy map or a grid of features which is projected back into 2D feature\n\ni=1 by rasterizing the viewing rays with the known camera poses {Pi}n\n\ni }n\n\n3\n\nDepth planes1-D Canvas1-D ProjectionsCameraSampling locations1-D Feature Map(a) Projection(b) UnprojectionCamera2-D Feature Grid2-D World Gridz = 1z = 2z = 3z = 1z = 2z = 3\fmaps {Oi}n\ni=1 using the given camera poses. These 2D maps are then mapped to a view speci\ufb01c\nrepresentation of the shape such as a per view depth/disparity map. The key components of our\nsystem are the differentiable projection and unprojection operations which allow us to learn the\nsystem end-to-end while injecting the underlying 3D geometry in a metrically accurate manner. We\nrefer to our system as a Learnt Stereo Machine (LSM). We present two variants - one that produces\nper voxel occupancy maps (Voxel LSM) and another that outputs a depth map per input image (Depth\nLSM) and provide details about the components and the rationale behind them below.\n\n2D Image Encoder. The \ufb01rst step in a stereo algorithm is to compute a good set of features to\nmatch across images. Traditional stereo algorithms typically use raw patches as features. We model\nthis as a feed forward CNN with a convolution-deconvolution architecture with skip connections\n(UNet) [44] to enable the features to have a large enough receptive \ufb01eld while at the same time having\naccess to lower level features (using skip connections) whenever needed. Given images {Ii}n\ni=1, the\nfeature encoder produces dense feature maps {Fi}n\ni=1 in 2D image space, which are passed to the\nunprojection module along with the camera parameters to be lifted into metric 3D space.\n\nk = K[R|t]X k\n\nk=1, we compute the feature for kth block by projecting {X k\n\nDifferentiable Unprojection. The goal of the unprojection operation is to lift information from\n2D image frame to the 3D world frame. Given a 2D point p, its feature representation F(p) and our\nglobal 3D grid representation, we replicate F(p) along the viewing ray for p into locations along\nthe viewing ray in the metric 3D grid (a 2D illustration is presented in Figure 2). In the case of\nperspective projection speci\ufb01ed by an intrinsic camera matrix K and an extrinsic camera matrix [R|t],\nthe unprojection operation uses this camera pose to trace viewing rays in the world and copy the\nimage features into voxels in this 3D world grid. Instead of analytically tracing rays, given the centers\nw} using\nw}NV\nof blocks in our 3D grid {X k\nthe camera projection equations p(cid:48)\nw into the image space. p(cid:48)\nk is a continuous quantity\nwhereas F is de\ufb01ned on at discrete 2D locations. Thus, we use the differentiable bilinear sampling\nw.\noperation to sample from the discrete grid [25] to obtain the feature at X k\nSuch an operation has the highly desirable property that features from pixels in multiple images\nthat may correspond to the same 3D world point unproject to the same location in the 3D grid -\ntrivially enforcing epipolar constraints. As a result, any further processing on these unprojected grids\nhas easy access to corresponding features to make matching decisions foregoing the need for long\nrange image connections for feature matching in image space. Also, by projecting discrete 3D points\ninto 2D and bilinearly sampling from the feature map rather than analytically tracing rays in 3D,\nwe implicitly handle the issue where the probability of a grid voxel being hit by a ray decreases\nwith distance from the camera due to their projective nature. In our formulation, every voxel gets\na \u201csoft\" feature assigned based on where it projects back in the image, making the feature grids\nGf smooth and providing stable gradients. This geometric procedure of lifting features from 2D\nmaps into 3D space is in contrast with recent learning based approaches [5, 51] which either reshape\n\ufb02attened feature maps into 3D grids for subsequent processing or inject pose into the system using\nfully connected layers. This procedure effectively saves the network from having to implicitly learn\nprojective geometry and directly bakes this given fact into the system. In LSMs, we use this operation\nto unproject the feature maps {Fi}n\ni=1 in image space produced by the feature encoder into feature\ngrids {Gf\nFor single image prediction, LSMs cannot match features from multiple images to reason about where\nto place surfaces. Therefore, we append geometric features along the rays during the projection and\nunprojection operation to facilitate single view prediction. Speci\ufb01cally, we add the depth value and\nthe ray direction at each sampling point.\n\ni }n\n\ni=1 that lie in metric 3D space.\n\ni }n\n\nRecurrent Grid Fusion. The 3D feature grids {Gf\ni=1 encode information about individual input\nimages and need to be fused to produce a single grid so that further stages may reason jointly over all\nthe images. For example, a simple strategy to fuse them would be to just use a point-wise function -\ne.g. max or average. This approach poses an issue where the combination is too spatially local and\nearly fuses all the information from the individual grids. Another extreme is concatenating all the\nfeature grids before further processing. The complexity of this approach scales linearly with the\nnumber of inputs and poses issues while processing a variable number of images. Instead, we choose\nto processed the grids in a sequential manner using a recurrent neural network. Speci\ufb01cally, we use a\n3D convolutional variant of the Gated Recurrent Unit (GRU) [24, 4, 5] which combines the grids\n\n4\n\n\fi }n\n\n{Gf\ni=1 using 3D convolutions (and non-linearities) into a single grid Gp. Using convolutions helps\nus effectively exploit neighborhood information in 3D space for incrementally combining the grids\nwhile keeping the number of parameters low. Intuitively, this step can be thought of as mimicking\nincremental matching in MVS where the hidden state of the GRU stores a running belief about the\nmatching scores by matching features in the observations it has seen. One issue that arises is that we\nnow have to de\ufb01ne an ordering on the input images, whereas the output should be independent of the\nimage ordering. We tackle this issue by randomly permuting the image sequences during training\nwhile constraining the output to be the same. During inference, we empirically observe that the \ufb01nal\noutput has very little variance with respect to ordering of the input image sequence.\n\n3D Grid Reasoning. Once the fused grid Gp is constructed, a classical multi-view stereo approach\nwould directly evaluate the photo-consistency at the grid locations by comparing the appearance of\nthe individual views and extract the surface at voxels where the images agree. We model this step with\na 3D UNet that transforms the fused grid Gp into Go. The purpose of this network is to use shape cues\npresent in Gp such as feature matches and silhouettes as well as build in shape priors like smoothness\nand symmetries and knowledge about object classes enabling it to produce complete shapes even\nwhen only partial information is visible. The UNet architecture yet again allows the system to use\nlarge enough receptive \ufb01elds for doing multi-scale matching while also using lower level information\ndirectly when needed to produce its \ufb01nal estimate Go. In the case of full 3D supervision (Voxel LSM),\nthis grid can be made to represent a per voxel occupancy map. Go can also be seen as a feature grid\ncontaining the \ufb01nal representation of the 3D world our system produces from which views can be\nrendered using the projection operation described below.\n\nDifferentiable Projection. Given a 3D feature grid G and a camera P, the projection operation\nproduces a 2D feature map O by gathering information along viewing rays. The direct method would\nbe to trace rays for every pixel and accumulate information from all the voxels on the ray\u2019s path.\nSuch an implementation would require handling the fact that different rays can pass through different\nnumber of voxels on their way. For example, one can de\ufb01ne a reduction function along the rays to\naggregate information (e.g. max, mean) but this would fail to capture spatial relationships between\nthe ray features. Instead, we choose to adopt a plane sweeping approach where we sample from\nlocations on depth planes at equally spaced z-values {zk}Nz\nConsider a 3D point Xw that lies along the ray corresponding to a 2D point p in the projected\nfeature grid at depth zw - i.e. p = K[R|t]Xw and z(Xw) = zw. The corresponding feature O(p) is\ncomputed by sampling from the grid G at the (continuous) location Xw. This sampling can be done\ndifferentiably in 3D using trilinear interpolation. In practice, we use nearest neighbor interpolation\nin 3D for computational ef\ufb01ciency. Samples along each ray are concatenated in ascending z-order\nto produce the 2D map O where the features are stacked along the channel dimension. Rays in this\nfeature grid can be trivially traversed by just following columns along the channel dimension allowing\nus to learn the function to pool along these rays by using 1x1 convolutions on these feature maps and\nprogressively reducing the number of feature channels.\n\nk=1 along the ray.\n\nArchitecture Details. As mentioned above, we present two versions of LSMs - Voxel LSM (V-\nLSM) and Depth LSM (D-LSM). Given one or more images and cameras, Voxel LSM (V-LSM)\nproduces a voxel occupancy grid whereas D-LSM produces a depth map per input view. Both\nsystems share the same set of CNN architectures (UNet) for the image encoder, grid reasoning and\nthe recurrent pooling steps. We use instance normalization for all our convolution operations and\nlayer normalization for the 3D convolutional GRU. In V-LSM, the \ufb01nal grid Go is transformed into\na probabilistic voxel occupancy map V \u2208 Rvh\u00d7vw\u00d7vd by a 3D convolution followed by softmax\noperation. We use simple binary cross entropy loss between ground truth occupancy maps and V. In\nD-LSM, Go is \ufb01rst projected into 2D feature maps {Oi}n\ni=1 which are then transformed into metric\ndepth maps {di}n\ni=1 by 1x1 convolutions to learn the reduction function along rays followed by\ndeconvolution layers to upsample the feature map back to the size of the input image. We use absolute\nL1 error in depth to train D-LSM. We also add skip connections between early layers of the image\nencoder and the last deconvolution layers producing depth maps giving it access to high frequency\ninformation in the images.\n\n5\n\n\fFigure 3: Voxel grids produced by V-LSM for example image sequences alongside a learning based baseline\nwhich uses pose information in a fully connected manner. V-LSM produces geometrically meaningful recon-\nstructions (e.g. the curved arm rests instead of perpendicular ones (in R2N2) in the chair on the top left and\nthe siren lights on top of the police car) instead of relying on purely semantic cues. More visualizations in\nsupplementary material.\n\n4 Experiments\n\nIn this section, we demonstrate the ability of LSMs to learn 3D shape reconstruction in a geometrically\naccurate manner. First, we present quantitative results for V-LSMs on the ShapeNet dataset [3] and\ncompare it to various baselines, both classical and learning based. We then show that LSMs generalize\nto unseen object categories validating our hypothesis that LSMs go beyond object/class speci\ufb01c priors\nand use photo-consistency cues to perform category-agnostic reconstruction. Finally, we present\nqualitative and quantitative results from D-LSM and compare it to traditional multi-view stereo\napproaches.\n\nDataset and Metrics. We use the synthetic ShapeNet dataset [3] to generate posed image-sets,\nground truth 3D occupancy maps and depth maps for all our experiments. More speci\ufb01cally, we\nuse a subset of 13 major categories (same as [5]) containing around 44k 3D models resized to lie\nwithin the unit cube centered at the origin with a train/val/test split of [0.7, 0.1, 0.2]. We generated a\nlarge set of realistic renderings for the models sampled from a viewing sphere with \u03b8az \u2208 [0, 360)\nand \u03b8el \u2208 [\u221220, 30] degrees and random lighting variations. We also rendered the depth images\ncorresponding to each rendered image. For the volumetric ground truth, we voxelize each of the\nmodels at a resolution of 32 \u00d7 32 \u00d7 32. In order to evaluate the outputs of V-LSM, we binarize the\nprobabilities at a \ufb01xed threshold (0.4 for all methods except visual hull (0.75)) and use the voxel\nintersection over union (IoU) as the similarity measure. To aggregate the per model IoU, we compute\na per class average and take the mean as a per dataset measure. All our models are trained in a class\nagnostic manner.\nImplementation. We use 224 \u00d7 224 images to train LSMs with a shape batch size of 4 and\n4 views per shape. Our world grid is at a resolution of 323. We implemented our networks in\nTensor\ufb02ow and trained both the variants of LSMs for 100k iterations using Adam. The projection and\nunprojection operations are trivially implemented on the GPU with batched matrix multiplications\nand bilinear/nearest sampling enabling inference at around 30 models/sec on a GTX 1080Ti. We\nunroll the GRU for upto 4 time steps while training and apply the trained models for arbitrary number\nof views at test time.\n\nMulti-view Reconstruction on ShapeNet. We evaluate V-LSMs on the ShapeNet test set and\ncompare it to the following baselines - a visual hull baseline which uses silhouettes to carve out\nvolumes, 3D-R2N2 [5], a previously proposed system which doesn\u2019t use camera pose and performs\nmulti-view reconstruction, 3D-R2N2 w/pose which is an extension of 3D-R2N2 where camera pose\nis injected using fully connected layers. For the experiments, we implemented the 3D-R2N2 system\n\n6\n\n\f# Views\n\n3D-R2N2 [5]\n\nVisual Hull\n3D-R2N2 w/pose\nV-LSM\n\nV-LSM w/bg\n\n1\n\n55.6\n\n18.0\n55.1\n61.5\n\n60.5\n\n2\n\n59.6\n\n36.9\n59.4\n72.1\n\n69.8\n\n3\n\n61.3\n\n47.0\n61.2\n76.2\n\n73.7\n\n4\n\n62.0\n\n52.4\n62.1\n78.2\n\n75.6\n\nTable 1: Mean Voxel IoU on the ShapeNet test set. Note\nthat the original 3D-R2N2 system does not use camera\npose whereas the 3D-R2N2 w/pose system is trained\nwith pose information. V-LSM w/bg refers to voxel LSM\ntrained and tested with random images as backgrounds\ninstead of white backgrounds only.\n\nFigure 4: Generalization performance for V-LSM\nand 3D-R2N2 w/pose measured by gap in voxel\nIoU when tested on unseen object categories.\n\nFigure 5: Qualitative results for per-view depth map prediction on ShapeNet. We show the depth maps predicted\nby Depth-LSM (visualized with shading from a shifted viewpoint) and the point cloud obtained by unprojecting\nthem into world coordinates.\n\n(and the 3D-R2N2 w/pose) and trained it on our generated data (images and voxel grids). Due to the\ndifference in training data/splits and the implementation, the numbers are not directly comparable\nto the ones reported in [5] but we observe similar performance trends. For the 3D-R2N2 w/pose\nsystem, we use the camera pose quaternion as the pose representation and process it through 2 fully\nconnected layers before concatenating it with the feature passed into the LSTM. Table 1 reports the\nmean voxel IoU (across 13 categories) for sequences of {1, 2, 3, 4} views. The accuracy increases\nwith number of views for all methods but it can be seen that the jump is much less for the R2N2\nmethods indicating that it already produces a good enough estimate at the beginning but fails to\neffectively use multiple views to improve its reconstruction signi\ufb01cantly. The R2N2 system with\nnaively integrated pose fails to improve over the base version, completely ignoring it in favor of\njust image-based information. On the other hand, our system, designed speci\ufb01cally to exploit these\ngeometric multi-view cues improves signi\ufb01cantly with more views. Figure 3 shows some example\nreconstructions for V-LSM and 3D-R2N2 w/pose. Our system progressively improves based on\nthe viewpoint it receives while the R2N2 w/pose system makes very con\ufb01dent predictions early on\n(sometimes \u201cretrieving\" a completely different instance) and then stops improving as much. As we\nuse a geometric approach, we end up memorizing less and reconstruct when possible. More detailed\nresults can be found in the supplementary material.\n\nGeneralization.\nIn order to test how well LSMs learn to generalize to unseen data, we split our\ndata into 2 parts with disjoint sets of classes - split 1 has data from 6 classes while split 2 has data\nfrom the other 7. We train three V-LSMs - trained on split 1 (V-LSM-S1), on split 2 (V-LSM-S2) and\nboth splits combined (V-LSM-All). The quantity we are interested in is the change in performance\nwhen we test the system on a category it hasn\u2019t seen during training. We use the difference in test IoU\nof a category C between V-LSM-All and V-LSM-S1 if C is not in split 1 and vice versa. Figure 4\nshows the mean of this quantity across all classes as the number of views change. It can be seen that\nfor a single view, the difference in performance is fairly high and as we see more views, the difference\n\n7\n\n12345678Number of Views510152025Gap in Performance3D-R2N2 w/poseV-LSM\fin performance decreases - indicating that our system has learned to exploit category agnostic shape\ncues. On the other hand, the 3D-R2N2 w/pose system fails to generalize with more views. Note that\nthe V-LSMs have been trained with a time horizon of 4 but are evaluated till upto 8 steps here.\n\nSensitivity to noisy camera pose and masks.\nWe conducted experiments to quantify the ef-\nfects of noisy camera pose and segmentations\non performance for V-LSMs. We evaluated\nmodels trained with perfect poses on data with\nperturbed camera extrinsics and observed that\nperformance degrades (as expected) yet still re-\nmains better than the baseline (at 10\u25e6 noise).\nWe also trained new models with synthetically\nperturbed extrinsics and achieve signi\ufb01cantly\nhigher robustness to noisy poses while maintain-\ning competitive performance (Figure 6). This is\nillustrated in Figure 6. The perturbation is intro-\nduced by generating a random rotation matrix\nwhich rotates the viewing axis by a max angular\nmagnitude \u03b8 while still pointing at the object of\ninterest.\nWe also trained LSMs on images with random images backgrounds (V-LSM w/bg in Table 1) rather\nthan only white backgrounds and saw a very small drop in performance. This shows that our method\nlearns to match features rather than relying heavily on perfect segmentations.\n\nFigure 6: Sensitivity to noise in camera pose estimates\nfor V-LSM for systems trained with and without pose\nperturbation.\n\nMulti-view Depth Map Prediction. We show qualitative results from Depth LSM in Figure 5. We\nmanage to obtain thin structures in challenging examples (chairs/tables) while predicting consistent\ngeometry for all the views. We note that the skip connections from the image to last layers for D-LSM\ndo help in directly using low level image features while producing depth maps. The depth maps\nare viewed with shading in order to point out that we produce metrically accurate geometry. The\nunprojected point clouds also align well with each other showing the merits of jointly predicting the\ndepth maps from a global volume rather than processing them independently.\n\nComparision to Plane Sweeping. We qualitatively compare D-LSM to the popular plane sweeping\n(PS) approach [6, 57] for stereo matching. Figure 7 shows the unprojected point clouds from per\nview depths maps produced using PS and D-LSM using 5 and 10 images. We omit an evaluation with\nless images as plane sweeping completely fails with fewer images. We use the publicly available\nimplementation for the PS algorithm [19] and use 5x5 zero mean normalized cross correlation as\nmatching windows with 300 depth planes. We can see that our approach is able to produce much\ncleaner point clouds with less input images. It is robust to texture-less areas where traditional stereo\nalgorithms fail (e.g. the car windows) by using shape priors to reason about them. We also conducted\na quantitative comparison using PS and D-LSM with 10 views (D-LSM was trained using only four\n3/2 (maximally possible depth\nrange) around the origin as the original models lie in a unit cube centered at the origin. Furthermore,\npixels where PS is not able to provide a depth estimate are not taken into account. Note that all these\nchoices disadvantage our method. We compute the per depth map error as the median absolute depth\ndifference for the valid pixels, aggregate to a per category mean error and report the average of the\nper category means for PS as 0.051 and D-LSM as 0.024. Please refer to the supplementary material\nfor detailed results.\n\nimages). The evaluation region is limited to a depth range of \u00b1\u221a\n\n5 Discussion\n\nWe have presented Learnt Stereo Machines (LSM) - an end-to-end learnt system that performs\nmulti-view stereopsis. The key insight of our system is to use ideas from projective geometry\nto differentiably transfer features between 2D images and the 3D world and vice-versa. In our\nexperiments we showed the bene\ufb01ts of our formulation over direct methods - we are able to generalize\nto new object categories and produce compelling reconstructions with fewer images than classical\n\n8\n\nViews:1Views:2Views:3Views:44550556065707580Mean Voxel IoUAccurate PoseNoise:0Noise:10Noise:20Noisy PoseNoise:0Noise:10Noise:20Noisy PoseNoise:0Noise:10Noise:20\f(a) PS 5 Images\n\n(b) LSM 5 Images\n\n(c) PS 10 Images\n\n(d) LSM 10 Images\n\n(e) PS 20 Images\n\nFigure 7: Comparison between Depth-LSM and plane sweeping stereo (PS) with varying numbers of images.\n\nsystems. However, our system also has some limitations. We discuss some below and describe how\nthey lead to future work.\nA limiting factor in our current system is the coarse resolution (323) of the world grid. Classical\nalgorithms typically work on much higher resolutions frequently employing special data structures\nsuch as octrees. We can borrow ideas from recent works [43, 21] which show that CNNs can predict\nsuch high resolution volumes. We also plan to apply LSMs to more general geometry than objects,\neventually leading to a system which can reconstruct single/multiple objects and entire scenes. The\nmain challenge in this setup is to \ufb01nd the right global grid representation. In scenes for example, a\ngrid in terms of a per-view camera frustum might be more appropriate than a global aligned euclidean\ngrid.\nIn our experiments we evaluated classical multi-view 3D reconstruction where the goal is to produce\n3D geometry from images with known poses. However, our system is more general and the projection\nmodules can be used wherever one needs to move between 2D image and 3D world frames. Instead\nof predicting just depth maps from our \ufb01nal world representation, one can also predict other view\nspeci\ufb01c representations such as silhouettes or pixel wise part segmentation labels etc. We can also\nproject the \ufb01nal world representation into views that we haven\u2019t observed as inputs (we would omit\nthe skip connections from the image encoder to make the projection unconditional). This can be used\nto perform view synthesis grounded in 3D.\n\nAcknowledgments\n\nThis work was supported in part by NSF Award IIS- 1212798 and ONR MURI-N00014-10-1-0933.\nChristian H\u00e4ne is supported by an \u201cEarly Postdoc.Mobility\u201d fellowship No. 165245 from the Swiss\nNational Science Foundation. The authors would like to thank David Fouhey, Saurabh Gupta and\nShubham Tulsiani for valuable discussions and Fyusion Inc. for providing GPU hours for the work.\n\nReferences\n[1] S. Becker and G. E. Hinton. Self-organizing neural network that discovers surfaces in random-dot\n\nstereograms. Nature, 1992.\n\n[2] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Conference on Computer\n\nGraphics and Interactive Techniques, 1999.\n\n[3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song,\nH. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. arXiv preprint\narXiv:1512.03012, 2015.\n\n[4] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase\nrepresentations using rnn encoder-decoder for statistical machine translation. In Conference on Empirical\nMethods in Natural Language Processing (EMNLP 2014), 2014.\n\n[5] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A uni\ufb01ed approach for single and\n\nmulti-view 3d object reconstruction. In European Conference on Computer Vision (ECCV), 2016.\n\n[6] R. T. Collins. A space-sweep approach to true multi-image matching. In Conference on Computer Vision\n\nand Pattern Recognition (CVPR), 1996.\n\n9\n\n\f[7] D. Cremers and K. Kolev. Multiview stereo and silhouette consistency via convex functionals over convex\ndomains. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(6):1161\u20131174, 2011.\n\n[8] B. Curless and M. Levoy. A volumetric method for building complex models from range images. In\n\nConference on Computer Graphics and Interactive Techniques, 1996.\n\n[9] A. Dame, V. A. Prisacariu, C. Y. Ren, and I. Reid. Dense reconstruction using 3d object shape priors. In\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2013.\n\n[10] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep\n\nnetwork. In Neural Information Processing Systems (NIPS), 2014.\n\n[11] H. Fan, H. Su, and L. Guibas. A point set generation network for 3d object reconstruction from a single\n\nimage. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[12] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deepstereo: Learning to predict new views from the\n\nworld\u2019s imagery. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[13] Y. Furukawa and J. Ponce. Accurate, dense, and robust multi-view stereopsis. Transactions on Pattern\n\nAnalysis and Machine Intelligence (TPAMI), 2010.\n\n[14] R. Garg, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the\n\nrescue. In European Conference on Computer Vision (ECCV), 2016.\n\n[15] P. Gargallo, E. Prados, and P. Sturm. Minimizing the reprojection error in surface reconstruction from\n\nimages. In International Conference on Computer Vision (ICCV), pages 1\u20138, 2007.\n\n[16] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector\n\nrepresentation for objects. In European Conference on Computer Vision (ECCV), 2016.\n\n[17] C. Haene, C. Zach, A. Cohen, and M. Pollefeys. Dense semantic 3d reconstruction. Transactions on\n\nPattern Analysis and Machine Intelligence (TPAMI), 2016.\n\n[18] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning\n\nfor patch-based matching. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[19] C. H\u00e4ne, L. Heng, G. H. Lee, A. Sizov, and M. Pollefeys. Real-time direct dense matching on \ufb01sheye\n\nimages using plane-sweeping stereo. In International Conference on 3D Vision (3DV), 2014.\n\n[20] C. H\u00e4ne, N. Savinov, and M. Pollefeys. Class speci\ufb01c 3d object shape priors using surface normals. In\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2014.\n\n[21] C. H\u00e4ne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction.\n\nInternational Conference on 3D Vision (3DV), 2017.\n\nIn\n\n[22] C. H\u00e4ne, C. Zach, A. Cohen, R. Angst, and M. Pollefeys. Joint 3d scene reconstruction and class\n\nsegmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2013.\n\n[23] W. Hartmann, S. Galliani, M. Havlena, K. Schindler, and L. V. Gool. Learned multi-patch similarity. In\n\nInternational Conference on Computer Vision, (ICCV), 2017.\n\n[24] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.\n\n[25] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In Neural\n\nInformation Processing Systems (NIPS), 2015.\n\n[26] T. Kanade, H. Kano, S. Kimura, A. Yoshida, and K. Oda. Development of a video-rate stereo machine. In\n\nInternational Conference on Intelligent Robots and Systems (IROS), 1995.\n\n[27] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category-speci\ufb01c object reconstruction from a single image.\n\nIn Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[28] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. End-to-end\nlearning of geometry and context for deep stereo regression. In International Conference on Computer\nVision (ICCV), 2017.\n\n[29] K. N. Kutulakos and S. M. Seitz. A theory of shape by space carving. International Journal of Computer\n\nVision (IJCV), 38(3):199\u2013218, 2000.\n\n10\n\n\f[30] P. Labatut, J.-P. Pons, and R. Keriven. Ef\ufb01cient multi-view reconstruction of large-scale scenes using\ninterest points, delaunay triangulation and graph cuts. In International Conference on Computer Vision,\n(ICCV), 2007.\n\n[31] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out of perspective. In Conference on Computer Vision\n\nand Pattern Recognition (CVPR), 2014.\n\n[32] A. Laurentini. The visual hull concept for silhouette-based image understanding. Transactions on Pattern\n\nAnalysis and Machine Intelligence (TPAMI), 16(2):150\u2013162, 1994.\n\n[33] V. Lempitsky and Y. Boykov. Global optimization for shape \ufb01tting. In Conference on Computer Vision\n\nand Pattern Recognition, (CVPR), 2007.\n\n[34] M. Lhuillier and L. Quan. A quasi-dense approach to surface reconstruction from uncalibrated images.\n\nTransactions on Pattern Analysis and Machine Intelligence (TPAMI), 2005.\n\n[35] S. Liu and D. B. Cooper. Statistical inverse ray tracing for image-based 3d modeling. Transactions on\n\nPattern Analysis and Machine Intelligence (TPAMI), 36(10):2074\u20132088, 2014.\n\n[36] D. Marr and T. Poggio. Cooperative computation of stereo disparity. In From the Retina to the Neocortex,\n\npages 239\u2013243. 1976.\n\n[37] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train\nconvolutional networks for disparity, optical \ufb02ow, and scene \ufb02ow estimation. In Conference on Computer\nVision and Pattern Recognition (CVPR), 2016.\n\n[38] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M. Frahm, R. Yang, D. Nist\u00e9r, and M. Pollefeys.\nReal-time visibility-based fusion of depth maps. In International Conference on Computer Vision (ICCV),\n2007.\n\n[39] T. Pollard and J. L. Mundy. Change detection in a 3-d world. In Conference on Computer Vision and\n\nPattern Recognition (CVPR), 2007.\n\n[40] M. Pollefeys, D. Nist\u00e9r, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S.-J.\nKim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch, and\nH. Towles. Detailed real-time urban 3d reconstruction from video. International Journal of Computer\nVision (IJCV), 78(2):143\u2013167, 2008.\n\n[41] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch. Visual modeling\n\nwith a hand-held camera. International Journal of Computer Vision (IJCV), 59(3):207\u2013232, 2004.\n\n[42] D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning\n\nof 3d structure from images. In Neural Information Processing Systems (NIPS), 2016.\n\n[43] G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger. Octnetfusion: Learning depth fusion from data. In\n\nInternational Conference on 3D Vision (3DV), 2017.\n\n[44] O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation.\n\nIn Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.\n\n[45] N. Savinov, C. H\u00e4ne, L. Ladicky, and M. Pollefeys. Semantic 3d reconstruction with continuous regular-\nization and ray potentials using a visibility consistency constraint. In Conference on Computer Vision and\nPattern Recognition (CVPR), 2016.\n\n[46] A. Saxena, J. Schulte, and A. Y. Ng. Depth estimation using monocular and stereo cues. In Neural\n\nInformation Processing Systems (NIPS), 2005.\n\n[47] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence\n\nalgorithms. International Journal of Computer Vision (IJCV), 47(1-3):7\u201342, 2002.\n\n[48] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multi-\nview stereo reconstruction algorithms. In Conference on Computer Vision and Pattern Recognition (CVPR),\n2006.\n\n[49] A. Sinha, J. Bai, and K. Ramani. Deep learning 3d shape surfaces using geometry images. In European\n\nConference on Computer Vision (ECCV), 2016.\n\n[50] S. N. Sinha, P. Mordohai, and M. Pollefeys. Multi-view stereo via graph cuts on the dual of an adaptive\n\ntetrahedral mesh. In International Conference on Computer Vision (ICCV), 2007.\n\n11\n\n\f[51] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3d models from single images with a convolu-\n\ntional network. In European Conference on Computer Vision (ECCV), 2016.\n\n[52] S. Tulsiani, A. Kar, J. Carreira, and J. Malik. Learning category-speci\ufb01c deformable 3d models for object\n\nreconstruction. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2016.\n\n[53] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view reconstruction via\ndifferentiable ray consistency. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[54] A. O. Ulusoy, A. Geiger, and M. J. Black. Towards probabilistic volumetric reconstruction using ray\n\npotentials. In International Conference on 3D Vision (3DV), 2015.\n\n[55] G. Vogiatzis, P. H. Torr, and R. Cipolla. Multi-view stereo via volumetric graph-cuts. In Conference on\n\nComputer Vision and Pattern Recognition (CVPR), 2005.\n\n[56] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d\n\nobject reconstruction without 3d supervision. In Neural Information Processing Systems (NIPS), 2016.\n\n[57] R. Yang, G. Welch, and G. Bishop. Real-time consensus-based scene reconstruction using commodity\n\ngraphics hardware. In Computer Graphics Forum, 2003.\n\n[58] S. Yingze Bao, M. Chandraker, Y. Lin, and S. Savarese. Dense object reconstruction with semantic priors.\n\nIn Conference on Computer Vision and Pattern Recognition (CVPR), 2013.\n\n[59] C. Zach, T. Pock, and H. Bischof. A globally optimal algorithm for robust tv-l 1 range image integration.\n\nIn International Conference on Computer Vision, (ICCV), 2007.\n\n[60] J. Zbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image\n\npatches. Journal of Machine Learning Research (JMLR), 17(1-32):2, 2016.\n\n[61] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from\n\nvideo. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n12\n\n\f", "award": [], "sourceid": 291, "authors": [{"given_name": "Abhishek", "family_name": "Kar", "institution": "UC Berkeley"}, {"given_name": "Christian", "family_name": "H\u00e4ne", "institution": "UC Berkeley"}, {"given_name": "Jitendra", "family_name": "Malik", "institution": null}]}