{"title": "Depth from a Single Image by Harmonizing Overcomplete Local Network Predictions", "book": "Advances in Neural Information Processing Systems", "page_first": 2658, "page_last": 2666, "abstract": "A single color image can contain many cues informative towards different aspects of local geometric structure. We approach the problem of monocular depth estimation by using a neural network to produce a mid-level representation that summarizes these cues. This network is trained to characterize local scene geometry by predicting, at every image location, depth derivatives of different orders, orientations and scales. However, instead of a single estimate for each derivative, the network outputs probability distributions that allow it to express confidence about some coefficients, and ambiguity about others. Scene depth is then estimated by harmonizing this overcomplete set of network predictions, using a globalization procedure that finds a single consistent depth map that best matches all the local derivative distributions. We demonstrate the efficacy of this approach through evaluation on the NYU v2 depth data set.", "full_text": "Depth from a Single Image by Harmonizing\n\nOvercomplete Local Network Predictions\n\nAyan Chakrabarti\n\nTTI-Chicago\nChicago, IL\n\nayanc@ttic.edu\n\nJingyu Shao\n\nDept. of Statistics, UCLA\u2217\n\nLos Angeles, CA\n\nshaojy15@ucla.edu\n\nGregory Shakhnarovich\n\nTTI-Chicago\nChicago, IL\n\ngregory@ttic.edu\n\nAbstract\n\nA single color image can contain many cues informative towards different as-\npects of local geometric structure. We approach the problem of monocular depth\nestimation by using a neural network to produce a mid-level representation that\nsummarizes these cues. This network is trained to characterize local scene geom-\netry by predicting, at every image location, depth derivatives of different orders,\norientations and scales. However, instead of a single estimate for each derivative,\nthe network outputs probability distributions that allow it to express con\ufb01dence\nabout some coef\ufb01cients, and ambiguity about others. Scene depth is then estimated\nby harmonizing this overcomplete set of network predictions, using a globalization\nprocedure that \ufb01nds a single consistent depth map that best matches all the local\nderivative distributions. We demonstrate the ef\ufb01cacy of this approach through\nevaluation on the NYU v2 depth data set.\n\n1\n\nIntroduction\n\nIn this paper, we consider the task of monocular depth estimation\u2014i.e., recovering scene depth from\na single color image. Knowledge of a scene\u2019s three-dimensional (3D) geometry can be useful in\nreasoning about its composition, and therefore measurements from depth sensors are often used to\naugment image data for inference in many vision, robotics, and graphics tasks. However, the human\nvisual system can clearly form at least an approximate estimate of depth in the absence of stereo and\nparallax cues\u2014e.g., from two-dimensional photographs\u2014and it is desirable to replicate this ability\ncomputationally. Depth information inferred from monocular images can serve as a useful proxy\nwhen explicit depth measurements are unavailable, and be used to re\ufb01ne these measurements where\nthey are noisy or ambiguous.\n\nThe 3D co-ordinates of a surface imaged by a perspective camera are physically ambiguous along a\nray passing through the camera center. However, a natural image often contains multiple cues that can\nindicate aspects of the scene\u2019s underlying geometry. For example, the projected scale of a familiar\nobject of known size indicates how far it is; foreshortening of regular textures provide information\nabout surface orientation; gradients due to shading indicate both orientation and curvature; strong\nedges and corners can correspond to convex or concave depth boundaries; and occluding contours or\nthe relative position of key landmarks can be used to deduce the coarse geometry of an object or the\nwhole scene. While a given image may be rich in such geometric cues, it is important to note that\nthese cues are present in different image regions, and each indicates a different aspect of 3D structure.\n\nWe propose a neural network-based approach to monocular depth estimation that explicitly leverages\nthis intuition. Prior neural methods have largely sought to directly regress to depth [1, 2]\u2014with some\nadditionally making predictions about smoothness across adjacent regions [4], or predicting relative\n\n\u2217Part of this work was done while JS was a visiting student at TTI-Chicago.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: To recover depth from a single image, we \ufb01rst use a neural network trained to characterize\nlocal depth structure. This network produces distributions for values of various depth derivatives\u2014of\ndifferent orders, at multiple scales and orientations\u2014at every pixel, using global scene features and\nthose from a centered local image patch (top left). A distributional output allows the network to\ndetermine different derivatives at different locations with different degrees of certainty (right). An\nef\ufb01cient globalization algorithm is then used to produce a single consistent depth map estimate.\n\ndepth ordering between pairs of image points [7]. In contrast, we train a neural network with a rich\ndistributional output space. Our network characterizes various aspects of the local geometric structure\nby predicting values of a number of derivatives of the depth map\u2014at various scales, orientations, and\nof different orders (including the 0th derivative, i.e., the depth itself)\u2014at every image location.\n\nHowever, as mentioned above, we expect different image regions to contain cues informative towards\ndifferent aspects of surface depth. Therefore, instead of over-committing to a single value, our\nnetwork outputs parameterized distributions for each derivative, allowing it to effectively characterize\nthe ambiguity in its predictions. The full output of our network is then this set of multiple distributions\nat each location, characterizing coef\ufb01cients in effectively an overcomplete representation of the depth\nmap. To recover the depth map itself, we employ an ef\ufb01cient globalization procedure to \ufb01nd the\nsingle consistent depth map that best agrees with this set of local distributions.\n\nWe evaluate our approach on the NYUv2 depth data set [11], and \ufb01nd that it achieves state-of-the-art\nperformance. Beyond the bene\ufb01ts to the monocular depth estimation task itself, the success of our\napproach suggests that our network can serve as a useful way to incorporate monocular cues in more\ngeneral depth estimation settings\u2014e.g., when sparse or noisy depth measurements are available. Since\nthe output of our network is distributional, it can be easily combined with partial depth cues from other\nsources within a common globalization framework. Moreover, we expect our general approach\u2014of\nlearning to predict distributions in an overcomplete respresentation followed by globalization\u2014to\nbe useful broadly in tasks that involve recovering other kinds of scene value maps that have rich\nstructure, such as optical or scene \ufb02ow, surface re\ufb02ectances, illumination environments, etc.\n\n2 Related Work\n\nInterest in monocular depth estimation dates back to the early days of computer vision, with methods\nthat reasoned about geometry from cues such as diffuse shading [12], or contours [13, 14]. However,\nthe last decade has seen accelerated progress on this task [1\u201310], largely owing to the availability of\ncheap consumer depth sensors, and consequently, large amounts of depth data for training learning-\nbased methods. Most recent methods are based on training neural networks to map RGB images\nto geometry [1\u20137]. Eigen et al. [1, 2] set up their network to regress directly to per-pixel depth\nvalues, although they provide deeper supervision to their network by requiring an intermediate layer\n\n2\n\n\fto explicitly output a coarse depth map. Other methods [3, 4] use conditional random \ufb01elds (CRFs)\nto smooth their neural estimates. Moreover, the network in [4] also learns to predict one aspect of\ndepth structure, in the form of the CRF\u2019s pairwise potentials.\n\nSome methods are trained to exploit other individual aspects of geometric structure. Wang et al. [6]\ntrain a neural network to output surface normals instead of depth (Eigen et al. [1] do so as well, for a\nnetwork separately trained for this task). In a novel approach, Zoran et al. [7] were able to train a\nnetwork to predict the relative depth ordering between pairs of points in the image\u2014whether one\nsurface is behind, in front of, or at the same depth as the other. However, their globalization scheme\nto combine these outputs was able to achieve limited accuracy at estimating actual depth, due to the\nlimited information carried by ordinal pair-wise predictions.\n\nIn contrast, our network learns to reason about a more diverse set of structural relationships, by\npredicting a large number of coef\ufb01cients at each location. Note that some prior methods [3, 5] also\nregress to coef\ufb01cients in some basis instead of to depth values directly. However, their motivation\nfor this is to reduce the complexity of the output space, and use basis sets that have much lower\ndimensionality than the depth map itself. Our approach is different\u2014our predictions are distributions\nover coef\ufb01cients in an overcomplete representation, motivated by the expectation that our network\nwill be able to precisely characterize only a small subset of the total coef\ufb01cients in our representation.\n\nOur overall approach is similar to, and indeed motivated by, the recent work of Chakrabarti et al. [15],\nwho proposed estimating a scene map (they considered disparity estimation from stereo images)\nby \ufb01rst using local predictors to produce distributional outputs from many overlapping regions at\nmultiple scales, followed by a globalization step to harmonize these outputs. However, in addition to\nthe fact that we use a neural network to carry out local inference, our approach is different in that\ninference is not based on imposing a restrictive model (such as planarity) on our local outputs. Instead,\nwe produce independent local distributions for various derivatives of the depth map. Consequently,\nour globalization method need not explicitly reason about which local predictions are \u201coutliers\u201d with\nrespect to such a model. Moreover, since our coef\ufb01cients can be related to the global depth map\nthrough convolutions, we are able to use Fourier-domain computations for ef\ufb01cient inference.\n\n3 Proposed Approach\n\nWe formulate our problem as that of estimating a scene map y(n) \u2208 R, which encodes point-wise\nscene depth, from a single RGB image x(n) \u2208 R3, where n \u2208 Z2 indexes location on the image\nplane. We represent this scene map y(n) in terms of a set of coef\ufb01cients {wi(n)}K\ni=1 at each location\nn, corresponding to various spatial derivatives. Speci\ufb01cally, these coef\ufb01cients are related to the scene\nmap y(n) through convolution with a bank of derivative \ufb01lters {ki}K\n\ni=1, i.e.,\n\nwi(n) = (y \u2217 ki)(n).\n\n(1)\n\nFor our task, we de\ufb01ne {ki} to be a set of 2D derivative-of-Gaussian \ufb01lters with standard deviations\n2s pixels, for scales s = {1, 2, 3}. We use the zeroth order derivative (i.e., the Gaussian itself),\n\ufb01rst order derivatives along eight orientations, as well as second order derivatives\u2014along each of\nthe orientations, and orthogonal orientations (see Fig. 1 for examples). We also use the impulse\n\ufb01lter which can be interpreted as the zeroth derivative at scale 0, with the corresponding coef\ufb01cients\nwi(n) = y(n)\u2014this gives us a total of K = 64 \ufb01lters. We normalize the \ufb01rst and second order \ufb01lters\nto be unit norm. The zeroth order \ufb01lters coef\ufb01cients typically have higher magnitudes, and in practice,\nwe \ufb01nd it useful to normalize them as kkik2 = 1/4 to obtain a more balanced representation.\nTo estimate the scene map y(n), we \ufb01rst use a convolutional neural network to output distributions\nfor the coef\ufb01cients p (wi(n)), for every \ufb01lter i and location n. We choose a parametric form for these\ndistributions p(\u00b7), with the network predicting the corresponding parameters for each coef\ufb01cient. The\nnetwork is trained to produce these distributions for each set of coef\ufb01cients {wi(n)} by using as input\na local region centered around n in the RGB image x. We then form a single consistent estimate\nof y(n) by solving a global optimization problem that maximizes the likelihood of the different\ncoef\ufb01cients of y(n) under the distributions provided by our network. We now describe the different\ncomponents of our approach (which is summarized in Fig. 1)\u2014the parametric form for our local\ncoef\ufb01cient distributions, the architecture of our neural network, and our globalization method.\n\n3\n\n\fFigure 2: We train a neural network to output distributions for K depth derivatives {wi(n)} at each\nlocation n, using a color image as input. The distributions are parameterized as Gaussian mixtures,\nand the network produces the M mixture weights for each coef\ufb01cient. Our network includes a local\npath (green) with a cascade of convolution layers to extract features from a 97\u00d7 97 patch around each\nlocation n; and a scene path (red) with pre-trained VGG-19 layers to compute a single scene feature\nvector. We learn a linear map (with x32 upsampling) from this scene vector to per-location features.\nThe local and scene features are concatenated and used to generate the \ufb01nal distributions (blue).\n\n3.1 Parameterizing Local Distributions\n\nOur neural network has to output a distribution, rather than a single estimate, for each coef\ufb01cient\nwi(n). We choose Gaussian mixtures as a convenient parametric form for these distributions:\n\npi,n (wi(n)) =\n\n\u02c6pj\ni (n)\n\n1\n\n\u221a2\u03c0\u03c3i\n\nexp \u2212|wi(n) \u2212 cj\ni|2\n\n2\u03c32\ni\n\n! ,\n\nM\n\nXj=1\n\n(2)\n\nwhere M is the number of mixture components (64 in our implementation), \u03c32\ni is a common variance\nfor all components for derivative i, and {cj\ni} the individual component means. A distribution for a\nspeci\ufb01c coef\ufb01cient wi(n) can then characterized by our neural network by producing the mixture\nweights {\u02c6pj\nPrior to training the network, we \ufb01x the means {cj\nvalues {wi} for each derivative i, and set the means cj\n\ni } based on a training set of\nground truth depth maps. We use one-dimensional K-means clustering on sets of training coef\ufb01cient\ni in (2) above to the cluster centers. We set \u03c32\ni\nto the average in-cluster variance\u2014however, since these coef\ufb01cients have heavy-tailed distributions,\nwe compute this average only over clusters with more than a minimum number of assignments.\n\ni (n) = 1, for each wi(n) from the scene\u2019s RGB image.\n\ni (n)},Pj \u02c6pj\n\ni} and variances {\u03c32\n\n3.2 Neural Network-based Local Predictions\n\nOur method uses a neural network to predict the mixture weights \u02c6pj\ni (n) of the parameterization in (2)\nfrom an input color image. We train our network to output K \u00d7 M numbers at each pixel location\nn, which we interpret as a set of M -dimensional vectors corresponding to the weights {\u02c6pj\ni (n)}j ,\nfor each of the K distributions of the coef\ufb01cients {wi(n)}i. This training is done with respect to a\nloss between the predicted \u02c6pj\ni (n), and the best \ufb01t of the parametric form in (2) to the ground truth\nderivative value wi(n). Speci\ufb01cally, we de\ufb01ne qj\n\ni (n) in terms of the true wi(n) as:\n\nqj\n\ni (n) \u221d exp \u2212|wi(n) \u2212 cj\ni|2\n\n2\u03c32\ni\n\n! , Xj\n\nqj\ni (n) = 1,\n\n(3)\n\nand de\ufb01ne the training loss L in terms of the KL-divergence between these vectors qj\nnetwork predictions \u02c6pj\n\ni (n), weighting the loss for each derivative by its variance \u03c32\ni :\n\ni (n) and the\n\nL = \u2212\n\n1\n\nN K Xi,n\n\n\u03c32\ni\n\nM\n\nXj=1\n\nqj\n\ni (n)(cid:16)log \u02c6pj\n\ni (n) \u2212 log qj\n\ni (n)(cid:17) ,\n\n(4)\n\n4\n\n\fwhere N is the total number of locations n.\nOur network has a fairly high-dimensional output space\u2014corresponding to K \u00d7 M numbers, with\n(M \u2212 1) \u00d7 K degrees of freedom, at each location n. Its architecture, detailed in Fig. 2, uses a\ncascade of seven convolution layers (each with ReLU activations) to extract a 1024-dimensional local\nfeature vector from each 97 \u00d7 97 local patch in the input image. To further add scene-level semantic\ncontext, we include a separate path that extracts a single 4096-dimensional feature vector from\nthe entire image\u2014using pre-trained layers (upto pool5) from the VGG-19 [16] network, followed\ndownsampling with averaging by a factor of two, and a fully connected layer with a ReLU activation\nthat is trained with dropout. This global vector is used to derive a 64-dimensional vector for each\nlocation n\u2014using a learned layer that generates a feature map at a coarser resolution, that is then\nbi-linearly upsampled by a factor of 32 to yield an image-sized map.\n\nThe concatenated local and scene-level features are passed through two more hidden layers (with\nReLU activations). The \ufb01nal layer produces the K \u00d7 M -vector of mixture weights \u02c6pj\ni (n), applying a\nseparate softmax to each of the M -dimensional vector {pj\ni (n)}j . All layers in the network are learned\nend-to-end, with the VGG-19 layers \ufb01netuned with a reduced learning rate factor of 0.1 compared to\nthe rest of the network. The local path of the network is applied in a \u201cfully convolutional\u201d way [17]\nduring training and inference, allowing ef\ufb01cient reuse of computations between overlapping patches.\n\n3.3 Global Scene Map Estimation\n\nApplying our neural network to a given input image produces a dense set of distributions pi,n(wi(n))\nfor all derivative coef\ufb01cients at all locations. We combine these to form a single coherent estimate by\n\ufb01nding the scene map y(n) whose coef\ufb01cients {wi(n)} have high likelihoods under the corresponding\ndistributions {pi,n(\u00b7)}. We do this by optimizing the following objective:\n\u03c32\ni log pi,n ((ki \u2217 y)(n)) ,\n\ny = arg max\n\n(5)\n\ny Xi,n\n\nwhere, like in (4), the log-likelihoods for different derivatives are weighted by their variance \u03c32\ni .\n\nThe objective in (5) is a summation over a large (K times image-size) number of non-convex terms,\neach of which depends on scene values y(n) at multiple locations n in a local neighborhood\u2014\nbased on the support of \ufb01lter ki. Despite the apparent complexity of this objective, we \ufb01nd that\napproximate inference using an alternating minimization algorithm, like in [15], works well in\npractice. Speci\ufb01cally, we create explicit auxiliary variables wi(n) for the coef\ufb01cients, and solve the\nfollowing modi\ufb01ed optimization problem:\n\ny = arg min\n\ny\n\n(wi(n) \u2212 (ki \u2217 y)(n))2 +\n\n1\n2R(y). (6)\n\n\u03b2\n\n2 Xi,n\n\nmin\n\n{wi(n)}\u2212\uf8ee\n\uf8f0Xi,n\n\n\u03c32\n\ni log pi,n (wi(n))\uf8f9\n\uf8fb +\n\nNote that the second term above forces coef\ufb01cients of y(n) to be equal to the corresponding auxiliary\nvariables wi(n), as \u03b2 \u2192 \u221e. We iteratively compute (6), by alternating between minimizing the\nobjective with respect to y(n) and to {wi(n)}, keeping the other \ufb01xed, while increasing the value of\n\u03b2 across iterations.\nNote that there is also a third regularization term R(y) in (6), which we de\ufb01ne as\n\nR(y) =Xr Xn\n\nk(\u2207r \u2217 y)(n)k2,\n\n(7)\n\nusing 3 \u00d7 3 Laplacian \ufb01lters, at four orientations, for {\u2207r}. In practice, this term only affects the\ncomputation of y(n) in the initial iterations when the value of \u03b2 is small, and in later iterations is\ndominated by the values of wi(n). However, we \ufb01nd that adding this regularization allows us to\nincrease the value of \u03b2 faster, and therefore converge in fewer iterations.\n\nEach step of our alternating minimization can be carried out ef\ufb01ciently. When y(n) \ufb01xed, the\nobjective in (6) can be minimized with respect to each coef\ufb01cient wi(n) independently as:\n\nwi(n) = arg min\n\nw \u2212 log pi,n(w) +\n\n\u03b2\n2\u03c32\ni\n\n(w \u2212 \u00afwi(n))2,\n\n(8)\n\n5\n\n\fwhere \u00afwi(n) = (ki \u2217 y)(n) is the corresponding derivative of the current estimate of y(n). Since\npi,n(\u00b7) is a mixture of Gaussians, the objective in (8) can also be interpreted as the (scaled) negative\nlog-likelihood of a Gaussian-mixture, with \u201cposterior\u201d mixture means \u00afwj\n\ni (n) and weights \u00afpj\n(cj\ni \u2212 \u00afwi(n))2\n\ni (n):\n\n(9)\n\n\u00afwj\n\ni (n) =\n\nci\nj + \u03b2 \u00afwi(n)\n\n1 + \u03b2\n\n,\n\ni (n) \u221d \u02c6pj\n\u00afpj\n\ni (n) exp \u2212\n\n\u03b2\n\n\u03b2 + 1\n\n! .\n\n2\u03c32\ni\n\nWhile there is no closed form solution to (8), we \ufb01nd that a reasonable approximation is to simply set\nwi(n) to the posterior mean value \u00afwj\nThe second step at each iteration involves minimizing (6) with respect to y given the current estimates\nof wi(n). This is a simple least-squares minimization given by\n\ni (n) for which weight \u00afpj\n\ni (n) is the highest.\n\ny = arg min\n\ny\n\n\u03b2Xi,n\n\n((ki \u2217 y)(n) \u2212 w(n))2 +Xr,n\n\nk(\u2207r \u2217 y)(n)k2.\n\n(10)\n\nNote that since all terms above are related to y by convolutions with different \ufb01lters, we can carry out\nthis minimization very ef\ufb01ciently in the Fourier domain.\nWe initialize our iterations by setting wi(n) simply to the component mean cj\nweight \u02c6pj\nincreasing \u03b2 from 2\u221210 to 27, by a factor of 21/8 at each iteration.\n\ni for which our predicted\ni (n) is highest. Then, we apply the y and {wi(n)} minimization steps alternatingly, while\n\n4 Experimental Results\n\nWe train and evaluate our method on the NYU v2 depth dataset [11]. To construct our training and\nvalidation sets, we adopt the standard practice of using the raw videos corresponding to the training\nimages from the of\ufb01cial train/test split. We randomly select 10% of these videos for validation,\nand use the rest for training our network. Our training set is formed by sub-sampling video frames\nuniformly, and consists of roughly 56,000 color image-depth map pairs. Monocular depth estimation\nalgorithms are evaluated on their accuracy in the 561 \u00d7 427 crop of the depth map that contains a\nvalid depth projection (including \ufb01lled-in areas within this crop). We use the same crop of the color\nimage as input to our algorithm, and train our network accordingly.\n\nWe let the scene map y(n) in our formulation correspond to the reciprocal of metric depth, i.e., y(n) =\n1/z(n). While other methods use different compressive transform (e.g., [1, 2] regress to log z(n)), our\nchoice is motivated by the fact that points on the image plane are related to their world co-ordinates\nby a perspective transform. This implies, for example, that in planar regions the \ufb01rst derivatives of\ny(n) will depend only on surface orientation, and that second derivatives will be zero.\n\n4.1 Network Training\n\nWe use data augmentation during training, applying random rotations of \u00b15\u25e6 and horizontal \ufb02ips\nsimultaneously to images and depth maps, and random contrast changes to images. We use a fully\nconvolutional version of our architecture during training with a stride of 8 pixels, yielding nearly\n4000 training patches per image. We train the network using SGD for a total of 14 epochs, using a\nbatch size of only one image and a momentum value of 0.9. We begin with a learning rate of 0.01,\nand reduce it after the 4th, 8th, 10th, 12th, and 13th epochs, each time by a factor of two. This\nschedule was set by tracking the post-globalization depth accuracy on a validation set.\n\n4.2 Evaluation\n\nFirst, we analyze the informativeness of individual distributional outputs from our neural network.\nFigure 3 visualizes the accuracy and con\ufb01dence of the local per-coef\ufb01cient distributions produced by\nour network on a typical image. For various derivative \ufb01lters, we display maps of the absolute error\nbetween the true coef\ufb01cient values wi(n) and the mean of the corresponding predicted distributions\n{pi,n(\u00b7)}. Alongside these errors, we also visualize the network\u2019s \u201ccon\ufb01dence\u201d in terms of a map of\nthe standard deviations of {pi,n(\u00b7)}. We see that the network makes high con\ufb01dence predictions for\ndifferent derivatives in different regions, and that the number of such high con\ufb01dence predictions is\nleast for zeroth order derivatives. Moreover, we \ufb01nd that all regions with high predicted con\ufb01dence\n\n6\n\n\fTable 1: Effect of Individual Derivatives on Global Estimation Accuracy (on 100 validation images)\n\nFilters\n\nFull\n\nScale 0,1 (All orders)\nScale 0,1,2 (All orders)\n\nOrder 0 (All scales)\nOrder 0,1 (All scales)\n\nScale 0 (Pointwise Depth)\n\nRMSE (lin.)\n\nRMSE(log)\n\nAbs Rel.\n\nSqr Rel.\n\n\u03b4 < 1.25\n\nLower Better\n\nHigher Better\n\u03b4 < 1.252\n\n\u03b4 < 1.253\n\n0.6921\n0.7471\n0.7241\n0.7971\n0.6966\n0.7424\n\n0.2533\n0.2684\n0.2626\n0.2775\n0.2542\n0.2656\n\n0.1887\n0.2019\n0.1967\n0.2110\n0.1894\n0.2005\n\n0.1926\n0.2411\n0.2210\n0.2735\n0.1958\n0.2177\n\n76.62%\n75.33%\n75.82%\n73.64%\n76.56%\n74.50%\n\n91.58%\n90.90%\n91.12%\n90.40%\n91.53%\n90.66%\n\n96.62%\n96.28%\n96.41%\n95.99%\n96.62%\n96.30%\n\nFigure 3: We visualize the informativeness of the local predictions from our network (on an image\nfrom the validation set). We show the accuracy and con\ufb01dence of the predicted distributions for\ncoef\ufb01cients of different derivative \ufb01lters (shown inset), in terms of the error between the distribution\nmean and true coef\ufb01cient value, and the distribution standard deviation respectively. We \ufb01nd that\nerrors are always low in regions of high con\ufb01dence (low standard deviation). We also \ufb01nd that despite\nthe fact that individual coef\ufb01cients have many low-con\ufb01dence regions, our globalization procedure is\nable to combine them to produce an accurate depth map.\n\n(i.e., low standard deviation) also have low errors. Figure 3 also displays the corresponding global\ndepth estimates, along with their accuracy relative to the ground truth. We \ufb01nd that despite having\nlarge low-con\ufb01dence regions for individual coef\ufb01cients, our \ufb01nal depth map is still quite accurate. This\nsuggests that the information from different coef\ufb01cients\u2019 predicted distributions is complementary.\n\nTo quantitatively characterize the contribution of the various components of our overcomplete\nrepresentation, we conduct an ablation study on 100 validation images. With the same trained\nnetwork, we include different subsets of \ufb01lter coef\ufb01cients for global estimation\u2014leaving out either\nspeci\ufb01c derivative orders, or scales\u2014and report their accuracy in Table 1. We use the standard\nmetrics from [2] for accuracy between estimated and true depth values \u02c6z(n) and z(n) across all\npixels in all images: root mean square error (RMSE) of both z and log z, mean relative error\n(|z(n) \u2212 \u02c6z(n)|/z(n)) and relative square error (|z(n) \u2212 \u02c6z(n)|2/z(n)), as well as percentages of\npixels with error \u03b4 = max(z(n)/\u02c6z(n), \u02c6z(n)/z(n)) below different thresholds. We \ufb01nd that removing\neach of these subsets degrades the performance of the global estimation method\u2014with second order\nderivatives contributing least to \ufb01nal estimation accuracy. Interestingly, combining multiple scales but\nwith only zeroth order derivatives performs worse than using just the point-wise depth distributions.\n\nFinally, we evaluate the performance of our method on the NYU v2 test set. Table 2 reports the\nquantitative performance of our method, along with other state-of-the-art approaches over the entire\ntest set, and we \ufb01nd that the proposed method yields superior performance on most metrics. Figure 4\nshows example predictions from our approach and that of [1]. We see that our approach is often able\nto better reproduce local geometric structure in its predictions (desk & chair in column 1, bookshelf\nin column 4), although it occasionally mis-estimates the relative position of some objects (e.g., globe\nin column 5). At the same time, it is also usually able to correctly estimate the depth of large and\ntexture-less planar regions (but, see column 6 for an example failure case).\n\nOur overall inference method (network predictions and globalization) takes 24 seconds per-image\nwhen using an NVIDIA Titan X GPU. The source code for implementation, along with a pre-trained\nnetwork model, are available at http://www.ttic.edu/chakrabarti/mdepth.\n\n7\n\n\fTable 2: Depth Estimation Performance on NYUv2 [11] Test Set\n\nMethod\n\nRMSE (lin.)\n\nRMSE(log)\n\nAbs Rel.\n\nSqr Rel.\n\n\u03b4 < 1.25\n\nLower Better\n\nHigher Better\n\u03b4 < 1.252\n\n\u03b4 < 1.253\n\nProposed\nEigen 2015 [1] (VGG)\nWang [3]\nBaig [5]\nEigen 2014 [2]\nLiu [4]\nZoran [7]\n\n0.620\n0.641\n0.745\n0.802\n0.877\n0.824\n1.22\n\n0.205\n0.214\n0.262\n\n-\n\n0.283\n\n-\n\n0.43\n\n0.149\n0.158\n0.220\n0.241\n0.214\n0.230\n0.41\n\n0.118\n0.121\n0.210\n\n-\n\n0.204\n\n-\n\n0.57\n\n80.6%\n76.9%\n60.5%\n61.0%\n61.4%\n61.4%\n\n-\n\n95.8%\n95.0%\n89.0%\n\n-\n\n88.8%\n88.3%\n\n-\n\n98.7%\n98.8%\n97.0%\n\n-\n\n97.2%\n97.1%\n\n-\n\nFigure 4: Example depth estimation results on NYU v2 test set.\n\n5 Conclusion\n\nIn this paper, we described an alternative approach to reasoning about scene geometry from a single\nimage. Instead of formulating the task as a regression to point-wise depth values, we trained a neural\nnetwork to probabilistically characterize local coef\ufb01cients of the scene depth map in an overcomplete\nrepresentation. We showed that these local predictions could then be reconciled to form an estimate\nof the scene depth map using an ef\ufb01cient globalization procedure. We demonstrated the utility of our\napproach by evaluating it on the NYU v2 depth benchmark.\n\nIts performance on the monocular depth estimation task suggests that our network\u2019s local predictions\neffectively summarize the depth cues present in a single image. In future work, we will explore how\nthese predictions can be used in other settings\u2014e.g., to aid stereo reconstruction, or improve the\nquality of measurements from active and passive depth sensors. We are also interested in exploring\nwhether our approach of training a network to make overcomplete probabilistic local predictions can\nbe useful in other applications, such as motion estimation or intrinsic image decomposition.\n\nAcknowledgments. AC acknowledges support for this work from the National Science Foundation\nunder award no. IIS-1618021, and from a gift by Adobe Systems. AC and GS thank NVIDIA\nCorporation for donations of Titan X GPUs used in this research.\n\n8\n\n\fReferences\n\n[1] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common\n\nmulti-scale convolutional architecture. In Proc. ICCV, 2015.\n\n[2] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a\n\nmulti-scale deep network. In NIPS, 2014.\n\n[3] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille. Towards uni\ufb01ed depth and semantic\n\nprediction from a single image. In Proc. CVPR, 2015.\n\n[4] F. Liu, C. Shen, and G. Lin. Deep convolutional neural \ufb01elds for depth estimation from a single\n\nimage. In Proc. CVPR, 2015.\n\n[5] M. Baig and L. Torresani. Coupled depth learning. In Proc. WACV, 2016.\n[6] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In\n\nProc. CVPR, 2015.\n\n[7] D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman. Learning ordinal relationships for mid-level\n\nvision. In Proc. ICCV, 2015.\n\n[8] K. Karsch, C. Liu, and S. B. Kang. Depth extraction from video using non-parametric sampling.\n\nIn Proc. ECCV. 2012.\n\n[9] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out of perspective. In Proc. CVPR, 2014.\n\n[10] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In NIPS,\n\n2005.\n\n[11] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference\n\nfrom rgbd images. In Proc. ECCV. 2012.\n\n[12] B. K. Horn and M. J. Brooks. Shape from shading. MIT Press, 1986.\n[13] M. B. Clowes. On seeing things. Arti\ufb01cial intelligence, 1971.\n[14] K. Sugihara. Machine interpretation of line drawings. MIT Press, 1986.\n[15] A. Chakrabarti, Y. Xiong, S. Gortler, and T. Zickler. Low-level vision by consensus in a spatial\n\nhierarchy of regions. In Proc. CVPR, 2015.\n\n[16] K. Chat\ufb01eld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details:\n\nDelving deep into convolutional nets. In Proc. BMVC, 2014.\n\n[17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.\n\nIn Proc. CVPR, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1372, "authors": [{"given_name": "Ayan", "family_name": "Chakrabarti", "institution": "TTI Chicago"}, {"given_name": "Jingyu", "family_name": "Shao", "institution": "UCLA"}, {"given_name": "Greg", "family_name": "Shakhnarovich", "institution": "TTI-Chicago"}]}