{"title": "DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction", "book": "Advances in Neural Information Processing Systems", "page_first": 492, "page_last": 502, "abstract": "Reconstructing 3D shapes from single-view images has been a long-standing\nresearch problem. In this paper, we present DISN, a Deep Implicit Surface Net-\nwork which can generate a high-quality detail-rich 3D mesh from a 2D image by\npredicting the underlying signed distance fields. In addition to utilizing global\nimage features, DISN predicts the projected location for each 3D point on the\n2D image and extracts local features from the image feature maps. Combin-\ning global and local features significantly improves the accuracy of the signed\ndistance field prediction, especially for the detail-rich areas. To the best of our\nknowledge, DISN is the first method that constantly captures details such as\nholes and thin structures present in 3D shapes from single-view images. DISN\nachieves the state-of-the-art single-view reconstruction performance on a variety\nof shape categories reconstructed from both synthetic and real images. Code is\navailable at https://github.com/laughtervv/DISN. The supplemen-\ntary can be found at https://xharlie.github.io/images/neurips_\n2019_supp.pdf", "full_text": "DISN: Deep Implicit Surface Network for\nHigh-quality Single-view 3D Reconstruction\n\nWeiyue Wang*,1\n\nQiangeng Xu*,1\n\nDuygu Ceylan2\n\n1University of Southern California\n\nLos Angeles, California\n\nRadomir Mech2\n2Adobe\n\nSan Jose, California\n\nUlrich Neumann1\n\n{weiyuewa,qiangenx,uneumann}@usc.edu\n\n{ceylan,rmech}@adobe.com\n\nAbstract\n\nReconstructing 3D shapes from single-view images has been a long-standing\nresearch problem. In this paper, we present DISN, a Deep Implicit Surface Net-\nwork which can generate a high-quality detail-rich 3D mesh from a 2D image by\npredicting the underlying signed distance \ufb01elds. In addition to utilizing global\nimage features, DISN predicts the projected location for each 3D point on the\n2D image and extracts local features from the image feature maps. Combin-\ning global and local features signi\ufb01cantly improves the accuracy of the signed\ndistance \ufb01eld prediction, especially for the detail-rich areas. To the best of our\nknowledge, DISN is the \ufb01rst method that constantly captures details such as\nholes and thin structures present in 3D shapes from single-view images. DISN\nachieves the state-of-the-art single-view reconstruction performance on a variety\nof shape categories reconstructed from both synthetic and real images. Code is\navailable at https://github.com/laughtervv/DISN. The supplemen-\ntary can be found at https://xharlie.github.io/images/neurips_\n2019_supp.pdf.\n\nIntroduction\n\n1\nOver the recent years, a multitude of\nsingle-view 3D reconstruction meth-\nods have been proposed where deep\nlearning based methods have speci\ufb01-\ncally achieved promising results. To\nrepresent 3D shapes, many of these\nmethods utilize either voxels [2\u20139]\nor point clouds [10] due to ease of\nencoding them in a neural network.\nHowever, such representations are of-\nten limited in terms of resolution. A\nfew recent methods [11\u201313] have ex-\nplored utilizing explicit surface rep-\nresentations in a neural network but\nmake the assumption of a \ufb01xed topol-\nogy,\nlimiting the \ufb02exibility of the\napproaches. Moreover, point- and\nmesh-based methods use Chamfer\nDistance (CD) and Earth-mover Dis-\ntance (EMD) as training losses. How-\never, these distances only provide ap-\nproximated metrics for measuring shape similarity.\n\nRendered\n\nImage\n\nOccNet\n\nDISN\n\nOccNet\n\nDISN\n\nReal Image\n\nFigure 1: Single-view reconstruction results using Occ-\nNet [1] and DISN on synthetic and real images.\n\n* indicates equal contributions.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTo address the aforementioned limitations in voxels, point clouds and meshes, in this paper, we\nstudy an alternative implicit 3D surface representation, Signed Distance Functions (SDF). SDFs\nhave recently attracted attention from researchers and few other works [14\u201316, 1] also choose to\nreconstruct 3D shapes by generating an implicit \ufb01eld. However, such methods either generate a binary\noccupancy grid or consider only the global information. Therefore, while they succeed in recovering\noverall shape, they fail to recover \ufb01ne-grained details. After exploring different forms of the implicit\n\ufb01eld and the information that preserves local details, we present an ef\ufb01cient, \ufb02exible, and effective\nDeep Implicit Surface Network (DISN) for predicting SDFs from single-view images (Figure 1).\nAn SDF simply encodes the signed distance of each point sample in 3D from the boundary of\nthe underlying shape. Thus, given a set of signed distance values, the shape can be extracted by\nidentifying the iso-surface using methods such as Marching Cubes [17]. As illustrated in Figure 4,\ngiven a convolutional neural network (CNN) that encodes the input image into a feature vector, DISN\npredicts the SDF value of a given 3D point using this feature vector. By sampling different 3D point\nlocations, DISN is able to generate an implicit \ufb01eld of the underlying surface with in\ufb01nite resolution.\nMoreover, without the need of a \ufb01xed topology assumption, the regressing target for DISN is an\naccurate ground truth instead of an approximated metric.\nWhile many single-view 3D reconstruction methods [2, 10, 16, 1] that learn a shape embedding from\na 2D image are able to capture the global shape properties, they have a tendency to ignore details\nsuch as holes or thin structures. Such \ufb01ne-grained details only occupy a small portion in 3D space\nand thus sacri\ufb01cing them does not incur a high loss compared to ground truth shape. However, such\nresults can be visually unsatisfactory.\nTo address this problem, we introduce a local feature extraction module. Speci\ufb01cally, we estimate the\nviewpoint parameters of the input image. We utilize this information to project each query point onto\nthe input image to identify a corresponding local patch. We extract local features from such patches\nand use them in conjunction with global image features to predict the SDF values of the 3D points.\nThis module enables the network to learn the relations between projected pixels and 3D space and\nsigni\ufb01cantly improves the reconstruction quality of \ufb01ne-grained details in the resulting 3D shape. As\nshown in Figure 1, DISN is able to generate shape details, such as the patterns on the bench back and\nholes on the ri\ufb02e handle, which previous state-of-the-art methods fail to produce. To the best of our\nknowledge, DISN is the \ufb01rst deep learning model that is able to capture such high-quality details\nfrom single-view images.\nWe evaluate our approach on various shape categories using both synthetic data generated from\n3D shape datasets as well as online product images. Qualitative and quantitative comparisons\ndemonstrate that our network outperforms state-of-the-art methods and generates plausible shapes\nwith high-quality details. Furthermore, we also extend DISN to multi-view reconstruction and other\napplications such as shape interpolation.\n\n2 Related Work\nThere have been extensive studies on learning-based single-view 3D reconstruction using various 3D\nrepresentations including voxels [2\u20138], octrees [18\u201320], points [10], and primitives [21, 22]. More\nrecently, Sinha et al. [23] propose to generate the surface of an object using geometry images. Tang\net al. [24] use shape skeletons for surface reconstruction, however, their method requires additional\nshape primitives dataset. Groueix et al. [11] present AtlasNet to generate surfaces of 3D shapes using\na set of parametric surface elements. Wang et al. [12] introduce a graph-based network Pix2Mesh\nto reconstruct 3D manifold shapes from input images whereas Wang et al. [13] present 3DN to\nreconstruct a 3D shape by deforming a given source mesh.\nMost of the aforementioned methods use explicit 3D representations and often suffer from problems\nsuch as limited resolution and \ufb01xed mesh topology. Implicit representations provide an alternative\nrepresentation to overcome these limitations. In our work, we adopt the Signed Distance Functions\n(SDF) which are among the most popular implicit surface representations. Several deep learning\napproaches have utilized SDFs recently. Dai et al. [14] use a voxel-based SDF representation for shape\ninpainting. Nevertheless, 3D CNNs are known to suffer from high memory usage and computation\ncost. Park et al. [15] introduce DeepSDF for shape completion using an auto-decoder structure.\nHowever, their network is not feed-forward and requires optimizing the embedding vector during test\ntime which limits the ef\ufb01ciency and capability of the approach. Chen and Zhang [16] use SDFs in\ndeep networks for the task of shape generation. While their method achieves promising results for the\n\n2\n\n\fgeneration task, it fails to recover \ufb01ne-grained details of 3D objects for single-view reconstruction.\nMescheder et al. [1] learns an implicit representation by predicting the probability of each cell in\na volumetric grid being occupied or not, i.e., being inside or outside of a 3D model. By iteratively\nsubdividing each active cell (i.e., cells surrounded by occupied and empty cells) into sub-cells and\nrepeating the prediction for each sub-cell, they alleviate the problem of the limited resolution of\nvolumetric grids. Finally, in concurrent work, Saito et al. [25] utilize local image features to predict if\na 3D point sample is inside or outside the surface of a mesh and demonstrate high quality human\nreconstruction results. In contrast, our method not only predicts the sign (i.e., being inside or outside)\nof sampled points but also the distance which is continuous. We compare our method with recent\napproaches in Section 4.1 and demonstrate state-of-the-art results.\n3 Method\nGiven an image of an object, our goal is to reconstruct a 3D shape that captures both the overall\nstructure and \ufb01ne-grained details of the object. We consider modeling a 3D shape as a signed distance\nfunction (SDF). As illustrated in Figure 2, SDF is a continuous function that maps a given spatial\npoint p = (x, y, z) \u2208 R3 to a real value s \u2208 R: s = SDF (p). Instead of more common 3D\nrepresentations such as depth [26], the absolute value of s indicates the distance of the point to the\nsurface, while the sign of s represents if the point is inside or outside the surface. An iso-surface\nS0 = {p|SDF (p) = 0} implicitly represents the underlying 3D shape.\nIn this paper, we use a feed-forward deep neural network,\nDeep Implicit Surface Network (DISN), to predict the SDF\nfrom an input image. DISN takes a single image as input\nand predicts the SDF value for any given point. Unlike the\n3D CNN methods [14] which generate a volumetric grid\nwith \ufb01xed resolution, DISN produces a continuous \ufb01eld\nwith arbitrary resolution. Moreover, we introduce a local\nfeature extraction method to improve recovery of shape\ndetails.\n3.1 DISN: Deep Implicit Surface Network\nThe overview of our method is illustrated in Figure 4.\nGiven an image, DISN consists of two parts: camera pose\nestimation and SDF prediction. DISN \ufb01rst estimates the camera parameters that map an object in\nworld coordinates to the image plane. Given the predicted camera parameters, we project each 3D\nquery point onto the image plane and collect multi-scale CNN features for the corresponding image\npatch. DISN then decodes the given spatial point to an SDF value using both the multi-scale local\nimage features and the global image features.\n\nFigure 2: Illustration of SDF. (a) Ren-\ndered 3D surface with s = 0. (b) Cross-\nsection of the SDF. A point is outside the\nsurface if s > 0, inside if s < 0, and on\nthe surface if s = 0.\n\nFigure 3: Local feature extraction. Given a 3D\npoint p, we use the estimated camera parameters\nto project p onto the image plane. Then we identify\nthe projected location on each feature map layer of\nthe encoder. We concatenate features at each layer\nto get the local features of point p.\n\nFigure 4: Given an image and a point p, we estimate\nthe camera pose and project p onto the image plane.\nDISN uses the local features at the projected location,\nthe global features, and the point features to predict the\nSDF of p. \u2018MLPs\u2019 denotes multi-layer perceptrons.\n\n3.1.1 Camera Pose Estimation\nGiven an input image, our \ufb01rst goal is to estimate the corresponding viewpoint. We train our network\non the ShapeNet Core dataset [27] where all the models are aligned. Therefore we use this aligned\n\n3\n\nS<0S>0(a)(b)Local FeaturesGlobal Featuresp(x,y,z)ConcatEncoderGlobal FeaturesSDFLocal Features p(x,y,z)Feature MapsEstimatedCameraPoseMLPsPoint FeaturesDecoderDecoder+Point Features\fmodel space as the world space where our camera parameters are with respect to, and we assume a\n\ufb01xed set of intrinsic parameters. Regressing camera parameters from an input image directly using a\nCNN often fails to produce accurate poses as discussed in [28]. To overcome this issue, Insafutdinov\nand Dosovitskiy [28] introduce a distilled ensemble approach to regress camera pose by combining\nseveral pose candidates. However, this method requires a large number of network parameters\nand a complex training procedure. We present a more ef\ufb01cient and effective network illustrated in\nFigure 5. In a recent work, Zhou et al. [29] show that a 6D rotation representation is continuous\nand easier for a neural network to regress compared to more commonly used representations such as\nquaternions and Euler angles. Thus, we employ the 6D rotation representation b = (bx, by), where\nb \u2208 R6,bx \u2208 R3, by \u2208 R3. Given b, the rotation matrix R = (Rx, Ry, Rz)T \u2208 R3\u00d73 is obtained\nby\n\nRx = N (bx), Rz = N (Rx \u00d7 by), Ry = Rz \u00d7 Rx,\n\n(1)\nwhere Rx, Ry, Rz \u2208 R3, N (\u00b7) is the normalization function, \u2018\u00d7\u2019 indicates cross product. Translation\nt \u2208 R3 from world space to camera space is directly predicted by the network.\nInstead of calculating losses on camera parameters directly\nas in [28], we use the predicted camera pose to transform\na given point cloud from the world space to the camera\ncoordinate space. We compute the loss Lcam by calculat-\ning the mean squared error between the transformed point\ncloud and the ground truth point cloud in the camera space:\n\n(cid:80)\n\npw\u2208P Cw\n\n(cid:80)\n\n||pG \u2212 (Rpw + t))||2\npw\u2208P Cw\n\n1\n\n2\n\n,\n\nLcam =\n\nFigure 5: Camera Pose Estimation Net-\nwork. \u2018PC\u2019 denotes point cloud. \u2018GT\nCam\u2019 and \u2018Pred Cam\u2019 denote the ground\ntruth and predicted cameras.\n\n(2)\nwhere P Cw \u2208 RN\u00d73 is the point cloud in the world space,\nN is number of points in P Cw. For each pw \u2208 P Cw, pG\nrepresents the corresponding ground truth point location\nin the camera space and || \u00b7 ||2\n2 is the squared L2 distance.\n3.1.2 SDF Prediction with Deep Neural Network\nGiven an image I, we denote the ground truth SDF by SDF I (\u00b7), and the goal of our network f (\u00b7) is\nto estimate SDF I (\u00b7). Unlike the commonly used CD and EMD losses in previous reconstruction\nmethods [10, 11], our guidance is a true ground truth instead of approximated metrics.\nPark et al [15] recently propose DeepSDF, a direct approach to regress SDF with a neural network.\nDeepSDF concatenates the location of a query 3D point and the shape embedding extracted from\na depth image or a point cloud and uses an auto-decoder to obtain the corresponding SDF value.\nThe auto-decoder structure requires optimizing the shape embedding for each object. In our initial\nexperiments, when we applied a similar network architecture in a feed-forward manner, we observed\nconvergence issues. Alternatively, Chen and Zhang [16] propose to concatenate the global features of\nan input image and the location of a query point to every layer of a decoder. While this approach\nworks better in practice, it also results in a signi\ufb01cant increase in the number of network parameters.\nOur solution is to use a multi-layer perceptron to map the given point location to a higher-dimensional\nfeature space. This high dimensional feature is then concatenated with global and local image\nfeatures respectively and used to regress the SDF value. We provide the details of our network in the\nsupplementary.\nLocal Feature Extraction As shown in Figure 6(a), our initial\nexperiments showed that it is hard to capture shape details such\nas holes and thin structures when only global image features are\nused. Thus, we introduce a local feature extraction method to focus\non reconstructing \ufb01ne-grained details, such as the back poles of a\nchair (Figure 6). As illustrated in Figure 3, a 3D point p \u2208 R3\nis projected to a 2D location q \u2208 R2 on the image plane with the\nestimated camera parameters. We retrieve features on each feature\nmap corresponding to location q and concatenate them to get the\nlocal image features. Since the feature maps in the later layers are\nsmaller in dimension than the original image, we resize them to\nthe original size with bilinear interpolation and extract the resized\nfeatures at location q.\n\nFigure 6: Shape reconstruc-\ntion results (a) without and (b)\nwith local feature extraction.\n\nInput\n\n(b)\n\n(a)\n\n4\n\nCNNTranslationRotationPC in World SpaceApply TransformationPC in Pred Cam SpacePC in GT Cam SpaceMSE\fTwo decoders then take the global and local image features respectively as input with the point\nfeatures and make an SDF prediction. The \ufb01nal SDF is the sum of these two predictions. Figure 6\ncompares the results of our approach with and without local feature extraction. With only global\nfeatures, the network is able to predict the overall shape but fails to produce details. Local feature\nextraction helps to recover these missing details by predicting the residual SDF.\nLoss Functions We regress continuous SDF values instead of formulating a binary classi\ufb01cation\nproblem (e.g., inside or outside of a shape) as in [16]. This strategy enables us to extract surfaces that\ncorrespond to different iso-values. To ensure that the network concentrates on recovering the details\nnear and inside the iso-surface S0, we propose a weighted loss function. Our loss is de\ufb01ned by\n\nLSDF =\n\nm|f (I, p) \u2212 SDF I (p)|,\n\nif SDF I (p) < \u03b4,\notherwise,\n\n(3)\n\n(cid:88)\n(cid:26)m1,\n\np\n\nm2,\n\nm =\n\nwhere | \u00b7 | is the L1-norm. m1, m2 are different weights, and for points whose signed distance is\nbelow a certain threshold \u03b4, we use a higher weight of m1.\n3.2 Surface Reconstruction\nTo generate a mesh surface, we \ufb01rstly de\ufb01ne a dense 3D grid and predict SDF values for each grid\npoint. Once we compute the SDF values for each point in the dense grid, we use Marching Cubes [17]\nto obtain the 3D mesh that corresponds to the iso-surface S0.\n\n4 Experiments\nWe perform quantitative and qualitative comparisons on single-view 3D reconstruction with state-of-\nthe-art methods [11\u201313, 16, 1] in Section 4.1. We also compare the performance of our method on\n\n3DN\n\nAtlasNet\n\nInput\nGT\nFigure 7: Single-view reconstruction results of various methods. \u2018GT\u2019 denotes ground truth shapes.\nBest viewed on screen with zooming in.\n\nPix2Mesh 3DCNN IMNET\n\nOurscam\n\nOccNet\n\nOurs\n\n5\n\n\fcamera pose estimation with [28] in Section 4.2. We further conduct ablation studies in Section 4.3\nand showcase several applications in Section 4.4. More qualitative results and all detailed network\narchitectures can be found in supplementary.\nDataset For both camera prediction and SDF prediction, we follow the settings of [11\u201313, 1],\nand use the ShapeNet Core dataset [27], which includes 13 object categories, and an of\ufb01cial train-\ning/testing split to train and test our method. We train a single network on all categories and report\nthe test results generated by this network.\nChoy et al. [30] provide a dataset of renderings of ShapeNet Core models where each model is\nrendered from 24 views with limited variation in terms of camera orientation. In order to make\nour method more general, we provide a new 2D dataset 1 composed of renderings of the models in\nShapeNet Core. Speci\ufb01cally, for each mesh model, our dataset provides 36 renderings with smaller\nvariation(similar to [30]\u2019s) and 36 views with a larger variation(bigger yaw angle range and larger\ndistance variation). Unlike Choy et al., we allow the object to move away from the origin, therefore,\nproviding more degrees of freedom in terms of camera parameters. We ignore the \"Roll\" angle of\nthe camera since it is very rare in a real-world scenarios. We also render higher resolution images\n(224 by 224 instead of the original 137 by 137). Finally, to facilitate future studies, we also pair each\nrendered RGBA image with a depth image, a normal map and an albedo image as shown in Figure 8.\n\nRGBA\nNormal\nFigure 8: Each view of each object has four representations correspondingly\n\nAlbedo\n\nDepth\n\nData Preparation and Implementation Details For each 3D mesh in ShapeNet Core, we \ufb01rst\ngenerate an SDF grid with resolution 2563 using [31, 32]. Models in ShapeNet Core are aligned and\nwe choose this aligned model space as our world space where each render view in [30] represents a\ntransformation to a different camera space.\nWe train our camera pose estimation network and SDF prediction network separately. For both\nnetworks, we use VGG-16 [33] as the image encoder. When training the SDF prediction network, we\nextract the local features using the ground truth camera parameters. As mentioned in Section 3.1,\nDISN is able to generate a signed distance \ufb01eld with an arbitrary resolution by continuously sampling\npoints and regressing their SDF values. However, in practice, we are interested in points near the\niso-surface S0. Therefore, we use Monte Carlo sampling to choose 2048 grid points under Gaussian\ndistribution N (0, 0.1) during training. We choose m1 = 4, m2 = 1, and \u03b4 = 0.01 as the parameters\nof Equation 3. Our network is implemented with TensorFlow. We use the Adam optimizer with a\nlearning rate of 1 \u00d7 10\u22124 and a batch size of 16.\nFor testing, we \ufb01rst use the camera pose prediction network to estimate the camera parameters\nfor the input image and feed the estimated parameters as input to SDF prediction. We follow the\naforementioned surface reconstruction procedure (Section 3.2) to generate the output mesh.\n\nEvaluation Metrics For quantitative evaluations, we apply four commonly used metrics to compute\nthe difference between a reconstructed mesh object and its ground truth mesh: (1) Chamfer Distance\n(CD), (2) Earth Mover\u2019s Distance (EMD) between uniformly sampled point clouds, (3) Intersection\nover Union (IoU) on voxelized meshes, and (4) F-Score [34]. The de\ufb01nitions of CD and EMD can be\nfound in the supplemental.\n\n4.1 Single-view Reconstruction Comparison With State-of-the-art Methods\n\nIn this section, we compare our approach on single-view reconstruction with state-of-the-art meth-\nods: AtlasNet [11], Pixel2Mesh [12], 3DN [13], OccNet [1] and IMNET [16]. AtlasNet [11] and\n\n1https://github.com/Xharlie/ShapenetRender_more_variation\n\n6\n\n\fPixel2Mesh [12] generate a \ufb01xed-topology mesh from a 2D image. 3DN [13] deforms a given\nsource mesh to reconstruct the target model. When comparing to this method, we choose a source\nmesh from a given set of templates by querying a template embedding as proposed in the original\nwork. IMNET [16] and OccNet [1] both predict the sign of SDF to reconstruct 3D shapes. Since\nIMNET trains an individual model for each category, we implement their model following the original\npaper and train a single model on all 13 categories. Due to mismatch between the scales of shapes\nreconstructed by our method and OccNet, we only report their IoU, which is scale-invariant. In\naddition, we train a 3D CNN model, denoted by \u20183DCNN\u2019, where the encoder is the same as DISN\nand a decoder is a volumetric 3D CNN structure with an output dimension of 643. The ground truth\nfor 3DCNN is the SDF values on all 643 grid locations. For both IMNET and 3DCNN, we use\nthe same surface reconstruction method as ours to output reconstructed meshes. We also report the\nresults of DISN using estimated camera poses and ground truth poses, denoted by \u2018Ourscam\u2019 and\n\u2018Ours\u2019 respectively. AtlasNet, Pixel2Mesh, and 3DN use explicit surface generation, while 3DCNN,\nIMNET, OccNet, and our methods reconstruct implicit surfaces.\nAs shown in Table 1, DISN outperforms all other models in EMD and IoU. Only 3DN performs better\nthan our model on CD, however, 3DN requires more information than ours in the form of a source\nmesh as input. Figure 7 shows qualitative results. As illustrated in both quantitative and qualitative\nresults, implicit surface representation provides a \ufb02exible method of generating topology-variant\n3D meshes. Comparisons to 3D CNN show that predicting SDF values for given points produces\nsmoother surfaces than generating a \ufb01xed 3D volume using an image embedding. We speculate\nthat this is due to SDF being a continuous function with respect to point locations. It is harder for\na deep network to approximate an overall SDF volume with global image features only. Moreover,\nour method outperforms IMNET and OccNet in terms of recovering shape details. For example,\nin Figure 7, local feature extraction enables our method to generate different patterns of the chair\nbacks in the \ufb01rst three rows, while other methods fail to capture such details. We further validate\nthe effectiveness of our local feature extraction module in Section 4.3. Although using ground truth\ncamera poses (i.e., \u2019Ours\u2019) outperforms using predicted camera poses (i.e., \u2019Ourscam\u2019) in quantitative\nresults, respective qualitative results demonstrate no signi\ufb01cant difference.\n\nEMD\n\nCD\n\nIoU\n\nplane bench box\n\ncar\n\n5.29\n5.15\n3.99\n5.85\n4.73\n4.38\n4.11\n\n3DN\n\nAtlasNet 3.39 3.22 3.36 3.72 3.86\nPxl2mesh 2.98 2.58 3.44 3.43 3.52\n3.30 2.98 3.21 3.28 4.45\nIMNET 2.90 2.80 3.14 2.73 3.01\n3D CNN 3.36 2.90 3.06 2.52 3.01\nOurscam 2.67 2.48 3.04 2.67 2.67\n2.45 2.41 2.99 2.52 2.62\n\nOurs\n\n3.12\n2.92\n3.91\n2.81\n2.85\n2.73\n2.63\n\nchair display lamp speaker ri\ufb02e sofa table phone boat Mean\n3.75 3.35 3.14 3.98 3.19 4.39 3.67\n3.56 3.04 2.70 3.52 2.66 3.94 3.34\n4.47 2.78 3.31 3.94 2.70 3.92 3.56\n3.80 2.65 2.71 3.39 2.14 2.75 3.13\n3.35 2.71 2.60 3.09 2.10 2.67 3.00\n3.47 2.30 2.62 3.11 2.06 2.77 2.84\n3.37 1.93 2.55 3.07 2.00 2.55 2.71\nAtlasNet 5.98 6.98 13.76 17.04 13.21 7.18 38.21 15.96 4.59 8.29 18.08 6.35 15.85 13.19\nPxl2mesh 6.10 6.20 12.11 13.45 11.13 6.39 31.41 14.52 4.51 6.54 15.61 6.04 12.66 11.28\n6.75 7.96 8.34 7.09 17.53 8.35 12.79 17.28 3.26 8.27 14.05 5.18 10.20 9.77\nIMNET 12.65 15.10 11.39 8.86 11.27 13.77 63.84 21.83 8.73 10.30 17.82 7.06 13.25 16.61\n3D CNN 10.47 10.94 10.40 5.26 11.15 11.78 35.97 17.97 6.80 9.76 13.35 6.30 9.80 12.30\nOurscam 9.96 8.98 10.19 5.39 7.71 10.23 25.76 17.90 5.58 9.16 13.59 6.40 11.91 10.98\n9.58 22.73 16.70 4.36 8.71 13.29 6.21 10.87 10.17\n23.2 45.3 27.9 23.3 42.5 28.1 30.0\n36.4\n55.9\n52.3 50.9 60.0 31.2 69.4 40.1 47.3\n45.3 57.6 60.7 31.3 71.4 46.4 48.7\n47.2\n52.6 52.3 64.1 45.0 70.9 56.6 54.6\n56.2\n58.0 50.5 67.2 50.3 70.9 57.4 55.3\n51.5\n65.3 45.8 67.1 50.6 70.9 52.1 56.4\n47.9\n56.4\n54.9 59.2 65.9 47.9 72.9 55.9 57.0\n55.9 68.0 67.1 48.9 73.6 60.2 59.4\n57.7\n\n9.01 8.32 9.98 4.92 7.54\nAtlasNet 39.2 34.2 20.7 22.0 25.7\nPxl2mesh 51.5 40.7 43.4 50.1 40.2\n54.3 39.8 49.4 59.4 34.4\nIMNET 55.4 49.5 51.5 74.5 52.2\n3D CNN 50.6 44.3 52.3 76.9 52.6\n54.7 45.2 73.2 73.1 50.2\nOccNet\nDISNcam 57.5 52.9 52.3 74.3 54.3\nDISN 61.7 54.2 53.1 77.0 54.9\n\n21.3\n29.1\n35.4\n29.6\n36.2\n37.0\n34.7\n39.7\n\n3DN\n\nOurs\n\n3DN\n\nTable 1: Quantitative results on ShapeNet Core for various methods. Metrics are CD (\u00d70.001, the\nsmaller the better), EMD (\u00d7100, the smaller the better) and IoU (%, the larger the better). CD and\nEMD are computed on 2048 points.\nWe also compute the F-score (see Table 2) which measures the percentage of surface area that\nis reconstructed correctly and thus provides a reliable metric [34].\nIn our evaluations, we use\n\n7\n\n\fThreshold(%) 0.5% 1% 2% 5% 10% 20%\n0.064 0.295 0.691 0.935 0.984 0.997\n0.063 0.286 0.673 0.922 0.977 0.995\n0.079 0.327 0.718 0.943 0.984 0.996\n0.070 0.307 0.700 0.940 0.986 0.998\n\n3DCNN\nIMNet\nDISN\n\nDISNcam\n\n[28]\n0.073\n4.86\n\nOurs\n0.047\n2.95\n\nOursnew\n0.059\n\n4.38/2.67\n\nd3D\nd2D\n\nTable 3: Camera pose estimation com-\nparison. The unit of d2D is pixels.\n\nTable 2: F-Score for varying thresholds (% of reconstruction\nvolume side length, same as [34]) on all categories.\nF1 = 2 \u2217 (Precision \u00b7 Recall)/(Precision + Recall). We uniformly sample points from both ground\ntruth and generated meshes. We de\ufb01ne precision as the percentage of the generated points whose\ndistance to the closest ground truth point is less than a threshold. Similarly, we de\ufb01ne recall as\nthe percentage of ground truth points whose distance to the closest generated point is less than a\nthreshold.\n4.2 Camera Pose Estimation\nWe compare our camera pose estimation with [28]. Given a point cloud P Cw in world coordinates for\nan input image, we transform P Cw using the predicted camera pose and compute the mean distance\nd3D between the transformed point cloud and the ground truth point cloud in camera space. We\nalso compute the 2D reprojection error d2D of the transformed point cloud after we project it onto\nthe input image. Table 3 reports d3D and d2D of [28] and our method. With the help of the 6D\nrotation representation, our method outperforms [28] by 2 pixels in terms of 2D reprojection error.\nWe also train and test the pose estimation on the new 2D dataset. Even these images possess more\nview variation, because of the better rendering quality, we can achieve an average 2D distance of 4.38\npixels on 224 by 224 images (2.67 pixels if normalized to the original resolution of 137 by 137).\n\n4.3 Ablation Studies\nTo show the impact of the camera pose estimation, local feature extraction, and different network\narchitectures, we conduct ablation studies on the ShapeNet \u201cchair\u201d category, since it has the greatest\nvariety. Table 4 reports the quantitative results and Figure 9 shows the qualitative results.\n\nInput\n\nBinarycam Binary\n\nGlobal\n\nOne-\n\nstreamcam\n\nOne-\nstream\n\nTwo-\n\nstreamcam\n\nTwo-\nstream\n\nGT\n\nFigure 9: Qualitative results of our method using different settings. \u2018GT\u2019 denotes ground truth shapes,\nand \u2018cam\u2019 denotes models with estimated camera parameters.\n\nCamera Pose Estimation As is shown in Section 4.2, camera pose estimation potentially intro-\nduces uncertainty to the local feature extraction process with an average reprojection error of 2.95\npixels. Although the quantitative reconstruction results with ground truth camera parameters are\nconstantly superior to the results with estimated parameters in Table 4, Figure 9 demonstrates that a\nsmall difference in the image projection does not affect the reconstruction quality signi\ufb01cantly.\nBinary Classi\ufb01cation Previous studies [1, 16] formulate SDF prediction as a binary classi\ufb01cation\nproblem by predicting the probability of a point being inside or outside the surface S0. Even though\nSection 4.1 illustrates our superior performance over [1, 16], we further validate the effectiveness\nof our regression supervision by comparing with classi\ufb01cation supervision using our own network\nstructure. Instead of producing a SDF value, we train our network with classi\ufb01cation supervision and\noutput the probability of a point being inside the mesh surface. We use a softmax cross entropy loss\nto optimize this network. We report the result of this classi\ufb01cation network as \u2018Binary\u2019.\nLocal Feature Extraction Local image features of each point provide access to the corresponding\nlocal information that captures shape details. To validate the effectiveness of this information, we\nremove the \u2018local features extraction\u2019 module from DISN and denote this setting by \u2018Global\u2019. This\n\n8\n\n\fmodel predicts the SDF value solely based on the global image features. By comparing \u2018Global\u2019 with\nother methods in Table 4 and Figure 9, we conclude that local feature extraction helps the model\ncapture shape details and improve the reconstruction quality by a large margin.\n\nNetwork Structures To further assess the impact of different network architectures, in addition\nto our original architecture with two decoders (which we call \u2019Two-stream\u2019), we also introduce a\n\u2018One-stream\u2019 architecture where the global features, the local features, and the point features are\nconcatenated and fed into a single decoder which predicts the SDF value. Detailed structure of this\narchitecture can be found in the supplementary. As illustrated in Table 4 and Figure 9, the original\nTwo-stream setting is slightly superior to One-stream, which shows that DISN is robust to different\nnetwork architectures.\n\nGlobal\n\nOne-stream\n\nTwo-stream\n\nCamera\n\nBinary\n\nground truth | estimated\n\nPose\nEMD\nCD\nIoU\nTable 4: Quantitative results on the category \u201cchair\u201d. CD (\u00d70.001), EMD (\u00d7100) and IoU (%).\n\n2.62 | 2.65\n7.55 | 7.63\n55.3 | 53.9\n\n2.88 | 2.99\n8.27 | 8.80\n54.9 | 53.5\n\n2.71 | 2.74\n7.86 | 8.30\n53.6 | 53.5\n\n2.75 | n/a\n7.64 | n/a\n54.8 | n/a\n\nn/a\n\nground truth | estimated\n\nground truth | estimated\n\n4.4 Applications\nShape interpolation Figure 10 shows shape\ninterpolation results where we interpolate both\nglobal and local image features going from the\nleftmost sample to the rightmost. We see that\nthe generated shape is gradually transformed.\n\nTest with online product images Figure 11\nillustrates 3D reconstruction results by DISN on online product images. Note that our model is\ntrained on rendered images, this experiment validates the domain transferability of DISN.\n\nFigure 10: Shape interpolation result.\n\nMulti-view reconstruction Our model\ncan also take multiple 2D views of\nthe same object as input. After ex-\ntracting the global and the local\nim-\nage features for each view, we apply\nmax pooling and use the resulting fea-\ntures as input\nto each decoder. We\nhave retrained our network for 3 in-\nFigure 11: Test our model on online product images.\nput views and visualize some results in\nFigure 12. Combining multi-view features helps DISN to further address shape details.\n\n5 Conclusion\nIn this paper, we present DISN, a deep implicit\nsurface network for single-view reconstruction.\nGiven a 3D point and an input image, DISN pre-\ndicts the SDF value for the point. We introduce a\nlocal feature extraction module by projecting the\n3D point onto the image plane with an estimated\ncamera pose. With the help of such local features,\nDISN is able to capture \ufb01ne-grained details and\nFigure 12: Multi-view reconstruction results.\ngenerate high-quality 3D models. Qualitative and\n(a) Single-view input. (b) Reconstruction result\nquantitative experiments validate the superior per-\nfrom (a). (c)&(d) Two other views. (e) Multi-\nformance of DISN over state-of-the-art methods\nview reconstruction result from (a), (c) and (d).\nand the \ufb02exibility of our model.\nThough we achieve state-of-the-art performance in single-view reconstruction, our method is only\nable to handle objects with a clear background since it\u2019s trained with rendered images. To address\nthis limitation, our future work includes extending SDF generation with texture prediction using a\ndifferentiable renderer [35].\n\n(b)\n\n(d)\n\n(c)\n\n(a)\n\n(e)\n\n9\n\n\fReferences\n[1] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger.\n\nOccupancy networks: Learning 3d reconstruction in function space. In CVPR, 2019.\n\n[2] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer\nnets: Learning single-view 3d object reconstruction without 3d supervision. In NeurIPS, 2016.\n\n[3] R. Girdhar, D.F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative\n\nvector representation for objects. In ECCV, 2016.\n\n[4] Rui Zhu, Hamed Kiani Galoogahi, Chaoyang Wang, and Simon Lucey. Rethinking reprojection:\n\nClosing the loop for pose-aware shape reconstruction from a single image. In ICCV, 2017.\n\n[5] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T Freeman, and Joshua B\n\nTenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In NeurIPS, 2017.\n\n[6] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. Multi-view supervision\n\nfor single-view reconstruction via differentiable ray consistency. In CVPR, 2017.\n\n[7] Jiajun Wu, Chengkai Zhang, Xiuming Zhang, Zhoutong Zhang, William T. Freeman, and\nJoshua B. Tenenbaum. Learning shape priors for single-view 3d completion and reconstruction.\nIn NeurIPS, 2018.\n\n[8] Guandao Yang, Yin Cui, Serge Belongie, and Bharath Hariharan. Learning single-view 3d\n\nreconstruction with limited pose supervision. In ECCV, 2018.\n\n[9] Weiyue Wang, Qiangui Huang, Suya You, Chao Yang, and Ulrich Neumann. Shape inpainting\nusing 3d generative adversarial network and recurrent convolutional networks. In ICCV, 2017.\n\n[10] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object\n\nreconstruction from a single image. In CVPR, 2017.\n\n[11] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan Russell, and Mathieu Aubry.\n\nAtlasNet: A Papier-M\u00e2ch\u00e9 Approach to Learning 3D Surface Generation. In CVPR, 2018.\n\n[12] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh:\nGenerating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018.\n\n[13] Weiyue Wang, Duygu Ceylan, Radomir Mech, and Ulrich Neumann. 3dn: 3d deformation\n\nnetwork. In CVPR, 2019.\n\n[14] Angela Dai, Charles Ruizhongtai Qi, and Matthias Nie\u00dfner. Shape completion using 3d-encoder-\n\npredictor cnns and shape synthesis. 2017.\n\n[15] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove.\nDeepsdf: Learning continuous signed distance functions for shape representation. arXiv preprint\narXiv:1901.05103, 2019.\n\n[16] Zhiqin Chen and Hao Zhang. Learning implicit \ufb01elds for generative shape modeling. arXiv\n\npreprint arXiv:1812.02822, 2018.\n\n[17] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface\n\nconstruction algorithm. In ACM siggraph computer graphics, 1987.\n\n[18] Christian H\u00e4ne, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3d\n\nobject reconstruction. In 3DV, 2017.\n\n[19] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Ef\ufb01cient convolu-\n\ntional architectures for high-resolution 3d outputs. In ICCV, 2017.\n\n[20] Peng-Shuai Wang, Chun-Yu Sun, Yang Liu, and Xin Tong. Adaptive o-cnn: A patch-based\n\ndeep representation of 3d shapes. arXiv preprint arXiv:1809.07917, 2018.\n\n[21] Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 3d-prnn: Generating\n\nshape primitives with recurrent neural networks. In ICCV, 2017.\n\n10\n\n\f[22] Chengjie Niu, Jun Li, and Kai Xu. Im2struct: Recovering 3d shape structure from a single rgb\n\nimage. In CVPR, 2018.\n\n[23] Ayan Sinha, Asim Unmesh, Qixing Huang, and Karthik Ramani. Surfnet: Generating 3d shape\n\nsurfaces using deep residual networks. In CVPR, 2018.\n\n[24] Jiapeng Tang, Xiaoguang Han, Junyi Pan, Kui Jia, and Xin Tong. A skeleton-bridged deep\nlearning approach for generating meshes of complex topologies from single rgb images. In\nCVPR, 2019.\n\n[25] Shunsuke Saito, , Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and\nHao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization.\narXiv preprint arXiv:1905.05172, 2019.\n\n[26] Yiqi Zhong, Cho-Ying Wu, Suya You, and Ulrich Neumann. Deep rgb-d canonical correlation\n\nanalysis for sparse depth completion. In NeurIPS, 2019.\n\n[27] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo\nLi, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu.\nShapenet: An information-rich 3d model repository. arxiv, 2015.\n\n[28] Eldar Insafutdinov and Alexey Dosovitskiy. Unsupervised learning of shape and pose with\n\ndifferentiable point clouds. In NeurIPS, 2018.\n\n[29] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation\n\nrepresentations in neural networks. arXiv preprint arXiv:1812.07035, 2018.\n\n[30] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2:\n\nA uni\ufb01ed approach for single and multi-view 3d object reconstruction. In ECCV, 2016.\n\n[31] Hongyi Xu and Jernej Barbi\u02c7c. Signed distance \ufb01elds for polygon soup meshes. In Proceedings\n\nof Graphics Interface 2014, pages 35\u201341. Canadian Information Processing Society, 2014.\n\n[32] Fun Shing Sin, Daniel Schroeder, and Jernej Barbi\u02c7c. Vega: non-linear fem deformable object\nsimulator. In Computer Graphics Forum, volume 32, pages 36\u201348. Wiley Online Library, 2013.\n\n[33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[34] Maxim Tatarchenko, Stephan R Richter, Ren\u00e9 Ranftl, Zhuwen Li, Vladlen Koltun, and Thomas\nBrox. What do single-view 3d reconstruction networks learn? In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 3405\u20133414, 2019.\n\n[35] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In CVPR,\n\n2018.\n\n11\n\n\f", "award": [], "sourceid": 270, "authors": [{"given_name": "Qiangeng", "family_name": "Xu", "institution": "USC"}, {"given_name": "Weiyue", "family_name": "Wang", "institution": "Waymo"}, {"given_name": "Duygu", "family_name": "Ceylan", "institution": "Adobe Research"}, {"given_name": "Radomir", "family_name": "Mech", "institution": "Adobe Systems Incorporated"}, {"given_name": "Ulrich", "family_name": "Neumann", "institution": "USC"}]}