{"title": "RenderNet: A deep convolutional network for differentiable rendering from 3D shapes", "book": "Advances in Neural Information Processing Systems", "page_first": 7891, "page_last": 7901, "abstract": "Traditional computer graphics rendering pipelines are designed for procedurally\ngenerating 2D images from 3D shapes with high performance. The nondifferentiability due to discrete operations (such as visibility computation) makes it hard to explicitly correlate rendering parameters and the resulting image, posing a significant challenge for inverse rendering tasks. Recent work on differentiable rendering achieves differentiability either by designing surrogate gradients for non-differentiable operations or via an approximate but differentiable renderer. These methods, however, are still limited when it comes to handling occlusion, and restricted to particular rendering effects. We present RenderNet, a differentiable rendering convolutional network with a novel projection unit that can render 2D images from 3D shapes. Spatial occlusion and shading calculation are automatically encoded in the network. Our experiments show that RenderNet can successfully learn to implement different shaders, and can be used in inverse rendering tasks to estimate shape, pose, lighting and texture from a single image.", "full_text": "RenderNet: A deep convolutional network for\n\ndifferentiable rendering from 3D shapes\n\nThu Nguyen-Phuoc\nUniversity of Bath\n\nChuan Li\n\nLambda Labs\n\nStephen Balaban\n\nLambda Labs\n\nT.Nguyen.Phuoc@bath.ac.uk\n\nc@lambdalabs.com\n\ns@lambdalabs.com\n\nYong-Liang Yang\nUniversity of Bath\n\nY.Yang@cs.bath.ac.uk\n\nAbstract\n\nTraditional computer graphics rendering pipelines are designed for procedu-\nrally generating 2D images from 3D shapes with high performance. The non-\ndifferentiability due to discrete operations (such as visibility computation) makes it\nhard to explicitly correlate rendering parameters and the resulting image, posing\na signi\ufb01cant challenge for inverse rendering tasks. Recent work on differentiable\nrendering achieves differentiability either by designing surrogate gradients for\nnon-differentiable operations or via an approximate but differentiable renderer.\nThese methods, however, are still limited when it comes to handling occlusion, and\nrestricted to particular rendering effects. We present RenderNet, a differentiable\nrendering convolutional network with a novel projection unit that can render 2D im-\nages from 3D shapes. Spatial occlusion and shading calculation are automatically\nencoded in the network. Our experiments show that RenderNet can successfully\nlearn to implement different shaders, and can be used in inverse rendering tasks to\nestimate shape, pose, lighting and texture from a single image.\n\n1\n\nIntroduction\n\nRendering refers to the process of forming a realistic or stylized image from a description of the 3D\nvirtual object (e.g., shape, pose, material, texture), and the illumination condition of the surrounding\nscene (e.g., light position, distribution, intensity). On the other hand, inverse rendering (graphics)\naims at estimating these properties from a single image. The two most popular rendering methods,\nrasterization-based rendering and ray tracing, are designed to achieve fast performance and realism\nrespectively, but not for inverse graphics. These two methods rely on discrete operations, such\nas z-buffering and ray-object intersection, to identify point visibility in a rendering scene, which\nmakes these techniques non-differentiable. Although it is possible to treat them as non-differentiable\nrenderers in computer vision tasks [1], inferring parameters, such as shapes or poses, from the\nrendered images using traditional graphics pipelines is still a challenging task. A differentiable\nrenderer that can correlate the change in a rendered image with the change in rendering parameters\ntherefore will facilitate a range of applications, such as vision-as-inverse-graphics tasks or image-\nbased 3D modelling and editing.\nRecent work in differentiable rendering achieves differentiability in various ways. Loper and Black\n[2] propose an approximate renderer which is differentiable. Kato et al. [3] achieve differentiability\nby proposing an approximate gradient for the rasterization operation. Recent work on image-based\nreconstruction uses differentiable projections of 3D objects onto silhouette masks as a surrogate for a\nrendered image of the objects [4, 5]. Wu et al. [6] and Tulsiani et al. [7] derive differentiable projective\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffunctions from normal, depth, and silhouette maps, but respectively can only handle orthographic\nprojection, or needs multiple input images. These projections can then be used to construct an error\nsignal for the reconstruction process. All of these approaches, however, are restricted to speci\ufb01c\nrendering styles (rasterization) [2, 3, 8], input geometry types [9, 10], or limiting output formats such\nas depth or silhouette maps [4, 5, 6, 7, 11, 12]. Moreover, none of these approaches try to solve the\nproblem from the network architecture design point of view. Recent progress in machine learning\nshows that network architecture plays an important role for improving the performance of many\ntasks. For example, in classi\ufb01cation, ResNet [13] and DenseNet [14] have contributed signi\ufb01cant\nperformance gains. In segmentation tasks, U-Net [15] proves that having short-cut connections can\ngreatly improve the detail level of the segmentation masks. In this paper, we therefore focus on\ndesigning a neural network architecture suitable for the task of rendering and inverse rendering.\nWe propose RenderNet, a convolutional neural network (CNN) architecture that can be trained end-\nto-end for rendering 3D objects, including object visibility computation and pixel color calculation\n(shading). Our method explores the novel idea of combining the ability of CNNs with inductive\nbiases about the 3D world for geometry-based image synthesis. This is different from recent image-\ngenerating CNNs driven by object attributes [16], noise [17], semantic maps [18], or pixel attributes\n[19], which make very few assumption about the 3D world and the image formation process. Inspired\nby the literature from computer graphics, we propose the projection unit that incorporates prior\nknowledge about the 3D world, and how it is rendered, into RenderNet. The projection unit, through\nlearning, is a differentiable approximation of the non-differentiable visibility computation step,\nmaking RenderNet an end-to-end system. Unlike non-learnt approaches in previous work, a learnt\nprojection unit uses deep features instead of low-level primitives, making RenderNet generalize well\nto a variety of input geometries, robust to erroneous or low-resolution input, as well as enabling\nlearning multi-style rendering with the same network architecture. RenderNet is differentiable and\ncan be easily integrated to other neural networks, bene\ufb01ting various inverse rendering tasks, such\nas novel-view synthesis, pose prediction, or image-based 3D shape reconstruction, unlike previous\nimage-based inverse rendering work that can recover only part of the full 3D shapes [20, 21].\nWe choose the voxel presentation of 3D shapes for its regularity and \ufb02exibility, and its application in\nvisualizing volumetric data such as medical images. Although voxel grids are traditionally memory\ninef\ufb01cient, computers are becoming more powerful, and recent work also addresses this inef\ufb01ciency\nusing octrees [22, 23], enabling high-resolution voxel grids. In this paper, we focus on voxel data,\nand leave other data formats such as polygon meshes and unstructured point clouds as possible future\nextensions. We demonstrate that RenderNet can generate renderings of high quality, even from\nlow-resolution and noisy voxel grids. This is a signi\ufb01cant advantage compared to mesh renderers,\nincluding more recent work in differentiable rendering, which do not handle erroneous inputs well.\nBy framing the rendering process as a feed-forward CNN, RenderNet has the ability to learn to\nexpress different shaders with the same network architecture. We demonstrate a number of rendering\nstyles ranging from simple shaders such as Phong shading [24], suggestive contour shading [25], to\nmore complex shaders such as a composite of contour shading and cartoon shading [26] or ambient\nocclusion [27], some of which are time-consuming and computationally expensive. RenderNet also\nhas the potential to be combined with neural style transfer to improve the synthesized results, or other\ncomplex shaders that are hard to de\ufb01ne explicitly.\nIn summary, the proposed RenderNet can bene\ufb01t both rendering and inverse rendering: RenderNet can\nlearn to generate images with different appearance, and can also be used for vision-as-inverse-graphics\ntasks. Our main contributions are threefold.\n\n\u2022 A novel convolutional neural network architecture that learns to render in different styles\nfrom a 3D voxel grid input. To our knowledge, we are the \ufb01rst to propose a neural renderer\nfor 3D shapes with the projection unit that enables both rendering and inverse rendering.\n\u2022 We show that RenderNet generalizes well to objects of unseen category and more complex\nscene geometry. RenderNet can also produce textured images from textured voxel grids,\nwhere the input textures can be RGB colors or deep features computed from semantic inputs.\n\u2022 We show that our model can be integrated into other modules for applications, such as\n\ntexturing or image-based reconstruction.\n\n2\n\n\f2 Related work\n\nOur work is related to three categories of learning-based works: image-based rendering, geometry-\nbased rendering and image-based shape reconstruction. In this section, we review some landmark\nmethods that are closely related to our work. In particular, we focus on neural-network-based\nmethods.\nImage-based rendering There is a rich literature of CNN-based rendering by learning from images.\nDosovitskiy et al. [16] create 2D images from low-dimensional vectors and attributes of 3D objects.\nCascaded re\ufb01nement networks [18], and Pix2Pix [28] additionally condition on semantic maps or\nsketches as inputs. Using a model that is more deeply grounded in computer graphics, DeepShading\n[19] learns to create images with high \ufb01delity and complex visual effects from per-pixel attributes.\nDC-IGN [29] learns disentangled representation of images with respect to transformations, such as\nout-of-plane rotations and lighting variations, and thus is able to edit images with respect to these\nfactors. Relevant works on novel 3D view synthesis [30] leverage category-speci\ufb01c shape priors and\noptical \ufb02ow to deal with occlusion/disocclusion. While these methods yield impressive results, we\nargue that geometry-based methods, which make stronger assumptions about the 3D world and how\nit produces 2D images, will be able to perform better in certain tasks, such as out-of-plane rotation,\nimage relighting, and shape texturing. This also coincides with Rematas et al. [31], Yang et al. [32]\nand Su et al. [33] who use strong 3D priors to assist the novel-view synthesis task.\nGeometry-based rendering Despite the rich literature in rendering in computer graphics, there is\na lot less work using differentiable rendering techniques. OpenDR [2] has been a popular framework\nfor differentiable rendering. However, being a more general method, it is more strenuous to be\nintegrated into other neural networks and machine learning frameworks. Kato et al. [3] approximate\nthe gradient of the rasterization operation to make the rendering differentiable. However, this method\nis limited to rasterization-based rendering, making it dif\ufb01cult to represent more complex effects that\nare usually achieved by ray tracing such as global illumination, re\ufb02ection, or refraction.\nImage-based 3D shape reconstruction Reconstructing 3D shape from 2D image can be treated\nas estimating the posterior of the 3D shape conditioned on the 2D information. The prior of the shape\ncould be a simple smoothness prior or a prior learned from 3D shape datasets. The likelihood term,\non the other hand, requires estimating the distribution of 2D images given the 3D shape. Recent\nwork has been using 2D silhouette maps of the images [4, 5]. While this proves effective, silhouette\nimages contain little information about the shape. Hence a large number of images or views of the\nobject is required for the reconstruction task. For normal maps and depth maps of the shape, Wu et al.\n[6] derive differentiable projective functions assuming orthographic projection. Similarly, Tulsiani\net al. [7] propose a differentiable formulation that enables computing gradients of the 3D shape given\nmultiple observations of depth, normal or pixel color maps from arbitrary views. In our work, we\npropose RenderNet as a powerful model for the likelihood term. To reconstruct 3D shapes from 2D\nimages, we do MAP estimation using our trained rendering network as the likelihood function, in\naddition to a shape prior that is learned from a 3D shape dataset. We show that we can recover not\nonly the pose and shape, but also lighting and texture from a single image.\n\n3 Model\n\nThe traditional computer graphics pipeline renders images from the viewpoint of a virtual pin-hole\ncamera using a common perspective projection. The viewing direction is assumed to be along\nthe negative z-axis in the camera coordinate system. Therefore, the 3D content de\ufb01ned in the\nworld coordinate system needs to be transformed into the camera coordinate system before being\nrendered. The two currently popular rendering methods, rasterization-based rendering and ray tracing,\nprocedurally compute the color of each pixel in the image with two major steps: testing visibility in\nthe scene, and computing shaded color value under an illumination model.\nRenderNet jointly learns both steps of the rendering process from training data, which can be\ngenerated using either rasterization or ray tracing. Inspired by the traditional rendering pipeline, we\nalso adopt the world-space-to-camera-space coordinate transformation strategy, and assume that the\ncamera is axis-aligned and looks along the negative z-axis of the volumetric grid that discretizes\nthe input shape. Instead of having the network learn operations which are differentiable and easy\nto implement, such as rigid-body coordinate transformation or the interaction of light with surface\n\n3\n\n\fFigure 1: Network architecture. See Section 2 in the supplementary document for details.\n\nnormals (e.g. assuming a Phong illumination model [24]), we provide most of them explicitly to the\nnetwork. This allows RenderNet to focus its capacity on more complex aspects of the rendering task,\nsuch as recognizing visibility and producing shaded color.\nRenderNet receives a voxel grid as input, and applies a rigid-body transformation to convert from\nthe world coordinate system to the camera coordinate system. The tranformed input, after being\ntrilinearly sampled, is then fed to a CNN with a projection unit to produce a rendered 2D image.\nRenderNet consists of 3D convolutions, a projection unit that computes visibility of objects in the\nscene and projects them onto 2D feature maps, followed by 2D convolutions to compute shading.\nWe train RenderNet using a pixel-space loss between the target image and the output. Optionally,\nthe network can produce normal maps of the 3D input which can be combined with light sources to\nilluminate the scene. While the projection unit can easily incorporate orthographic projections, the\n3D convolutions can morph the scene and allows for perspective camera views. In future versions of\nRenderNet, perspective transformation may also be explicitly incorporated into the network.\n\n3.1 Rotation and resampling\n\nThe transformed input via rigid body motion ensures that the camera is always in the same canonical\npose relative to the voxel grid being rendered. The transformation is parameterized by the rotation\naround the y-axis and z-axis, which corresponds to the azimuth and elevation, and a distance R that\ndetermines the scaling factor, i.e., how close the object is to the camera. We embedded the input voxel\ngrid into a larger grid to make sure the object is not cut off after rotation. The total transformation\ntherefore includes scaling, rotation, translation, and trilinear resampling.\n\n3.2 Projection unit\nThe input of RenderNet is a voxel grid V of dimension HV \u00d7WV \u00d7DV \u00d7CV (corresponding to height,\nwidth, depth, and channel), and the output is an image I of dimension HI\u00d7WI\u00d7CI (corresponding\nto height, width and channel). To bridge the disparity between the 3D input and 2D output, we devise\na novel projection unit. The design of this unit is straightforward: it consists of a reshaping layer, and\na multilayer perceptron (MLP). Max pooling is often used to \ufb02atten the 3D input across the depth\ndimension [4, 5], but this can only create the silhouette map of the 3D shape. The projection unit, on\nthe other hand, learns not only to perform projection, but also to determine visibility of different parts\nof the 3D input along the depth dimension after projection.\nFor the reshaping step of the unit, we collapse the depth dimension with the feature maps to map the\nincoming 4D tensor to a 3D squeezed tensor V (cid:48) with dimension W\u00d7H\u00d7(D\u00b7 C). This is immediately\nfollowed by an MLP, which is capable of learning more complex structure within the local receptive\n\ufb01eld than a conventional linear \ufb01lter [13]. We apply the MLP on each (D \u00b7 C) vector, which we\nimplement using a 1\u00d71 convolution in this project. The reshaping step allows each unit of the MLP\nto access the features across different channels and the depth dimension of the input, enabling the\n\n4\n\nDown-conv3dPreluPreluConv2dPreluUp-conv2dPreluDown-conv3dConv3dConv3d3D shape64 x 64 x 64 x 1 FINALOUTPUTConv2dConv2dUp-conv2dPreluCamera posePROJECTION UNIT3D CONVOLUTIONLight positionINPUTPrelu2D CONVOLUTIONPreluCAMERACOORDINATES128 x 128 x128 x 1512 x 512 x 332 x 32 x 32 x 162) MLP(1x1 CONVOLUTION)32 x 32 x (cid:31)32x16)1) Reshaping32 x 32 x 51232 x 32 x 512NORMALMAP512 x 512 x 3\fnetwork to learn the projection operation and visibility computation along the depth axis. Given\nthe squeezed 3D tensor V (cid:48) with (D \u00b7 C) channels, the projection unit produces a 3D tensor with K\nchannels as follows:\n\n(cid:33)\n\n(cid:32)(cid:88)\n\nIi,j,k = f\n\nwk,dc \u00b7 V (cid:48)\n\ni,j,dc + bk\n\n(1)\n\nwhere i, j are pixel coordinates, k is the image channel, dc is the squeezed depth channel, where\nd and c are the depth and channel dimension of the original 4D tensor respectively, and f is some\nnon-linear function (parametric ReLU in our experiments).\n\ndc\n\n3.3 Extending RenderNet\n\nWe can combine RenderNet with other networks to handle more rendering parameters and perform\nmore complex tasks such as shadow rendering or texture mapping. We model a conditional renderer\np(I | V, h) where h can be extra rendering parameter such as lights, or spatially-varying parameters\nsuch as texture.\nHere we demonstrate the extensibility of RenderNet using the example of the Phong illumination\nmodel [24]. The per-pixel shaded color for the images is calculated by S = max(0,(cid:126)l \u00b7 (cid:126)n + a), where\n(cid:126)l is the unit light direction vector, (cid:126)n is the normal vector, whose components are encoded by the RGB\nchannels of the normal map, and a is an ambient constant. Shading S and albedo map A are further\ncombined to create the \ufb01nal image I based on I = A (cid:12) S [34]. This is illustrated in Section 4.1,\nwhere we combine the albedo map and normal map rendered by the combination of a texture-mapping\nnetwork and RenderNet to render shaded images of faces.\n\n4 Experiments\n\nTo explore the generality of RenderNet, we test our method on both computer graphics and vision\ntasks. First, we experiment with different rendering tasks with varying degree of complexity, including\nchallenging cases such as texture mapping and surface relighting. Second, we experiment with vision\napplications such as image-based pose and shape reconstruction.\nDatasets We use the chair dataset from ShapeNet Core [35]. Apart from being one of the categories\nwith the largest number of data points (6778 objects), the chair category also has large intra-class\nvariation. We convert the ShapeNet Dataset to 64\u00d764\u00d764 voxel grids using volumetric convolution\n[36]. We randomly sampled 120 views of each object to render training images at 512\u00d7512 resolution.\nThe elevation and azimuth are uniformly sampled between [10, 170] degrees and [0, 359] degrees,\nrespectively. Camera radius are set at 3 to 6.3 units from the origin, with the object\u2019s axis-aligned\nbounding box normalized to 1 unit length. For the texture mapping tasks, we generate 100,000 faces\nfrom the Basel Face Dataset [37], and render them with different azimuths between [220, 320] degrees\nand elevations between [70, 110] degrees. We use Blender3D to generate the Ambient Occlusion (AO)\ndataset, and VTK for the other datasets. For the contour dataset, we implemented the pixel-based\nsuggestive contour [25] algorithm in VTK.\nTraining We adopt the patch training strategy to speed up the training process in our model. We\ntrain the network using random spatially cropped samples (along the width and height dimensions)\nfrom the training voxel grids, while keeping the depth and channel dimensions intact. We only use\nthe full-sized voxel grid input during inference. The patch size starts as small as 1/8 of the full-sized\ngrid, and progressively increases towards 1/2 of the full-sized grid at the end of the training.\nWe train RenderNet using a pixel-space regression loss. We use mean squared error loss for colored\nimages, and binary cross entropy for grayscale images. We use the Adam optimizer [38], with a\nlearning rate of 0.00001.\nCode, data and trained models will be available at: https://github.com/thunguyenphuoc/\nRenderNet.\n\n4.1 Learning to render and apply texture\n\nFigure 2 shows that RenderNet is able to learn different types of shaders, including Phong shading,\ncontour line shading, complex multi-pass shading (cartoon shading), and a ray-tracing effect (Ambient\n\n5\n\n\fFigure 2: Left: Different types of shaders generated by RenderNet (intput at the top). Right:\nComparing Phong shading between RenderNet, a standard OpenGL mesh renderer, and a standard\nMarching Cubes algorithm. RenderNet produces competitive results with the OpenGL mesh renderer\nwithout suffering from mesh artefacts (notice the seating pad of chair (c) or the leg of chair (d) in\nMesh renderer), and does not suffer from low-resolution input like Marching cubes.\n\nOcclusion) with the same network architecture. RenderNet was trained on datasets for each of these\nshaders, and the \ufb01gure shows outputs generated for unseen test 3D shapes. We report the PSNR score\nfor each shader in Figure 5.\nRenderNet generalizes well to shapes of unseen categories. While it was trained on chairs, it can also\nrender non-man-made objects such as the Stanford Bunny and Monkey (Figure 3). The method also\nworks very well when there are multiple objects in the scene, suggesting the network recognizes the\nvisibility of the objects in the scene.\nRenderNet can also handle corrupted or low-resolution volumetric data. For example, Figure 3 shows\nthat the network is able to produce plausible renderings for the Bunny when the input model was\narti\ufb01cially corrupted by adding 50% random noise. When the input model is downsampled (here we\nlinearly downsampled the input by 50%), RenderNet can still render a high-resolution image with\nsmooth details. This is advantageous compared to the traditional computer graphics mesh rendering,\nwhich requires a clean and high-quality mesh in order to achieve good rendered results.\nIt is also straightforward to combine RenderNet with other modules for tasks such as mapping and\nrendering texture (Figure 4). We create a texture-mapping network to map a 1D texture vector\nrepresentation (these are the PCA coef\ufb01cients for generating albedo texture using the BaselFace\ndataset) to a 3D representation of the texture that has the same width, height and depth as the shape\ninput. This output is concatenated along the channel dimension with the input 3D shape before\ngiven RenderNet to render the albedo map. This is equivalent to assigning a texture value to the\ncorresponding voxel in the binary shape voxel grid. We also add another output branch of 2D\nconvolutions to RenderNet to render the normal map. The albedo map and the normal map produced\nby RenderNet are then combined to create shaded renderings of faces as described in Section 3.3.\nSee Section 2.3 in the supplementary document for network architecture details.\n\n4.2 Architecture comparison\n\nIn this section, we compare RenderNet with two baseline encoder-decoder architectures to render\nPhong-shaded images. Similar to RenderNet, the networks receive the 3D shape, pose, light position\nand light intensity as input. In contrast to RenderNet, the 3D shape given to the alternative network is\nin the canonical pose, and the networks have to learn to transform the 3D input to the given pose.\nThe \ufb01rst network follows the network architecture by Dosovitskiy et al. [16], which consists of a\n\n6\n\nInputContourCartoonAOPhongRenderNetMesh renderMarching cubesa)b)c)d)\fFigure 3: Generalization. Even with input from unseen categories or of low quality, RenderNet can\nstill produce good results in different styles (left) and from different views (right).\n\nFigure 4: Rendering texture and manipulating rendering inputs. Best viewed in color.\n\nseries of fully-connected layers and up-convolution layers. The second network is similar but has a\ndeeper decoder than the \ufb01rst one by adding residual blocks. For the 3D shape, we use an encoding\nnetwork to map the input to a latent shape vector (refer to Section 2.2 in the supplementary document\nfor details). We call these two networks EC and EC-Deep, respectively. These networks are trained\ndirectly on shaded images with a binary cross-entropy loss, using the chair category from ShapeNet.\nRenderNet, on the other hand, \ufb01rst renders the normal map, and combines this with the lighting input\nto create the shaded image using the shading equation in Section 3.3.\nAs shown in Figure 5, the alternative model (here we show the EC model) fails to produce important\ndetails of the objects and achieves lower PSNR score on the Phong-shaded chair dataset. More\nimportantly, this architecture \u201cremembers\u201d the global structure of the objects and fails to generalize\nto objects of unseen category due to the use of the fully connected layers. In contrast, our model is\nbetter for rendering tasks as it generalizes well to different categories of shapes and scenes.\n\n4.3 Shape reconstruction from images\n\nHere we demonstrate that RenderNet can be used for single-image reconstruction. It achieves this\ngoal via an iterative optimization that minimizes the following reconstruction loss:\n\nminimize\n\nz,\u03b8,\u03c6,\u03b7\n\n(cid:107)I \u2212 f (z, \u03b8, \u03c6, \u03b7)(cid:107)2\n\n(2)\n\nwhere I is the observed image and f is our pre-trained RenderNet. z is the shape to reconstruct, \u03b8\nand \u03b7 are the pose and lighting parameters, and \u03c6 is the texture variable. In essence, this process\nmaximizes the likelihood of observing the image I given the shape z.\n\n7\n\nInputPhongContourCartoonAOOrganicLow-res (x0.5 original res)Complex sceneCorruptedInputElevationAzimuthScaling10\u00b030\u00b060\u00b090\u00b0120\u00b0150\u00b00\u00b060\u00b0120\u00b0180\u00b0240\u00b0 300\u00b0x0.4x0.6x0.8x1.0x1.2x1.4Same texture-Same face-Different lightingGTResultsSame face-Different textureSame texture-Different faceVoxel input\fPSNR score\n\nRender style\nRenderNet Phong\nEC Phong\nEC-Deep Phong\nRenderNet Contour\nRenderNet Toon\nRenderNet AO\nRenderNet Face\n\nPSNR\n25.39\n24.21\n20.88\n19.70\n17.77\n22.37\n27.43\n\nFigure 5: Left: Architecture comparison in different tasks: a) Novel-view synthesis, b) Relighting\nand c) Generalization. Right: PSNR score of different shaders, including the two alternative\narchitectures.\n\nHowever, directly minimizing this loss often leads to noisy, unstable results (shown in Figure 2\nin the supplementary document). In order to improve the reconstruction, we use a shape prior for\nregularizing the process \u2013 a pre-trained 3D auto-encoder similar to the TL-embedding network [39]\nwith 80000 shapes. Instead of optimizing z, we optimize its latent representation z(cid:48):\n\nminimize\nz(cid:48),\u03b8,\u03c6(cid:48),\u03b7\n\n(cid:107)I \u2212 f (g(z(cid:48)), \u03b8, h(\u03c6(cid:48)), \u03b7)(cid:107)2\n\n(3)\n\nwhere g is the decoder of the 3D auto-encoder. It regularizes the reconstructed shape g(z(cid:48)) by using\nthe prior shape knowledge (weights in the decoder) for shape generation. Similarly, we use the\ndecoder h that was trained with RenderNet for the texture rendering task in Section 4.1 to regularize\nthe texture variable \u03c6(cid:48). This corresponds to MAP estimation, where the prior term is the shape\ndecoder and the likelihood term is given by RenderNet. Note that it is straightforward to extend this\nmethod to the multi-view reconstruction task by summing over multiple per-image losses with shared\nshape and appearance.\nWe compare RenderNet with DC-IGN by Kulkarni et al. [29] in Figure 6. DC-IGN learns to\ndecompose images into a graphics code Z, which is a disentangled representation containing a set of\nlatent variables for shape, pose and lighting, allowing them to manipulate these properties to generate\nnovel views or perform image relighting. In contrast to their work, we explicitly reconstruct the 3D\ngeometry, pose, lighting and texture, which greatly improves tasks such as out-of-plane rotation,\nand allows us to do re-texturing. We also generate results with much higher resolution (512\u00d7512)\ncompared to DC-IGN (150\u00d7150). Our results show that having an explicit reconstruction not only\ncreates sharper images with higher level of details in the task of novel-view prediction, but also\ngives us more control in the relighting task such as light color, brightness, or light position (here we\nmanipulate the elevation and azimuth of the light position), and especially, the re-texturing task.\nFor the face dataset, we report the Intersection-over-Union (IOU) between the ground truth and\nreconstructed voxel grid of 42.99 \u00b1 0.64 for 95% con\ufb01dence interval. We also perform the same\nexperiment for the chair dataset \u2013 refer to Section 1 in the supplementary material for implementation\ndetails and additional results.\n\n8\n\nEC-DeepEC-DeepRenderNetEC-DeepRenderNetECRenderNetECECa)b)c)\fFigure 6: Image-based reconstruction. We show both the reconstructed images and normal maps\nfrom a single image. The cross indicates a factor not learnt by the network. Note: for the re-texturing\ntask, we only show the albedo to visualize the change in texture more clearly. Best viewed in color.\n\n5 Discussion and conclusion\n\nIn this paper, we presented RenderNet, a convolutional differentiable rendering network that can\nbe trained end-to-end with a pixel-space regression loss. Despite the simplicity in the design of the\nnetwork architecture and the projection unit, our experiments demonstrate that RenderNet successfully\nperforms rendering and inverse rendering. Moreover, as shown in Section 4.1, there is the potential to\ncombine different shaders in one network that shares the same 3D convolutions and projection unit,\ninstead of training different networks for different shaders. This opens up room for improvement\nand exploration, such as extending RenderNet to work with unlabelled data, using other losses\nlike adversarial losses or perceptual losses, or combining RenderNet with other architectures, such\nas U-Net or a multi-scale architecture where the projection unit is used at different resolutions.\nAnother interesting possibility is to combine RenderNet with a style-transfer loss for stylization of\n3D renderings.\nThe real world is three-dimensional, yet the majority of current image synthesis CNNs, such as\nGAN [17] or DC-IGN [29], only operates in 2D feature space and makes almost no assumptions\nabout the 3D world. Although these methods yield impressive results, we believe that having a\nmore geometrically grounded approach can greatly improve the performance and the \ufb01delity of the\ngenerated images, especially for tasks such as novel-view synthesis, or more \ufb01ne-grained editing\ntasks such as texture editing. For example, instead of having a GAN generate images from a noise\nvector via 2D convolutions, a GAN using RenderNet could \ufb01rst generate a 3D shape, which is then\nrendered to create the \ufb01nal image. We hope that RenderNet can bring more attention to the computer\ngraphics literature, especially geometry-grounded approaches, to inspire future developments in\ncomputer vision.\n\nAcknowledgments\n\nWe thank Christian Richardt for helpful discussions. We thank Lucas Theis for helpful discussions and\nfeedback on the manuscript. This work was supported in part by the European Union\u2019s Horizon 2020\nresearch and innovation programme under the Marie Sklodowska-Curie grant agreement No 665992,\nthe UK\u2019s EPSRC Centre for Doctoral Training in Digital Entertainment (CDE), EP/L016540/1, and\nCAMERA, the RCUK Centre for the Analysis of Motion, Entertainment Research and Applications,\nEP/M023281/1. We also received GPU support from Lambda Labs.\n\n9\n\nAzimuthLightingElevationTextureDC-IGNRenderNetGTRecon.GTRecon.Manipulation with RenderNetManipulationNormalVoxel\fReferences\n[1] Danilo Jimenez Rezende, S. M. Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas\n\nHeess. Unsupervised learning of 3d structure from images. In NIPS, pages 4996\u20135004. 2016.\n\n[2] Matthew M. Loper and Michael J. Black. OpenDR: An approximate differentiable renderer. In ECCV,\n\npages 154\u2013169. 2014.\n\n[3] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In IEEE CVPR, 2018.\n\n[4] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets:\nLearning single-view 3d object reconstruction without 3d supervision. In NIPS, pages 1696\u20131704. 2016.\n\n[5] Rui Zhu, Hamed Kiani Galoogahi, Chaoyang Wang, and Simon Lucey. Rethinking reprojection: Closing\n\nthe loop for pose-aware shape reconstruction from a single image. In IEEE CVPR, pages 57\u201365, 2017.\n\n[6] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T Freeman, and Joshua B Tenenbaum.\n\nMarrNet: 3D Shape Reconstruction via 2.5D Sketches. In NIPS, 2017.\n\n[7] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. Multi-view supervision for\n\nsingle-view reconstruction via differentiable ray consistency. In IEEE CVPR, pages 209\u2013217, 2017.\n\n[8] Paul Henderson and Vittorio Ferrari. Learning to generate and reconstruct 3d meshes with only 2d\n\nsupervision. In British Machine Vision Conference (BMVC), 2018.\n\n[9] Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, and William T. Freeman.\nUnsupervised training for 3d morphable model regression. In The IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), June 2018.\n\n[10] Abhijit Kundu, Yin Li, and James M. Rehg. 3d-rcnn: Instance-level 3d object reconstruction via render-\n\nand-compare. In CVPR, 2018.\n\n[11] E. Richardson, M. Sela, R. Or-El, and R. Kimmel. Learning detailed face reconstruction from a single\nimage. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5553\u20135562,\nJuly 2017. doi: 10.1109/CVPR.2017.589.\n\n[12] JunYoung Gwak, Christopher B Choy, Manmohan Chandraker, Animesh Garg, and Silvio Savarese. Weakly\nsupervised 3d reconstruction with adversarial constraint. In 3D Vision (3DV), 2017 Fifth International\nConference on 3D Vision, 2017.\n\n[13] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In ICLR, 2014.\n\n[14] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolu-\n\ntional networks. In IEEE CVPR, pages 2261\u20132269, 2017.\n\n[15] O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation.\nIn Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351, pages 234\u2013241,\n2015.\n\n[16] Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, and Thomas Brox. Learning to\n\ngenerate chairs, tables and cars with convolutional networks. 39(4):692\u2013705, 2017.\n\n[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes,\nN. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27,\npages 2672\u20132680. Curran Associates, Inc., 2014.\n\n[18] Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded re\ufb01nement networks. In\n\nIEEE ICCV, pages 1520\u20131529, 2017.\n\n[19] Oliver Nalbach, Elena Arabadzhiyska, Dushyant Mehta, Hans-Peter Seidel, and Tobias Ritschel. Deep\nshading: Convolutional neural networks for screen-space shading. Computer Graphics Forum (Proc.\nEGSR), 36(4):65\u201378, 2017.\n\n[20] Jonathan T Barron and Jitendra Malik. Shape, illumination, and re\ufb02ectance from shading. TPAMI, 2015.\n\n[21] Tatsunori Taniai and Takanori Maehara. Neural inverse rendering for general re\ufb02ectance photometric stereo.\nIn ICML, volume 80 of JMLR Workshop and Conference Proceedings, pages 4864\u20134873. JMLR.org, 2018.\n\n10\n\n\f[22] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Ef\ufb01cient\n\nconvolutional architectures for high-resolution 3d outputs. In IEEE ICCV, pages 2107\u20132115, 2017.\n\n[23] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolu-\n\ntional neural networks for 3d shape analysis. ACM TOG (Siggraph), 36(4):72:1\u201372:11, 2017.\n\n[24] Bui Tuong Phong. Illumination for computer generated pictures. Commun. ACM, 18(6):311\u2013317, 1975.\n\n[25] Doug DeCarlo, Adam Finkelstein, Szymon Rusinkiewicz, and Anthony Santella. Suggestive contours for\n\nconveying shape. ACM TOG (Siggraph), 22(3):848\u2013855, July 2003.\n\n[26] Holger Winnem\u00f6ller, Sven C. Olsen, and Bruce Gooch. Real-time video abstraction. ACM TOG (Siggraph),\n\n25(3):1221\u20131226, 2006.\n\n[27] Gavin Miller. Ef\ufb01cient algorithms for local and global accessibility shading. In Proc. ACM SIGGRAPH,\n\npages 319\u2013326, 1994.\n\n[28] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional\n\nadversarial networks. In IEEE CVPR, pages 5967\u20135976, 2017.\n\n[29] Tejas D Kulkarni, William F. Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse\n\ngraphics network. In NIPS, pages 2539\u20132547, 2015.\n\n[30] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C. Berg. Transformation-\ngrounded image generation network for novel 3d view synthesis. In IEEE CVPR, pages 702\u2013711, 2017.\n\n[31] K. Rematas, C. H. Nguyen, T. Ritschel, M. Fritz, and T. Tuytelaars. Novel views of objects from a single\n\nimage. IEEE Trans. Pattern Anal. Mach. Intell., 39(8):1576\u20131590, 2017.\n\n[32] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with\n\nrecurrent transformations for 3d view synthesis. In NIPS, pages 1099\u20131107. 2015.\n\n[33] Hao Su, Fan Wang, Li Yi, and Leonidas J. Guibas. 3d-assisted image feature synthesis for novel views of\n\nan object. CoRR, abs/1412.0003, 2014.\n\n[34] Berthold K. P. Horn. Determining lightness from an image. Computer Graphics and Image Processing, 3\n\n(4):277\u2013299, 1974.\n\n[35] Angel X. Chang, Thomas A. Funkhouser, Leonidas J. Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li,\nSilvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet:\nAn information-rich 3d model repository. CoRR, abs/1512.03012, 2015.\n\n[36] F. S. Nooruddin and G. Turk. Simpli\ufb01cation and repair of polygonal models using volumetric techniques.\n\nIEEE Trans. on Vis. and Comp. Graphics, 9(2):191\u2013205, 2003.\n\n[37] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face model\nfor pose and illumination invariant face recognition. In Proceedings of the 2009 Sixth IEEE International\nConference on Advanced Video and Signal Based Surveillance, pages 296\u2013301, 2009.\n\n[38] Diederik P. Kingma and Jimm Ba. Adam: A method for stochastic optimization.\n\nConference on Learning Representations (ICLR), 2015.\n\nIn International\n\n[39] Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and\n\ngenerative vector representation for objects. In ECCV, pages 484\u2013499, 2016.\n\n11\n\n\f", "award": [], "sourceid": 4893, "authors": [{"given_name": "Thu", "family_name": "Nguyen-Phuoc", "institution": "University of Bath"}, {"given_name": "Chuan", "family_name": "Li", "institution": "Lambda Labs, Inc."}, {"given_name": "Stephen", "family_name": "Balaban", "institution": "Lambda"}, {"given_name": "Yongliang", "family_name": "Yang", "institution": "University of Bath"}]}