{"title": "Self-Supervised Intrinsic Image Decomposition", "book": "Advances in Neural Information Processing Systems", "page_first": 5936, "page_last": 5946, "abstract": "Intrinsic decomposition from a single image is a highly challenging task, due to its inherent ambiguity and the scarcity of training data. In contrast to traditional fully supervised learning approaches, in this paper we propose learning intrinsic image decomposition by explaining the input image. Our model, the Rendered Intrinsics Network (RIN), joins together an image decomposition pipeline, which predicts reflectance, shape, and lighting conditions given a single image, with a recombination function, a learned shading model used to recompose the original input based off of intrinsic image predictions. Our network can then use unsupervised reconstruction error as an additional signal to improve its intermediate representations. This allows large-scale unlabeled data to be useful during training, and also enables transferring learned knowledge to images of unseen object categories, lighting conditions, and shapes. Extensive experiments demonstrate that our method performs well on both intrinsic image decomposition and knowledge transfer.", "full_text": "Self-Supervised Intrinsic Image Decomposition\n\nMichael Janner\n\nMIT\n\nJiajun Wu\n\nMIT\n\nTejas D. Kulkarni\n\nDeepMind\n\njanner@mit.edu\n\njiajunwu@mit.edu\n\ntejasdkulkarni@gmail.com\n\nIlker Yildirim\n\nMIT\n\nilkery@mit.edu\n\nJoshua B. Tenenbaum\n\nMIT\n\njbt@mit.edu\n\nAbstract\n\nIntrinsic decomposition from a single image is a highly challenging task, due to\nits inherent ambiguity and the scarcity of training data. In contrast to traditional\nfully supervised learning approaches, in this paper we propose learning intrinsic\nimage decomposition by explaining the input image. Our model, the Rendered\nIntrinsics Network (RIN), joins together an image decomposition pipeline, which\npredicts re\ufb02ectance, shape, and lighting conditions given a single image, with a\nrecombination function, a learned shading model used to recompose the original\ninput based off of intrinsic image predictions. Our network can then use unsu-\npervised reconstruction error as an additional signal to improve its intermediate\nrepresentations. This allows large-scale unlabeled data to be useful during train-\ning, and also enables transferring learned knowledge to images of unseen object\ncategories, lighting conditions, and shapes. Extensive experiments demonstrate\nthat our method performs well on both intrinsic image decomposition and knowl-\nedge transfer.\n\n1\n\nIntroduction\n\nThere has been remarkable progress in computer vision, particularly for answering questions such\nas \u201cwhat is where?\u201d given raw images. This progress has been possible due to large labeled training\nsets and representation learning techniques such as convolutional neural networks [LeCun et al.,\n2015]. However, the general problem of visual scene understanding will require algorithms that\nextract not only object identities and locations, but also their shape, re\ufb02ectance, and interactions\nIntuitively disentangling the contributions from these three components, or intrinsic\nwith light.\nimages, is a major triumph of human vision and perception. Conferring this type of intuition to an\nalgorithm, though, has proven a dif\ufb01cult task, constituting a major open problem in computer vision.\nThis problem is challenging in particular because it is fundamentally underconstrained. Consider\nthe porcelain vase in Figure 1a. Most individuals would have no dif\ufb01culty identifying the true colors\nand shape of the vase, along with estimating the lighting conditions and the resultant shading on the\nobject, as those shown in 1b. However, the alternatives in 1c, which posits a \ufb02at shape, and 1d, with\nunnatural red lighting, are entirely consistent in that they compose to form the correct observed vase\nin 1a.\nThe task of \ufb01nding appropriate intrinsic images for an object is then not a question of simply \ufb01nd-\ning a valid answer, as there are countless factorizations that would be equivalent in terms of their\nrendered combination, but rather of \ufb01nding the most probable answer. Roughly speaking, there are\ntwo methods of tackling such a problem: a model must either (1) employ handcrafted priors on the\nre\ufb02ectance, shape, and lighting conditions found in the natural world in order to assign probabilities\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: A porcelain vase (a) along with three predictions (b-d) for its underlying intrinsic images.\nThe set in (c) assumes the contribution from shading is negligible by predicting a completely \ufb02at\nrather than rounded shape. The re\ufb02ectance is therefore indistinguishable from the observed image.\nThe set in (d) includes the correct shape but assumes red lighting and a much brighter blue color in\nthe regions affected by shading. While the decomposition in (b) is much more inuitively pleasing\nthan either of these alternatives, all of these options are valid in that they combine to exactly form\nthe observed vase. (e) shows a sphere with our visualized normals map as a shape reference.\n\nto intrinsic image proposals or (2) have access to a library of ground truth intrinsic images and their\ncorresponding composite images.\nUnfortunately, there are limitations to both methods. Although there has been success with the\n\ufb01rst route in the past [Barron and Malik, 2015], strong priors are often dif\ufb01cult to hand-tune in a\ngenerally useful fashion. On the other hand, requiring access to complete, high quality ground truth\nintrinsic images for real world scenes is also limiting, as creating such a training set requires an\nenormous amount of human effort and millions of crowd-sourced annotations [Bell et al., 2014].\nIn this paper, we propose a deep structured autoencoder, the Rendered Instrinsics Network (RIN),\nthat disentangles intrinsic image representations and uses them to reconstruct the input. The de-\ncomposition model consists of a shared convolutional encoder for the observation and three separate\ndecoders for the re\ufb02ectance, shape, and lighting. The shape and lighting predictions are used to\ntrain a differentiable shading function. The output of the shader is combined with the re\ufb02ectance\nprediction to reproduce the observation. The minimal structure imposed in the model \u2013 namely, that\nintrinsic images provide a natural way of disentangling real images and that they provide enough\ninformation to be used as input to a graphics engine \u2013 makes RIN act as an autoencoder with useful\nintermediate representations.\nThe structure of RIN also exploits two natural sources of supervision: one applied to the interme-\ndiate representations themselves, and the other to the reconstructed image. This provides a way for\nRIN to improve its representations with unlabeled data. By avoiding the need for intrinsic image\nlabels for all images in the dataset, RIN can adapt to new types of inputs even in the absence of\nground truth data. We demonstrate the utility of this approach in three transfer experiments. RIN is\n\ufb01rst trained on a simple set of \ufb01ve geometric primitives in a supervised manner and then transferred\nto common computer vision test objects. Next, RIN is trained on a dataset with a skewed underlying\nlighting distribution and \ufb01lls in the missing lighting conditions on the basis of unlabeled observa-\ntions. Finally, RIN is trained on a single Shapenet category and then transferred to a separate, highly\ndissimilar category.\nOur contributions are three-fold. First, we propose a novel formulation for intrinsic image decom-\nposition, incorporating a differentiable, unsupervised reconstruction loss into the loop. Second, we\ninstantiate this approach with the RIN, a new model that uses convolutional neural networks for both\nintrinsic image prediction and recombination via a learned shading function. This is also the \ufb01rst\nwork to apply deep learning to the full decomposition into re\ufb02ectance, shape, lights, and shading,\nas prior work has focused on the re\ufb02ectance-shading decomposition. Finally, we show that RIN can\nmake use of unlabeled data to improve its intermediate intrinsic image representations and transfer\nknowledge to new objects unseen during training.\n\n2\n\nabcdeed\fFigure 2: RIN contains two convolutional encoder-decoders, one used for predicting the intrinsic\nimages from an input and another for predicting the shading stemming from a light source applied\non a shape. The two networks together function as a larger structured autoencoder, forcing a speci\ufb01c\ntype of intermediate representation in order to reconstruct the input image.\n\n2 Related Work\n\nIntrinsic images were introduced by Barrow and Tenenbaum as useful mid-level scene descrip-\ntors [Barrow and Tenenbaum, 1978]. The model posits that an image can be expressed as the\npointwise product between contributions from the true colors of an object, or its re\ufb02ectance, and\ncontributions from the shading on that object:\n\nimage I = re\ufb02ectance R \u00b7 shading S\n\n(1)\n\nDecomposing one step further, the shading is expressed as some function of an object\u2019s shape and\nthe ambient lighting conditions. The exact nature of this shading function varies by implementation.\nEarly work on intrinsic image decomposition was based on insights from Land\u2019s Retinex The-\nory [Land and McCann, 1971]. Horn [1974] separated images into true colors and shading using the\nassumption that large image gradients tend to correspond to re\ufb02ectance changes and small gradients\nto lighting changes. While this assumption works well for a hypothetical Mondrian World of \ufb02at\ncolors, it does not always hold for natural images. In particular, Weiss [2001] found that this model\nof re\ufb02ectance and lighting is rarely true for outdoor scenes.\nMore recently, Barron and Malik [2015] developed an iterative algorithm called SIRFS that max-\nimizes the likelihood of intrinsic image proposals under priors derived from regularities in natural\nimages. SIRFS proposes shape and lighting estimates and combines them via a spherical harmonics\nrenderer to produce a shading image. Lombardi and Nishino [2012, 2016] and Oxholm and Nishino\n[2016] proposed a Bayesian formulation of such an optimization procedure, also formulating priors\nbased on the distribution of material properties and the physics of lighting in the real world. Re-\nsearchers have also explored reconstructing full 3D shapes through intrinsic images by making use\nof richer generative models [Kar et al., 2015, Wu et al., 2017].\nTang et al. [2012] combined Lambertian re\ufb02ectance assumptions with Deep Belief Networks to\nlearn a prior over the re\ufb02ectance of greyscale images and applied their Deep Lambertian Network\nto one-shot face recognition. Narihira et al. [2015b] applied deep learning to intrinsic images \ufb01rst\nusing human judgments on real images and later in the context of animated movie frames [Narihira\net al., 2015a]. Rematas et al. [2016] and Hold-Geoffroy et al. [2017] also used convolutional neural\nnetworks to estimate re\ufb02ectance maps and illumination parameters, respectively, in unconstrained\noutdoor settings.\nInnamorati et al. [2017] generalized the intrinsic image decomposition by considering the contribu-\ntions of specularity and occlusion in a direction-dependent model. Shi et al. [2017] found improved\nperformance in the full decomposition by incorporating skip layer connections [He et al., 2016] in\nthe network architecture, which were used to generate much crisper images. Our work can be seen\n\n3\n\n\fFigure 3: In contrast to simple Lambertian shading techniques, our learned shading model can\nhandle shadows cast between objects. Inputs to the shader are shape and lighting parameter pairs.\n\nas a further extension of these models which aims to relax the need for a complete set of ground\ntruth data by modeling the image combination process, as in Nalbach et al. [2017].\nIncorporating a domain-speci\ufb01c decoder to reconstruct input images has been explored by Hinton\net al. in their transforming autoencoders [Hinton et al., 2011], which also learned natural represen-\ntations of images in use by the vision community. Our work differs in the type of representation in\nquestion, namely images rather than descriptors like af\ufb01ne transformations or positions. Kulkarni\net al. [2015] were also interested in learning disentangled representations in an autoencoder, which\nthey achieved by selective gradient updates during training. Similarly, Chen et al. [2016] showed\nthat a mutual information objective could drive disentanglement of a deep network\u2019s intermediate\nrepresentation.\n\n3 Model\n\n3.1 Use of Reconstruction\n\nRIN differs most strongly with past work in its use of the reconstructed input. Other approaches\nhave fallen into roughly two groups in this regard:\n\n1. Those that solve for one of the intrinsic images to match the observed image. SIRFS, for ex-\nample, predicts shading and then solves equation 1 for re\ufb02ectance given its prediction and the\ninput [Barron and Malik, 2015]. This ensures that the intrinsic image estimations combine to\nform exactly the observed image, but also deprives the model of any reconstruction error.\n\n2. Data-driven techniques that rely solely on ground truth labelings [Narihira et al., 2015a, Shi\net al., 2017]. These approaches assume access to ground truth labels for all inputs and do not\nexplicitly model the reconstruction of the input image based on intrinsic image predictions.\n\nMaking use of the reconstruction for this task has been previously unexplored because such an\nerror signal can be dif\ufb01cult to interpret. Just as the erroneous intrinsic images in Fig 1c-d combine\nto reconstruct the input exactly, one cannot assume that low reconstruction error implies accurate\nintrinsic images. An even simpler degenerate solution that yields zero reconstruction error is:\n\n\u02c6R = I\n\nand\n\n\u02c6S = 11T ,\n\n(2)\n\nwhere \u02c6S is the all-ones matrix. It is necessary to further constrain the predictions such that the model\ndoes not converge to such explanations.\n\n3.2 Shading Engine\n\nRIN decomposes an observation into re\ufb02ectance, shape, and lighting conditions. As opposed to\nmodels which estimate only re\ufb02ectance and shading, which may make direct use of Equation 1\nto generate a reconstruction, we must employ a function that transforms our shape and lighting\npredictions into a shading estimate. Linear Lambertian assumptions could reduce such a function to\na straightforward dot product, but would produce a shading function incapable of modeling lighting\nconditions that drastically change across an image or ray-tracing for the purposes of casting shadows.\n\n4\n\nInputsShader OutputsInputsShader Outputs\fFigure 4: Our shading model\u2019s outputs after training only on synthetic car models from the ShapeNet\ndataset [Chang et al., 2015]. (a) shows the effect of panning the light horizontally and (b) shows\nthe effect of changing the intensity of the light. The input lights are visualized by rendering them\nonto a sphere. Even though the shader was trained only on synthetic data, it generalizes well to real\nshapes with no further training. The shape input to (c) is an estimated normals map of a Beethoven\nbust [Qu\u00b4eau and Durou, 2015].\n\nInstead, we opt to learn a shading model. Such a model is not limited in the way that a pre-de\ufb01ned\nshading function would be, as evidenced by shadows cast between objects in Fig 3. Learning a\nshader also has the bene\ufb01t of allowing for different representations of lighting conditions. In our\nexperiments, lights are de\ufb01ned by a position in three-dimensional space and a magnitude, but alter-\nnate representations such as the radius, orientation, and color of a spotlight could be just as easily\nadopted. For work that employs the shading engine from SIRFS [Barron and Malik, 2015] instead\nof learning a shader in a similar disentanglement context, see Shu et al. [2017]. The SIRFS engine\nrepresents lights as spherical harmonics coef\ufb01cient vectors.\n\n3.3 Architecture\n\nOur model consists of two convolutional encoder-decoder networks, the \ufb01rst of which predicts in-\ntrinsic images from an observed image, and the second of which approximates the shading process of\na rendering engine. Both networks employ mirror-link connections introduced by Shi et al. [2017],\nwhich connect layers of the encoder and decoder of the same size. These connections yield sharper\nresults than the blurred outputs characteristic of many deconvolutional models.\nThe \ufb01rst network has a single encoder for the observation and three separate decoders for the re-\n\ufb02ectance, lighting, and shape. Unlike Shi et al. [2017], we do not link layers between the decoders\nso that it is possible to update the weights of one of the decoders without substantially affecting\nthe others, as is useful in the transfer learning experiments. The encoder has 5 convolutional lay-\ners with {16, 32, 64, 128, 256} \ufb01lters of size 3\u00d73 and stride of 2. Batch normalization [Ioffe and\nSzegedy, 2015] and ReLU activation are applied after every convolutional layer. The layers in the\nre\ufb02ectance and shape decoders have the same number of features as the encoder but in reverse order\nplus a \ufb01nal layer with 3 output channels. Spatial upsampling is applied after the convolutional lay-\ners in the decoders. The lighting decoder is a simple linear layer with an output dimension of four\n(corresponding to a position in three-dimensional space and an intensity of the light).\nThe shape is passed as input to the shading encoder directly. The lighting estimate is passed to\na fully-connected layer with output dimensionality matching that of the shading encoder\u2019s output,\nwhich is concatenated to the encoded shading representation. The shading decoder architecture is\nthe same as that of the \ufb01rst network. The \ufb01nal component of RIN, with no learnable parameters,\nis a componentwise multiplication between the output of the shading network and the predicted\nre\ufb02ectance.\n\n5\n\nabcInputsShader OutputsInputsShader OutputsaInputs\fFigure 5: Intrinsic image prediction from our model on objects from the training category (motor-\nbikes) as well as an example from outside this category (an airplane). The quality of the airplane\nintrinsic images is signi\ufb01cantly lower, which is re\ufb02ected in the reconstruction (labeled \u201dRender\u201d in\nthe RIN rows). This allows reconstruction to drive the improvement of the intermediate intrinsic\nimage representations. Predictions from SIRFS are shown for comparison. Note that the re\ufb02ectance\nin SIRFS is de\ufb01ned based on the difference between the observation and shading prediction, so there\nis not an analogous reconstruction.\n\nMotorbike (Train)\n\nAirplane (Transfer)\n\nRe\ufb02ectance\n\nRIN\nSIRFS\n\n0.0021\n0.0059\n\nShape\n0.0044\n0.0094\n\nLights Re\ufb02ectance\n0.1398\n\n0.0042\n0.0054\n\n\u2013\n\nShape\n0.0119\n0.0080\n\nLights\n0.4873\n\n\u2013\n\nTable 1: MSE of our model and SIRFS on a test set of ShapeNet motorbikes, the category used\nto train RIN, and airplanes, a held-out class. The lighting representation of SIRFS (a vector with\n27 components) is suf\ufb01ciently different from that of our model that we do not attempt to compare\nperformance here directly. Instead, see the visualization of lights in Fig 5.\n\n4 Experiments\n\nRIN makes use of unlabeled data by comparing its reconstruction to the original input image. Be-\ncause our shading model is fully differentiable, as opposed to most shaders that involve ray-tracing,\nthe reconstruction error may be backpropagated to the intrinsic image predictions and optimized via\na standard coordinate ascent algorithm. RIN has one shared encoder for the intrinsic images but\nthree separate decoders, so the appropriate decoder can be updated while the others are held \ufb01xed.\nIn the following experiments, we \ufb01rst train RIN (including the shading model) on a dataset with\nground truth labels for intrinsic images. This is treated as a standard supervised learning problem\nusing mean squared error on the intrinsic image predictions as a loss. The model is then trained\nfurther on an additional set of unlabeled data using only reconstruction loss as an error signal.\nWe refer to this as the self-supervised transfer. For both modes of learning, we optimize using\nAdam [Kingma and Ba, 2015].\nDuring transfer, one half of a minibatch will consist of the unlabeled transfer data the other half will\ncome from the labeled data. This ensures that the representations do not shift too far from those\nlearned during the initial supervised phase, as the underconstrained nature of the problem can drive\n\n6\n\nReference RINSIRFSReference RINSIRFSRenderReflectanceShapeShadingLightsRenderReflectanceShapeShadingLights\fFigure 6: Predictions of RIN before (\u201dDirect transfer\u201d) and after (\u201dSelf-supervised\u201d) it adapts to new\nshapes on the basis of unlabeled data.\n\nStanford Bunny\nShading\nShape\n0.071\n0.074\n0.048\n0.005\n\nUtah Teapot\n\nShape\n0.036\n0.029\n\nShading\n0.043\n0.003\n\nBlender Suzanne\nShading\nShape\n0.104\n0.086\n0.058\n0.007\n\nDirect transfer\nSelf-supervised\n\nTable 2: MSE of RIN trained on \ufb01ve geometric primitives before and after self-supervised learning\nof more complicated shapes.\n\nthe model to degenerate solutions. When evaluating our model on test data, we use the outputs of the\nthree decoders and the learned shader directly; we do not enforce that the predictions must explain\nthe input exactly.\nBelow, we demonstrate that our model can effectively transfer to different shapes, lighting condi-\ntions, and object categories without ground truth intrinsic images. However, for this unsupervised\ntransfer to yield bene\ufb01ts, there must be a suf\ufb01cient number of examples of the new, unlabeled data.\nFor example, the MIT Intrinsic Images dataset [Grosse et al., 2009], containing twenty real-world\nimages, is not large enough for the unsupervised learning to affect the representations of our model.\nIn the absence of any unsupervised training, our model is similar to that of Shi et al. [2017] adapted\nto predict the full set of intrinsic images.\n\n4.1 Supervised training\n\nData\nThe majority of data was generated from ShapeNet [Chang et al., 2015] objects rendered in\nBlender. For the labeled datasets, the rendered composite images were accompanied by the object\u2019s\nre\ufb02ectance, a map of the surface normals at each point, and the parameters of the lamp used to\nlight the scene. Surface normals are visualized by mapping the XYZ components of normals to\nappropriate RGB ranges. For the following supervised learning experiments, we used a dataset size\nof 40,000 images.\nIntrinsic image decomposition\nThe model in Fig 5 was trained on ShapeNet motorbikes. Al-\nthough it accurately predicts the intrinsic images of the train class, its performance drops when\ntested on other classes. In particular, the shape predictions suffer the most, as they are the most\ndissimilar from anything seen in the training set. Crucially, the poor intrinsic image predictions are\nre\ufb02ected in the reconstruction of the input image. This motivates the use of reconstruction error to\ndrive improvement of intrinsic images when there is no ground truth data.\nShading model\nIn contrast with the intrinsic image decomposition, shading prediction general-\nized well outside of the training set. The shader was trained on the shapes and lights from the same\nset of rendered synthetic cars as above. Even though this represents only a narrow distribution over\n\n7\n\nTrain ShapesTest ShapesReference RIN: Direct transferRIN: Self-supervisedSIRFSInput\fFigure 7: Predictions of RIN trained on left-lit images before and after self-supervised learning\non right-lit images. RIN uncovers the updated lighting distribution without external supervision or\nground truth data.\n\nshapes, we found that the shader produced plausible predictions for even real-world objects (Fig 4).\nBecause the shader generalized without any further effort, its parameters were never updated during\nself-supervised training. Freezing the parameters of the shader prevents our model from producing\nnonsensical shading images.\n\n4.2 Shape transfer\n\nData We generated a dataset of \ufb01ve shape primitives (cubes, spheres, cones, cylinders, and\ntoruses) viewed at random orientations using the Blender rendering engine. These images are\nused for supervised training. Three common reference shapes (Stanford bunny, Utah teapot, and\nBlender\u2019s Suzanne) are used as the unlabeled transfer class. To isolate the effects of shape mismatch\nin the labeled versus unlabeled data, all eight shapes were rendered with random monochromatic\nmaterials and a uniform distribution over lighting positions within a contained region of space in\nfront of the object. The datasets consisted of each shape rendered with 500 different colors, with\neach colored shape being viewed at 10 orientations.\nResults\nBy only updating weights for the shape decoder during self-supervised transfer, the pre-\ndictions for held-out shapes improves by 29% (averaged across the three shapes). Because a shape\nonly affects a rendered image via shading, the improvement in shapes comes alongside an improve-\nment in shading predictions as well. Shape-speci\ufb01c results are given in Table 2 and visualized in\nFig 6.\n\n4.3 Lighting transfer\n\nData\nCars from the ShapeNet 3D model repository were rendered at random orientations and\nscales. In the labeled data, they were lit only from the left side. In the unlabeled data, they were lit\nfrom both the left and right.\nResults\nBefore self-supervised training on the unlabeled data, the model\u2019s distribution over light-\ning predictions mirrored that of the labeled training set. When tested on images lit from the right,\nthen, it tended to predict centered lighting. After updating the lighting decoder based on recon-\nstruction error from these right-lit images though, the model\u2019s lighting predictions more accurately\nre\ufb02ected the new distribution and lighting mean-squared error reduces by 18%. Lighting predictions,\nalong with reconstructions, for right-lit images are shown in Fig 7.\n\n4.4 Category transfer\n\nIn the previous transfer experiments, only one intrinsic image was mismatched between the labeled\nand unlabeled data, so only one of RIN\u2019s decoders needed updating during transfer. When transfer-\nring between object categories, though, there is not such a guarantee. Although it might be expected\nthat a model trained on suf\ufb01ciently many object categories would learn a generally-useful distribu-\ntion over re\ufb02ectances, it is dif\ufb01cult to ensure that this is the case. We are interested in these sorts of\n\n8\n\nInput / Reference RIN: Direct transferRIN: Self-supervisedRenderLightsInput / Reference RIN: Direct transferRIN: Self-supervised\fFigure 8: RIN was \ufb01rst trained on ShapeNet airplanes and then tested on cars. Because most of the\nairplanes were white, the re\ufb02ectance predictions were washed out even for colorful cars. RIN \ufb01xed\nthe mismatch between datasets without any ground truth intrinsic images of cars.\n\nRe\ufb02ectance\n\nDirect transfer\nSelf-supervised\n\n0.019\n0.015\n\nShape Lights\n0.584\n0.014\n0.014\n0.572\n\nShading Render\n0.035\n0.065\n0.044\n0.006\n\nTable 3: MSE of RIN trained on ShapeNet airplanes before and after self-supervised transfer to cars.\nAlthough RIN improves its shading predictions, these are not necessarily driven by an improvement\nin shape prediction.\n\nscenarios to determine how well self-supervised transfer works when more than one decoder needs\nto be updated to account for unlabeled data.\nData\nDatasets of ShapeNet cars and airplanes were created analogously to those in Section 4.1.\nThe airplanes had a completely different color distribution than the cars as they were mostly white,\nwhereas the cars had a more varied re\ufb02ectance distribution. The airplanes were used as the labeled\ncategory to ensure a mismatch between the train and transfer data.\nResults\nTo transfer to the new category, we allowed updates to all three of the RIN decoders. (The\nshader was left \ufb01xed as usual.) There were pronounced improvements in the shading predictions\n(32%) accompanied by modest improvements in re\ufb02ectances (21%). The shading predictions were\nnot always caused by improved shape estimates. Because there is a many-to-one mapping from\nshape to shading (conditioned on a lighting condition), it is possible for the shape predictions to\nworsen in order to improve the shading estimates. The lighting predictions also remained largely\nunchanged, although for the opposite reason: because no lighting region were intentionally left out\nof the training data, the lighting predictions were adequate on the transfer classes even without\nself-supervised learning.\n\n5 Conclusion\n\nIn this paper, we proposed the Rendered Intrinsics Network for intrinsic image prediction. We\nshowed that by learning both the image decomposition and recombination functions, RIN can make\nuse of reconstruction loss to improve its intermediate representations. This allowed unlabeled data\nto be used during training, which we demonstrated with a variety of transfer tasks driven solely\nby self-supervision. When there existed a mismatch between the underlying intrinsic images of\nthe labeled and unlabeled data, RIN could also adapt its predictions in order to better explain the\nunlabeled examples.\n\n9\n\nInput / Reference RIN: Direct transferRIN: Self-supervisedRenderReflectanceShadingInput / Reference RIN: Direct transferRIN: Self-supervised\fAcknowledgements\n\nThis work is supported by ONR MURI N00014-16-1-2007, the Center for Brain, Minds and Ma-\nchines (NSF #1231216), Toyota Research Institute, and Samsung.\n\nReferences\n\nJonathan T Barron and Jitendra Malik. Shape, illumination, and re\ufb02ectance from shading. IEEE TPAMI, 37\n\n(8):1670\u20131687, 2015.\n\nH.G. Barrow and J.M. Tenenbaum. Recovering intrinsic scene characteristics from images. Computer Vision\n\nSystems, 1978.\n\nSean Bell, Kavita Bala, and Noah Snavely. Intrinsic images in the wild. ACM TOG, 33(4):159, 2014.\n\nAngel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio\nSavarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model reposi-\ntory. arXiv preprint arXiv:1512.03012, 2015.\n\nXi Chen, Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.\n\nRoger Grosse, Micah K. Johnson, Edward H. Adelson, and William T. Freeman. Ground-truth dataset and\n\nbaseline evaluations for intrinsic image algorithms. In ICCV, 2009.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In\n\nCVPR, 2016.\n\nGeoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In ICANN, 2011.\n\nYannick Hold-Geoffroy, Kalyan Sunkavalli, Sunil Hadap, Emiliano Gambaretto, and Jean-Francois Lalonde.\n\nDeep outdoor illumination estimation. In CVPR, 2017.\n\nBerthold K.P. Horn. Determining lightness from an image. Computer Graphics and Image Processing, 3:\n\n277\u2013299, 1974.\n\nCarlo Innamorati, Tobias Ritschel, Tim Weyrich, and Niloy J. Mitra. Decomposing single images for layered\n\nphoto retouching. Computer Graphics Forum, 36:15\u201325, 07 2017.\n\nSergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\nAbhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-speci\ufb01c object reconstruction\n\nfrom a single image. In CVPR, 2015.\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\nTejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse\n\ngraphics network. In NIPS, 2015.\n\nEdwin H. Land and John J. McCann. Lightness and retinex theory. Journal of the Optical Society of America,\n\n61:1\u201311, 1971.\n\nYann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n\nStephen Lombardi and Ko Nishino. Single image multimaterial estimation. In CVPR, 2012.\n\nStephen Lombardi and Ko Nishino. Re\ufb02ectance and illumination recovery in the wild. IEEE TPAMI, 38(1):\n\n129\u2013141, 2016.\n\nOliver Nalbach, Elena Arabadzhiyska, Dushyant Mehta, Hans-Peter Seidel, and Tobias Ritschel. Deep shading:\n\nConvolutional neural networks for screen-space shading. Computer Graphics Forum, 36(4), 2017.\n\nTakuya Narihira, Michael Maire, and Stella X. Yu. Direct intrinsics: Learning albedo-shading decomposition\n\nby convolutional regression. In ICCV, 2015a.\n\nTakuya Narihira, Michael Maire, and Stella X. Yu. Learning lightness from human judgement on relative\n\nre\ufb02ectance. In CVPR, 2015b.\n\n10\n\n\fGeoffrey Oxholm and Ko Nishino. Shape and re\ufb02ectance estimation in the wild. IEEE TPAMI, 38(2):376\u2013389,\n\n2016.\n\nYvain Qu\u00b4eau and Jean-Denis Durou. Edge-preserving integration of a normal \ufb01eld: Weighted least-squares,\ntv and L1 approaches. In International Conference on Scale Space and Variational Methods in Computer\nVision, 2015.\n\nKonstantinos Rematas, Tobias Ritschel, Mario Fritz, Efstratios Gavves, and Tinne Tuytelaars. Deep re\ufb02ectance\n\nmaps. In CVPR, June 2016.\n\nJian Shi, Yue Dong, Hao Su, and Stella X. Yu. Learning non-lambertian object intrinsics across shapenet\n\ncategories. In CVPR, 2017.\n\nZhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman, and Dimitris Samaras. Neural face\n\nediting with intrinsic image disentangling. In CVPR, July 2017.\n\nYichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton. Deep lambertian networks. In ICML, 2012.\n\nYair Weiss. Deriving intrinsic images from image sequences. In ICCV, 2001.\n\nJiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T Freeman, and Joshua B Tenenbaum. Marrnet:\n\n3d shape reconstruction via 2.5d sketches. In NIPS, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3022, "authors": [{"given_name": "Michael", "family_name": "Janner", "institution": "MIT"}, {"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}, {"given_name": "Tejas", "family_name": "Kulkarni", "institution": "DeepMind"}, {"given_name": "Ilker", "family_name": "Yildirim", "institution": "MIT"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}