{"title": "Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 1099, "page_last": 1107, "abstract": "An important problem for both graphics and vision is to synthesize novel views of a 3D object from a single image. This is in particular challenging due to the partial observability inherent in projecting a 3D object onto the image space, and the ill-posedness of inferring object shape and pose. However, we can train a neural network to address the problem if we restrict our attention to specific object classes (in our case faces and chairs) for which we can gather ample training data. In this paper, we propose a novel recurrent convolutional encoder-decoder network that is trained end-to-end on the task of rendering rotated objects starting from a single image. The recurrent structure allows our model to capture long- term dependencies along a sequence of transformations, and we demonstrate the quality of its predictions for human faces on the Multi-PIE dataset and for a dataset of 3D chair models, and also show its ability of disentangling latent data factors without using object class labels.", "full_text": "Weakly-supervised Disentangling with\n\nRecurrent Transformations for 3D View Synthesis\n\nJimei Yang1 Scott Reed2 Ming-Hsuan Yang1 Honglak Lee2\n\n1University of California, Merced\n\n{jyang44, mhyang}@ucmerced.edu\n{reedscot, honglak}@umich.edu\n\n2University of Michigan, Ann Arbor\n\nAbstract\n\nAn important problem for both graphics and vision is to synthesize novel views\nof a 3D object from a single image. This is particularly challenging due to the\npartial observability inherent in projecting a 3D object onto the image space, and\nthe ill-posedness of inferring object shape and pose. However, we can train a neu-\nral network to address the problem if we restrict our attention to speci\ufb01c object\ncategories (in our case faces and chairs) for which we can gather ample training\ndata. In this paper, we propose a novel recurrent convolutional encoder-decoder\nnetwork that is trained end-to-end on the task of rendering rotated objects start-\ning from a single image. The recurrent structure allows our model to capture\nlong-term dependencies along a sequence of transformations. We demonstrate\nthe quality of its predictions for human faces on the Multi-PIE dataset and for a\ndataset of 3D chair models, and also show its ability to disentangle latent factors\nof variation (e.g., identity and pose) without using full supervision.\n\nIntroduction\n\n1\nNumerous graphics algorithms have been established to synthesize photorealistic images from 3D\nmodels and environmental variables (lighting and viewpoints), commonly known as rendering. At\nthe same time, recent advances in vision algorithms enable computers to gain some form of un-\nderstanding of objects contained in images, such as classi\ufb01cation [16], detection [10], segmenta-\ntion [18], and caption generation [27], to name a few. These approaches typically aim to deduce\nabstract representations from raw image pixels. However, it has been a long-standing problem for\nboth graphics and vision to automatically synthesize novel images by applying intrinsic transfor-\nmations (e.g., 3D rotation and deformation) to the subject of an input image. From an arti\ufb01cial\nintelligence perspective, this can be viewed as answering questions about object appearance when\nthe view angle or illumination is changed, or some action is taken. These synthesized images may\nthen be perceived by humans in photo editing [14], or evaluated by other machine vision systems,\nsuch as the game playing agent with vision-based reinforcement learning [20, 21].\nIn this paper, we consider the problem of predicting transformed appearances of an object when\nit is rotated in 3D from a single image. In general, this is an ill-posed problem due to the loss of\ninformation inherent in projecting a 3D object into the image space. Classic geometry-based ap-\nproaches either recover a 3D object model from multiple related images, i.e., multi-view stereo and\nstructure-from-motion, or register a single image of a known object category to its prior 3D model,\ne.g., faces [5]. The resulting mesh can be used to re-render the scene from novel viewpoints. How-\never, having 3D meshes as intermediate representations, these methods are 1) limited to particular\nobject categories, 2) vulnerable to image alignment mistakes and 3) easy to generate artifacts dur-\ning unseen texture synthesis. To overcome these limitations, we propose a learning-based approach\nwithout explicit 3D model recovery. Having observed rotations of similar 3D objects (e.g., faces,\nchairs, household objects), the trained model can both 1) better infer the true pose, shape and texture\nof the object, and 2) make plausible assumptions about potentially ambiguous aspects of appearance\nin novel viewpoints. Thus, the learning algorithm relies on mappings between Euclidean image\n\n1\n\n\fspace and underlying nonlinear manifold. In particular, 3D view synthesis can be cast as pose man-\nifold traversal where a desired rotation can be decomposed into a sequence of small steps. A major\nchallenge arises due to the long-term dependency among multiple rotation steps; the key identify-\ning information (e.g., shape, texture) from the original input must be remembered along the entire\ntrajectory. Furthermore, the local rotation at each step must generate the correct result on the data\nmanifold, or subsequent steps will also fail.\nClosely related to the image generation task considered in this paper is the problem of 3D invariant\nrecognition, which involves comparing object images from different viewpoints or poses with dra-\nmatic changes of appearance. Shepard and Metzler in their mental rotation experiments [23] found\nthat the time taken for humans to match 3D objects from two different views increased proportion-\nally with the angular rotational difference between them. It was as if the humans were rotating their\nmental images at a steady rate. Inspired by this mental rotation phenomenon, we propose a recur-\nrent convolutional encoder-decoder network with action units to model the process of pose manifold\ntraversal. The network consists of four components: a deep convolutional encoder [16], shared\nidentity units, recurrent pose units with rotation action inputs, and a deep convolutional decoder [8].\nRather than training the network to model a speci\ufb01c rotation sequence, we provide control signals\nat each time step instructing the model how to move locally along the pose manifold. The rotation\nsequences can be of varying length. To improve the ease of training, we employed curriculum learn-\ning, similar to that used in other sequence prediction problems [28]. Intuitively, the model should\nlearn how to make one-step 15\u25e6 rotation before learning how to make a series of such rotations.\nThe main contributions of this work are summarized as follows. First, a novel recurrent convolu-\ntional encoder-decoder network is developed for learning to apply out-of-plane rotations to human\nfaces and 3D chair models. Second, the learned model can generate realistic rotation trajectories\nwith a control signal supplied at each step by the user. Third, despite only being trained to synthe-\nsize images, our model learns discriminative view-invariant features without using class labels. This\nweakly-supervised disentangling is especially notable with longer-term prediction.\n2 Related Work\nThe transforming autoencoder [12] introduces the notion of capsules in deep networks, which tracks\nboth the presence and position of visual features in the input image. These models can apply af\ufb01ne\ntransformations and 3D rotations to images. We address a similar task of rendering object appear-\nance undergoing 3D rotations, but we use a convolutional network architecture in lieu of capsules,\nand incorporate action inputs and recurrent structure to handle repeated rotation steps. The Predic-\ntive Gating Pyramid [19] is developed for time-series prediction and can learn image transformations\nincluding shifts and rotation over multiple time steps. Our task is related to this time-series predic-\ntion, but our formulation includes a control signal, uses disentangled latent features, and uses con-\nvolutional encoder and decoder networks to model detailed images. Ding and Taylor [7] proposed\na gating network to directly model mental rotation by optimizing transforming distance. Instead of\nextracting invariant recognition features in one shot, their model learns to perform recognition by\nexploring a space of relevant transformations. Similarly, our model can explore the space of rotation\nabout an object image by setting the control signal at each time step of our recurrent network.\nThe problem of training neural networks that generate images is studied in [26]. Dosovitskiy et al.\n[8] proposed a convolutional network mapping shape, pose and transformation labels to images for\ngenerating chairs. It is able to control these factors of variation and generate high-quality render-\nings. We also generate chair renderings in this paper, but our model adds several additional features:\na deep encoder network (so that we can generalize to novel images, rather than only decode), dis-\ntributed representations for appearance and pose, and recurrent structure for long-term prediction.\nContemporary to our work, the Inverse Graphics Network (IGN) [17] also adds an encoding func-\ntion to learn graphics codes of images, along with a decoder similar to that in the chair generating\nnetwork. As in our model, IGN uses a deep convolutional encoder to extract image representations,\napply modi\ufb01cations to these, and then re-render. Our model differs in that 1) we train a recur-\nrent network to perform trajectories of multiple transformations, 2) we add control signal input at\neach step, and 3) we use deterministic feed-forward training rather than the variational auto-encoder\n(VAE) framework [15] (although our approach could be extended to a VAE version).\nA related line of work to ours is disentangling the latent factors of variation that generate natural\nimages. Bilinear models for separating style and content are developed in [25], and are shown to\n\n2\n\n\fFigure 1: Deep convolutional encoder-decoder network for learning 3d rotation\n\nbe capable of separating handwriting style and character identity, and also separating face iden-\ntity and pose. The disentangling Boltzmann Machine (disBM) [22] applies this idea to augment\nthe Restricted Boltzmann Machine by partitioning its hidden state into distinct factors of variation\nand modeling their higher-order interaction. The multi-view perceptron [30] employs a stochastic\nfeedforward network to disentangle the identity and pose factors of face images in order to achieve\nview-invariant recognition. The encoder network for IGN is also trained to learn a disentangled rep-\nresentation of images by extracting a graphics code for each factor. In [6], the (potentially unknown)\nlatent factors of variation are both discovered and disentangled using a novel hidden unit regularizer.\nOur work is also loosely related to the \u201cDeepStereo\u201d algorithm [9] that synthesizes novel views of\nscenes from multiple images using deep convolutional networks.\n3 Recurrent Convolutional Encoder-Decoder Network\nIn this section we describe our model formulation. Given an image of 3D object, our goal is to syn-\nthesize its rotated views. Inspired by recent success of convolutional networks (CNNs) in mapping\nimages to high-level abstract representations [16] and synthesizing images from graphics codes [8],\nwe base our model on deep convolutional encoder-decoder networks. One example network struc-\nture is shown in Figure 1. The encoder network used 5 \u00d7 5 convolution-relu layers with stride 2 and\n2-pixel padding so that the dimension is halved at each convolution layer, followed by two fully-\nconnected layers. In the bottleneck layer, we de\ufb01ne a group of units to represent the pose (pose\nunits) where the desired transformations can be applied. The other group of units represent what\ndoes not change during transformations, named as identity units. The decoder network is symmetric\nto the encoder. To increase dimensionality we use \ufb01xed upsampling as in [8]. We found that \ufb01xed\nstride-2 convolution and upsampling worked better than max-pooling and unpooling with switches,\nbecause when applying transformations the encoder pooling switches would not in general match\nthe switches produced by the target image. The desired transformations are re\ufb02ected by the action\nunits. We used a 1-of-3 encoding, in which [100] encoded a clockwise rotation, [010] encoded a no-\nop, and [001] encoded a counter-clockwise rotation. The triangle indicates a tensor product taking\nas input the pose units and action units, and producing the transformed pose units. Equivalently, the\naction unit selects the matrix that transforms the input pose units to the output pose units.\nThe action units introduce a small linear increment to the pose units, which essentially model the\nlocal transformations in the nonlinear pose manifold. However, in order to achieve longer rotation\ntrajectories, if we simply accumulate the linear increments from the action units (e.g., [2 0 0] for\ntwo-step clockwise rotation, the pose units will fall off the manifold resulting in bad predictions.\nTo overcome this problem, we generalize the model to a recurrent neural network, which have\nbeen shown to capture long-term dependencies for a wide variety of sequence modeling problems.\nIn essence, we use recurrent pose units to model the step-by-step pose manifold traversals. The\nidentity units are shared across all time steps since we assume that all training sequences preserve\nthe identity while only changing the pose. Figure 2 shows the unrolled version of our RNN model.\nWe only perform encoding at the \ufb01rst time step, and all transformations are carried out in the latent\nspace; i.e., the model predictions at time step t are not fed into the next time step input. The training\nobjective is based on pixel-wise prediction over all time steps for training sequences:\n\nN(cid:88)\n\nT(cid:88)\n\nLrnn =\n\n||y(i,t) \u2212 g(fpose(x(i), a(i), t), fid(x(i)))||2\n\n2\n\n(1)\n\ni=1\n\nt=1\n\nwhere a(i) is the sequence of T actions, fid(x(i)) produces the identity features invariant to all the\ntime steps, fpose(x(i), a(i), t) produces the transformed pose features at time step t, g(\u00b7,\u00b7) is the\nimage decoder producing an image given the output of fid(\u00b7) and fpose(\u00b7,\u00b7,\u00b7), x(i) is the i-th image,\ny(i,t) is the i-th training image target at step t.\n\n3\n\n\fFigure 2: Unrolled recurrent convolutional network for learning to rotate 3d objects. The convolutional encoder\nand decoder have been abstracted out, represented here as vertical rectangles.\n\n3.1 Curriculum Training\nWe trained the network parameters using backpropagation through time and the ADAM optimization\nmethod [3]. To effectively train our recurrent network, we found it bene\ufb01cial to use curriculum\nlearning [4], in which we gradually increase the dif\ufb01culty of training by increasing the trajectory\nlength. This appears to be useful for sequence prediction with recurrent networks in other domains\nas well [21, 28]. In Section 4, we show that increasing the training sequence length improves both\nthe model\u2019s image prediction performance as well as the pose-invariant recognition performance of\nidentity features.\nAlso, longer training sequences force the identity units to better disentangle themselves from the\npose. If the same identity units need to be used to predict both a 15\u25e6-rotated and a 120\u25e6-rotated\nimage during training, these units cannot pick up pose-related information. In this way, our model\ncan learn disentangled features (i.e., identity units can do invariant identity recognition but are not\ninformative of pose, and vice versa) without explicitly regularizing to achieve this effect. We did not\n\ufb01nd it necessary to use gradient clipping.\n4 Experiments\nWe carry out experiments to achieve the following objectives. First, we examine the ability of our\nmodel to synthesize high-quality images of both face and complex 3D objects (chairs) in a wide\nrange of rotational angles. Second, we evaluate the discriminative performance of disentangled\nidentity units through cross-view object recognition. Third, we demonstrate the ability to generate\nand rotate novel object classes by interpolating identity units of query objects.\n4.1 Datasets\nMulti-PIE. The Multi-PIE [11] dataset consists of 754,204 face images from 337 people. The\nimages are captured from 15 viewpoints under 20 illumination conditions in different sessions. To\nevaluate our model for rotating faces, we select a subset of Multi-PIE that covers 7 viewpoints\nevenly from \u221245\u25e6 to 45\u25e6 under neutral illumination. Each face image is aligned through manually\nannotated landmarks on eyes, nose and mouth corners, and then cropped to 80 \u00d7 60 \u00d7 3 pixels. We\nuse the images of \ufb01rst 200 people for training and the remaining 137 people for testing.\nChairs. This dataset contains 1393 chair CAD models made publicly available by Aubry et al. [2].\nEach chair model is rendered from 31 azimuth angles (with steps of 11\u25e6 or 12\u25e6) and 2 elevation\nangles (20\u25e6 and 30\u25e6) at a \ufb01xed distance to the virtual camera. We use a subset of 809 chair models\nin our experiments, which are selected out of 1393 by Dosovitskiy et al. [8] in order to remove near-\nduplicate models (e.g., models differing only in color) or low-quality models. We crop the rendered\nimages to have a small border and resize them to a common size of 64 \u00d7 64 \u00d7 3 pixels. We also\nprepare their binary masks by subtracting the white background. We use the images of the \ufb01rst 500\nmodels as the training set and the remaining 309 models as the test set.\n4.2 Network Architectures and Training Details\nMulti-PIE. The encoder network for the Multi-PIE dataset used two convolution-relu layers with\nstride 2 and 2-pixel padding, followed by one fully-connected layer: 5\u00d75\u00d764\u22125\u00d75\u00d7128\u22121024. The\nnumber of identity and pose units are 512 and 128, respectively. The decoder network is symmetric\nto the encoder. The curriculum training procedure starts with the single-step rotation model which\nwe call RNN1.\n\n4\n\n\f\u221245\u25e6 \u221230\u25e6 \u221215\u25e6\n\n0\u25e6\n\n15\u25e6\n\n30\u25e6\n\n45\u25e6\n\n\u221245\u25e6 \u221230\u25e6 \u221215\u25e6\n\n0\u25e6\n\n15\u25e6\n\n30\u25e6\n\n45\u25e6\n\nFigure 3: 3D view synthesis on Multi-PIE. For each panel, the \ufb01rst row shows the ground truth from \u221245\u25e6\nto 45\u25e6, the second and third rows show the re-renderings of 6-step clockwise rotation from an input image of\n\u221245\u25e6 (red box) and of 6-step counter-clockwise rotation from an input image of 45\u25e6 (red box), respectively.\n\n45\u25e6\n\n30\u25e6\n\n15\u25e6 \u221215\u25e6 \u221230\u25e6 \u221245\u25e6\n\n45\u25e6\n\n30\u25e6\n\n15\u25e6 \u221215\u25e6 \u221230\u25e6 \u221245\u25e6\n\nInput\n\nRNN\n\n3D\nmodel\n\nFigure 4: Comparing face pose normalization results with 3D morphable model [29].\n\nWe prepare the training samples by pairing face images of the same person captured in the same\nsession with adjacent camera viewpoints. For example, x(i) at \u221230\u25e6 is mapped to y(i) at \u221215\u25e6 with\naction a(i) = [001]; x(i) at \u221215\u25e6 is mapped to y(i) at \u221230\u25e6 with action a(i) = [100]; and x(i) at \u221230\u25e6\nis mapped to y(i) at \u221230\u25e6 with action a(i) = [010]. For face images with ending viewpoints \u221245\u25e6\nand 45\u25e6, only one-way rotation is feasible. We train the network using the ADAM optimizer with\n\ufb01xed learning rate 10\u22124 for 400 epochs.1\nSince there are 7 viewpoints per person per session, we schedule the curriculum training with t = 2,\nt = 4 and t = 6 stages, which we call RNN2, RNN4 and RNN6, respectively. To sample training\nsequences with \ufb01xed length, we allow both clockwise and counter-clockwise rotations. For example,\nwhen t = 4, one input image x(i) at 30\u25e6 is mapped to (y(i,1), y(i,2), y(i,3), y(i,4)) with corresponding\nangles (45\u25e6, 30\u25e6, 15\u25e6, 0\u25e6) and action inputs ([001], [100], [100], [100]). In each stage, we initialize\nthe network parameters with the previous stage and \ufb01ne-tune the network with \ufb01xed learning rate\n10\u22125 for 10 additional epochs.\nChairs. The encoder network for chairs used three convolution-relu layers with stride 2 and 2-pixel\npadding, followed by two fully-connected layers: 5\u00d75\u00d764\u2212 5\u00d75\u00d7128\u2212 5\u00d75\u00d7256\u2212 1024\u2212 1024.\nThe decoder network is symmetric, except that after the fully-connected layers it branches into image\nand mask prediction layers. The mask prediction indicates whether a pixel belongs to foreground or\nbackground. We adopted this idea from the generative CNN [8] and found it bene\ufb01cial to training\nef\ufb01ciency and image synthesis quality. A tradeoff parameter \u03bb = 0.1 is applied to the mask prediction\nloss. We train the single-step network parameters with \ufb01xed learning rate 10\u22124 for 500 epochs. We\nschedule the curriculum training with t = 2, t = 4, t = 8 and t = 16, which we call RNN2, RNN4,\nRNN8 and RNN16. Note that the curriculum training stops at t = 16 because we reached the limit\nof GPU memory. Since the images of each chair model are rendered from 31 viewpoints evenly\nsampled between 0\u25e6 and 360\u25e6, we can easily prepare training sequences of clockwise or counter-\nclockwise t-step rotations around the circle. Similarly, the network parameters of the current stage\nare initialized with those of previous stage and \ufb01ne-tuned with the learning rate 10\u22125 for 50 epochs.\n4.3\nWe \ufb01rst examine the re-rendering quality of our RNN models for novel object instances that were\nnot seen during training. On the Multi-PIE dataset, given one input image from the test set with\npossible views between \u221245\u25e6 to 45\u25e6, the encoder produces identity units and pose units and then\nthe decoder renders images progressively with \ufb01xed identity units and action-driven recurrent pose\nunits up to t-steps. Examples are shown in Figure 3 of the longest rotations, i.e., clockwise from\n\n3D View Synthesis of Novel Objects\n\n1We carry out experiments using Caffe [13] on Nvidia K40c and Titan X GPUs.\n\n5\n\n\f\u221245\u25e6 to 45\u25e6 and counter-clockwise from 45\u25e6 to \u221245\u25e6 with RNN6. High-quality renderings are\ngenerated with smooth transformations between adjacent views. The characteristics of faces, such\nas gender, expression, eyes, nose and glasses are also preserved during rotation. We also compare\nour RNN model with a state-of-the-art 3D morphable model for face pose normalization [29] in\nFigure 4. It can be observed that our RNN model produces stable renderings while 3D morphable\nmodel is sensitive to facial landmark localization. One of the advantages of 3D morphable model is\nthat it preserves facial textures well.\nOn the chair dataset, we use RNN16 to synthesize 16 rotated views of novel chairs in the test set.\nGiven a chair image of a certain view, we de\ufb01ne two action sequences; one for progressive clock-\nwise rotation and another for counter-clockwise rotation. It is a more challenging task compared\nto rotating faces due to the complex 3D shapes of chairs and the large rotation angles (more than\n180\u25e6 after 16-step rotations). Since no previous methods tackle the exact same chair re-rendering\nproblem, we use a k-nearest-neighbor (KNN) method for baseline comparisons. The KNN baseline\nis implemented as follows. We \ufb01rst extract the CNN features \u201cfc7\u201d from VGG-16 net [24] for all\nthe chair images. For each test chair image, we \ufb01nd its K-nearest neighbors in the training set by\ncomparing their \u201cfc7\u201d features. The retrieved top-K images are expected to be similar to the query\nin terms of both style and pose [1]. Given a desired rotation angle, we synthesize rotated views of\nthe test image by averaging the corresponding rotated views of the retrieved top-K images in the\ntraining set at the pixel level. We tune the K value in [1,3,5,7], namely KNN1, KNN3, KNN5 and\nKNN7 to achieve the best performance. Two examples are shown in Figure 5. In our RNN model,\nthe 3D shapes are well preserved with clear boundaries for all the 16 rotated views from different\ninput, and the appearance changes smoothly between adjacent views with a consistent style.\n\nRNN\n\nKNN\n\nRNN\n\nKNN\n\nRNN\n\nKNN\n\nRNN\n\nKNN\n\nt=3\n\nt=6\n\nt=7\n\nt=8\n\nt=1\n\nt=2\n\nt=4\n\nInput\n\nt=9 t=10 t=11 t=12 t=13 t=14 t=15 t=16\nFigure 5: 3D view synthesis of 16-step rotations on Chairs. In each panel, we compare synthesis results of\nthe RNN16 model (top) and of the KNN5 baseline (bottom). The \ufb01rst two panels belong to the same chair of\ndifferent starting views while the last two panels are from another chair of two starting views. Input images are\nmarked with red boxes.\n\nt=5\n\nNote that conceptually the learned network parameters during different stages of curriculum training\ncan be used to process an arbitrary number of rotation steps. The RNN1 model (the \ufb01rst row in\nFigure 6) works well in the \ufb01rst rotation step, but it produces degenerate results from the second\nstep. The RNN2 (the second row), trained with two-step rotations, generates reasonable results in\nthe third step. Progressively, the RNN4 and RNN8 seem to generalize well on chairs with longer\npredictions (t = 6 for RNN4 and t = 12 for RNN8). We measure the quantitative performance of\nKNN and our RNN by the mean squared error (MSE) in (1) in Figure 7. As a result, the best KNN\nwith 5 retrievals (KNN5) obtains \u223c310 MSE, which is comparable to our RNN4 model, but our\nRNN16 model signi\ufb01cantly outperforms KNN5 (\u223c179 MSE) with a 42% relative improvement.\n4.4 Cross-View Object Recognition\nIn this experiment, we examine and compare the discriminative performance of disentangled repre-\nsentations through cross-view object recognition.\n\n6\n\n\fRNN1\n\nRNN2\n\nRNN4\n\nRNN8\n\nRNN16\n\nGT\n\nt=1 t=2 t=3 t=4 t=6 t=8 t=12 t=16\n\nModel\nFigure 6: Comparing chair synthesis results from\nRNN at different curriculum stages.\n\nFigure 7: Comparing reconstruction mean squared\nerrors (MSE) on chairs with RNNs and KNNs.\n\nMulti-PIE. We create 7 gallery/probe splits from the test set. In each split, the face images of\nthe same view, e.g., \u221245\u25e6 are collected as gallery and the rest of other views as probes. We extract\n512-d features from the identity units of RNNs for all the test images so that the probes are matched\nto the gallery by their cosine distance. It is considered as a success if the matched gallery image\nhas the same identity with one probe. We also categorize the probes in each split by measuring\ntheir angle offsets from the gallery. In particular, the angle offsets range from 15\u25e6 to 90\u25e6. The\nrecognition dif\ufb01culties increase with angle offsets. To demonstrate the discriminative performance\nof our learned representations, we also implement a convolutional network classi\ufb01er. The CNN\narchitecture is set up by connecting our encoder and identity units with a 200-way softmax output\nlayer, and its parameters are learned on the training set with ground truth class labels. The 512-d\nfeatures extracted from the layer before the softmax layer are used to perform cross-view object\nrecognition as above. Figure 8 (left) compares the average success rates of RNNs and CNN with\ntheir standard deviations over 7 splits for each angle offset. The success rates of RNN1 drop more\nthan 20% from angle offset 15\u25e6 to 90\u25e6. The success rates keep improving in general with curriculum\ntraining of RNNs, and the best results are achieved with RNN6. As expected, the performance gap\nfor RNN6 between 15\u25e6 to 90\u25e6 reduces to 10%. This phenomenon demonstrates that our RNN\nmodel gradually learns pose/viewpoint-invariant representations for 3D face recognition. Without\nusing any class labels, our RNN model achieves competitive results against the CNN.\nChairs. The experimental setup is similar to Multi-PIE. There are in total 31 azimuth views per\nchair instance. For each view, we create its gallery/probe split so that we have 31 splits. We ex-\ntract 512-d features from identity units of RNN1, RNN2, RNN4, RNN8 and RNN16. The probes\nfor each split are sorted by their angle offsets from the gallery images. Note that this experiment\nis particularly challenging because chair matching is a \ufb01ne-grained recognition task and chair ap-\npearances change signi\ufb01cantly with 3D rotations. We also compare our model against CNN, but\ninstead of training CNN from scratch we use the pre-trained VGG-16 net [24] to extract the 4096-d\n\u201cfc7\u201d features for chair matching. The success rates are shown in Figure 8 (right). The performance\ndrops quickly when the angle offset is greater than 45\u25e6, but the RNN16 signi\ufb01cantly improves the\noverall success rates especially for large angle offsets. We notice that the standard deviations are\nlarge around the angle offsets 70\u25e6 to 120\u25e6. This is because some views contain more information\nabout the chair 3D shapes than the other views so that we see performance variations. Interestingly,\nthe performance of VGG-16 net surpasses our RNN model when the angle offset is greater than\n120\u25e6. We hypothesize that this phenomenon results from the symmetric structures of most of the\nchairs. The VGG-16 net was trained with mirroring data augmentation to achieve certain symmetric\ninvariance while our RNN model does not explore this structure.\nTo further demonstrate the disentangling property of our RNN model, we use the pose units extracted\nfrom the input images to repeat the above cross-view recognition experiments. The mean success\nrates are shown in Table 1. It turns out that the better the identity units perform the worse the pose\nunits perform. When the identity units achieve near-perfect recognition on Multi-PIE, the pose units\nonly obtain a mean success rate 1.4%, which is close to the random guess 0.5% for 200 classes.\n4.5 Class Interpolation and View Synthesis\nIn this experiment, we demonstrate the ability of our RNN model to generate novel chairs by\ninterpolating between two existing ones. Given two chair images of the same view from dif-\n\n7\n\n12345678910111213141516100200300400500600700# of rotation stepsReconstruction error KNN1KNN3KNN5KNN7RNN1RNN2RNN4RNN8RNN16\fFigure 8: Comparing cross-view recognition success rates for faces (left) and chairs (right).\n\nModels\nMulti-PIE\n\nChairs\n\nRNN: identity RNN: pose\n\n93.3\n56.8\n\n1.4\n9.0\n\nCNN\n92.6\n52.5\n\nTable 1: Comparing mean cross-view recognition success rates (%) with identity and pose units.\n\npose, respectively. The interpolation is computed by zid = \u03b2z1\npose + (1 \u2212 \u03b2)z2\n\nferent instances, the encoder network is used to compute their identity units z1\nid and pose\nunits z1\nid and\npose, z2\npose, where \u03b2 = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]. The interpolated zid and\nzpose = \u03b2z1\nzpose are then fed into the recurrent decoder network to render its rotated views. Example interpo-\nlations between four chair instances are shown in Figure 9. The Interpolated chairs present smooth\nstylistic transformations between any pair of input classes (each row in Figure 9), and their unique\nstylistic characteristics are also well preserved among its rotated views (each column in Figure 9).\n\nid, z2\n\nid + (1 \u2212 \u03b2)z2\n\nInput\n\nt=1\n\nt=5\n\nt=9\n\nt=13\n\n\u03b2\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\nFigure 9: Chair style interpolation and view synthesis. Given four chair images of the same view (\ufb01rst row)\nfrom test set, each row presents renderings of style manifold traversal with a \ufb01xed view while each column\npresents the renderings of pose manifold traversal with a \ufb01xed interpolated identity.\n5 Conclusion\nIn this paper, we develop a recurrent convolutional encoder-decoder network and demonstrate its\neffectiveness for synthesizing 3D views of unseen object instances. On the Multi-PIE dataset and\na database of 3D chair CAD models, the model predicts accurate renderings across trajectories of\nrepeated rotations. The proposed curriculum training by gradually increasing trajectory length of\ntraining sequences yields both better image appearance and more discriminative features for pose-\ninvariant recognition. We also show that a trained model could interpolate across the identity mani-\nfold of chairs at \ufb01xed pose, and traverse the pose manifold while \ufb01xing the identity. This generative\ndisentangling of chair identity and pose emerged from our recurrent rotation prediction objective,\neven though we do not explicitly regularize the hidden units to be disentangled. Our future work\nincludes introducing more actions into the proposed model other than rotation, handling objects em-\nbedded in complex scenes, and handling one-to-many mappings for which a transformation yields a\nmulti-modal distribution over future states in the trajectory.\nAcknowledgments This work was supported in part by ONR N00014-13-1-0762, NSF CAREER\nIIS-1453651, and NSF CMMI-1266184. We thank NVIDIA for donating a Tesla K40 GPU.\n\n8\n\nAngle offset2030405060708090Classification success rates (%)6065707580859095100RNN1RNN2RNN4RNN6CNNAngle offset20406080100120140160180Classification success rates (%)0102030405060708090100RNN1RNN2RNN4RNN8RNN16VGG-16\fReferences\n\n[1] M. Aubry and B. C. Russell. Understanding deep features with computer-generated imagery. In ICCV,\n\n2015.\n\n[2] M. Aubry, D. Maturana, A. A. Efros, B. Russell, and J. Sivic. Seeing 3D chairs: exemplar part-based\n\n2D-3D alignment using a large dataset of CAD models. In CVPR, 2014.\n\n[3] J. Ba and D. Kingma. Adam: A method for stochastic optimization. In ICLR, 2015.\n[4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.\n[5] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In SIGGRAPH, 1999.\n[6] B. Cheung, J. Livezey, A. Bansal, and B. Olshausen. Discovering hidden factors of variation in deep\n\nnetworks. In ICLR, 2015.\n\n[7] W. Ding and G. Taylor. Mental rotation by optimizing transforming distance. In NIPS Deep Learning\n\nand Representation Learning Workshop, 2014.\n\n[8] A. Dosovitskiy, J. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural\n\nnetworks. In CVPR, 2015.\n\n[9] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deepstereo: Learning to predict new views from the\n\nworld\u2019s imagery. arXiv preprint arXiv:1506.06825, 2015.\n\n[10] R. Girshick. Fast R-CNN. In ICCV, 2015.\n[11] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. Image and Vision Computing, 28\n\n(5):807\u2013813, May 2010.\n\n[12] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In ICANN, 2011.\n[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[14] N. Kholgade, T. Simon, A. Efros, and Y. Sheikh. 3D object manipulation in a single photograph using\n\nstock 3D models. In SIGGRAPH, 2014.\n\n[15] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[17] T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network.\n\nIn NIPS, 2015.\n\n[18] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\n2015.\n\n[19] V. Michalski, R. Memisevic, and K. Konda. Modeling deep temporal dependencies with recurrent \u201cgram-\n\nmar cells\u201d. In NIPS, 2014.\n\n[20] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing\n\natari with deep reinforcement learning. In NIPS Deep Learning Workshop, 2013.\n\n[21] J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh. Action-conditional video prediction using deep networks\n\nin atari games. In NIPS, 2015.\n\n[22] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold\n\ninteraction. In ICML, 2014.\n\n[23] R. N. Shepard and J. Metzler. Mental rotation of three dimensional objects. Science, 171(3972):701\u2013703,\n\n1971.\n\n[24] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[25] J. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models. Neural Compu-\n\ntation, 12(6):1247\u20131283, 2000.\n\n[26] T. Tieleman. Optimizing neural networks that generate images. PhD thesis, University of Toronto, 2014.\n[27] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In\n\nCVPR, 2015.\n\n[28] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.\n[29] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-\ufb01delity pose and expression normalization for face\n\nrecognition in the wild. In CVPR, 2015.\n\n[30] Z. Zhu, P. Luo, X. Wang, and X. Tang. Multi-view perceptron: a deep model for learning face identity\n\nand view representations. In NIPS, 2014.\n\n9\n\n\f", "award": [], "sourceid": 686, "authors": [{"given_name": "Jimei", "family_name": "Yang", "institution": "UC Merced"}, {"given_name": "Scott", "family_name": "Reed", "institution": "University of Michigan"}, {"given_name": "Ming-Hsuan", "family_name": "Yang", "institution": "UC Merced"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "U. Michigan"}]}