{"title": "Deep Visual Analogy-Making", "book": "Advances in Neural Information Processing Systems", "page_first": 1252, "page_last": 1260, "abstract": "In addition to identifying the content within a single image, relating images and generating related images are critical tasks for image understanding. Recently, deep convolutional networks have yielded breakthroughs in producing image labels, annotations and captions, but have only just begun to be used for producing high-quality image outputs. In this paper we develop a novel deep network trained end-to-end to perform visual analogy making, which is the task of transforming a query image according to an example pair of related images. Solving this problem requires both accurately recognizing a visual relationship and generating a transformed query image accordingly. Inspired by recent advances in language modeling, we propose to solve visual analogies by learning to map images to a neural embedding in which analogical reasoning is simple, such as by vector subtraction and addition. In experiments, our model effectively models visual analogies on several datasets: 2D shapes, animated video game sprites, and 3D car models.", "full_text": "Deep Visual Analogy-Making\n\nScott Reed\n\nYi Zhang\n\nUniversity of Michigan, Ann Arbor, MI 48109, USA\n\n{reedscot,yeezhang,yutingzh,honglak}@umich.edu\n\nYuting Zhang Honglak Lee\n\nAbstract\n\nIn addition to identifying the content within a single image, relating images and\ngenerating related images are critical tasks for image understanding. Recently,\ndeep convolutional networks have yielded breakthroughs in predicting image la-\nbels, annotations and captions, but have only just begun to be used for generat-\ning high-quality images. In this paper we develop a novel deep network trained\nend-to-end to perform visual analogy making, which is the task of transforming a\nquery image according to an example pair of related images. Solving this problem\nrequires both accurately recognizing a visual relationship and generating a trans-\nformed query image accordingly. Inspired by recent advances in language mod-\neling, we propose to solve visual analogies by learning to map images to a neural\nembedding in which analogical reasoning is simple, such as by vector subtraction\nand addition. In experiments, our model effectively models visual analogies on\nseveral datasets: 2D shapes, animated video game sprites, and 3D car models.\n\nIntroduction\n\n1\nHumans are good at considering \u201cwhat-if?\u201d questions about objects in their environment. What if\nthis chair were rotated 30 degrees clockwise? What if I dyed my hair blue? We can easily imagine\nroughly how objects would look according to various hypothetical questions. However, current\ngenerative models of images struggle to perform this kind of task without encoding signi\ufb01cant prior\nknowledge about the environment and restricting the allowed transformations.\nOften, these visual hypothetical questions\ncan be effectively answered by analogi-\ncal reasoning.1 Having observed many\nsimilar objects rotating, one could learn\nto mentally rotate new objects. Having\nobserved objects with different colors (or\ntextures), one could learn to mentally re-\ncolor (or re-texture) new objects.\nSolving the analogy problem requires the\nability to identify relationships among im-\nages and transform query images accord-\ningly. In this paper, we propose to solve the problem by directly training on visual analogy comple-\ntion; that is, to generate the transformed image output. Note that we do not make any claim about\nhow humans solve the problem, but we show that in many cases thinking by analogy is enough to\nsolve it, without exhaustively encoding \ufb01rst principles into a complex model.\nWe denote a valid analogy as a 4-tuple A : B :: C : D, often spoken as \u201cA is to B as C is to D\u201d. Given\nsuch an analogy, there are several questions one might ask:\n\nFigure 1: Visual analogy making concept. We learn\nan encoder function f mapping images into a space\nin which analogies can be performed, and a decoder\ng mapping back to the image space.\n\n\u2022 A ? B :: C ? D - What is the common relationship?\n\u2022 A : B ? C : D - Are A and B related in the same way that C and D are related?\n\u2022 A : B :: C : ? - What is the result of applying the transformation A : B to C?\n\n1See [2] for a deeper philosophical discussion of analogical reasoning.\n\n1\n\nInfer RelationshipTransform query\fThe \ufb01rst two questions can be viewed as discriminative tasks, and could be formulated as classi\ufb01ca-\ntion problems. The third question requires generating an appropriate image to make a valid analogy.\nSince a model with this capability would be of practical interest, we focus on this question.\nOur proposed approach is to learn a deep encoder function f : RD \u2192 RK that maps images to an\nembedding space suitable for reasoning about analogies, and a deep decoder function g : RK \u2192 RD\nthat maps from the embedding back to the image space. (See Figure 1.) Our encoder function is\ninspired by word2vec [21], GloVe [22] and other embedding methods that map inputs to a space\nsupporting analogies by vector addition. In those models, analogies could be performed via\n\nd = arg maxw\u2208V cos(f (w), f (b) \u2212 f (a) + f (c))\n\nwhere V is the vocabulary and (a, b, c, d) form an analogy tuple such that a : b :: c : d. Other\nvariations, such as a multiplicative version [18], on this inference have been proposed. The vector\nf (b) \u2212 f (a) represents the transformation, which is applied to a query c by vector addition in the\nembedding space.\nIn the case of images, we can modify this naturally by replacing the cosine\nsimilarity and argmax over the vocabulary with application of a decoder function mapping from\nthe embedding back to the image space.\nClearly, this simple vector addition will not accurately model transformations for low-level repre-\nsentations such as raw pixels, and so in this work we seek to learn a high-level representation. In\nour experiments, we parametrize the encoder f and decoder g as deep convolutional neural net-\nworks (CNN), but in principle other methods could be used to model f and g. In addition to vector\naddition, we also propose more powerful methods of applying the inferred transformations to new\nimages, such as higher-order multiplicative interactions and multi-layer additive interactions.\nWe \ufb01rst demonstrate visual analogy making on a 2D shapes benchmark, with variation in shape,\ncolor, rotation, scaling and position, and evaluate the performance on analogy completion. Second,\nwe generate a dataset of animated 2D video game character sprites using graphics assets from the\nLiberated Pixel Cup [1]. We demonstrate the capability of our model to transfer animations onto\nnovel characters from a single frame, and to perform analogies that traverse the manifold induced\nby an animation. Third, we apply our model to the task of analogy making on 3D car models, and\nshow that our model can perform 3D pose transfer and rotation by analogy.\n2 Related Work\nHertzmann et al. [12] developed a method for applying new textures to images by analogy. This\nproblem is of practical interest, e.g., for stylizing animations [3]. Our model can also synthesize\nnew images by analogy to examples, but we study global transformations rather than only changing\nthe texture of the image.\nDoll\u00b4ar et al. [9] developed Locally-Smooth Manifold Learning to traverse image manifolds. We\nshare a similar motivation when analogical reasoning requires walking along a manifold (e.g. pose\nanalogies), but our model leverages a deep encoder and decoder trainable by backprop.\nMemisevic and Hinton [19] proposed the Factored Gated Boltzmann Machine for learning to repre-\nsent transformations between pairs of images. This and related models [25, 8, 20] use 3-way tensors\nor their factorization to infer translations, rotations and other transformations from a pair of images,\nand apply the same transformation to a new image. In this work, we share a similar goal, but we\ndirectly train a deep predictive model for the analogy task without requiring 3-way multiplicative\nconnections, with the intent to scale to bigger images and learn more subtle relationships involving\narticulated pose, multiple attributes and out-of-plane rotation.\nOur work is related to several previous works on disentangling factors of variation, for which a\ncommon application is analogy-making. As an early example, bilinear models [27] were proposed\nto separate style and content factors of variation in face images and speech signals. Tang et al. [26]\ndeveloped the tensor analyzer which uses a factor loading tensor to model the interaction among\nlatent factor groups, and was applied to face modeling. Several variants of higher-order Boltzmann\nmachine were developed to tackle the disentangling problem, featuring multiple groups of hidden\nunits, with each group corresponding to a single factor [23, 7]. Disentangling was also considered\nin the discriminative case in the Contractive Discriminative Analysis model [24]. Our work differs\nfrom these in that we train a deep end-to-end network for generating images by analogy.\nRecently several methods were proposed to generate high-quality images using deep networks.\nDosovitskiy et al. [10] used a CNN to generate chair images with controllable variation in appear-\n\n2\n\n\fance, shape and 3D pose. Contemporary to our work, Kulkarni et al. [17] proposed the Deep Convo-\nlutional Inverse Graphics Network, which is a form of variational autoencoder (VAE) [15] in which\nthe encoder disentangles factors of variation. Other works have considered a semi-supervised exten-\nsion of the VAE [16] incorporating class labels associated to a subset of the training images, which\ncan control the label units to perform some visual analogies. Cohen and Welling [6] developed a\ngenerative model of commutative Lie groups (e.g. image rotation, translation) that produced invari-\nant and disentangled representations. In [5], this work is extended to model the non-commutative\n3D rotation group SO(3). Zhu et al. [30] developed the multi-view perceptron for modeling face\nidentity and viewpoint, and generated high quality faces subject to view changes. Cheung et al.\n[4] also use a convolutional encoder-decoder model, and develop a regularizer to disentangle latent\nfactors of variation from a discriminative target.\nAnalogies have been well-studied in the NLP community; Turney [28] used analogies from SAT\ntests to evaluate the performance of text analogy detection methods. In the visual domain, Hwang\net al. [13] developed an analogy-preserving visual-semantic embedding model that could both detect\nanalogies and as a regularizer improve visual recognition performance. Our work is related to these,\nbut we focus mainly on generating images to complete analogies rather than detecting analogies.\n3 Method\nSuppose that A is the set of valid analogy tuples in the training set. For example, (a, b, c, d) \u2208 A\nimplies the statement \u201ca is to b as c is to d\u201d. Let the input image space for images a, b, c, d be RD,\nand the embedding space be RK (typically K < D). Denote the encoder as f : RD \u2192 RK and the\ndecoder as g : RK \u2192 RD. Figure 2 illustrates our architectures for visual analogy making.\n3.1 Making analogies by vector addition\nNeural word representations (e.g., [21, 22]) have been shown to be capable of analogy-making by\naddition and subtraction of word embeddings. Analogy making capability appears to be an emergent\nproperty of these embeddings, but for images we propose to directly train on the objective of analogy\ncompletion. Concretely, we propose the following objective for vector-addition-based analogies:\n\nLadd =\n\n||d \u2212 g(f (b) \u2212 f (a) + f (c))||2\n\n2\n\n(1)\n\n(cid:88)\n\na,b,c,d\u2208A\n\nThis objective has the advantage of being very simple to implement and train. In addition, with a\nmodest number of labeled relations, a large number of training analogies can be mined.\n3.2 Making analogy transformations dependent on the query context\nIn some cases, a purely additive model of applying transformations may not be ideal. For example,\nin the case of rotation, the manifold of a rotated object is circular, and after enough rotation has\nbeen applied, one returns to the original point. In the vector-addition model, we can add the same\nrotation vector f (b) \u2212 f (a) multiple times to a query f (c), but we will never return to the original\npoint (except when f (b) = f (a)). The decoder g could (in principle) solve this problem by learning\nto perform a \u201cmodulus\u201d operation, but this would make the training signi\ufb01cantly more dif\ufb01cult.\nInstead, we propose to parametrize the transformation increment to f (c) as a function of both f (b)\u2212\nf (a) and f (c) itself. In this way, analogies can be applied in a context-dependent way.\nWe present two variants of our training objective to solve this problem. The \ufb01rst, which we will call\nLmul, uses multiplicative interactions between f (b)\u2212 f (a) and f (c) to generate the increment. The\nsecond, which we call Ldeep, uses multiple fully connected layers to form a multi-layer perceptron\n(MLP) without using multiplicative interactions:\n\n(cid:88)\n(cid:88)\n\na,b,c,d\u2208A\n\na,b,c,d\u2208A\n\nLmul =\n\nLdeep =\n\n||d \u2212 g(f (c) + W\u00d71[f (b) \u2212 f (a)]\u00d72f (c))||2\n\n2\n\n||d \u2212 g(f (c) + h([f (b) \u2212 f (a); f (c)]))||2\n2.\n\n(2)\n\n(3)\n\nFor Lmul, W \u2208 RK\u00d7K\u00d7K is a 3-way tensor.2 In practice, to reduce the number of weights we\nlf . Multiplicative interactions\n2For a tensor W \u2208 RK\u00d7K\u00d7K and vectors v, w \u2208 RK, we de\ufb01ne the tensor multiplication W \u00d71 v \u00d72 w \u2208\n\nused a factorized tensor parametrized as Wijl = (cid:80)\nRK as (W \u00d71 v \u00d72 w)l =(cid:80)K\n\n(cid:80)K\nj=1 Wijlviwj,\u2200l \u2208 {1, ..., K}.\n\ni=1\n\nf W (1)\n\nif W (2)\n\njf W (3)\n\n3\n\n\fFigure 2: Illustration of the network structure for analogy making. The top portion shows the\nencoder, transformation module, and decoder. The bottom portion illustrates the transformations\n\nused for Ladd, Lmul and Ldeep. The(cid:78) icon in Lmul indicates a tensor product. We share weights\n\nwith all three encoder networks shown on the top left.\n\nwere similarly used in bilinear models [27], disentangling Boltzmann Machines [23] and Tensor\nAnalyzers [26]. Note that our multiplicative interaction in Lmul is different from [19] in that we use\nthe difference between two encoding vectors (i.e., f (b)\u2212 f (a)) to infer about the transformation (or\nrelation), rather than using a higher-order interaction (e.g., tensor product) for this inference.\nFor Ldeep, h : R2K \u2192 RK is an MLP (deep\nnetwork without 3-way multiplicative interactions)\nand [f (b)\u2212 f (a); f (c)] denotes concatenation of the\ntransformation vector with the query embedding.\nOptimizing the above objectives teaches the model\nto predict analogy completions in image space, but\nin order to traverse image manifolds (e.g.\nfor re-\npeated analogies) as in Algorithm 1, we also want\naccurate analogy completions in the embedding\nspace. To encourage this property, we introduce a regularizer to make the predicted transforma-\ntion increment T (f (a), f (b), f (c)) match the difference of encoder embeddings f (d) \u2212 f (c):\n\nAlgorithm 1: Manifold traversal by analogy,\nwith transformation function T (Eq. 5).\nGiven images a, b, c, and N (# steps)\nz \u2190 f (c)\nfor i = 1 to N do\n\nz \u2190 z + T (f (a), f (b), z)\nxi \u2190 g(z)\n\nreturn generated images xi (i = 1, ..., N)\n\n(cid:88)\n\uf8f1\uf8f2\uf8f3y \u2212 x\n\na,b,c,d\u2208A\n\nR =\n\nT (x, y, z) =\n\n||f (d) \u2212 f (c) \u2212 T (f (a), f (b), f (c))||2\n\n2 , where\n\nwhen using Ladd\nW \u00d71 [y \u2212 x] \u00d72 z when using Lmul\nwhen using Ldeep\nM LP ([y \u2212 x; z])\n\n(4)\n\n(5)\n\nThe overall training objective is a weighted combination of analogy prediction and the above regu-\nlarizer, e.g. Ldeep +\u03b1R. We set \u03b1 = 0.01 by cross validation on the shapes data and found it worked\nwell for all models on sprites and 3D cars as well. All parameters were trained with backpropagation\nusing stochastic gradient descent (SGD).\n3.3 Analogy-making with a disentangled feature representation\nVisual analogies change some aspects of a query image, and leave others unchanged; for example,\nchanging the viewpoint but preserving the shape and texture of an object. To exploit this fact,\nwe incorporate disentangling into our analogy prediction model. A disentangled representation is\nsimply a concatenation of coordinates along each underlying factor of variation. If one can reliably\ninfer these disentangled coordinates, a subset of analogies can be solved simply by swapping sets\nof coordinates among a reference and query embedding, and projecting back into the image space.\nHowever, in general, disentangling alone cannot solve analogies that require traversing the manifold\nstructure of a given factor, and by itself does not capture image relationships.\nIn this section we show how to incorporate disentangled features into our analogy model. The\ndisentangling component makes each group of embedding features encode its respective factor of\nvariation and be invariant to the others. The analogy component enables the model to traverse the\nmanifold of a given factor or subset of factors.\n\n4\n\nIncrement function TDecoder network gEncoder network ff(b)f(a)f(c)f(b)f(a)f(c)f(b)f(a)f(c)addmuldeepaddmuldeepabcd\fAlgorithm 2: Disentangling training update. The\nswitches s determine which units from f (a) and\nf (b) are used to reconstruct image c.\nGiven input images a, b and target c\nGiven switches s \u2208 {0, 1}K\nz \u2190 s \u00b7 f (a) + (1 \u2212 s) \u00b7 f (b)\n\n\u2206\u03b8 \u221d \u2202/\u2202\u03b8(cid:0)||g(z) \u2212 c||2\n\n(cid:1)\n\n2\n\nFigure 3: The encoder f learns a disentangled representation, in this case for pitch, elevation and\nidentity of 3D car models. In the example above, switches s would be a block [0; 1; 1] vector.\n\nFor learning a disentangled representation, we require three-image tuples: a pair from which to\nextract hidden units, and a third to act as a target for prediction. As shown in Figure 3, We use a\nvector of switch units s that decides which elements from f (a) and which from f (b) will be used\nto form the hidden representation z \u2208 RK. Typically s will have a block structure according to the\ngroups of units associated to each factor of variation. Once z has been extracted, it is projected back\ninto the image space via the decoder g(z).\nThe key to learning disentangled features is that images a, b, c should be distinct, so that there is no\npath from any image to itself. This way, the reconstruction target forces the network to separate the\nvisual concepts shared by (a, c) and (b, c), respectively, rather than learning the identity mapping.\nConcretely, the disentangling objective can be written as:\n\nLdis =\n\n||c \u2212 g(s \u00b7 f (a) + (1 \u2212 s) \u00b7 f (b))||2\n\n2\n\n(6)\n\n(cid:88)\n\na,b,c,s\u2208D\n\nNote that unlike analogy training, disentangling only requires a dataset D of 3-tuple of images a, b, c\nalong with a switch unit vector s. Intuitively, s describes the sense in which a, b and c are related.\nAlgorithm 2 describes the learning update we used to learn a disentangled representation.\n4 Experiments\nWe evaluated our methods using three datasets. The \ufb01rst is a set of 2D colored shapes, which is\na simple yet nontrivial benchmark for visual analogies. The second is a set of 2D sprites from the\nopen-source video game project called Liberated Pixel Cup [1], which we chose in order to get\ncontrolled variation in a large number of character attributes and animations. The third is a set of\n3D car model renderings [11], which allowed us to train a model to perform out-of-plane rotation.\nWe used Caffe [14] to train our encoder and decoder networks, with a custom Matlab wrapper\nimplementing our analogy sampling and training objectives. Many additional qualitative results of\nimages generated by our model are presented in the supplementary material.\n4.1 Transforming shapes: comparison of analogy models\nThe shapes dataset was used to benchmark performance on rotation, scaling and translation analo-\ngies. Speci\ufb01cally, we generated 48 \u00d7 48 images scaled to [0, 1] with four shapes, eight colors, four\nscales, \ufb01ve row and column positions, and 24 rotation angles.\nWe compare the performance of our models trained with Ladd, Lmul and Ldeep objectives, respec-\ntively. We did not perform disentangling training in this experiment. The encoder f consisted\nof 4096-1024-512-dimensional fully connected layers, with recti\ufb01ed linear nonlinearities (relu) for\nintermediate layers. The \ufb01nal embedding layer did not use any nonlinearity. The decoder g archi-\ntecture mirrors the encoder, but did not share weights. We trained for 200k steps with mini-batch\nsize 25 (i.e. 25 analogy 4-tuples per mini-batch). We used SGD with momentum 0.9, base learning\nrate 0.001 and decayed the learning rate by factor 0.1 every 100k steps.\n\nModel\nLadd\nLmul\nLdeep\n\nRotation steps\n\nScaling steps\n\n1\n\n2\n\n3\n\n4\n\n3\n\n2\n\n1\n8.39 11.0 15.1 21.5 5.57 6.09 7.22 14.6 5.44 5.66 6.25 7.45\n8.04 11.2 13.5 14.2 4.36 4.70 5.78 14.8 4.24 4.45 5.24 6.90\n1.98 2.19 2.45 2.87 3.97 3.94 4.37 11.9 3.84 3.81 3.96 4.61\n\n1\n\n4\n\nTranslation steps\n4\n\n2\n\n3\n\nTable 1: Comparison of squared pixel prediction error of Ladd, Lmul and Ldeep on shape analogies.\n\n5\n\nIdentityPitch switches sIdentityElevationPitchElevationabc\fFigure 5: Mean-squared prediction error on\nrepeated application of rotation analogies.\n\nFigure 4: Analogy predictions made by Ldeep for\nrotation, scaling and translation, respectively by\nrow. Ladd and Lmul perform as well for scaling\nand transformation, but fail for rotation.\nFigure 4 shows repeated predictions from Ldeep on rotation, scaling and translation test set analogies,\nshowing that our model has learned to traverse these manifolds. Table 1 shows that Ladd and Lmul\nperform similarly for scaling and translation, but only Ldeep can perform accurate rotation analogies.\nFurther extrapolation results with repeated rotations are shown in Figure 5. Though both Lmul and\nLdeep are in principle capable of learning the circular pose manifold, we suspect that Ldeep has\nmuch better performance due to the dif\ufb01culty of training multiplicative models such as Lmul.\n4.2 Generating 2D video game sprites\nGame developers often use what are known as \u201csprites\u201d to portray characters and objects in 2D video\ngames (more commonly on older systems, but still seen on phones and indie games). This entails\nsigni\ufb01cant human effort to draw each frame of each common animation for each character.3 In this\nsection we show how animations can be transferred to new characters by analogy.\nOur dataset consists of 60 \u00d7 60 color images of sprites scaled to [0, 1], with 7 attributes: body type,\nsex, hair type, armor type, arm type, greaves type, and weapon type, with 672 total unique characters.\nFor each character, there are 5 animations each from 4 viewpoints: spellcast, thrust, walk, slash and\nshoot. Each animation has between 6 and 13 frames. We split the data by characters: 500 training,\n72 validation and 100 for testing.\nWe conducted experiments using the Ladd and Ldeep variants of our objective, with and without dis-\nentangled features. We also experimented with a disentangled feature version in which the identity\nunits are taken to be the 22-dimensional character attribute vector, from which the pose is disentan-\ngled. In this case, the encoder for identity units acts as multiple softmax classi\ufb01ers, one for each\nattribute, hence we refer to this objective in experiments as Ldis+cls.\nThe encoder network consisted of two layers of 5\u00d7 5 convolution with stride 2 and relu, followed by\ntwo fully-connected and relu layers, followed by a projection onto the 1024-dimensional embedding.\nThe decoder mirrors the encoder. To increase the spatial dimension we use simple upsampling in\nwhich we copy each input cell value to the upper-left corner of its corresponding 2 \u00d7 2 output.\nFor Ldis, we used 512 units for identity and 512 for pose. For Ldis+cls, we used 22 categorical\nunits for identity, which is the attribute vector, and the remaining 490 for pose. During training for\nLdis+cls, we did not backpropagate reconstruction error through the identity units; we only used\nthe attribute classi\ufb01cation objective for those units. When Ldeep is used, the internal layers of the\ntransformation function T (see Figure 2) had dimension 300, and were each followed by relu. We\ntrained the models using SGD with momentum 0.9 and learning rate 0.00001 decayed by factor 0.1\nevery 100k steps. Training was conducted for 200k steps with mini-batch size 25.\nFigure 6 demonstrates the task of animation transfer, with predictions from a model trained on Ladd.\nTable 2 provides a quantitative comparison of Ladd, Ldis and Ldis+cls. We found that the disen-\ntangling and additive analogy models perform similarly, and that using attributes for disentangled\nidentity features provides a further gain. We conjecture that Ldis+cls wins because changes in cer-\ntain aspects of appearance, such as arm color, have a very small effect in pixel space yielding a weak\nsignal for pixel prediction, but still provides a strong signal to an attribute classi\ufb01er.\n\n3In some cases the work may be decreased by projecting 3D models to 2D or by other heuristics, but in\n\ngeneral the work scales with the number of animations and characters.\n\n6\n\nref+rot (gt)query+rot+rot+rot+rotref+scl (gt)query+scl+scl+scl+sclref+trans (gt)query+trans+trans+trans+trans\fFigure 6: Transferring animations. The top row shows the reference, and the bottom row shows the\ntransferred animation, where the \ufb01rst frame (in red) is the starting frame of a test set character.\n\nModel\nLadd\nLdis\nLdis+cls\nTable 2: Mean-squared pixel error on test analogies, by animation.\n\nthrust walk\n55.7\n53.8\n52.6\n55.8\n24.6\n17.2\n\nshoot\n77.6\n79.8\n40.8\n\nslash\n52.1\n53.5\n18.9\n\naverage\n\n56.0\n56.5\n23.0\n\nspellcast\n\n41.0\n40.8\n13.3\n\nFrom a practical perspective, the ability to transfer poses accurately to unseen characters could help\ndecrease manual labor of drawing (at least of drawing the assets comprising each character in each\nanimation frame). However, training this model required that each transferred animation already\nhas hundreds of examples. Ideally, the model could be shown a small number of examples for a\nnew animation, and transfer it to the existing character database. We call this setting \u201cfew-shot\u201d\nanalogy-making because only a small number of the target animations are provided.\n\nModel\nLadd\nLdis\nLdis+cls\n\n42.8\n19.3\n15.0\n\nNum. of few-shot examples\n6\n\n12\n42.7\n18.9\n12.0\n\n24\n42.3\n17.4\n11.3\n\n48\n41.0\n16.3\n10.4\n\nTable 3: Mean-squared pixel-prediction error\nfor few-shot analogy transfer of the \u201cspellcast\u201d\nanimation from each of 4 viewpoints. Ldis out-\nperforms Ladd, and Ldis+cls performs the best\nFigure 7: Few shot prediction with 48 examples.\neven with only 6 examples.\nTable 3 provides a quantitative comparison and \ufb01gure 7 provides a qualitative comparison of our\nproposed models in this task. We \ufb01nd that Ldis+cls provides the best performance by a wide margin.\nUnlike in Table 2, Ldis outperforms Ladd, suggesting that disentangling may allow new animations\nto be learned in a more data-ef\ufb01cient manner. However, Ldis has an advantage in that it can average\nthe identity features of multiple views of a query character, which Ladd cannot do.\nThe previous analogies only required us to combine disentangled features from two characters, e.g.\nthe identity from one and the pose from another, and so disentangling was suf\ufb01cient. However,\nour analogy method enables us to perform more challenging analogies by learning the manifold of\ncharacter animations, de\ufb01ned by the sequence of frames in each animation. Adjacent frames are thus\nneighbors on the manifold and each animation sequence can be viewed as a \ufb01ber in this manifold.\nWe trained a model by forming analogy tuples across animations as depicted in Fig. 8, using disen-\ntangled identity and pose features. Pose transformations were modeled by deep additive interactions,\nand we used Ldis+cls to disentangle pose from identity units. Figure 9 shows the result of several\nanalogies and their extrapolations, including character rotation for which we created animations.\n\nFigure 8: A cartoon visualization of the \u201cshoot\u201d animation manifold for two different characters\nin different viewpoints. The model can learn the structure of the animation manifold by forming\nanalogy tuples during training; example tuples are circled in red and blue above.\n\n7\n\nReference Output Query Prediction \fFigure 9: Extrapolating by analogy. The model sees the reference / output pair and repeatedly\napplies the inferred transformation to the query. This inference requires learning the manifold of\nanimation poses, and cannot be done by simply combining and decoding disentangled features.\n4.3\nIn this section we apply our model to analogy-making on 3D car renderings subject to changes in\nappearance and rotation angle. Unlike in the case of shapes, this requires the ability of the model to\nperform out-of-plane rotation, and the depicted objects are more complex.\n\n3D car analogies\n\nFeatures\nPose units\nID units\nCombined\n\nPose AUC ID AUC\n\n95.6\n50.1\n94.6\n\n85.2\n98.5\n98.4\n\nTable 4: Measuring the disentangling performance on 3D\ncars. Pose AUC refers to area under the ROC curve for\nFigure 10: 3D car analogies. The\nsame-or-different pose veri\ufb01cation, and ID AUC for same-\nor-different car veri\ufb01cation on pairs of test set images.\ncolumn \u201cGT\u201d denotes ground truth.\nWe use the car CAD models from [11]. For each of the 199 car models, we generated 64 \u00d7 64 color\nrenderings from 24 rotation angles each offset by 15 degrees. We split the models into 100 training,\n49 validation and 50 testing. The same convolutional network architecture was used as in the sprites\nexperiments, and we used 512 units for identity and 128 for pose.\n\nFigure 11: Repeated rotation analogies in forward and reverse directions, starting from frontal pose.\nFigure 10 shows test set predictions of our model trained on Ldis, where images in the fourth column\ncombine pose units from the \ufb01rst column and identity units from the second. Table 4 shows that the\nlearned features are in fact disentangled, and discriminative for identity and pose matching despite\nnot being discriminatively trained. Figure 11 shows repeated rotation analogies on test set cars using\na model trained on Ldeep, demonstrating that our model can perform out-of-plane rotation. This type\nof extrapolation is dif\ufb01cult because the query image shows a different car from a different starting\npose. We expect that a recurrent architecture can further improve the results, as shown in [29].\n5 Conclusions\nWe studied the problem of visual analogy making using deep neural networks, and proposed several\nnew models. Our experiments showed that our proposed models are very general and can learn to\nmake analogies based on appearance, rotation, 3D pose, and various object attributes. We provide\nconnection between analogy making and disentangling factors of variation, and showed that our\nproposed analogy representations can overcome certain limitations of disentangled representations.\nAcknowledgements This work was supported in part by NSF GRFP grant DGE-1256260, ONR\ngrant N00014-13-1-0762, NSF CAREER grant IIS-1453651, and NSF grant CMMI-1266184. We\nthank NVIDIA for donating a Tesla K40 GPU.\n\n8\n\n ref. output query predictionswalkthrustrotate PoseIDGTPredictionref outputquery+1+2+3+4-4-3-2-1\fReferences\n\n[1] Liberated pixel cup. http://lpc.opengameart.org/. Accessed: 2015-05-21.\n[2] P. Bartha. Analogy and analogical reasoning. In The Stanford Encyclopedia of Philosophy. Fall 2013\n\nedition, 2013.\n\n[3] P. B\u00b4enard, F. Cole, M. Kass, I. Mordatch, J. Hegarty, M. S. Senn, K. Fleischer, D. Pesare, and K. Breeden.\n\nStylizing animation by example. ACM Transactions on Graphics, 32(4):119, 2013.\n\n[4] B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen. Discovering hidden factors of variation in\n\ndeep networks. In ICLR Workshop, 2015.\n\n[5] T. Cohen and M. Welling. Learning the irreducible representations of commutative Lie groups. In ICML,\n\n2014.\n\n[6] T. Cohen and M. Welling. Transformation properties of learned visual representations. In ICLR, 2015.\n[7] G. Desjardins, A. Courville, and Y. Bengio. Disentangling factors of variation via generative entangling.\n\narXiv preprint arXiv:1210.5474, 2012.\n\n[8] W. Ding and G. W. Taylor. Mental rotation by optimizing transforming distance.\n\narXiv:1406.3010, 2014.\n\narXiv preprint\n\n[9] P. Doll\u00b4ar, V. Rabaud, and S. Belongie. Learning to traverse image manifolds. In NIPS, 2007.\n[10] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural\n\nnetworks. In CVPR, 2015.\n\n[11] S. Fidler, S. Dickinson, and R. Urtasun. 3d object detection and viewpoint estimation with a deformable\n\n3d cuboid model. In NIPS, 2012.\n\n[12] A. Hertzmann, C. Jacobs, N. Oliver, B. Curless, and D. Salesin. Image analogies. In SIGGRAPH, 2001.\n[13] S. J. Hwang, K. Grauman, and F. Sha. Analogy-preserving semantic embedding for visual object catego-\n\nrization. In NIPS, 2013.\n\n[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[15] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n[16] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep genera-\n\ntive models. In NIPS, 2014.\n\n[17] T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network.\n\nIn NIPS, 2015.\n\n[18] O. Levy, Y. Goldberg, and I. Ramat-Gan. Linguistic regularities in sparse and explicit word representa-\n\ntions. In CoNLL-2014, 2014.\n\n[19] R. Memisevic and G. E. Hinton. Learning to represent spatial transformations with factored higher-order\n\nboltzmann machines. Neural Computation, 22(6):1473\u20131492, 2010.\n\n[20] V. Michalski, R. Memisevic, and K. Konda. Modeling deep temporal dependencies with recurrent gram-\n\nmar cells. In NIPS, 2014.\n\n[21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and\n\nphrases and their compositionality. In NIPS, 2013.\n\n[22] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP,\n\n2014.\n\n[23] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold\n\ninteraction. In ICML, 2014.\n\n[24] S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza. Disentangling factors of variation for facial\n\nexpression recognition. In ECCV. 2012.\n\n[25] J. Susskind, R. Memisevic, G. Hinton, and M. Pollefeys. Modeling the joint density of two images under\n\na variety of transformations. In CVPR, 2011.\n\n[26] Y. Tang, R. Salakhutdinov, and G. Hinton. Tensor analyzers. In ICML, 2013.\n[27] J. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models. Neural compu-\n\ntation, 12(6):1247\u20131283, 2000.\n\n[28] P. D. Turney. Similarity of semantic relations. Computational Linguistics, 32(3):379\u2013416, 2006.\n[29] J. Yang, S. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transforma-\n\ntions for 3d view synthesis. In NIPS, 2015.\n\n[30] Z. Zhu, P. Luo, X. Wang, and X. Tang. Multi-view perceptron: a deep model for learning face identity\n\nand view representations. In NIPS, 2014.\n\n9\n\n\f", "award": [], "sourceid": 778, "authors": [{"given_name": "Scott", "family_name": "Reed", "institution": "University of Michigan"}, {"given_name": "Yi", "family_name": "Zhang", "institution": "University of Michigan"}, {"given_name": "Yuting", "family_name": "Zhang", "institution": "University of Michigan"}, {"given_name": "Honglak", "family_name": "Lee", "institution": "U. Michigan"}]}