{"title": "Coupled Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 469, "page_last": 477, "abstract": "We propose the coupled generative adversarial nets (CoGAN) framework for generating pairs of corresponding images in two different domains. The framework consists of a pair of generative adversarial nets, each responsible for generating images in one domain. We show that by enforcing a simple weight-sharing constraint, the CoGAN learns to generate pairs of corresponding images without existence of any pairs of corresponding images in the two domains in the training set. In other words, the CoGAN learns a joint distribution of images in the two domains from images drawn separately from the marginal distributions of the individual domains. This is in contrast to the existing multi-modal generative models, which require corresponding images for training. We apply the CoGAN to several pair image generation tasks. For each task, the CoGAN learns to generate convincing pairs of corresponding images. We further demonstrate the applications of the CoGAN framework for the domain adaptation and cross-domain image generation tasks.", "full_text": "Coupled Generative Adversarial Networks\n\nMitsubishi Electric Research Labs (MERL),\n\nMitsubishi Electric Research Labs (MERL),\n\nMing-Yu Liu\n\nmliu@merl.com\n\nOncel Tuzel\n\noncel@merl.com\n\nAbstract\n\nWe propose coupled generative adversarial network (CoGAN) for learning a joint\ndistribution of multi-domain images. In contrast to the existing approaches, which\nrequire tuples of corresponding images in different domains in the training set,\nCoGAN can learn a joint distribution without any tuple of corresponding images.\nIt can learn a joint distribution with just samples drawn from the marginal distri-\nbutions. This is achieved by enforcing a weight-sharing constraint that limits the\nnetwork capacity and favors a joint distribution solution over a product of marginal\ndistributions one. We apply CoGAN to several joint distribution learning tasks, in-\ncluding learning a joint distribution of color and depth images, and learning a joint\ndistribution of face images with different attributes. For each task it successfully\nlearns the joint distribution without any tuple of corresponding images. We also\ndemonstrate its applications to domain adaptation and image transformation.\n\n1\n\nIntroduction\n\nThe paper concerns the problem of learning a joint distribution of multi-domain images from data. A\njoint distribution of multi-domain images is a probability density function that gives a density value\nto each joint occurrence of images in different domains such as images of the same scene in different\nmodalities (color and depth images) or images of the same face with different attributes (smiling and\nnon-smiling). Once a joint distribution of multi-domain images is learned, it can be used to generate\nnovel tuples of images. In addition to movie and game production, joint image distribution learning\n\ufb01nds applications in image transformation and domain adaptation. When training data are given as\ntuples of corresponding images in different domains, several existing approaches [1, 2, 3, 4] can be\napplied. However, building a dataset with tuples of corresponding images is often a challenging task.\nThis correspondence dependency greatly limits the applicability of the existing approaches.\nTo overcome the limitation, we propose the coupled generative adversarial networks (CoGAN)\nframework. It can learn a joint distribution of multi-domain images without existence of corresponding\nimages in different domains in the training set. Only a set of images drawn separately from the\nmarginal distributions of the individual domains is required. CoGAN is based on the generative\nadversarial networks (GAN) framework [5], which has been established as a viable solution for image\ndistribution learning tasks. CoGAN extends GAN for joint image distribution learning tasks.\nCoGAN consists of a tuple of GANs, each for one image domain. When trained naively, the CoGAN\nlearns a product of marginal distributions rather than a joint distribution. We show that by enforcing a\nweight-sharing constraint the CoGAN can learn a joint distribution without existence of corresponding\nimages in different domains. The CoGAN framework is inspired by the idea that deep neural networks\nlearn a hierarchical feature representation. By enforcing the layers that decode high-level semantics\nin the GANs to share the weights, it forces the GANs to decode the high-level semantics in the\nsame way. The layers that decode low-level details then map the shared representation to images in\nindividual domains for confusing the respective discriminative models. CoGAN is for multi-image\ndomains but, for ease of presentation, we focused on the case of two image domains in the paper.\nHowever, the discussions and analyses can be easily generalized to multiple image domains.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWe apply CoGAN to several joint image distribution learning tasks. Through convincing visualization\nresults and quantitative evaluations, we verify its effectiveness. We also show its applications to\nunsupervised domain adaptation and image transformation.\n\n2 Generative Adversarial Networks\n\nA GAN consists of a generative model and a discriminative model. The objective of the generative\nmodel is to synthesize images resembling real images, while the objective of the discriminative model\nis to distinguish real images from synthesized ones. Both the generative and discriminative models\nare realized as multilayer perceptrons.\nLet x be a natural image drawn from a distribution, pX, and z be a random vector in Rd. Note that we\nonly consider that z is from a uniform distribution with a support of [\u22121 1]d, but different distributions\nsuch as a multivariate normal distribution can be applied as well. Let g and f be the generative and\ndiscriminative models, respectively. The generative model takes z as input and outputs an image,\ng(z), that has the same support as x. Denote the distribution of g(z) as pG. The discriminative model\nestimates the probability that an input image is drawn from pX. Ideally, f (x) = 1 if x \u223c pX and\nf (x) = 0 if x \u223c pG. The GAN framework corresponds to a minimax two-player game, and the\ngenerative and discriminative models can be trained jointly via solving\n\nV (f, g) \u2261 Ex\u223cpX [\u2212 log f (x)] + Ez\u223cpZ[\u2212 log(1 \u2212 f (g(z)))].\n\n(1)\n\nmax\n\ng\n\nmin\n\nf\n\nIn practice (1) is solved by alternating the following two gradient update steps:\n\nStep 1: \u03b8t+1\n\nf = \u03b8t\n\nf \u2212 \u03bbt\u2207\u03b8f V (f t, gt),\n\nStep 2: \u03b8t+1\n\ng = \u03b8t\n\ng + \u03bbt\u2207\u03b8g V (f t+1, gt)\n\nwhere \u03b8f and \u03b8g are the parameters of f and g, \u03bb is the learning rate, and t is the iteration number.\nGoodfellow et al. [5] show that, given enough capacity to f and g and suf\ufb01cient training iterations,\nthe distribution, pG, converges to pX. In other words, from a random vector, z, the network g can\nsynthesize an image, g(z), that resembles one that is drawn from the true distribution, pX.\n\n3 Coupled Generative Adversarial Networks\n\nCoGAN as illustrated in Figure 1 is designed for learning a joint distribution of images in two different\ndomains. It consists of a pair of GANs\u2014GAN1 and GAN2; each is responsible for synthesizing\nimages in one domain. During training, we force them to share a subset of parameters. This results in\nthat the GANs learn to synthesize pairs of corresponding images without correspondence supervision.\nGenerative Models: Let x1 and x2 be images drawn from the marginal distribution of the 1st\ndomain, x1 \u223c pX1 and the marginal distribution of the 2nd domain, x2 \u223c pX2, respectively. Let g1\nand g2 be the generative models of GAN1 and GAN2, which map a random vector input z to images\nthat have the same support as x1 and x2, respectively. Denote the distributions of g1(z) and g1(z) by\npG1 and pG2. Both g1 and g2 are realized as multilayer perceptrons:\n\n(cid:0)g(m1\u22121)\n\n(cid:0) . . . g(2)\n\n1\n\n1\n\n1 (z)(cid:1)(cid:1)(cid:1),\n(cid:0)g(1)\n\n(cid:0)g(m2\u22121)\n\n(cid:0) . . . g(2)\n\n2\n\n2\n\n2 (z)(cid:1)(cid:1)(cid:1)\n(cid:0)g(1)\n\ng2(z) = g(m2)\n\n2\n\ng1(z) = g(m1)\n1 and g(i)\n\n1\n\nwhere g(i)\n2 are the ith layers of g1 and g2 and m1 and m2 are the numbers of layers in g1 and\ng2. Note that m1 need not equal m2. Also note that the support of x1 need not equal to that of x2.\nThrough layers of perceptron operations, the generative models gradually decode information from\nmore abstract concepts to more material details. The \ufb01rst layers decode high-level semantics and the\nlast layers decode low-level details. Note that this information \ufb02ow direction is opposite to that in a\ndiscriminative deep neural network [6] where the \ufb01rst layers extract low-level features while the last\nlayers extract high-level features.\nBased on the idea that a pair of corresponding images in two domains share the same high-level\nconcepts, we force the \ufb01rst layers of g1 and g2 to have identical structure and share the weights.\nThat is \u03b8g(i)\nand \u03b8g(i)\nare the parameters of g(i)\n2 , respectively. This constraint forces the high-level semantics to\nbe decoded in the same way in g1 and g2. No constraints are enforced to the last layers. They can\nmaterialize the shared high-level representation differently for fooling the respective discriminators.\n\n, for i = 1, 2, ..., k where k is the number of shared layers, and \u03b8g(i)\n\n1 and g(i)\n\n= \u03b8g(i)\n\n1\n\n2\n\n1\n\n2\n\n2\n\n\fFigure 1: CoGAN consists of a pair of GANs: GAN1 and GAN2. Each has a generative model for synthesizing\nrealistic images in one domain and a discriminative model for classifying whether an image is real or synthesized.\nWe tie the weights of the \ufb01rst few layers (responsible for decoding high-level semantics) of the generative models,\ng1 and g2. We also tie the weights of the last few layers (responsible for encoding high-level semantics) of the\ndiscriminative models, f1 and f2. This weight-sharing constraint allows CoGAN to learn a joint distribution of\nimages without correspondence supervision. A trained CoGAN can be used to synthesize pairs of corresponding\nimages\u2014pairs of images sharing the same high-level abstraction but having different low-level realizations.\n\n1\n\n2\n\n2\n\n(cid:0)f (1)\n2 (x2)(cid:1)(cid:1)(cid:1)\n\n(cid:0)f (n2\u22121)\n\n(cid:0) . . . f (2)\n\n(cid:0)f (n1\u22121)\n\n(cid:0) . . . f (2)\n\n(cid:0)f (1)\n1 (x1)(cid:1)(cid:1)(cid:1), f2(x2) = f (n2)\n\n2\n\n1\n1\nand f (i)\n2\n\nDiscriminative Models: Let f1 and f2 be the discriminative models of GAN1 and GAN2 given by\nf1(x1) = f (n1)\nwhere f (i)\nare the ith layers of f1 and f2 and n1 and n2 are the numbers of layers. The\n1\ndiscriminative models map an input image to a probability score, estimating the likelihood that the\ninput is drawn from a true data distribution. The \ufb01rst layers of the discriminative models extract\nlow-level features, while the last layers extract high-level features. Because the input images are\nrealizations of the same high-level semantics in two different domains, we force f1 and f2 to have\nthe same last layers, which is achieved by sharing the weights of the last layers via \u03b8\n=\n, for i = 0, 1, ..., l \u2212 1 where l is the number of weight-sharing layers in the discriminative\n\u03b8\nf (n2\u2212i)\nmodels, and \u03b8f (i)\n2 , respectively. The weight-\nsharing constraint in the discriminators helps reduce the total number of parameters in the network,\nbut it is not essential for learning a joint distribution.\nLearning: The CoGAN framework corresponds to a constrained minimax game given by\n\nare the network parameters of f (i)\n1\n\nand f (i)\n\nand \u03b8f (i)\n\n2\n\n1\n\nf (n1\u2212i)\n\n1\n\n2\n\nmax\ng1,g2\n\nmin\nf1,f2\n\nV (f1, f2, g1, g2), subject to\n\n\u03b8g(i)\n\n1\n\n= \u03b8g(i)\n\n2\n\n,\n\n\u03b8\n\nf (n1\u2212j)\n\n1\n\n= \u03b8\n\nf (n2\u2212j)\n\n2\n\nfor i = 1, 2, ..., k\n, for j = 0, 1, ..., l \u2212 1\n\n(2)\n\nwhere the value function V is given by\nV (f1, f2, g1, g2) = Ex1\u223cpX1\n+ Ex2\u223cpX2\n\n[\u2212 log f1(x1)] + Ez\u223cpZ[\u2212 log(1 \u2212 f1(g1(z)))]\n[\u2212 log f2(x2)] + Ez\u223cpZ[\u2212 log(1 \u2212 f2(g2(z)))].\n\n(3)\nIn the game, there are two teams and each team has two players. The generative models form a\nteam and work together for synthesizing a pair of images in two different domains for confusing the\ndiscriminative models. The discriminative models try to differentiate images drawn from the training\ndata distribution in the respective domains from those drawn from the respective generative models.\nThe collaboration between the players in the same team is established from the weight-sharing\nconstraint. Similar to GAN, CoGAN can be trained by back propagation with the alternating gradient\nupdate steps. The details of the learning algorithm are given in the supplementary materials.\nRemarks: CoGAN learning requires training samples drawn from the marginal distributions, pX1\nand pX2. It does not rely on samples drawn from the joint distribution, pX1,X2, where corresponding\nsupervision would be available. Our main contribution is in showing that with just samples drawn\nseparately from the marginal distributions, CoGAN can learn a joint distribution of images in the\ntwo domains. Both weight-sharing constraint and adversarial training are essential for enabling\nthis capability. Unlike autoencoder learning [3], which encourages a generated pair of images\nto be identical to the target pair of corresponding images in the two domains for minimizing the\nreconstruction loss1, the adversarial training only encourages the generated pair of images to be\n\n1This is why [3] requires samples from the joint distribution for learning the joint distribution.\n\n3\n\nGeneratorsDiscriminatorsweight sharingGAN1GAN2\fFigure 2: Left (Task A): generation of digit and corresponding edge images. Right (Task B): generation of digit\nand corresponding negative images. Each of the top and bottom pairs was generated using the same input noise.\nWe visualized the results by traversing in the input space.\n\nFigure 3: The \ufb01gures plot the average pixel agreement ratios of the CoGANs with different weight-sharing\ncon\ufb01gurations for Task A and B. The larger the pixel agreement ratio the better the pair generation performance.\nWe found that the performance was positively correlated with the number of weight-sharing layers in the\ngenerative models but was uncorrelated to the number of weight-sharing layers in the discriminative models.\nCoGAN learned the joint distribution without weight-sharing layers in the discriminative models.\n\nindividually resembling to the images in the respective domains. With this more relaxed adversarial\ntraining setting, the weight-sharing constraint can then kick in for capturing correspondences between\ndomains. With the weight-sharing constraint, the generative models must utilize the capacity more\nef\ufb01ciently for fooling the discriminative models, and the most ef\ufb01cient way of utilizing the capacity\nfor generating a pair of realistic images in two domains is to generate a pair of corresponding images\nsince the neurons responsible for decoding high-level semantics can be shared.\nCoGAN learning is based on existence of shared high-level representations in the domains. If such a\nrepresentation does not exist for the set of domains of interest, it would fail.\n\n4 Experiments\n\nIn the experiments, we emphasized there were no corresponding images in the different domains in the\ntraining sets. CoGAN learned the joint distributions without correspondence supervision. We were\nunaware of existing approaches with the same capability and hence did not compare CoGAN with\nprior works. Instead, we compared it to a conditional GAN to demonstrate its advantage. Recognizing\nthat popular performance metrics for evaluating generative models all subject to issues [7], we\nadopted a pair image generation performance metric for comparison. Many details including the\nnetwork architectures and additional experiment results are given in the supplementary materials. An\nimplementation of CoGAN is available in https://github.com/mingyuliutw/cogan.\nDigits: We used the MNIST training set to train CoGANs for the following two tasks. Task A is\nabout learning a joint distribution of a digit and its edge image. Task B is about learning a joint\ndistribution of a digit and its negative image. In Task A, the 1st domain consisted of the original\nhandwritten digit images, while the 2nd domain consisted of their edge images. We used an edge\ndetector to compute training edge images for the 2nd domain. In the supplementary materials, we also\nshowed an experiment for learning a joint distribution of a digit and its 90-degree in-plane rotation.\nWe used deep convolutional networks to realized the CoGAN. The two generative models had an iden-\ntical structure; both had 5 layers and were fully convolutional. The stride lengths of the convolutional\nlayers were fractional. The models also employed the batch normalization processing [8] and the\nparameterized recti\ufb01ed linear unit processing [9]. We shared the parameters for all the layers except\nfor the last convolutional layers. For the discriminative models, we used a variant of LeNet [10].\n\n4\n\n# of weight-sharing layers in the discriminative models0123Avg. pixel agreement ratios0.880.90.920.940.96Task B: pair generation of digit and negative imagesGenerative models share 1 layer.Generative models share 2 layers.Generative models share 3 layers.Generative models share 4 layers.# of weight-sharing layers in the discriminative models0123avg. pixel agreement ratios0.880.90.920.940.96Task A: pair generation of digit and edge images\fThe inputs to the discriminative models were batches containing output images from the generative\nmodels and images from the two training subsets (each pixel value is linearly scaled to [0 1]).\nWe divided the training set into two equal-size non-overlapping subsets. One was used to train GAN1\nand the other was used to train GAN2. We used the ADAM algorithm [11] for training and set the\nlearning rate to 0.0002, the 1st momentum parameter to 0.5, and the 2nd momentum parameter to\n0.999 as suggested in [12]. The mini-batch size was 128. We trained the CoGAN for 25000 iterations.\nThese hyperparameters were \ufb01xed for all the visualization experiments.\nThe CoGAN learning results are shown in Figure 2. We found that although the CoGAN was\ntrained without corresponding images, it learned to render corresponding ones for both Task A and\nB. This was due to the weight-sharing constraint imposed to the layers that were responsible for\ndecoding high-level semantics. Exploiting the correspondence between the two domains allowed\nGAN1 and GAN2 to utilize more capacity in the networks to better \ufb01t the training data. Without the\nweight-sharing constraint, the two GANs just generated two unrelated images in the two domains.\nWeight Sharing: We varied the numbers of weight-sharing layers in the generative and discriminative\nmodels to create different CoGANs for analyzing the weight-sharing effect for both tasks. Due to\nlack of proper validation methods, we did a grid search on the training iteration hyperparameter\nand reported the best performance achieved by each network. For quantifying the performance, we\ntransformed the image generated by GAN1 to the 2nd domain using the same method employed\nfor generating the training images in the 2nd domain. We then compared the transformed image\nwith the image generated by GAN2. A perfect joint distribution learning should render two identical\nimages. Hence, we used the ratios of agreed pixels between 10K pairs of images generated by\neach network (10K randomly sampled z) as the performance metric. We trained each network 5\ntimes with different initialization weights and reported the average pixel agreement ratios over the 5\ntrials for each network. The results are shown in Figure 3. We observed that the performance was\npositively correlated with the number of weight-sharing layers in the generative models. With more\nsharing layers in the generative models, the rendered pairs of images resembled true pairs drawn\nfrom the joint distribution more. We also noted that the performance was uncorrelated to the number\nof weight-sharing layers in the discriminative models. However, we still preferred discriminator\nweight-sharing because this reduces the total number of network parameters.\nComparison with Conditional GANs: We compared the CoGAN with the conditional GANs [13].\nWe designed a conditional GAN with the generative and discriminative models identical to those in\nthe CoGAN. The only difference was the conditional GAN took an additional binary variable as input,\nwhich controlled the domain of the output image. When the binary variable was 0, it generated an\nimage resembling images in the 1st domain; otherwise, it generated an image resembling images in\nthe 2nd domain. Similarly, no pairs of corresponding images were given during the conditional GAN\ntraining. We applied the conditional GAN to both Task A and B and hoped to empirically answer\nwhether a conditional model can be used to learn to render corresponding images with correspondence\nsupervision. The pixel agreement ratio was used as the performance metric. The experiment results\nshowed that for Task A, CoGAN achieved an average ratio of 0.952, outperforming 0.909 achieved\nby the conditional GAN. For Task B, CoGAN achieved a score of 0.967, which was much better\nthan 0.778 achieved by the conditional GAN. The conditional GAN just generated two different\ndigits with the same random noise input but different binary variable values. These results showed\nthat the conditional model failed to learn a joint distribution from samples drawn from the marginal\ndistributions. We note that for the case that the supports of the two domains are different such as the\ncolor and depth image domains, the conditional model cannot even be applied.\nFaces: We applied CoGAN to learn a joint distribution of face images with different. We trained\nseveral CoGANs, each for generating a face with an attribute and a corresponding face without the\nattribute. We used the CelebFaces Attributes dataset [14] for the experiments. The dataset covered\nlarge pose variations and background clutters. Each face image had several attributes, including\nblond hair, smiling, and eyeglasses. The face images with an attribute constituted the 1st domain; and\nthose without the attribute constituted the 2nd domain. No corresponding face images between the\ntwo domains was given. We resized the images to a resolution of 132 \u00d7 132 and randomly sampled\n128 \u00d7 128 regions for training. The generative and discriminative models were both 7 layer deep\nconvolutional neural networks.\nThe experiment results are shown in Figure 4. We randomly sampled two points in the 100-\ndimensional input noise space and visualized the rendered face images as traveling from one pint to\n\n5\n\n\fFigure 4: Generation of face images with different attributes using CoGAN. From top to bottom, the \ufb01gure\nshows pair face generation results for the blond-hair, smiling, and eyeglasses attributes. For each pair, the 1st\nrow contains faces with the attribute, while the 2nd row contains corresponding faces without the attribute.\n\nthe other. We found CoGAN generated pairs of corresponding faces, resembling those from the same\nperson with and without an attribute. As traveling in the space, the faces gradually change from one\nperson to another. Such deformations were consistent for both domains. Note that it is dif\ufb01cult to\ncreate a dataset with corresponding images for some attribute such as blond hair since the subjects\nhave to color their hair. It is more ideal to have an approach that does not require corresponding\nimages like CoGAN. We also noted that the number of faces with an attribute was often several times\nsmaller than that without the attribute in the dataset. However, CoGAN learning was not hindered by\nthe mismatches.\nColor and Depth Images: We used the RGBD dataset [15] and the NYU dataset [16] for learning\njoint distribution of color and depth images. The RGBD dataset contains registered color and depth\nimages of 300 objects captured by the Kinect sensor from different view points. We partitioned the\ndataset into two equal-size non-overlapping subsets. The color images in the 1st subset were used for\ntraining GAN1, while the depth images in the 2nd subset were used for training GAN2. There were\nno corresponding depth and color images in the two subsets. The images in the RGBD dataset have\ndifferent resolutions. We resized them to a \ufb01xed resolution of 64 \u00d7 64. The NYU dataset contains\ncolor and depth images captured from indoor scenes using the Kinect sensor. We used the 1449\nprocessed depth images for the depth domain. The training images for the color domain were from\n\n6\n\n\fFigure 5: Generation of color and depth images using CoGAN. The top \ufb01gure shows the results for the RGBD\ndataset: the 1st row contains the color images, the 2nd row contains the depth images, and the 3rd and 4th rows\nvisualized the depth pro\ufb01le under different view points. The bottom \ufb01gure shows the results for the NYU dataset.\n\nall the color images in the raw dataset except for those registered with the processed depth images.\nWe resized both the depth and color images to a resolution of 176 \u00d7 132 and randomly cropped\n128 \u00d7 128 patches for training.\nFigure 5 showed the generation results. We found the rendered color and depth images resembled\ncorresponding RGB and depth image pairs despite of no registered images existed in the two domains\nin the training set. The CoGAN recovered the appearance\u2013depth correspondence unsupervisedly.\n\n5 Applications\n\nIn addition to rendering novel pairs of corresponding images for movie and game production, the\nCoGAN \ufb01nds applications in the unsupervised domain adaptation and image transformation tasks.\nUnsupervised Domain Adaptation (UDA): UDA concerns adapting a classi\ufb01er trained in one\ndomain to classify samples in a new domain where there is no labeled example in the new domain\nfor re-training the classi\ufb01er. Early works have explored ideas from subspace learning [17, 18] to\ndeep discriminative network learning [19, 20, 21]. We show that CoGAN can be applied to the UDA\nproblem. We studied the problem of adapting a digit classi\ufb01er from the MNIST dataset to the USPS\ndataset. Due to domain shift, a classi\ufb01er trained using one dataset achieves poor performance in the\nother. We followed the experiment protocol in [17, 20], which randomly samples 2000 images from\nthe MNIST dataset, denoted as D1, and 1800 images from the USPS dataset, denoted as D2, to de\ufb01ne\nan UDA problem. The USPS digits have a different resolution. We resized them to have the same\nresolution as the MNIST digits. We employed the CoGAN used for the digit generation task. For\nclassifying digits, we attached a softmax layer to the last hidden layer of the discriminative models.\nWe trained the CoGAN by jointly solving the digit classi\ufb01cation problem in the MNIST domain\nwhich used the images and labels in D1 and the CoGAN learning problem which used the images\nin both D1 and D2. This produced two classi\ufb01ers: c1(x1) \u2261 c(f (3)\n1 (x1)))) for MNIST\nand c2(x2) \u2261 c(f (3)\n2 (x2)))) for USPS. No label information in D2 was used. Note that\n1 \u2261 f (2)\nf (2)\nand f (3)\ndue to weight sharing and c denotes the softmax layer. We then applied\nc2 to classify digits in the USPS dataset. The classi\ufb01er adaptation from USPS to MNIST can be\nachieved in the same way. The learning hyperparameters were determined via a validation set. We\nreported the average accuracy over 5 trails with different randomly selected D1 and D2.\nTable 1 reports the performance of the proposed CoGAN approach with comparison to the state-\nof-the-art methods for the UDA task. The results for the other methods were duplicated from [20].\nWe observed that CoGAN signi\ufb01cantly outperformed the state-of-the-art methods. It improved the\naccuracy from 0.64 to 0.90, which translates to a 72% error reduction rate.\nCross-Domain Image Transformation: Let x1 be an image in the 1st domain. Cross-domain image\ntransformation is about \ufb01nding the corresponding image in the 2nd domain, x2, such that the joint\n\n2 (f (2)\n1 \u2261 f (3)\n\n2 (f (1)\n\n1 (f (2)\n\n1 (f (1)\n\n2\n\n2\n\n7\n\n\fMethod\n\nFrom MNIST\n\nto USPS\n\nFrom USPS\nto MNIST\nAverage\n\n[17]\n0.408\n\n0.274\n0.341\n\n[18]\n0.467\n\n0.355\n0.411\n\n[19]\n0.478\n\n0.631\n0.554\n\n[20]\n0.607\n\n0.673\n0.640\n\nCoGAN\n\n0.912 \u00b10.008\n0.891 \u00b10.008\n\n0.902\n\nTable 1: Unsupervised domain adaptation performance comparison. The\ntable reported classi\ufb01cation accuracies achieved by competing algorithms.\n\nFigure 6: Cross-domain image\ntransformation. For each pair, left\nis the input; right is the trans-\nformed image.\n\nprobability density, p(x1, x2), is maximized. Let L be a loss function measuring difference between\ntwo images. Given g1 and g2, the transformation can be achieved by \ufb01rst \ufb01nding the random vector\nthat generates the query image in the 1st domain z\u2217 = arg minz L(g1(z), x1). After \ufb01nding z\u2217, one\ncan apply g2 to obtain the transformed image, x2 = g2(z\u2217). In Figure 6, we show several CoGAN\ncross-domain transformation results, computed by using the Euclidean loss function and the L-BFGS\noptimization algorithm. We found the transformation was successful when the input image was\ncovered by g1 (The input image can be generated by g1.) but generated blurry images when it is not\nthe case. To improve the coverage, we hypothesize that more training images and a better objective\nfunction are required, which are left as future work.\n\n6 Related Work\n\nNeural generative models has recently received an increasing amount of attention. Several ap-\nproaches, including generative adversarial networks[5], variational autoencoders (VAE)[22], attention\nmodels[23], moment matching[24], stochastic back-propagation[25], and diffusion processes[26],\nhave shown that a deep network can learn an image distribution from samples. The learned networks\ncan be used to generate novel images. Our work was built on [5]. However, we studied a different\nproblem, the problem of learning a joint distribution of multi-domain images. We were interested\nin whether a joint distribution of images in different domains can be learned from samples drawn\nseparately from its marginal distributions of the individual domains. We showed its achievable via\nthe proposed CoGAN framework. Note that our work is different to the Attribute2Image work[27],\nwhich is based on a conditional VAE model [28]. The conditional model can be used to generate\nimages of different styles, but they are unsuitable for generating images in two different domains\nsuch as color and depth image domains.\nFollowing [5], several works improved the image generation quality of GAN, including a Laplacian\npyramid implementation[29], a deeper architecture[12], and conditional models[13]. Our work\nextended GAN to dealing with joint distributions of images.\nOur work is related to the prior works in multi-modal learning, including joint embedding space\nlearning [30] and multi-modal Boltzmann machines [1, 3]. These approaches can be used for\ngenerating corresponding samples in different domains only when correspondence annotations are\ngiven during training. The same limitation is also applied to dictionary learning-based approaches [2,\n4]. Our work is also related to the prior works in cross-domain image generation [31, 32, 33], which\nstudied transforming an image in one style to the corresponding images in another style. However,\nwe focus on learning the joint distribution in an unsupervised fashion, while [31, 32, 33] focus on\nlearning a transformation function directly in a supervised fashion.\n\n7 Conclusion\n\nWe presented the CoGAN framework for learning a joint distribution of multi-domain images. We\nshowed that via enforcing a simple weight-sharing constraint to the layers that are responsible for\ndecoding abstract semantics, the CoGAN learned the joint distribution of images by just using\nsamples drawn separately from the marginal distributions. In addition to convincing image generation\nresults on faces and RGBD images, we also showed promising results of the CoGAN framework for\nthe image transformation and unsupervised domain adaptation tasks.\n\n8\n\n\fReferences\n[1] Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning with deep boltzmann machines. In\n\nNIPS, 2012.\n\n[2] Shenlong Wang, Lei Zhang, Yan Liang, and Quan Pan. Semi-coupled dictionary learning with applications\n\nto image super-resolution and photo-sketch synthesis. In CVPR, 2012.\n\n[3] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal\n\n[4] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image super-resolution via sparse representation.\n\n[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\n[7] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In\n\ndeep learning. In ICML, 2011.\n\nIEEE TIP, 2010.\n\nneural networks. In NIPS, 2012.\n\nICLR, 2016.\n\n[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv:1502.03167, 2015.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\n\nhuman-level performance on imagenet classi\ufb01cation. In ICCV, 2015.\n\n[10] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 1998.\n\n[11] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[12] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. In ICLR, 2016.\n\n[13] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014.\n[14] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\n[15] Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. A large-scale hierarchical multi-view rgb-d object\n\nICCV, 2015.\n\ndataset. In ICRA, 2011.\n\n[16] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support\n\n[17] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip Yu. Transfer feature learning\n\ninference from rgbd images. In ECCV, 2012.\n\nwith joint distribution adaptation. In ICCV, 2013.\n\n[18] Basura Fernando, Tatiana Tommasi, and Tinne Tuytelaars. Joint cross-domain classi\ufb01cation and subspace\n\nlearning for unsupervised adaptation. Pattern Recognition Letters, 65:60\u201366, 2015.\n\n[19] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion:\n\nMaximizing for domain invariance. arXiv:1412.3474, 2014.\n\n[20] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep domain adaptation.\n\narXiv:1603.06432, 2016.\n\n[21] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois Laviolette,\n\nMario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016.\n\n[22] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n[23] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for\n\nimage generation. In ICML, 2015.\n\n[24] Yujia Li, Kevin Swersky, and Richard Zemel. Generative moment matching networks. ICML, 2016.\n[25] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. ICML, 2014.\n\n[26] Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised\n\nlearning using nonequilibrium thermodynamics. In ICML, 2015.\n\n[27] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation\n\nfrom visual attributes. arXiv:1512.00570, 2015.\n\n[28] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\n\nlearning with deep generative models. In NIPS, 2014.\n\n[29] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian\n\npyramid of adversarial networks. In NIPS, 2015.\n\n[30] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with\n\nmultimodal neural language models. arXiv:1411.2539, 2014.\n\n[31] Junho Yim, Heechul Jung, ByungIn Yoo, Changkyu Choi, Dusik Park, and Junmo Kim. Rotating your face\n\nusing multi-task deep neural network. In CVPR, 2015.\n\n[32] Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making. In NIPS, 2015.\n[33] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with\n\nconvolutional neural networks. In CVPR, 2015.\n\n9\n\n\f", "award": [], "sourceid": 258, "authors": [{"given_name": "Ming-Yu", "family_name": "Liu", "institution": "MERL"}, {"given_name": "Oncel", "family_name": "Tuzel", "institution": "Mitsubishi Electric Research Labs (MERL)"}]}