{"title": "Multi-mapping Image-to-Image Translation via Learning Disentanglement", "book": "Advances in Neural Information Processing Systems", "page_first": 2994, "page_last": 3004, "abstract": "Recent advances of image-to-image translation focus on learning the one-to-many mapping from two aspects: multi-modal translation and multi-domain translation. However, the existing methods only consider one of the two perspectives, which makes them unable to solve each other's problem. To address this issue, we propose a novel unified model, which bridges these two objectives. First, we disentangle the input images into the latent representations by an encoder-decoder architecture with a conditional adversarial training in the feature space. Then, we encourage the generator to learn multi-mappings by a random cross-domain translation. As a result, we can manipulate different parts of the latent representations to perform multi-modal and multi-domain translations simultaneously. \nExperiments demonstrate that our method outperforms state-of-the-art methods.", "full_text": "Multi-mapping Image-to-Image Translation via\n\nLearning Disentanglement\n\nXiaoming Yu1,2, Yuanqi Chen1,2, Thomas Li1,3, Shan Liu4, and Ge Li 1,2\n\n1School of Electronics and Computer Engineering, Peking University 2Peng Cheng Laboratory\n3Advanced Institute of Information Technology, Peking University\n\n4Tencent America\n\nxiaomingyu@pku.edu.cn, cyq373@pku.edu.cn\n\ntli@aiit.org.cn, shanl@tencent.com, geli@ece.pku.edu.cn\n\nAbstract\n\nRecent advances of image-to-image translation focus on learning the one-to-many\nmapping from two aspects: multi-modal translation and multi-domain translation.\nHowever, the existing methods only consider one of the two perspectives, which\nmakes them unable to solve each other\u2019s problem. To address this issue, we propose\na novel uni\ufb01ed model, which bridges these two objectives. First, we disentangle\nthe input images into the latent representations by an encoder-decoder architecture\nwith a conditional adversarial training in the feature space. Then, we encourage\nthe generator to learn multi-mappings by a random cross-domain translation. As a\nresult, we can manipulate different parts of the latent representations to perform\nmulti-modal and multi-domain translations simultaneously. Experiments demon-\nstrate that our method outperforms state-of-the-art methods. Code will be available\nat https://github.com/Xiaoming-Yu/DMIT.\n\n1\n\nIntroduction\n\nImage-to-image (I2I) translation is a broad concept that aims to translate images from one domain to\nanother. Many computer vision and image processing problems can be handled in this framework,\ne.g. image colorization [16], image inpainting [39], style transfer [45], etc. Previous works [16, 45,\n40, 18, 24] present the impressive results on the task with deterministic one-to-one mapping, but\nsuffer from mode collapse when the outputs correspond to multiple possibilities. For example, in the\nseason transfer task, as shown in Fig. 1, a summer image may correspond to multiple winter scenes\nwith different styles of lighting, sky, and snow. To tackle this problem and generalize the applicable\nscenarios of I2I, recent studies focus on one-to-many translation and explore the problem from two\nperspectives: multi-domain translation [20, 3, 25], and multi-modal translation [46, 22, 15, 42, 39].\nThe multi-domain translation aims to learn mappings between each domain and other domains.\nUnder a single uni\ufb01ed framework, recent works realize the translation among multiple domains.\nHowever, between the two domains, what these methods have learned are still deterministic one-to-\none mappings, thus they fail to capture the multi-modal nature of the image distribution within the\nimage domain. Another line of works is the multi-modal translation. BicycleGAN [46] achieves the\none-to-many mapping between the source domain and the target domain by combining the objective\nof cVAE-GAN [21] and cLR-GAN [2, 5, 7]. MUNIT [15] and DRIT [22] extend the method to\nlearn two one-to-many mappings between the two image domains in an unsupervised setting, i.e.,\ndomain A to domain B and vice versa. While capable of generating diverse and realistic translation\noutputs, these methods are limited when there are multiple image domains to be translated. In order\nto adapt to the new task, the domain-speci\ufb01c encoder-decoder architecture in these methods needs to\nbe duplicated to the number of image domains. Moreover, they assume that there is no correlation of\nthe styles between domains, while we argue that they could be aligned as shown in Fig. 1. Besides,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Multi-mapping image-to-image translation. The images with a black border are the\ninput images, and other images are generated by our method. The images on the same column have\nthe same style, which indicates that the styles between image domains could be aligned.\n\nexisting one-to-many mapping methods usually assume the state of the domain is \ufb01nite and discrete,\nwhich limits their application scenarios.\nIn this paper, we focus on bridging the objectives of multi-domain translation and multi-modal\ntranslation with an unsupervised uni\ufb01ed framework. For clarity, we refer our task to as multi-mapping\ntranslation. Simultaneous modeling for these two problems not only makes the framework more\nef\ufb01cient but also encourages the model to learn ef\ufb01cient representations for diverse translations.\nTo instantiate the idea, as shown in Fig. 2(d), we assume that the images can be disentangled into two\nlatent representation spaces: a content space C and a style space S, and propose an encoder-decoder\narchitecture to learn the disentangled representations. Our assumption is developed by the shared\nlatent space assumption [24], but we disentangle the latent space into two separate parts to model\nthe multi-modal distribution and to achieve cross-domain translation. Unlike partially-shared latent\nspace assumption [15, 22], that treats style information as domain-speci\ufb01c, the styles between image\ndomains are aligned in our assumption, as shown in Fig. 1. Speci\ufb01cally, the style representations\nin this work are low-dimensional vectors which do not contain spatial information and hence can\nonly control the global appearance of the outputs. By using a uni\ufb01ed style encoder to learn style\nrepresentations and thus fully utilizing samples of all image domains, the sample space of our style\nrepresentation is denser than that learned from only one speci\ufb01c image domain. As for content\nrepresentations, they are feature maps capturing the spatial structure information across domains. To\nmitigate the effects of distribution shift among domains, we eliminate domain-speci\ufb01c information in\ncontent representations via conditional adversarial learning. To achieve multi-mapping translation\nusing a single uni\ufb01ed decoder, we concatenate the disentangled style representations with the target\ndomain label, then adopt the style-based injection method to render the content representations to\nour desired outputs. Through learning the inverse mapping of disentanglement, we can change\nthe domain label to translate an image to the speci\ufb01c domain or modify the style representation to\nproduce multi-modal outputs. Furthermore, we can extend our framework to a more challenging task\nof semantic image synthesis whose domains can be considered as an uncountable set and cannot be\nmodeled by existing I2I approaches.\nThe contributions of this work are summarized as follows:\n\nof multi-domain and multi-modal translations.\n\n\u2022 We introduce an unsupervised uni\ufb01ed multi-mapping framework, which unites the objectives\n\u2022 By aligning latent representations among image domains, our model is ef\ufb01cient in learning\n\u2022 Experimental results show our model is superior to the state-of-the-art methods.\n\ndisentanglement and performing \ufb01ner image translation.\n\n2 Related Work\n\nImage-to-image translation. The problem of I2I is \ufb01rst de\ufb01ned by Isola et al. [16]. Based on\nthe generative adversarial networks [11, 27], they propose a general-purpose framework (pix2pix)\nto handle I2I. To get rid of the constraint of paired data in pix2pix, [45, 40, 18] utilize the cycle-\n\n2\n\nThis bird has wings that are brown and has a fat bellyA black bird with a red headSummerWinterYoungOldSeason transferFacial attribute transferSemantic image synthesis\fFigure 2: Comparisons of unsupervised I2I translation methods. Denote Xk as the k-th image\ndomain. The solid lines and dashed lines represent the \ufb02ow of encoder and generator respectively.\nThe lines with the same color indicate they belong to the same module.\n\nTable 1: Comparisons with recent works on unsupervised image-to-image translation\n\nMulti-modal\ntranslation\n\n-\n-\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nMulti-domain\n\ntranslation\n\n-\n(cid:88)\n-\n-\n(cid:88)\n(cid:88)\n\nMulti-mapping\n\ntranslation\n\n-\n-\n-\n-\n-\n(cid:88)\n\nUni\ufb01ed\nstructure\n\n-\n(cid:88)\n-\n-\n-\n(cid:88)\n\nUNIT\n\nStarGAN\nMUNIT\nDRIT\n\nSingleGAN\n\nOurs\n\nFeature\n\ndisentanglement\n\n-\n-\n(cid:88)\n(cid:88)\n-\n(cid:88)\n\nRepresentation\n\nalignment\n\n(cid:88)\n-\n\n-\n(cid:88)\n\nPartial\nPartial\n\nconsistency for the stability of training. UNIT [24] assumes a shared latent space for two image\ndomains. It achieves unsupervised translation by learning the bijection between latent and image\nspaces using two generators. However, these methods only learn the one-to-one mapping between\ntwo domains and thus produce deterministic output for an input image. Recent studies focus on\nmulti-domain translation [20, 3, 42, 25] and multi-modal translation [46, 39, 22, 15, 39, 42, 31].\nUnfortunately, neither multi-modal translation nor multi-domain translation considers the other\u2019s\nscenario, which makes them unable to solve the problem of each other. Table 1 shows a feature-\nby-feature comparison among various unsupervised I2I models. Different from the aforementioned\nmethods, we explore a combination of these two problems rather than separation, which makes our\nmodel more ef\ufb01cient and general purpose. Concurrent with our work, several independent researches\n[4, 33, 37] also tackle the multi-mapping problem from different perspectives.\nRepresentation disentanglement. To achieve a \ufb01ner manipulation in image generation, disen-\ntangling the factors of data variation has attracted a great amount of attention [19, 13, 2]. Some\nprevious works [20, 25] aim to learn domain-invariant representations from data across multiple\ndomains, then generate different realistic versions of an input image by varying the domain labels.\nOthers [22, 15, 10] focus on disentangling the images into domain-invariant and domain-speci\ufb01c\nrepresentations to facilitate learning diverse cross-domain mappings. Inspired by these works, we\nattempt to disentangle the images into solely independent parts: content and style. Moreover, we\nalign these representations among image domains, which allows us to utilize rich content and style\nfrom different domains and manipulate the translation in \ufb01ner detail.\nSemantic image synthesis. The goal of semantic image synthesis is to generate an image to match\nthe given text while retaining the irrelevant information from the input image. Dong et al. [6]\ntrain a conditional GAN to synthesize a manipulated version of the image given an original image\nand a target text description. To preserve text-irrelevant contents of the original image, Paired-D\nGAN [26] proposes to model the foreground and background distribution with different discriminators.\nTAGAN [30] introduces a text-adaptive discriminator to pay attention to the regions that correspond\nto the given text. In this work, we treat the image set with the same text description as an image\ndomain. Thus the domains are countless and each domain contains very few images in the training\nset. Bene\ufb01t from the uni\ufb01ed framework and the representation alignment among different domains,\nwe can tackle this problem in our uni\ufb01ed multi-mapping framework.\n\n3 Proposed Method\n\nLet X =(cid:83)N\n\nk=1 Xk \u2282 RH\u00d7W\u00d73 be an image set that contains all possible images of N different\ndomains. We assume that the images can be disentangled to two latent representations (C,S). C is\n\n3\n\n\ud835\udcb32\ud835\udcb31\ud835\udc9e\ud835\udc3a1\ud835\udc3a2\ud835\udcb32\ud835\udcb31\ud835\udc9e\ud835\udc3a1\ud835\udc3a2\ud835\udcae2\ud835\udcae1\ud835\udc3a...\ud835\udc9e\ud835\udcb31\ud835\udcb32\ud835\udcb3\ud835\udc41\ud835\udc3a\ud835\udc9e\ud835\udcae\ud835\udcb31\ud835\udcb32\ud835\udcb3\ud835\udc41(a) One-to-one translation(b) Multi-modal translation(c) Multi-domain translation(d) Multi-mapping translation...\fFigure 3: Overview. (a) The disentanglement path learns the bijective mapping between the disen-\ntangled representations and the input image. (b) The translation path encourages to generate diverse\noutputs with possible styles in different domains.\n\nthe set of contents excluded from the variation among domains and styles, and S is the set of styles\nthat is the rendering of the contents. Our goal is to train a uni\ufb01ed model that learns multi-mappings\namong multiple domains and styles. To achieve this goal, we also de\ufb01ne D as a set of domain labels\nand treat D as another disentangled representations of the images. Then we propose to learn mapping\nfunctions between images and disentangled representations X (cid:10) (C,S,D).\nAs illustrated in Fig. 3(a), we introduce the content encoder Ec : X \u2192 C that maps an input image to\nits content, and the encoder style Es : X \u2192 S that extracts the style of the input image. To unify the\nformulation, We also denote the determined mapping function between X and D as the domain label\nencoder Ed : X \u2192 D which is organized as a dictionary1 and extracts the domain label from the\ninput image. The inversely disentangled mapping is formulated as the generator G : (C,S,D) \u2192 X .\nAs a result, with any desired style s \u2208 S and domain label d \u2208 D, we can translate an input image\nxi \u2208 X to the corresponding target xt \u2208 X\n\nxt = G(Ec(xi), s, d).\n\n(1)\n\n3.1 Network Architecture\n\nEncoder. The content encoder Ec is a fully convolutional network that encode the input image to\nthe spatial feature map c. Since the small output stride used in Ec, c retains rich spatial structure\ninformation of input image. The style encoder Es consists of several residual blocks followed by\nglobal average pooling and fully connected layers. By global average pooling, Es removes the\nstructure information of input and extract the statistical characteristics to represent the input style [9].\nThe \ufb01nal style representation s are constructed as a low-dimensional vector by the reparameterization\ntrick [19].\nGenerator. Motivated by recent style-based methods [8, 14, 17, 15, 42], we adopt a style-based\ngenerator G to simultaneous model for multi-domain and multi-modal translations. Speci\ufb01cally, the\ngenerator G consists of several residual blocks followed by several deconvolutional layers. Each\nconvolution layer in residual blocks is equipped with CBIN [42, 43] for information injection.\nDiscriminator. Unlike previous works [22, 15, 42] that apply different discriminators for different\nimage domains, we propose to adopt a uni\ufb01ed conditional discriminator for different domains.\nSince the large distribution shift between image domains in I2I, it is challenging to use a uni\ufb01ed\ndiscriminator. Inspired by the style-based generator, we apply CBIN to the discriminator to extend\nthe capacity of our model. For more details of our network, we refer the reader to our supplementary\nmaterials.\n\n3.2 Learning Strategy\n\nOur proposed method encourages the bijective mapping between the image and the latent represen-\ntations while learning disentanglement. Fig. 3 presents an overview of our model, whose learning\n1Since encoder Ed has a deterministic mapping, it is no need for joint training with Ed in our training stage.\n\n4\n\n(a) Disentanglement pathEdGKLN(s)DcEcD(d)dxixi'cdsL1Esd(b) Translation pathL1GD(d)N(s)xtDxxiEcs\u2019cL1c\u2019EsSamplingDataLoss Prior distribution Content encoderDomain encoderStyle encoderGenerator\fprocess can be separated into disentanglement path and translation path. The disentanglement path\ncan be considered as an encoder-decoder architecture that uses conditional adversarial training on the\nlatent space. Here we enforce the encoders to encode the image into the disentangled representations,\nwhich can be mapped back to the input image by the conditional generator. The translation path\nenforces the generator to capture the full distribution of possible outputs by a random cross-domain\ntranslation.\nDisentanglement path. To disentangle the latent representations from image xi, we adopt cVAE [34]\nas the base structure. To align the style representations across visual domains and constrain the\ninformation of the styles [1], we encourage the distribution of styles of all domains to be as close as\npossible to a prior distribution.\nLcV AE = \u03bbKLExi\u223cX [KL(Es(xi)||q(s)]+\u03bbrecExi\u223cX [(cid:107)G(Ec(xi), Es(xi), Ed(xi)) \u2212 xi(cid:107)1]. (2)\nTo enable stochastic sampling at test time, we choose the prior distribution q(s) to be a standard\nGaussian distribution N (0, I). As for the content representations, we propose to perform conditional\nadversarial training in the content space to address the distribution shift issue of the contents among\ndomains. This process encourages Ec to exclude the information of the domain d in content c\nLc\nGAN = Exi\u223cX [log(Dc(Ec(xi), Ed(xi))) + Ed\u223c(D\u2212{Ed(xi)})[log(1 \u2212 Dc(Ec(xi), d))]].\n\n(3)\n\nthe overall loss of the disentanglement path is\n\nLD\u2212P ath = LcV AE + Lc\n\nGAN .\n\n(4)\n\nTranslation path. The disentanglement path encourages the model to learn the content c and the\nstyle s with a prior distribution. But it leaves two issues to be solved: First, limited by the number\nof training data and the optimization of KL loss, the generator G may sample only a subset of S\nand generate the images with speci\ufb01c domain labels in the training stage [35]. It may lead to poor\ngenerations when sampling s in the prior distribution N and d that does not match the test image, as\ndiscussed in [46]. Second, the above training process lacks ef\ufb01cient incentives for the use of styles,\nwhich would result in low diversity of the generated images. To overcome these issues and encourage\nour generator to capture a complete distribution of outputs, we \ufb01rst propose to randomly sample\ndomain labels and styles in the prior distributions, in order to cover the whole sampling space at\ntraining time. Then we introduce the latent regression [2, 46] to force the generator to utilize the style\nvector. The regression can also be applied to the content c to separate the style s from c. Thus the\nlatent regression can be written as\n\nLreg = E c\u223cC\ns\u223cN\nd\u223cD\n\n[(cid:107)Es(G(c, s, d)) \u2212 s(cid:107)1] + E c\u223cC\ns\u223cN\nd\u223cD\n\n[(cid:107)Ec(G(c, s, d)) \u2212 c(cid:107)1].\n\n(5)\n\nTo match the distribution of generated images to the real data with sampling domain labels and styles,\nwe employ conditional adversarial training in the pixel space\n\nLx\nGAN =Exi\u223cX [log(Dx(xi, Ed(xi))) + Ed\u223c(D\u2212{Ed(xi)})[\nlog(1 \u2212 Dx(G(Ec(xi), s, d), d))]]].\n\n+ Es\u223cN [\n\n1\n2\n\nlog(1 \u2212 Dx(xi, d))\n\n1\n2\n\n(6)\n\nNote that we also discriminate the pair of real image xi and mismatched target domain label d, in\norder to encourage the generator to generate images that correspond to the given domain label. The\n\ufb01nal objective of the translation is\n\n(7)\n\n(8)\n\nLT\u2212P ath = \u03bbregLreg + Lx\n\nGAN .\n\nBy combining both training paths, the full objective function of our model is\n\nLD\u2212P ath + LT\u2212P ath.\n\nmin\n\nG,Ec,Es\n\nmax\nDc,Dx\n\n4 Experiments\n\nWe compare our approach against recent one-to-many mapping models in two tasks, including season\ntransfer and semantic image synthesis. For brevity, we refer to our method, Disentanglement for\nMulti-mapping Image-to-Image Translation, as DMIT. In the supplementary material, we provide\nadditional visual results and extend our model to facial attribute transfer [23] and sketch-to-photo [41].\n\n5\n\n\fFigure 4: Qualitative comparison of season transfer. The \ufb01rst column shows the input image. Each of\nthe remaining columns shows four outputs with the speci\ufb01ed season from a method. Each image pair\nfor the speci\ufb01ed season re\ufb02ects the diversity within the domain.\n\n4.1 Datasets\nYosemite summer \u2194 winter. The unpaired dataset is provided by Zhu et al. [45] for evaluating\nunsupervised I2I methods. We use the default image size 256\u00d7256 and training set in all experiments.\nThe domain label(summer/winter) is organized as a one-hot vector.\nCUB. The Caltech-UCSD Birds (CUB) [36] dataset contains 200 bird species with 11,788 images\nthat each have 10 text captions [32]. We preprocess the CUB dataset according to the method in [38].\nThe captions are encoded as the domain labels by the pretrained text encoder proposed in [38].\n\n4.2 Season Transfer\n\nSeason transfer is a coarse-grained translation task that aims to learn the mapping between summer\nand winter. We compare our method against \ufb01ve baselines, including:\n\n\u2022 Multi-domain models: StarGAN [3] and StarGAN\u2217 that adds the noise vector into the\n\ngenerator to encourage the diverse outputs.\n\n\u2022 Multi-modal models: MUNIT [15], DRIT [22], and version-c of SingleGAN [42].\n\nIn the above models, MUNIT, DRIT and SingleGAN require a pair of GANs for summer \u2192 winter\nand winter \u2192 summer severally. StarGAN-based models and DMIT only use a uni\ufb01ed structure\nto learn the bijection mapping between two domains. To better evaluate the performance of multi-\ndomain and multi-modal mappings, we propose to test inter-domain and intra-domain translations\nseparately.\nAs the qualitative comparison in Fig. 4 shows, the synthesis of StarGAN has signi\ufb01cant artifacts and\nsuffer from mode collapse caused by the assumption of deterministic cross-domain mapping. With\nthe noise disturbance, the quality of generated images by StarGAN\u2217 has improved, but the results are\nstill lacking in diversity. All of the multi-modal models produce diverse results. However, without\nutilizing the style information between different domains, the generated images are monotonous and\nonly differ in simple modes, such as global illumination. We observe that MUNIT is hard to converge\nand to produce realistic season transfer results due to the limited training data. DRIT and SingleGAN\nproduce realistic results, but the images are not vivid enough. In contrast, our DMIT can use only\none uni\ufb01ed model to produce realistic images with diverse details for different image domains.\nTo quantify the performance, we \ufb01rst translate each test image to 10 targets by sampling styles from\nprior distribution. Then we adopt Fr\u00e9chet Inception Distance (FID) [12] to evaluate the quality of\ngenerated images, and LPIPS (of\ufb01cial version 0.1) [44] to measure the diversity [15, 22, 46] of\nsamples generated by same input image within a speci\ufb01c domain. The quantitative results shown in\nTable 2 further con\ufb01rm our observations above. It is remarkable that our method achieves the best\nFID score while greatly surpassing the multi-domain and multi-modal models in LPIPS distance.\n\n6\n\nDMIT w/o \u2112\ud835\udc3a\ud835\udc34\ud835\udc41\ud835\udc50DMIT w/o D-PathDMIT w/o T-PathSingleGANStarGAN*StarGANDRITMUNITInputSummerWinterDMITDMIT w/ VanillaDDMIT w/ ProjectionD\fTable 2: Quantitative comparison of season transfer.\n\nStarGAN\nStarGAN\u2217\nMUNIT\nDRIT\nSingleGAN\nDMIT w/o T-Path\nDMIT w/o D-Path\nDMIT w/o Lc\nGAN\nDMIT w/ VanillaD\nDMIT w/ ProjectionD\nDMIT\n\nsummer\u2192winter\nFID LPIPS\n-\n0.012\n0.166\n0.205\n0.184\n0.109\n0.545\n0.268\n0.259\n0.289\n0.302\n\n218.78\n152.11\n84.43\n58.70\n63.77\n75.90\n116.71\n60.81\n63.34\n66.50\n58.46\n\nsummer\u2192summer winter\u2192summer winter\u2192winter\nFID LPIPS\n-\n0.011\n0.141\n0.179\n0.178\n0.116\n0.544\n0.256\n0.242\n0.299\n0.279\n\nFID LPIPS\n-\n0.013\n0.134\n0.192\n0.188\n0.124\n0.517\n0.270\n0.255\n0.293\n0.292\n\nFID\n233.61\n135.25\n58.96\n49.58\n51.64\n57.24\n85.97\n43.54\n44.73\n46.92\n43.04\n\nLPIPS\n-\n0.011\n0.133\n0.166\n0.186\n0.118\n0.513\n0.260\n0.239\n0.301\n0.275\n\n248.29\n153.79\n73.82\n53.79\n54.24\n72.75\n95.63\n50.33\n50.79\n52.4\n48.02\n\n224.37\n149.04\n68.92\n57.11\n57.30\n65.15\n124.96\n58.09\n60.10\n65.66\n55.23\n\nAblation study. To analyze the importance of different components in our model, we perform an\nablation study with \ufb01ve variants of DMIT.\nAs for the training paths, we observe that both T-Path and D-Path are indispensable. Without T-Path,\nthe model is dif\ufb01cult to perform cross-domain translation as we analyzed in Section 3. In contrast,\nwithout D-Path, the generated images are blurry and unrealistic and produce meaningless diversity by\nthe artifacts. Combining these two paths result in a trade-off of quality and diversity of images.\nAs for the training incentive, we observe Lc\nGAN is in\ufb02uential for the diversity score. Without this\nincentive, the visual styles are similar in summer and winter. It suggests that Dc encourages the\nmodel to eliminate the domain bias and to learn well-disentangled representations.\nAs for the architecture of discriminator, we evaluate two other conditional models with different\ninformation injection strategies, including vanilla conditional discriminator (VanillaD) [16, 27]\nthat concatenates input image and conditional information together, and projection discriminator\n(ProjectionD) [28, 29] that projects the conditional information to the hidden activation of image. The\nqualitative results in Table 2 indicate that the capacity of VanillaD is limited. The images generated of\nDMIT with ProjectionD are diverse, but prone to contain artifacts, which leads to its lower FID score.\nOur full DMIT, equipped style-based discriminator, gets the balance between diversity and quality.\n\n4.3 Semantic Image Synthesis\n\nTo further verify the potential of DMIT in mixed-modality (text and image) translation, we study on\nthe task of semantic image synthesis. The existing I2I approaches usually assume the state of the\ndomain is discrete, which causes them to not be able to handle this task. We compare our model\nwith the state-of-the-art models of semantic image synthesis: SISGAN [6], Paired-D GAN [26], and\nTAGAN [30].\nFig. 5 shows our qualitative comparison with the baselines. Although SISGAN can generate di-\nverse images that match the text, it is dif\ufb01cult to generate high-quality images. The structure and\nbackground of the images are retained well by Paired-D GAN, but the results do not match the\ntext well. Furthermore, it can be observed that Paired-D GAN cannot produce diversity for condi-\ntional input with different samples. TAGAN presents images with acceptable semantic matching\nresults, but the quality is unsatisfactory. By encoding the style from the input image, DMIT can well\npreserve the original background of the input image and generate high-quality images that match\nthe text descriptions. Meanwhile, DMIT can also produce diverse results by sampling other style\nrepresentation.\nBesides to calculate FID to qualify the performance, we perform a human perceptual study on\nAmazon Mechanical Turk (AMT) to measure the semantic matching score. We randomly sample\n2, 500 images and mismatched texts for generating questions. For each comparison, \ufb01ve different\nworkers are required to select which image looks more realistic and \ufb01ts the given text. As shown\nin Table 3, DMIT gets the best of both image quality and semantic matching score. Since retaining\nthe irrelevant information of the input image is important for semantic image synthesis, we also\nevaluate the reconstruction ability of different methods by transforming the input image with its\ncorresponding text. The scores of PSNR and SSIM further demonstrate the capabilities of our method\n\n7\n\n\fFigure 5: Qualitative comparison of semantic image synthesis. In each column, the \ufb01rst row is the\ninput image and the remaining rows are the outputs according to the above text description. In each\npair of images generated by DMIT, the images in the \ufb01rst column are generated by encoding the style\nfrom the input image and the second column are generated by random style.\n\nTable 3: Quantitative comparison of semantic image synthesis.\n\nFID\nSISGAN\n67.24\nPaired-D GAN 27.62\n34.49\nTAGAN\n13.85\nDMIT\n\nHuman\nevaluation\n\nPSNR SSIM\n0.193\n0.886\n0.736\n0.934\n\n15.3% 11.27\n25.2% 22.34\n20.4% 19.01\n39.1% 25.49\n\nin learning ef\ufb01cient representations. It suggests that the disentangled representations enable our\nmodel to manipulate the translation in \ufb01ner detail.\n\n4.4 Limitations\n\nAlthough DMIT can perform multi-mapping translation, we observe that the style representations\ntend to model some global properties as discussed in [31]. Besides, we observe that the convergence\nrates of different domains are generally different. Further exploration will allow this work to be a\ngeneral-purpose solution for a variety of multi-mapping translation tasks.\n\n5 Conclusion\n\nIn this paper, we present a novel model for multi-mapping image-to-image translation with unpaired\ndata. By learning disentangled representations, it is able to use the advances of both multi-domain\nand multi-modal translations in a holistic manner. The integration of these two multi-mapping\nproblems encourages our model to learn a more complete distribution of possible outputs, improving\nthe performance of each task. Experiments in various multi-mapping tasks show that our model is\nsuperior to the existing methods in terms of quality and diversity.\n\nAcknowledgments\n\nThis work was supported in part by Shenzhen Municipal Science and Technology Program (No.\nJCYJ20170818141146428), National Engineering Laboratory for Video Technology - Shenzhen\nDivision, and National Natural Science Foundation of China and Guangdong Province Scienti\ufb01c\nResearch on Big Data (No. U1611461). In addition, we would like to thank the anonymous reviewers\nfor their helpful and constructive comments.\n\n8\n\nThis small yellow bird has gray wings and a black bill.An orange bird with green wings and blue head.This black bird has no other colors with a short bill.A black bird with a red head.SISGANPaired-D GANDMITTAGAN\fReferences\n[1] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational informa-\n\ntion bottleneck. In ICLR, 2017.\n\n[2] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nNIPS, 2016.\n\n[3] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo.\nStargan: Uni\ufb01ed generative adversarial networks for multi-domain image-to-image translation. In\nCVPR, 2018.\n\n[4] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis\n\nfor multiple domains. arXiv preprint arXiv:1912.01865, 2019.\n\n[5] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. In ICLR,\n\n2016.\n\n[6] Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semantic image synthesis via adversarial\n\nlearning. In ICCV, 2017.\n\n[7] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin\n\nArjovsky, and Aaron Courville. Adversarially learned inference. In ICLR, 2016.\n\n[8] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic\n\nstyle. In ICLR, 2017.\n\n[9] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional\n\nneural networks. In CVPR, 2016.\n\n[10] Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Bengio. Image-to-image translation for\n\ncross-domain disentanglement. In NIPS, 2018.\n\n[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.\n\n[13] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In ICLR, 2017.\n\n[14] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance\n\nnormalization. In ICCV, 2017.\n\n[15] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-\n\nimage translation. In ECCV, 2018.\n\n[16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\n\nconditional adversarial networks. In CVPR, 2017.\n\n[17] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative\n\nadversarial networks. In CVPR, 2019.\n\n[18] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to\n\ndiscover cross-domain relations with generative adversarial networks. In ICML, 2017.\n\n[19] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[20] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, et al.\n\nFader networks: Manipulating images by sliding attributes. In NIPS, 2017.\n\n[21] Anders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, Hugo Larochelle, and Ole Winther.\n\nAutoencoding beyond pixels using a learned similarity metric. In ICML, 2016.\n\n9\n\n\f[22] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Kumar Singh, and Ming-Hsuan Yang.\n\nDiverse image-to-image translation via disentangled representations. In ECCV, 2018.\n\n[23] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\n\nwild. In ICCV, 2015.\n\n[24] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation\n\nnetworks. In NIPS, 2017.\n\n[25] Alexander H Liu, Yen-Cheng Liu, Yu-Ying Yeh, and Yu-Chiang Frank Wang. A uni\ufb01ed feature\n\ndisentangler for multi-domain image translation and manipulation. In NIPS, 2018.\n\n[26] Duc Minh Vo and Akihiro Sugimoto. Paired-d gan for semantic image synthesis. In ACCV,\n\n2018.\n\n[27] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[28] Takeru Miyato and Masanori Koyama. cgans with projection discriminator. In ICLR, 2018.\n\n[29] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\n\nfor generative adversarial networks. In ICLR, 2018.\n\n[30] Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. Text-adaptive generative adversarial networks:\n\nManipulating images with natural language. In NIPS, 2018.\n\n[31] Ori Press, Tomer Galanti, Sagie Benaim, and Lior Wolf. Emerging disentanglement in auto-\n\nencoder based unsupervised image content transfer. In ICLR, 2019.\n\n[32] Scott Reed, Zeynep Akata, Bernt Schiele, and Honglak Lee. Learning deep representations of\n\n\ufb01ne-grained visual descriptions. In CVPR, 2016.\n\n[33] Andr\u00e9s Romero, Pablo Arbel\u00e1ez, Luc Van Gool, and Radu Timofte. Smit: Stochastic multi-label\n\nimage-to-image translation. In ICCV Workshops, 2019.\n\n[34] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using\n\ndeep conditional generative models. In NIPS, 2015.\n\n[35] Yu Tian, Xi Peng, Long Zhao, Shaoting Zhang, and Dimitris N Metaxas. Cr-gan: Learning\n\ncomplete representations for multi-view generation. In IJCAI, 2018.\n\n[36] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The\n\ncaltech-ucsd birds-200-2011 dataset. 2011.\n\n[37] Yaxing Wang, Abel Gonzalez-Garcia, Joost van de Weijer, and Luis Herranz. Sdit: Scalable\n\nand diverse cross-domain image translation. In ACM MM, 2019.\n\n[38] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong\nHe. Attngan: Fine-grained text to image generation with attentional generative adversarial\nnetworks. In CVPR, 2018.\n\n[39] Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee. Diversity-\n\nsensitive conditional generative adversarial networks. In ICLR, 2019.\n\n[40] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for\n\nimage-to-image translation. In ICCV, 2017.\n\n[41] Aron Yu and Kristen Grauman. Fine-grained visual comparisons with local learning. In CVPR,\n\n2014.\n\n[42] Xiaoming Yu, Xing Cai, Zhenqiang Ying, Thomas Li, and Ge Li. Singlegan: Image-to-image\ntranslation by a single-generator network using multiple generative adversarial learning. In ACCV,\n2018.\n\n[43] Xiaoming Yu, Zhenqiang Ying, Thomas Li, Shan Liu, and Ge Li. Multi-mapping image-to-\n\nimage translation with central biasing normalization. arXiv preprint arXiv:1806.10050, 2018.\n\n10\n\n\f[44] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason-\n\nable effectiveness of deep features as a perceptual metric. In CVPR, 2018.\n\n[45] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\n\ntranslation using cycle-consistent adversarial networkss. In ICCV, 2017.\n\n[46] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and\n\nEli Shechtman. Toward multimodal image-to-image translation. In NIPS, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1710, "authors": [{"given_name": "Xiaoming", "family_name": "Yu", "institution": "Peking University"}, {"given_name": "Yuanqi", "family_name": "Chen", "institution": "SECE, Peking University"}, {"given_name": "Shan", "family_name": "Liu", "institution": "Tencent"}, {"given_name": "Thomas", "family_name": "Li", "institution": "Shenzhen Graduate School, Peking University"}, {"given_name": "Ge", "family_name": "Li", "institution": "SECE, Shenzhen Graduate School, Peking University"}]}