{"title": "Multi-marginal Wasserstein GAN", "book": "Advances in Neural Information Processing Systems", "page_first": 1776, "page_last": 1786, "abstract": "Multiple marginal matching problem aims at learning mappings to match a source domain to multiple target domains and it has attracted great attention in many applications, such as multi-domain image translation. However, addressing this problem has two critical challenges: (i) Measuring the multi-marginal distance among different domains is very intractable; (ii) It is very difficult to exploit cross-domain correlations to match the target domain distributions. In this paper, we propose a novel Multi-marginal Wasserstein GAN (MWGAN) to minimize Wasserstein distance among domains. Specifically, with the help of multi-marginal optimal transport theory, we develop a new adversarial objective function with inner- and inter-domain constraints to exploit cross-domain correlations. Moreover, we theoretically analyze the generalization performance of MWGAN, and empirically evaluate it on the balanced and imbalanced translation tasks. Extensive experiments on toy and real-world datasets demonstrate the effectiveness of MWGAN.", "full_text": "Multi-marginal Wasserstein GAN\n\nJiezhang Cao\u2217, Langyuan Mo\u2217, Yifan Zhang, Kui Jia, Chunhua Shen, Mingkui Tan\u2217\u2020\nSouth China University of Technology, Peng Cheng Laboratory, The University of Adelaide\n\n{secaojiezhang, selymo, sezyifan}@mail.scut.edu.cn\n\n{mingkuitan, kuijia}@scut.edu.cn, chunhua.shen@adelaide.edu.au\n\nAbstract\n\nMultiple marginal matching problem aims at learning mappings to match a source\ndomain to multiple target domains and it has attracted great attention in many\napplications, such as multi-domain image translation. However, addressing this\nproblem has two critical challenges: (i) Measuring the multi-marginal distance\namong different domains is very intractable; (ii) It is very dif\ufb01cult to exploit\ncross-domain correlations to match the target domain distributions. In this paper,\nwe propose a novel Multi-marginal Wasserstein GAN (MWGAN) to minimize\nWasserstein distance among domains. Speci\ufb01cally, with the help of multi-marginal\noptimal transport theory, we develop a new adversarial objective function with inner-\nand inter-domain constraints to exploit cross-domain correlations. Moreover, we\ntheoretically analyze the generalization performance of MWGAN, and empirically\nevaluate it on the balanced and imbalanced translation tasks. Extensive experiments\non toy and real-world datasets demonstrate the effectiveness of MWGAN.\n\n1\n\nIntroduction\n\nMultiple marginal matching (M3) problem aims to map an input image (source domain) to multiple\ntarget domains (see Figure 1(a)), and it has been applied in computer vision, e.g., multi-domain image\ntranslation [10, 23, 25]. In practice, the unsupervised image translation [30] gains particular interest\nbecause of its label-free property. However, due to the lack of corresponding images, this task is\nextremely hard to learn stable mappings to match a source distribution to multiple target distributions.\nRecently, some methods [10, 30] address M3 problem, which, however, face two main challenges.\nFirst, existing methods often neglect to jointly optimize the multi-marginal distance among domains,\nwhich cannot guarantee the generalization performance of methods and may lead to distribution\nmismatching issue. Recently, CycleGAN [51] and UNIT [32] repeatedly optimize every pair of two\ndifferent domains separately (see Figure 1(b)). In this sense, they are computationally expensive and\nmay have poor generalization performance. Moreover, UFDN [30] and StarGAN [10] essentially\nmeasure the distance between an input distribution and a mixture of all target distributions (see Figure\n1(b)). As a result, they may suffer from distribution mismatching issue. Therefore, it is necessary to\nexplore a new method to measure and optimize the multi-marginal distance.\nSecond, it is very challenging to exploit the cross-domain correlations to match target domains.\nExisting methods [51, 30] only focus on the correlations between the source and target domains, since\nthey measure the distance between two distributions (see Figure 1(b)). However, these methods often\nignore the correlations among target domains, and thus they are hard to fully capture information to\nimprove the performance. Moreover, when the source and target domains are signi\ufb01cantly different,\nor the number of target domains is large, the translation task turns to be dif\ufb01cult for existing methods\nto exploit the cross-domain correlations.\n\n\u2217Authors contributed equally.\n\u2020Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) The Edge\u2192CelebA image translation task.\n\n(b) Comparisons of different distribution measures.\nFigure 1: An example of M3 problem and comparisons of existing methods. (a) For the Edge\u2192CelebA\ntask, we aim to learn mappings to match a source distribution (i.e., Edge images) to the target\ndistributions (i.e., black and blond hair images). (b) Left: we employ CycleGAN multiple times to\nmeasure the distance between every generated distribution and its corresponding target distribution.\nMiddle: StarGAN and UFDN measure the distance between Ps and a mixed distribution of P\u03b81 and\nP\u03b82; Right: MWGAN jointly measures Wasserstein distance among Ps, P\u03b81 and P\u03b82. (Dotted circle:\nthe generated distributions, solid circle: the real source or target distributions, double-headed arrow:\ndistribution divergence, different colors represent different domains.)\n\nIn this paper, we seek to use multi-marginal Wasserstein distance to solve M3 problem, but directly\noptimizing it is intractable. Therefore, we develop a new dual formulation to make it tractable and\npropose a novel multi-marginal Wasserstein GAN (MWGAN) by enforcing inner- and inter-domain\nconstraints to exploit the correlations among domains.\nThe contributions of this paper are summarized as follows:\n\u2022 We propose a novel GAN method (called MWGAN) to optimize a feasible multi-marginal distance\namong different domains. MWGAN overcomes the limitations of existing methods by alleviating\nthe distribution mismatching issue and exploiting cross-domain correlations.\n\u2022 We de\ufb01ne and analyze the generalization of our proposed method for the multiple domain transla-\ntion task, which is more important than existing generalization analyses [13, 36] studying only on\ntwo domains and non-trivial for multiple domains.\n\u2022 We empirically show that MWGAN is able to solve the imbalanced image translation task well\nwhen the source and target domains are signi\ufb01cantly different. Extensive experiments on toy and\nreal-world datasets demonstrate the effectiveness of our proposed method.\n\n2 Related Work\n\nGenerative adversarial networks (GANs). Deep neural networks have theoretical and experimental\nexplorations [7, 21, 48, 49, 53]. In particular, GANs [17] have been successfully applied in computer\nvision tasks, such as image generation [3, 6, 18, 20], image translation [2, 10, 19] and video prediction\n[35]. Speci\ufb01cally, a generator tries to produce realistic samples, while a discriminator tries to\ndistinguish between generated data and real data. Recently, some studies try to improve the quality\n[5, 9, 26] and diversity [43] of generated images, and improve the mechanism of GANs [1, 11, 38, 39]\nto deal with the unstable training and mode collapse problems.\nMulti-domain image translation. M3 problem can be applied in domain adaptation [45] and image\ntranslation [27, 52]. CycleGAN [51], DiscoGAN [28], DualGAN [47] and UNIT [32] are proposed\nto address two-domain image translation task. However, in Figure 1(b), these methods measure\nthe distance between every pair of distributions multiple times, which is computationally expensive\nwhen applied to the multi-domain image translation task. Recently, StarGAN [10] and AttGAN [23]\nuse a single model to perform multi-domain image translation. UFDN [30] translates images by\nlearning domain-invariant representation for cross-domains. Essentially, the above three methods\nare two-domain image translation methods because they measure the distance between an input\ndistribution and a uniform mixture of other target distributions (see Figure 1(b)). Therefore, these\nmethods may suffer from distribution mismatching issue and obtain misleading feedback for updating\nmodels when the source and target domains are signi\ufb01cantly different. In addition, we discuss the\ndifference between some GAN methods in Section I in supplementary materials.\n\n2\n\ntarget distributiontarget distribution generated distributiongenerated distribution1t\uf0502t\uf050source distributions\uf0501\uf050\u03b82\uf050\u03b8matchmatchCycleGANStarGANand UFDNMWGANs\uf050s\uf0501\uf050\u03b81\uf050\u03b82\uf050\u03b82\uf050\u03b81\uf050\u03b81t\uf0502\uf050\u03b82t\uf050\f3 Problem De\ufb01nition\nNotation. We use calligraphic letters (e.g., X ) to denote space, capital letters (e.g., X) to denote\nrandom variables, and bold lower case letter (e.g., x) to denote the corresponding values. Let\nD=(X , P) be the domain, P or \u00b5 be the marginal distribution over X and P(X ) be the set of all the\nprobability measures over X . For convenience, let X =Rd, and let I={0, ..., N} and [N ]={1, ..., N}.\nMultiple marginal matching (M3) problem. In this paper, M3 problem aims to learn mappings to\nmatch a source domain to multiple target domains. For simplicity, we consider one source domain\nDs={X , Ps} and N target domains Di={X , Pti}, i\u2208[N ], where Ps is the source distribution, and\nPti is the i-th real target distribution. Let gi, i\u2208[N ] be the generative models parameterized by \u03b8i,\nand P\u03b8i be the generated distribution in the i-th target domain. In this problem, the goal is to learn\nmultiple generative models such that each generated distribution P\u03b8i in the i-th target domain can be\nclose to the corresponding real target distribution Pti (see Figure 1(a)).\nOptimal transport (OT) theory. Recently, OT [42] theory has attracted great attention in many\napplications [3, 46]. Directly solving the primal formulation of OT [40] might be intractable [16].\nTo address this, we consider the dual formulation of the multi-marginal OT problem as follows.\nProblem I (Dual problem [40]) Given N +1 marginals \u00b5i\u2208P(X ), potential functions fi, i\u2208I, and\na cost function c(X (0), . . . , X (N )) : Rd(N +1)\u2192R, the dual Kantorovich problem can be de\ufb01ned as:\n. (1)\nW (\u00b50, ..., \u00b5N )= sup\nfi\nj }j\u2208J0 and\nIn practice, we optimize the discrete case of Problem I. Speci\ufb01cally, given samples {x(0)\nj }j\u2208Ji drawn from source domain distribution Ps and generated target distributions P\u03b8i , i\u2208[N ],\n{x(i)\nrespectively, where Ji is an index set and ni=|Ji| is the number of samples, we have:\nProblem II (Discrete dual problem) Let F ={f0, . . . , fN} be the set of Kantorovich potentials,\nthen the discrete dual problem \u02c6h(F ) can be de\ufb01ned as:\n\nX (0), ..., X (N )(cid:17)\n\n(cid:16)\nX (i)(cid:17)\u2264c\n\nX (i)(cid:17)\n\nX (i)(cid:17)\n\n(cid:88)\n\n(cid:88)\n\n(cid:16)\n\n(cid:16)\n\n(cid:16)\n\n(cid:90)\n\n, s.t.\n\nd\u00b5i\n\nfi\n\nfi\n\ni\n\ni\n\n\u02c6h(F )=\n\nmax\n\nF\n\nfi\n\nj\u2208Ji\n\nx(i)\nj\n\n, s.t.\n\nfi\n\ni\n\nx(i)\nki\n\nx(0)\nk0\n\n, . . . , x(N )\nkN\n\n,\u2200ki\u2208[ni].\n\n(2)\n\n(cid:88)\n\n(cid:88)\n\n1\nni\n\ni\n\n(cid:16)\n\n(cid:17)\n\n(cid:88)\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\u2264c\n\n(cid:17)\n\nUnfortunately, it is challenging to optimize Problem II due to the intractable inequality constraints\nand multiple potential functions. To address this, we seek to propose a new optimization method.\n\n4 Multi-marginal Wasserstein GAN\n\n4.1 A New Dual Formulation\nFor two domains, WGAN [3] solves Problem II by setting f0=f and f1=\u2212f. However, it is hard to\nextend WGAN to multiple domains. To address this, we propose a new dual formulation in order\nto optimize Problem II. To this end, we use a shared potential in Problem II, which is supported\nby empirical and theoretical evidence. In the multi-domain image translation task, the domains\nare often correlated, and thus share similar properties and differ only in details (see Figure 1(a)).\nThe cross-domain correlations can be exploited by the shared potential function (see Section J in\nsupplementary materials). More importantly, the optimal objectives of Problem II and the following\nproblem can be equal under some conditions (see Section B in supplementary materials).\nProblem III Let F\u03bb={\u03bb0f, . . . , \u03bbN f} be Kantorovich potentials, then we de\ufb01ne dual problem as:\n,\u2200ki\u2208[ni]. (3)\nmax\nF\u03bb\n\n(cid:17)\u2264c\n\n, . . . , x(N )\nkN\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u02c6h(F\u03bb)=\n\nx(0)\nk0\n\nx(i)\nki\n\nx(i)\nj\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\nj\u2208Ji\n\n\u03bbif\n\ni\n\n, s.t.\n\nf\n\n\u03bbi\nni\n\ni\n\nTo further build the relationship between Problem II and Problem III, we have the following theorem\nso that Problem III can be optimized well by GAN-based methods (see Subsection 4.2).\nTheorem 1 Suppose the domains are connected, the cost function c is continuously differentiable\nand each \u00b5i is absolutely continuous. If (f0, . . . , fN ) and (\u03bb0f, . . . , \u03bbN f ) are solutions to Problem\n\nI, then there exist some constants \u03b5i for each i \u2208 I such that(cid:80)\nhave an equivalent Wasserstein distance, i.e.,(cid:80)\n\nRemark 1 From Theorem 1, if we train a shared function f to obtain a solution of Problem I, we\ni \u03bbif regardless of whatever the value \u03b5i is.\n\ni \u03b5i = 0 and fi = \u03bbif + \u03b5i.\n\ni fi=(cid:80)\n\nTherefore, we are able to optimize Problem III instead of intractable Problem II in practice.\n\n3\n\n\fnumber of iterations of the discriminator per generator iteration ncritic; Uniform distribution U [0, 1].\n\nj=1 in the i-th target domain; batch size mbs; the\n\nj }ni\n\nj=1 in the initial domain, {\u02c6x(i)\n\nAlgorithm 1 Multi-marginal WGAN.\nInput: Training data {xj}n0\nOutput: The discriminator f, the generators {gi}i\u2208[N ] and the classi\ufb01er \u03c6\n1: while not converged do\n2:\n3:\n4:\n5:\n6:\n7:\n8: end while\n\nfor t = 0, . . . , ncritic do\nSample x\u223c\u02c6Ps and \u02c6x\u223c\u02c6P\u03b8i ,\u2200i, and \u02dcx \u2190 \u03c1x + (1 \u2212 \u03c1)\u02c6x, where \u03c1\u223cU [0, 1]\nUpdate f by ascending the gradient: \u2207w[Ex\u223c\u02c6Ps\ni E\u02c6x\u223c\u02c6P\u03b8i\nUpdate classi\ufb01er \u03c6 by descending the gradient \u2207v[C\u03b1(\u03c6)]\nend for\nUpdate each generator gi by descending the gradient: \u2207\u03b8i [\u2212\u03bb+\n\n[f (x)]\u2212(cid:80)\n\ni E\u02c6x\u223c\u02c6P\u03b8i\n\ni \u03bb+\n\n[f (\u02c6x)]\u2212R\u03c4 (f )]\n\n[f (\u02c6x)] \u2212 M\u03b1(gi)]\n\n4.2 Proposed Objective Function\n\n(cid:17)\n\nx\u223c\u02c6Ps\n\ni\n\n\u02c6x\u223c\u02c6P\u03b8i\n\ni , \u03bb+\n\ni E\n\u03bb+\n\ni\u2208[N ] \u03bb+\n\n= maxf E\n\n[f (\u02c6x)] , s.t. \u02c6P\u03b8i\u2208Di, f\u2208\u2126.\n\n[f (x)]\u2212(cid:88)\n\ni >0, i\u2208[N ], then Problem III can be rewritten as follows:\n\ni f (\u02c6x(i))\u2264c(x, \u02c6x(1), . . . , \u02c6x(N )), f\u2208F} with x\u2208\u02c6Ps and \u02c6x(i)\u2208\u02c6P\u03b8i, i\u2208[N ].\n\nTo minimize Wasserstein distance among domains, we now present a novel multi-marginal\nWasserstein GAN (MWGAN) based on the proposed dual formulation in (3). Speci\ufb01cally, let\nF={f : Rd\u2192R} be the class of discriminators parameterized by w, and G={g: Rd\u2192Rd} be the class\nof generators and gi\u2208G is parameterized by \u03b8i. Motivated by the adversarial mechanism of WGAN,\nlet \u03bb0=1 and \u03bbi:=\u2212\u03bb+\nProblem IV (Multi-marginal Wasserstein GAN) Given a discriminator f\u2208F and generators\ngi\u2208G, i\u2208[N ], we can de\ufb01ne the following multi-marginal Wasserstein distance as\n(4)\nW\nwhere \u02c6Ps is the real source distribution, and the distribution \u02c6P\u03b8i is generated by gi in the i-th domain,\n\n(cid:16)\u02c6Ps, \u02c6P\u03b81, . . . , \u02c6P\u03b8N\n\u2126={f|f (x) \u2212(cid:80)\nIn Problem IV, we refer to \u02c6P\u03b8i\u2208Di, i\u2208[N ] as inner-domain constraints and f\u2208\u2126 as inter-domain\nconstraints (See Subsections 4.3 and 4.4). The in\ufb02uence of these constraints are investigated in\nSection N of supplementary materials. Note that \u03bb+\ni re\ufb02ects the importance of the i-th target domain.\ni =1/N, i\u2208[N ] when no prior knowledge is available on the target domains. To\nIn practice, we set \u03bb+\nminimize Problem IV, we optimize the generators with the following update rule.\nTheorem 2 If each generator gi\u2208G, i\u2208[N ] is locally Lipschitz (see more details of Assumption 1 [3]),\nthen there exists a discriminator f to Problem IV, we have the gradient \u2207\u03b8iW (\u02c6Ps, \u02c6P\u03b81, . . . , \u02c6P\u03b8N ) =\n\u2212\u03bb+\nTheorem 2 provides a good update rule for optimizing MWGAN. Speci\ufb01cally, we \ufb01rst train an\n[\u2207\u03b8if (gi(x))].\noptimal discriminator f and then update each generator along the direction of E\nThe detailed algorithm is shown in Algorithm 1. Speci\ufb01cally, the generators cooperatively exploit\nmulti-domain correlations (see Section J in supplementary materials) and generate samples in the\nspeci\ufb01c target domain to fool the discriminator; the discriminator enforces generated data in target\ndomains to maintain the similar features from the source domain.\n\n[\u2207\u03b8if (gi(x))] for all \u03b8i, i\u2208[N ] when all terms are well-de\ufb01ned.\n\nE\nx\u223c\u02c6Ps\n\nx\u223c\u02c6Ps\n\ni\n\nInner-domain Constraints\n\n4.3\nIn Problem IV, the distribution P\u03b8i generated by the generator gi should belong to the i-th domain for\nany i. To this end, we introduce an auxiliary domain classi\ufb01cation loss and the mutual information.\nDomain classi\ufb01cation loss. Given an input x:=x(0) and generator gi, we aim to translate the input\nx to an output \u02c6x(i) which can be classi\ufb01ed to the target domain Di correctly. To achieve this goal, we\nintroduce an auxiliary classi\ufb01er \u03c6: X\u2192Y parameterized by v to optimize the generators. Speci\ufb01cally,\nwe label real data x\u223c\u02c6Pti as 1, where \u02c6Pti is an empirical distribution in the i-th target domain, and we\nlabel generated data \u02c6x(i)\u223c\u02c6P\u03b8i as 0. Then, the domain classi\ufb01cation loss w.r.t. \u03c6 can be de\ufb01ned as:\n(5)\nwhere \u03b1 is a hyper-parameter, y is corresponding to x(cid:48), and (cid:96)(\u00b7,\u00b7) is a binary classi\ufb01cation loss, such\nas hinge loss [50], mean square loss [34], cross-entropy loss [17] and Wasserstein loss [12].\n\nC\u03b1(\u03c6) = \u03b1 \u00b7 E\n\n[(cid:96) (\u03c6 (x(cid:48)) , y)] ,\n\nx(cid:48)\u223c\u02c6Pti\u222a\u02c6P\u03b8i\n\n4\n\n\fMutual information maximization. After learning the classi\ufb01er \u03c6, we maximize the lower bound\nof the mutual information [8, 23] between the generated image and the corresponding domain, i.e.,\n(6)\nBy maximizing the mutual information in (6), we correlate the generated image gi(x) with the i-th\ndomain, and then we are able to translate the source image to the speci\ufb01ed domain.\n\nM\u03b1(gi) = \u03b1 \u00b7 E\n\ny(i)=1\n\nx\u223c\u02c6Ps\n\n(cid:12)(cid:12)(cid:12) gi(x)\n(cid:17)(cid:105)\n\n(cid:16)\n\nlog \u03c6\n\n(cid:104)\n\n.\n\nInter-domain Constraints\n\n4.4\nThen, we enforce the inter-domain constraints in Problem IV, i.e., the discriminator f\u2208F\u2229\u2126. One\ncan let discriminator be 1-Lipschitz continuous, but it may ignore the dependency among domains\n(see Section H in supplementary materials). Thus, we relax the constraints by the following lemma.\nLemma 1 (Constraints relaxation) If the cost function c(\u00b7) is measured by (cid:96)2 norm, then there\n\nexists Lf\u22651 such that the constraints in Problem IV satisfy(cid:80)\n\ni |f (x)\u2212f (\u02c6x(i))|/(cid:107)x\u2212\u02c6x(i)(cid:107)\u2264Lf .\n\nNote that Lf measures the dependency among domains (see Section G in supplementary materials).\nIn practice, Lf can be calculated with the cost function, or treated as a tuning parameter for simplicity.\nInter-domain gradient penalty. In practice, directly enforcing the inequality constraints in Lemma\n1 would have poor performance when generated samples are far from real data. We thus propose\nthe following inter-domain gradient penalty. Speci\ufb01cally, given real data x in the source domain and\ngenerated samples \u02c6x(i), if \u02c6x(i) can be properly close to x, as suggested in [37], we can calculate its\ngradient and introduce the following regularization term into the objective of MWGAN, i.e.,\n\nE\n\u02dcx(i)\u223c \u02c6Qi\n\n(7)\nwhere (\u00b7)+= max{0,\u00b7}, \u03c4 is a hyper-parameter, \u02dcx(i) is sampled between x and \u02c6x(i), and \u02c6Qi, i\u2208[N ]\nis a constructed distribution relying on some sampling strategy. In practice, one can construct a\ndistribution where samples \u02dcx(i) can be interpolated between real data x and generated data \u02c6x(i) for\nevery domain [18]. Note that the gradient penalty captures the dependency of domains since the cost\nfunction in Problem IV measures the distance among all domains jointly.\n\n+\n\n,\n\ni\n\nR\u03c4 (f ) = \u03c4 \u00b7(cid:16)(cid:88)\n\n(cid:13)(cid:13)(cid:13)\u2207f\n\n(cid:16)\n\n\u02dcx(i)(cid:17)(cid:13)(cid:13)(cid:13)\u2212Lf\n\n(cid:17)2\n\n5 Theoretical Analysis\n\nIn this section, we provide the generalization analysis for the proposed method. Motivated by [4], we\ngive a new de\ufb01nition of generalization for multiple distributions as follows.\nDe\ufb01nition 1 (Generalization) Let Ps and P\u03b8i be the continuous real and generated distributions,\nand \u02c6Ps and \u02c6P\u03b8i be the empirical real and generated distributions. The distribution distance\nW (\u00b7, . . . ,\u00b7) is said to generalize with n training samples and error \u0001, if for every true generated\ndistribution P\u03b8i, the following inequality holds with high probability,\n\n(cid:16)\u02c6Ps, \u02c6P\u03b81, . . . , \u02c6P\u03b8N\n\n(cid:12)(cid:12)(cid:12) \u2264 \u0001.\n(cid:17) \u2212 W (Ps, P\u03b81 , . . . , P\u03b8N )\n\n(cid:12)(cid:12)(cid:12)W\n\n(8)\n\nIn De\ufb01nition 1, the generalization bound measures the difference between the expected distance and\nthe empirical distance. In practice, our goal is to train MWGAN to obtain a small empirical distance,\nso that the expected distance would also be small.\nWith the help of De\ufb01nition 1, we are able to analyze the generalization ability of the proposed method.\nLet \u03ba be the capacity of the discriminator, and if the discriminator is L-Lipschitz continuous and\nbounded in [\u2212\u2206, \u2206], then we have the following generalization bound.\nTheorem 3 (Generalization bound) Given the continuous real and generated distributions Ps\nand P\u03b8i, i\u2208I, and the empirical versions \u02c6Ps and \u02c6P\u03b8i , i\u2208I with at least n samples in each domain,\nthere is a universal constant C such that n\u2265C\u03ba\u22062 log(L\u03ba/\u0001)/\u00012 with the error \u0001, the following\ngeneralization bound is satis\ufb01ed with probability at least 1\u2212e\u2212\u03ba,\n\n(cid:16)\u02c6Ps, \u02c6P\u03b81, . . . , \u02c6P\u03b8N\n\n(cid:12)(cid:12)(cid:12) \u2264 \u0001.\n(cid:17) \u2212 W (Ps, P\u03b81 , . . . , P\u03b8N )\n\n(cid:12)(cid:12)(cid:12)W\n\n(9)\n\nTheorem 3 shows that MWGAN has a good generalization ability with enough training data\nin each domain. In practice, if successfully minimizing the multi-domain Wasserstein distance\ni.e., W (\u02c6Ps, \u02c6P\u03b81, . . . , \u02c6P\u03b8N ), the expected distance W (Ps, P\u03b81 , . . . , P\u03b8N ) can also be small.\n\n5\n\n\fFigure 2: Comparisons of distribution matching abilities on the value surface of discriminator. Each\nmethod learns from a Gaussian distribution to other six Gaussian (upper line) or Uniform distributions\n(lower line). (Green: source distribution; Red: target distributions; Orange: generated distributions. )\n\n6 Experiments\n\nImplementation details. All experiments are conducted based on PyTorch, with an NVIDIA TITAN\nX GPU.3 We use Adam [29] with \u03b21=0.5 and \u03b22=0.999 and set the learning rate as 0.0001. We\ntrain the model 100k iterations with batch size 16. We set \u03b1=10, \u03c4 =10 and Lf to be the number\nof target domains in Loss (7). The details of the loss function and the network architectures of the\ndiscriminator, generators and classi\ufb01er can be referred to Section P in supplementary materials.\nBaselines. We adopt the following methods as baselines: (i) CycleGAN [51] is a two-domain image\ntranslation method which can be \ufb02exibly extended to perform the multi-domain image translation\ntask. (ii) UFDN [30] and (iii) StarGAN [10] are multi-domain image translation methods.\nDatasets. We conduct experiments on three datasets. Note that all images are resized as 128\u00d7128.\n(i) Toy dataset. We generate a Gaussian distribution in the source domain, and other six Gaussian or\nUniform distributions in the target domains. More details can be found in the supplemental materials.\n(ii) CelebA [33] contains 202,599 face images, where each image has 40 binary attributes. We use the\nfollowing attributes: hair color (black, blond and brown), eyeglasses, mustache and pale skin. In the\n\ufb01rst experiment, we use black hair images as the source domain, and use the blond hair, eyeglasses,\nmustache and pale skin images as target domains. In the second experiment, we extract 50k Canny\nedges from CelebA. We take edge images as the source domain and hair images as target domains.\n(iii) Style painting [51]. The size of Real scene, Monet, Van Gogh and Ukiyo-e is 6287, 1073, 400\nand 563, respectively. We take real scene images as the source domain, and others as target domains.\nEvaluation Metrics. We use the following evaluation metrics: (i) Fr\u00e9chet Inception Distance\n(FID) [24] evaluates the quality of the translated images. In general, a lower FID score means better\nperformance. (ii) Classi\ufb01cation accuracy widely used in [10, 23] evaluates the probability that\nthe generated images belong to corresponding target domains. Speci\ufb01cally, we train a classi\ufb01er on\nCelebA (90% for training and 10% for testing) using ResNet-18 [22], resulting in a near-perfect\naccuracy, then use the classi\ufb01er to measure the classi\ufb01cation accuracy of the generated images.\n\n6.1 Results on Toy Dataset\n\nWe compare MWGAN with UFDN and StarGAN on toy dataset to verify the limitations mentioned\nin Section 2. Speci\ufb01cally, we measure the distribution matching ability and plot the value surface of\nthe discriminator. Here, the value surface depicts the outputs of the discriminator [18, 31].\nIn Figure 2, MWGAN matches the target domain distributions very well as it is able to capture\nthe geometric information of real distribution using a low-capacity network. Moreover, the value\nsurface shows that the discriminator provides correct gradients to update the generators. However,\nthe baseline methods are very sensitive to the type of source and target domain distributions. With\nthe same capacity, the baseline methods on similar distributions (top row) are able to match the target\ndomain distributions. However, they cannot match the target domain distribution well when the initial\nand the target domain distributions are different (see bottom row of Figure 2).\n\n3The source code of our method is available: https://github.com/caojiezhang/MWGAN.\n\n6\n\n(a) Real distribution(b) UFDN(c) StarGAN(d) MWGAN\fFigure 3: Comparisons of attribute translation on CelebA. The \ufb01rst column shows the input images,\nthe next four columns show the single attribute translation results, and the last four columns show the\nmulti-attribute translation results. (B: Blond hair; E: Eyeglasses; M: Mustache; P: Pale skin.)\n\nTable 1: Comparisons of FID and classi\ufb01cation accuracy (%) on single facial attribute translation.\n\nMethod\n\nFID\nCycleGAN 20.45\n65.06\n23.47\n19.63\n\nUFDN\nStarGAN\nMWGAN\n\nHair\nAccuracy (%)\n\n95.07\n92.01\n96.00\n97.65\n\nFID\n23.69\n69.30\n25.36\n22.94\n\nEyeglasses\n\nAccuracy (%)\n\n96.94\n79.34\n99.51\n99.53\n\nFID\n24.94\n76.04\n23.75\n23.69\n\nMustache\n\nAccuracy (%)\n\n93.89\n97.18\n99.06\n98.35\n\nFID\n18.09\n53.11\n18.12\n15.91\n\nPale skin\n\nAccuracy (%)\n\n80.75\n83.33\n92.48\n93.66\n\nTable 2: Comparisons of classi\ufb01cation accuracy\n(%) on multi-attribute synthesis. (B: Blond hair,\nE: Eyeglasses, M: Mustache, P: Pale skin.)\n\nTable 3: Comparisons of the FID value for each\nfacial attribute (different colors of hair) on the\nEdge\u2192CelebA translation task.\n\nMethod\n\nB+E\nCycleGAN 66.43\n72.53\n66.66\n75.82\n\nUFDN\nStarGAN\nMWGAN\n\nB+M B+M+E B+M+E+P\n33.33\n51.40\n62.20\n69.01\n\n2.11\n8.54\n6.10\n19.95\n\n11.03\n23.00\n45.77\n53.75\n\nMethod\n\nCycleGAN\n\nUFDN\nStarGAN\nMWGAN\n\nBlack hair Blond hair Brown hair\n\n65.79\n88.40\n57.51\n35.24\n\n65.10\n131.65\n53.41\n33.81\n\n81.59\n144.78\n81.00\n51.87\n\n6.2 Results on CelebA\n\nWe compare MWGAN with several baselines on both balanced and imbalanced translation tasks.\n(i) Balanced image translation task. In this experiment, we train the generators to produce single\nattribute images, and then synthesize multi-attribute images using the composite generators. We\ngenerate attributes in order of {Blond hair, Eyeglasses, Mustache, Pale skin}. Taking two attributes as\nan example, let g1 and g2 be the generators of Blond hair and Eyeglasses images, respectively, then\nimages with Blond hair and Eyeglasses attributes are generated by the composite generators g2\u25e6g1.\nQualitative results. In Figure 3, MWGAN has a better or comparable performance than baselines\non the single attribute translation task, but achieves the highest visual quality of multi-attributes\ntranslation results.\nIn other words, MWGAN has good generalization performance. However,\nCycleGAN is hard to synthesize multi-attributes. UFDN cannot guarantee the identity of the translated\nimages and produces images with blurring structures. Moreover, StarGAN highly depends on the\nnumber of transferred domains and the synthesized images sometimes lack the perceptual realism.\nQuantitative results. We further compare FID and classi\ufb01cation accuracy for the single-attribute\nresults. For the multi-attribute results, we only report classi\ufb01cation accuracy because FID is no longer\na valid measure and may give misleading results when training data are not suf\ufb01cient [24]. In Table 1,\nMWGAN achieves the lowest FID and comparable classi\ufb01cation accuracy, indicating that it produces\nrealistic single-attribute images of the highest quality. In Table 2, MWGAN achieves the highest\nclassi\ufb01cation accuracy and thus synthesizes the most realistic multi-attribute images.\n\n7\n\nMWGANStarGANCycleGANUFDNBlond hairEyeglassesInputMustachePale skinB+EB+MB+M+EB+M+E+P\fFigure 4: Comparisons of the edge\u2192CelebA\ntranslation results. The \ufb01rst column shows the\ninput images, and the next three columns show\nthe single attribute translation results.\n\nFigure 5: Comparisons of style transfer results.\nThe \ufb01rst column shows the real world images,\nand the last three columns show translation re-\nsults, i.e., Monet, Van Gogh and Ukiyo-e.\n\n(ii) Imbalanced image translation task. In this experiment, we compare MWGAN with baselines\non the Edge\u2192CelebA translation task. Note that this task is unbalanced because the information of\nedge images is much less than facial attribute images.\nQualitative results. In Figure 4, MWGAN is able to generate the most natural-looking facial images\nwith the corresponding attributes from edge images. In contrast, UFDN fails to preserve the facial\ntexture of an edge image, and generates images with very blurry and distorted structure. In addition,\nCycleGAN and StarGAN mostly preserve the domain information but cannot maintain the sharpness\nof images and the facial structure information. Moreover, this experiment also shows the superiority\nof our method on the imbalanced image translation task.\nQuantitative results. In Table 3, MWGAN achieves the lowest FID, showing that it is able to\nproduce the most realistic facial attributes from the edge images. In contrast, the FID values of\nbaselines are large because these methods are hard to generate sharp and realistic images. We also\nperform a perceptual evaluation with AMT for this task (see Section M in supplementary materials).\n\n6.3 Results on Painting Translation\n\nIn this experiment, we \ufb01nally train our model on the painting dataset to conduct the style transfer task\n[41, 44]. As suggested in [14, 15, 51], we only show the qualitative results. Note that this translation\ntask is also imbalanced because the input and target distributions are signi\ufb01cantly different.\nIn Figure 5, MWGAN generates painting images with higher visual quality. In contrast, UFDN fails to\ngenerate clearly structural painting images because it is hard to learn domain-invariant representation\nwhen domains are highly imbalanced. CycleGAN cannot fully learn some useful information from\npainting images to scene images. When taking a painting image as an input, StarGAN may obtain\nmisleading information to update the generator. In this sense, when all domains are signi\ufb01cantly\ndifferent, StarGAN may not learn a good single generator to synthesize images of multiple domains.\n\n7 Conclusion\n\nIn this paper, we have proposed a novel multi-marginal Wasserstein GAN (MWGAN) for multiple\nmarginal matching problem. Speci\ufb01cally, with the help of multi-marginal optimal transport theory,\nwe develop a new dual formulation for better adversarial learning on the unsupervised multi-domain\nimage translation task. Moreover, we theoretically de\ufb01ne and further analyze the generalization ability\nof the proposed method. Extensive experiments on both toy and real-world datasets demonstrate the\neffectiveness of the proposed method.\n\n8\n\nMWGANStarGANCycleGANUFDNBlack hairBlond hairInputBrown hairMWGANStarGANCycleGANUFDNMonetVan GoghInputCezanneUkiyo-e\fAcknowledgements\n\nThis work is partially funded by Guangdong Provincial Scienti\ufb01c and Technological Funds un-\nder Grants 2018B010107001, National Natural Science Foundation of China (NSFC) 61602185,\nkey project of NSFC (No. 61836003), Fundamental Research Funds for the Central Univer-\nsities D2191240, Program for Guangdong Introducing Innovative and Enterpreneurial Teams\n2017ZT07X183, and Tencent AI Lab Rhino-Bird Focused Research Program (No. JR201902).\nThis work is also partially funded by Microsoft Research Asia (MSRA Collaborative Research\nProgram 2019).\n\nReferences\n[1] J. Adler and S. Lunz. Banach wasserstein gan. In Advances in Neural Information Processing\n\nSystems, pages 6754\u20136763, 2018.\n\n[2] A. Almahairi, S. Rajeshwar, A. Sordoni, P. Bachman, and A. Courville. Augmented cyclegan:\nLearning many-to-many mappings from unpaired data. In International Conference on Machine\nLearning, volume 80, pages 195\u2013204, 2018.\n\n[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In\n\nInternational Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[4] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative\nadversarial nets (GANs). In International Conference on Machine Learning, volume 70, pages\n224\u2013232, 2017.\n\n[5] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high \ufb01delity natural\n\nimage synthesis. In International Conference on Learning Representations, 2019.\n\n[6] J. Cao, Y. Guo, Q. Wu, C. Shen, J. Huang, and M. Tan. Adversarial learning with local coordinate\ncoding. In International Conference on Machine Learning, volume 80, pages 707\u2013715, 2018.\n\n[7] J. Cao, Q. Wu, Y. Yan, L. Wang, and M. Tan. On the \ufb02atness of loss surface for two-layered\n\nrelu networks. In Asian Conference on Machine Learning, pages 545\u2013560, 2017.\n\n[8] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable\nrepresentation learning by information maximizing generative adversarial nets. In Advances in\nNeural Information Processing Systems, 2016.\n\n[9] X. Chen, C. Xu, X. Yang, and D. Tao. Attention-gan for object trans\ufb01guration in wild images.\n\nIn The European Conference on Computer Vision, pages 164\u2013180, 2018.\n\n[10] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Uni\ufb01ed generative\nadversarial networks for multi-domain image-to-image translation. In The IEEE Conference on\nComputer Vision and Pattern Recognition, June 2018.\n\n[11] F. Farnia and D. Tse. A convex duality framework for gans. In Advances in Neural Information\n\nProcessing Systems, pages 5248\u20135258, 2018.\n\n[12] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a wasserstein\n\nloss. In Advances in Neural Information Processing Systems, 2015.\n\n[13] T. Galanti, S. Benaim, and L. Wolf. Generalization bounds for unsupervised cross-domain\n\nmapping with wgans. arXiv preprint arXiv:1807.08501, 2018.\n\n[14] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural\nIn The IEEE Conference on Computer Vision and Pattern Recognition, pages\n\nnetworks.\n2414\u20132423, 2016.\n\n[15] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, and E. Shechtman. Controlling perceptual\nfactors in neural style transfer. In The IEEE Conference on Computer Vision and Pattern\nRecognition, 2017.\n\n9\n\n\f[16] A. Genevay, G. Peyre, and M. Cuturi. Learning generative models with sinkhorn divergences.\n\nIn Arti\ufb01cial Intelligence and Statistics, volume 84, pages 1608\u20131617, 2018.\n\n[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems,\npages 2672\u20132680, 2014.\n\n[18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\nwasserstein gans. In Advances in Neural Information Processing Systems, pages 5767\u20135777,\n2017.\n\n[19] Y. Guo, Q. Chen, J. Chen, J. Huang, Y. Xu, J. Cao, P. Zhao, and M. Tan. Dual reconstruction\nnets for image super-resolution with gradient sensitive loss. arXiv preprint arXiv:1809.07099,\n2018.\n\n[20] Y. Guo, Q. Chen, J. Chen, Q. Wu, Q. Shi, and M. Tan. Auto-embedding generative adversarial\n\nnetworks for high resolution image synthesis. IEEE Transactions on Multimedia, 2019.\n\n[21] Y. Guo, Y. Zheng, M. Tan, Q. Chen, J. Chen, P. Zhao, and J. Huang. Nat: Neural architec-\nture transformer for accurate and compact architectures. In Advances in Neural Information\nProcessing Systems, 2019.\n\n[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The\n\nIEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[23] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen. Arbitrary facial attribute editing: Only change\n\nwhat you want. arXiv preprint arXiv:1711.10678, 2017.\n\n[24] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two\ntime-scale update rule converge to a local nash equilibrium. In Advances in Neural Information\nProcessing Systems, pages 6626\u20136637, 2017.\n\n[25] L. Hui, X. Li, J. Chen, H. He, and J. Yang. Unsupervised multi-domain image translation with\ndomain-speci\ufb01c encoders/decoders. In International Conference on Pattern Recognition, pages\n2044\u20132049, 2018.\n\n[26] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality,\n\nstability, and variation. In International Conference on Learning Representations, 2018.\n\n[27] H. Kazemi, S. Soleymani, F. Taherkhani, S. Iranmanesh, and N. Nasrabadi. Unsupervised\nimage-to-image translation using domain-speci\ufb01c variational information bound. In Advances\nin Neural Information Processing Systems, pages 10348\u201310358, 2018.\n\n[28] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations\nwith generative adversarial networks. In International Conference on Machine Learning, 2017.\n\n[29] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\nConference on Learning Representations, 2015.\n\nIn International\n\n[30] A. H. Liu, Y.-C. Liu, Y.-Y. Yeh, and Y.-C. F. Wang. A uni\ufb01ed feature disentangler for multi-\ndomain image translation and manipulation. In Advances in Neural Information Processing\nSystems, pages 2595\u20132604, 2018.\n\n[31] H. Liu, G. Xianfeng, and D. Samaras. A two-step computation of the exact gan wasserstein\n\ndistance. In International Conference on Machine Learning, pages 3165\u20133174, 2018.\n\n[32] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In\n\nAdvances in Neural Information Processing Systems, 2017.\n\n[33] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In The IEEE\n\nInternational Conference on Computer Vision, pages 3730\u20133738, 2015.\n\n[34] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley. Least squares generative\nIn The IEEE International Conference on Computer Vision, pages\n\nadversarial networks.\n2794\u20132802, 2017.\n\n10\n\n\f[35] M. Mathieu, C. Couprie, and Y. LeCun. Deep Multi-scale Video Prediction beyond Mean\n\nSquare Error. In International Conference on Learning Representations, 2016.\n\n[36] X. Pan, M. Zhang, and D. Ding. Theoretical analysis of image-to-image translation with\n\nadversarial learning. In International Conference on Machine Learning, 2018.\n\n[37] H. Petzka, A. Fischer, and D. Lukovnikov. On the regularization of wasserstein gans. In\n\nInternational Conference on Learning Representations, 2018.\n\n[38] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generative adversarial\nnetworks through regularization. In Advances in Neural Information Processing Systems, pages\n2018\u20132028, 2017.\n\n[39] M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee. On the convergence and robustness of training\ngans with regularized optimal transport. In Advances in Neural Information Processing Systems,\npages 7091\u20137101, 2018.\n\n[40] F. Santambrogio. Optimal transport for applied mathematicians. Birk\u00e4user, NY, pages 99\u2013102,\n\n2015.\n\n[41] C. Song, Z. Wu, Y. Zhou, M. Gong, and H. Huang. Etnet: Error transition network for arbitrary\nstyle transfer. In Advances in Neural Information Processing Systems, pages 668\u2013677, 2019.\n\n[42] C. Villani. Optimal Transport: Old and New. Springer Science & Business Media, 2008.\n\n[43] C. Wang, C. Xu, X. Yao, and D. Tao. Evolutionary generative adversarial networks. IEEE\n\nTransactions on Evolutionary Computation, 2019.\n\n[44] Z. Wu, C. Song, Y. Zhou, M. Gong, and H. Huang. Efanet: Exchangeable feature alignment\n\nnetwork for arbitrary style transfer. In AAAI Conference on Arti\ufb01cial Intelligence, 2020.\n\n[45] S. Xie, Z. Zheng, L. Chen, and C. Chen. Learning semantic representations for unsupervised\ndomain adaptation. In International Conference on Machine Learning, pages 5419\u20135428, 2018.\n\n[46] Y. Yan, M. Tan, Y. Xu, J. Cao, M. Ng, H. Min, and Q. Wu. Oversampling for imbalanced\ndata via optimal transport. In AAAI Conference on Arti\ufb01cial Intelligence, volume 33, pages\n5605\u20135612, 2019.\n\n[47] Z. Yi, H. R. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-\nimage translation. In The IEEE International Conference on Computer Vision, pages 2868\u20132876,\n2017.\n\n[48] R. Zeng, C. Gan, P. Chen, W. Huang, Q. Wu, and M. Tan. Breaking winner-takes-all: Iterative-\nwinners-out networks for weakly supervised temporal action localization. IEEE Transactions\non Image Processing, 28(12):5797\u20135808, 2019.\n\n[49] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan. Graph convolutional\nnetworks for temporal action localization. In The IEEE International Conference on Computer\nVision, 2019.\n\n[50] Y. Zhang, P. Zhao, J. Cao, W. Ma, J. Huang, Q. Wu, and M. Tan. Online adaptive asymmetric\nactive learning for budgeted imbalanced data. In ACM SIGKDD International Conference on\nKnowledge Discovery & Data Mining, pages 2768\u20132777, 2018.\n\n[51] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-\nconsistent adversarial networks. In The IEEE International Conference on Computer Vision,\n2017.\n\n[52] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward\nmultimodal image-to-image translation. In Advances in Neural Information Processing Systems,\npages 465\u2013476, 2017.\n\n[53] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discrimination-\naware channel pruning for deep neural networks. In Advances in Neural Information Processing\nSystems, pages 875\u2013886, 2018.\n\n11\n\n\f", "award": [], "sourceid": 1034, "authors": [{"given_name": "Jiezhang", "family_name": "Cao", "institution": "South China University of Technology"}, {"given_name": "Langyuan", "family_name": "Mo", "institution": "South China University of Technology"}, {"given_name": "Yifan", "family_name": "Zhang", "institution": "South China University of Technology"}, {"given_name": "Kui", "family_name": "Jia", "institution": "South China University of Technology"}, {"given_name": "Chunhua", "family_name": "Shen", "institution": "University of Adelaide"}, {"given_name": "Mingkui", "family_name": "Tan", "institution": "South China University of Technology"}]}