{"title": "The Numerics of GANs", "book": "Advances in Neural Information Processing Systems", "page_first": 1825, "page_last": 1835, "abstract": "In this paper, we analyze the numerics of common algorithms for training Generative Adversarial Networks (GANs). Using the formalism of smooth two-player games we analyze the associated gradient vector field of GAN training objectives. Our findings suggest that the convergence of current algorithms suffers due to two factors: i) presence of eigenvalues of the Jacobian of the gradient vector field with zero real-part, and ii) eigenvalues with big imaginary part. Using these findings, we design a new algorithm that overcomes some of these limitations and has better convergence properties. Experimentally, we demonstrate its superiority on training common GAN architectures and show convergence on GAN architectures that are known to be notoriously hard to train.", "full_text": "The Numerics of GANs\n\nLars Mescheder\n\nAutonomous Vision Group\n\nMPI T\u00fcbingen\n\nMachine Intelligence and Perception Group\n\nSebastian Nowozin\n\nMicrosoft Research\n\nlars.mescheder@tuebingen.mpg.de\n\nsebastian.nowozin@microsoft.com\n\nAndreas Geiger\n\nAutonomous Vision Group\n\nMPI T\u00fcbingen\n\nandreas.geiger@tuebingen.mpg.de\n\nAbstract\n\nIn this paper, we analyze the numerics of common algorithms for training Gener-\native Adversarial Networks (GANs). Using the formalism of smooth two-player\ngames we analyze the associated gradient vector \ufb01eld of GAN training objectives.\nOur \ufb01ndings suggest that the convergence of current algorithms suffers due to two\nfactors: i) presence of eigenvalues of the Jacobian of the gradient vector \ufb01eld with\nzero real-part, and ii) eigenvalues with big imaginary part. Using these \ufb01ndings,\nwe design a new algorithm that overcomes some of these limitations and has better\nconvergence properties. Experimentally, we demonstrate its superiority on training\ncommon GAN architectures and show convergence on GAN architectures that are\nknown to be notoriously hard to train.\n\n1\n\nIntroduction\n\nGenerative Adversarial Networks (GANs) [10] have been very successful in learning probability\ndistributions. Since their \ufb01rst appearance, GANs have been successfully applied to a variety of\ntasks, including image-to-image translation [12], image super-resolution [13], image in-painting [27]\ndomain adaptation [26], probabilistic inference [14, 9, 8] and many more.\nWhile very powerful, GANs are known to be notoriously hard to train. The standard strategy for\nstabilizing training is to carefully design the model, either by adapting the architecture [21] or by\nselecting an easy-to-optimize objective function [23, 4, 11].\nIn this work, we examine the general problem of \ufb01nding local Nash-equilibria of smooth games. We\nrevisit the de-facto standard algorithm for \ufb01nding such equilibrium points, simultaneous gradient\nascent. We theoretically show that the main factors preventing the algorithm from converging are\nthe presence of eigenvalues of the Jacobian of the associated gradient vector \ufb01eld with zero real-part\nand eigenvalues with a large imaginary part. The presence of the latter is also one of the reasons that\nmake saddle-point problems more dif\ufb01cult than local optimization problems. Utilizing these insights,\nwe design a new algorithm that overcomes some of these problems. Experimentally, we show that\nour algorithm leads to stable training on many GAN architectures, including some that are known to\nbe hard to train.\nOur technique is orthogonal to strategies that try to make the GAN-game well-de\ufb01ned, e.g. by adding\ninstance noise [24] or by using the Wasserstein-divergence [4, 11]: while these strategies try to ensure\nthe existence of Nash-equilibria, our paper deals with their computation and the numerical dif\ufb01culties\nthat can arise in practice.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn summary, our contributions are as follows:\n\nNash-equilibria.\n\n\u2022 We identify the main reasons why simultaneous gradient ascent often fails to \ufb01nd local\n\u2022 By utilizing these insights, we design a new, more robust algorithm for \ufb01nding Nash-\n\u2022 We empirically demonstrate that our method enables stable training of GANs on a variety of\n\nequilibria of smooth two-player games.\n\narchitectures and divergence measures.\n\nThe proofs for the theorems in this paper can be found the supplementary material.1\n\n2 Background\n\nIn this section we \ufb01rst revisit the concept of Generative Adversarial Networks (GANs) from a\ndivergence minimization point of view. We then introduce the concept of a smooth (non-convex)\ntwo-player game and de\ufb01ne the terminology used in the rest of the paper. Finally, we describe\nsimultaneous gradient ascent, the de-facto standard algorithm for \ufb01nding Nash-equilibria of such\ngames, and derive some of its properties.\n\n2.1 Divergence Measures and GANs\n\nGenerative Adversarial Networks are best understood in the context of divergence minimization:\nassume we are given a divergence function D, i.e. a function that takes a pair of probability\ndistributions as input, outputs an element from [0,\u221e] and satis\ufb01es D(p, p) = 0 for all probability\ndistributions p. Moreover, assume we are given some target distribution p0 from which we can draw\ni.i.d. samples and a parametric family of distributions q\u03b8 that also allows us to draw i.i.d. samples. In\npractice q\u03b8 is usually implemented as a neural network that acts on a hidden code z sampled from\nsome known distribution and outputs an element from the target space. Our goal is to \ufb01nd \u00af\u03b8 that\nminimizes the divergence D(p0, q\u03b8), i.e. we want to solve the optimization problem\n\nmin\n\n\u03b8\n\nD(p0, q\u03b8).\n\n(1)\n\nMost divergences that are used in practice can be represented in the following form [10, 16, 4]:\n\n(2)\nfor some function class F \u2286 X \u2192 R and convex functions g1, g2 : R \u2192 R. Together with (1), this\nleads to mini-max problems of the form\n\nf\u2208F Ex\u223cq [g1(f (x))] \u2212 Ex\u223cp [g2(f (x))]\n\nD(p, q) = max\n\nf\u2208F Ex\u223cq\u03b8 [g1(f (x))] \u2212 Ex\u223cp0 [g2(f (x))] .\n\nmax\n\nmin\n\n\u03b8\n\n(3)\n\nThese divergences include the Jensen-Shannon divergence [10], all f-divergences [16], the Wasserstein\ndivergence [4] and even the indicator divergence, which is 0 if p = q and \u221e otherwise.\nIn practice, the function class F in (3) is approximated with a parametric family of functions,\ne.g. parameterized by a neural network. Of course, when minimizing the divergence w.r.t. this\napproximated family, we no longer minimize the correct divergence. However, it can be veri\ufb01ed that\ntaking any class of functions in (3) leads to a divergence function for appropriate choices of g1 and\ng2. Therefore, some authors call these divergence functions neural network divergences [5].\n\n2.2 Smooth Two-Player Games\n\nA differentiable two-player game is de\ufb01ned by two utility functions f (\u03c6, \u03b8) and g(\u03c6, \u03b8) de\ufb01ned over a\ncommon space (\u03c6, \u03b8) \u2208 \u21261 \u00d7 \u21262. \u21261 corresponds to the possible actions of player 1, \u21262 corresponds\nto the possible actions of player 2. The goal of player 1 is to maximize f, whereas player 2 tries to\nmaximize g. In the context of GANs, \u21261 is the set of possible parameter values for the generator,\nwhereas \u21262 is the set of possible parameter values for the discriminator. We call a game a zero-sum\ngame if f = \u2212g. Note that the derivation of the GAN-game in Section 2.1 leads to a zero-sum game,\n1The code for all experiments in this paper is available under https://github.com/LMescheder/\n\nTheNumericsOfGANs.\n\n2\n\n\fAlgorithm 1 Simultaneous Gradient Ascent (SimGA)\n1: while not converged do\n2:\n3:\n4:\n5:\n6: end while\n\nv\u03c6 \u2190 \u2207\u03c6f (\u03b8, \u03c6)\nv\u03b8 \u2190 \u2207\u03b8g(\u03b8, \u03c6)\n\u03c6 \u2190 \u03c6 + hv\u03c6\n\u03b8 \u2190 \u03b8 + hv\u03b8\n\nwhereas in practice people usually employ a variant of this formulation that is not a zero-sum game\nfor better convergence [10].\nOur goal is to \ufb01nd a Nash-equilibrium of the game, i.e. a point \u00afx = ( \u00af\u03c6, \u00af\u03b8) given by the two conditions\n\n\u00af\u03c6 \u2208 argmax\n\nf (\u03c6, \u00af\u03b8)\n\nand\n\n\u00af\u03b8 \u2208 argmax\n\ng( \u00af\u03c6, \u03b8).\n\n\u03c6\n\n\u03b8\n\nWe call a point ( \u00af\u03c6, \u00af\u03b8) a local Nash-equilibrium, if (4) holds in a local neighborhood of ( \u00af\u03c6, \u00af\u03b8).\nEvery differentiable two-player game de\ufb01nes a vector \ufb01eld\n\n(cid:18)\u2207\u03c6f (\u03c6, \u03b8)\n\n\u2207\u03b8g(\u03c6, \u03b8)\n\n(cid:19)\n\n.\n\nv(\u03c6, \u03b8) =\n\n(cid:18) \u22072\n\nWe call v the associated gradient vector \ufb01eld to the game de\ufb01ned by f and g.\nFor the special case of zero-sum two-player games, we have g = \u2212f and thus\n\nv(cid:48)(\u03c6, \u03b8) =\n\n\u03c6f (\u03c6, \u03b8)\n\n\u2212\u2207\u03c6,\u03b8f (\u03c6, \u03b8) \u2212\u22072\n\n\u2207\u03c6,\u03b8f (\u03c6, \u03b8)\n\u03b8f (\u03c6, \u03b8)\n\n(cid:19)\n\n.\n\n(4)\n\n(5)\n\n(6)\n\nAs a direct consequence, we have the following:\nLemma 1. For zero-sum games, v(cid:48)(x) is negative (semi-)de\ufb01nite if and only if \u22072\n(semi-)de\ufb01nite and \u22072\nCorollary 2. For zero-sum games, v(cid:48)(\u00afx) is negative semi-de\ufb01nite for any local Nash-equilibrium\n\u00afx. Conversely, if \u00afx is a stationary point of v(x) and v(cid:48)(\u00afx) is negative de\ufb01nite, then \u00afx is a local\nNash-equilibrium.\n\n\u03b8f (\u03c6, \u03b8) is positive (semi-)de\ufb01nite.\n\n\u03c6f (\u03c6, \u03b8) is negative\n\nNote that Corollary 2 is not true for general two-player games.\n\n2.3 Simultaneous Gradient Ascent\n\nThe de-facto standard algorithm for \ufb01nding Nash-equilibria of general smooth two-player games\nis Simultaneous Gradient Ascent (SimGA), which was described in several works, for example in\n[22] and, more recently also in the context of GANs, in [16]. The idea is simple and is illustrated in\nAlgorithm 1. We iteratively update the parameters of the two players by simultaneously applying\ngradient ascent to the utility functions of the two players. This can also be understood as applying the\nEuler-method to the ordinary differential equation\n\nd\ndt\n\nx(t) = v(x(t)),\n\n(7)\n\nwhere v(x) is the associated gradient vector \ufb01eld of the two-player game.\nIt can be shown that simultaneous gradient ascent converges locally to a Nash-equilibrium for a\nzero-sum game, if the Hessian of both players is negative de\ufb01nite [16, 22] and the learning rate is\nsmall enough. Unfortunately, in the context of GANs the former condition is rarely met. We revisit\nthe properties of simultaneous gradient ascent in Section 3 and also show a more subtle property,\nnamely that even if the conditions for the convergence of simultaneous gradient ascent are met, it\nmight require extremely small step sizes for convergence if the Jacobian of the associated gradient\nvector \ufb01eld has eigenvalues with large imaginary part.\n\n3\n\n\f(cid:61)(z)\n\n(cid:61)(z)\n\n(cid:61)(z)\n\n(cid:60)(z)\n\n(cid:60)(z)\n\n(cid:60)(z)\n\n(a) Illustration how the eigenvalues\nare projected into unit ball.\n\n(b) Example where h has to be cho-\nsen extremely small.\n\n(c) Illustration how our method alle-\nviates the problem.\n\nFigure 1: Images showing how the eigenvalues of A are projected into the unit circle and what causes\nproblems: when discretizing the gradient \ufb02ow with step size h, the eigenvalues of the Jacobian at a\n\ufb01xed point are projected into the unit ball along rays from 1. However, this is only possible if the\neigenvalues lie in the left half plane and requires extremely small step sizes h if the eigenvalues are\nclose to the imaginary axis. The proposed method moves the eigenvalues to the left in order to make\nthe problem better posed, thus allowing the algorithm to converge for reasonable step sizes.\n3 Convergence Theory\n\nIn this section, we analyze the convergence properties of the most common method for training\nGANs, simultaneous gradient ascent2. We show that two major failure causes for this algorithm\nare eigenvalues of the Jacobian of the associated gradient vector \ufb01eld with zero real-part as well as\neigenvalues with large imaginary part.\nFor our theoretical analysis, we start with the following classical theorem about the convergence of\n\ufb01xed-point iterations:\nProposition 3. Let F : \u2126 \u2192 \u2126 be a continuously differential function on an open subset \u2126 of Rn\nand let \u00afx \u2208 \u2126 be so that\n\n1. F (\u00afx) = \u00afx, and\n2. the absolute values of the eigenvalues of the Jacobian F (cid:48)(\u00afx) are all smaller than 1.\n\nThen there is an open neighborhood U of \u00afx so that for all x0 \u2208 U, the iterates F (k)(x0) converge\nto \u00afx. The rate of convergence is at least linear. More precisely, the error (cid:107)F (k)(x0) \u2212 \u00afx(cid:107) is in\nO(|\u03bbmax|k) for k \u2192 \u221e where \u03bbmax is the eigenvalue of F (cid:48)(\u00afx) with the largest absolute value.\n\nProof. See [6], Proposition 4.4.1.\n\nIn numerics, we often consider functions of the form\n\nF (x) = x + h G(x)\n\n(8)\n\nfor some h > 0. Finding \ufb01xed points of F is then equivalent to \ufb01nding solutions to the nonlinear\nequation G(x) = 0 for x. For F as in (8), the Jacobian is given by\n\nF (cid:48)(x) = I + h G(cid:48)(x).\n\n(9)\nNote that in general neither F (cid:48)(x) nor G(cid:48)(x) are symmetric and can therefore have complex eigenval-\nues.\nThe following Lemma gives an easy condition, when a \ufb01xed point of F as in (8) satis\ufb01es the\nconditions of Proposition 3.\n\n2A similar analysis of alternating gradient ascent, a popular alternative to simultaneous gradient ascent, can\n\nbe found in the supplementary material.\n\n4\n\n\fLemma 4. Assume that A \u2208 Rn\u00d7n only has eigenvalues with negative real-part and let h > 0. Then\nthe eigenvalues of the matrix I + h A lie in the unit ball if and only if\n\nh <\n\n1\n\n|(cid:60)(\u03bb)|\n\n1 +\n\n(cid:17)2\n\n2\n\n(cid:16)(cid:61)(\u03bb)(cid:60)(\u03bb)\n\n(10)\n\nfor all eigenvalues \u03bb of A.\nCorollary 5. If v(cid:48)(\u00afx) only has eigenvalues with negative real-part at a stationary point \u00afx, then\nAlgorithm 1 is locally convergent to \u00afx for h > 0 small enough.\n\nEquation 10 shows that there are two major factors that determine the maximum possible step size h:\n(i) the maximum value of (cid:60)(\u03bb) and (ii) the maximum value q of |(cid:61)(\u03bb)/(cid:60)(\u03bb)|. Note that as q goes to\nin\ufb01nity, we have to choose h according to O(q\u22122) which can quickly become extremely small. This\nis visualized in Figure 1: if G(cid:48)(\u00afx) has an eigenvalue with small absolute real part but big imaginary\npart, h needs to be chosen extremely small to still achieve convergence. Moreover, even if we make h\nsmall enough, most eigenvalues of F (cid:48)(\u00afx) will be very close to 1, which leads by Proposition 3 to very\nslow convergence of the algorithm. This is in particular a problem of simultaneous gradient ascent\nfor two-player games (in contrast to gradient ascent for local optimization), where the Jacobian G(cid:48)(\u00afx)\nis not symmetric and can therefore have non-real eigenvalues.\n\n4 Consensus Optimization\n\nIn this section, we derive the proposed method and analyze its convergence properties.\n\n4.1 Derivation\n\nFinding stationary points of the vector \ufb01eld v(x) is equivalent to solving the equation v(x) = 0. In\nthe context of two-player games this means solving the two equations\n\u2207\u03c6f (\u03c6, \u03b8) = 0 and \u2207\u03b8g(\u03c6, \u03b8) = 0.\n\nA simple strategy for \ufb01nding such stationary points is to minimize L(x) = 1\ntunately, this can result in unstable stationary points of v or other local minima of 1\npractice, we found it did not work well.\nWe therefore consider a modi\ufb01ed vector \ufb01eld w(x) that is as close as possible to the original vector\n\ufb01eld v(x), but at the same time still minimizes L(x) (at least locally). A sensible candidate for such\na vector \ufb01eld is\n\nw(x) = v(x) \u2212 \u03b3\u2207L(x)\n\n(12)\n\nfor some \u03b3 > 0. A simple calculation shows that the gradient \u2207L(x) is given by\n\n\u2207L(x) = v(cid:48)(x)Tv(x).\n\n(13)\n\nThis vector \ufb01eld is the gradient vector \ufb01eld associated to the modi\ufb01ed two-player game given by the\ntwo modi\ufb01ed utility functions\n\n\u02dcf (\u03c6, \u03b8) = f (\u03c6, \u03b8) \u2212 \u03b3L(\u03c6, \u03b8)\n\nand\n\n\u02dcg(\u03c6, \u03b8) = g(\u03c6, \u03b8) \u2212 \u03b3L(\u03c6, \u03b8).\n\n(14)\n\nThe regularizer L(\u03c6, \u03b8) encourages agreement between the two players. Therefore we call the\nresulting algorithm Consensus Optimization (Algorithm 2). 3 4\n\n3This algorithm requires backpropagation through the squared norm of the gradient with respect to the\nweights of the network. This is sometimes called double backpropagation and is for example supported by the\ndeep learning frameworks Tensor\ufb02ow [1] and PyTorch [19].\n\n4As was pointed out by Ferenc Huzs\u00e1r in one of his blog posts on www.inference.vc, naively implementing\nthis algorithm in a mini-batch setting leads to biased estimates of L(x). However, the bias goes down linearly\nwith the batch size, which justi\ufb01es the usage of consensus optimization in a mini-batch setting. Alternatively,\nit is possible to debias the estimate by subtracting a multiple of the sample variance of the gradients, see the\nsupplementary material for details.\n\n5\n\n(11)\n2(cid:107)v(x)(cid:107)2 for x. Unfor-\n2(cid:107)v(x)(cid:107)2 and in\n\n\fAlgorithm 2 Consensus optimization\n1: while not converged do\nv\u03c6 \u2190 \u2207\u03c6(f (\u03b8, \u03c6) \u2212 \u03b3L(\u03b8, \u03c6))\n2:\nv\u03b8 \u2190 \u2207\u03b8(g(\u03b8, \u03c6) \u2212 \u03b3L(\u03b8, \u03c6))\n3:\n\u03c6 \u2190 \u03c6 + hv\u03c6\n4:\n\u03b8 \u2190 \u03b8 + hv\u03b8\n5:\n6: end while\n\n4.2 Convergence\n\nFor analyzing convergence, we consider a more general algorithm than in Section 4.1 which is given\nby iteratively applying a function F of the form\n\nF (x) = x + h A(x)v(x).\n\n(15)\nfor some step size h > 0 and an invertible matrix A(x) to x. Consensus optimization is a special\ncase of this algorithm for A(x) = I \u2212 \u03b3 v(cid:48)(x)T. We assume that 1\n\u03b3 is not an eigenvalue of v(cid:48)(x)T for\nany x, so that A(x) is indeed invertible.\nLemma 6. Assume h > 0 and A(x) invertible for all x. Then \u00afx is a \ufb01xed point of (15) if and only if\nit is a stationary point of v. Moreover, if \u00afx is a stationary point of v, we have\n\n(16)\nLemma 7. Let A(x) = I \u2212 \u03b3v(cid:48)(x)T and assume that v(cid:48)(\u00afx) is negative semi-de\ufb01nite and invertible5 .\nThen A(\u00afx)v(cid:48)(\u00afx) is negative de\ufb01nite.\n\nF (cid:48)(\u00afx) = I + hA(\u00afx)v(cid:48)(\u00afx).\n\nAs a consequence of Lemma 6 and Lemma 7, we can show local convergence of our algorithm to a\nlocal Nash equilibrium:\nCorollary 8. Let v(x) be the associated gradient vector \ufb01eld of a two-player zero-sum game and\nA(x) = I \u2212 \u03b3v(cid:48)(x)T. If \u00afx is a local Nash-equilibrium, then there is an open neighborhood U of \u00afx so\nthat for all x0 \u2208 U, the iterates F (k)(x0) converge to \u00afx for h > 0 small enough.\nOur method solves the problem of eigenvalues of the Jacobian with (approximately) zero real-part.\nAs the next Lemma shows, it also alleviates the problem of eigenvalues with a big imaginary-to-real-\npart-quotient:\nLemma 9. Assume that A \u2208 Rn\u00d7n is negative semi-de\ufb01nite. Let q(\u03b3) be the maximum of |(cid:61)(\u03bb)|\n|(cid:60)(\u03bb)|\n(possibly in\ufb01nite) with respect to \u03bb where \u03bb denotes the eigenvalues of A \u2212 \u03b3AT A and (cid:60)(\u03bb) and\n(cid:61)(\u03bb) denote their real and imaginary part respectively. Moreover, assume that A is invertible with\n|Av| \u2265 \u03c1|v| for \u03c1 > 0 and let\n\n(17)\n\n(18)\n\nc = min\n\nv\u2208S(Cn)\n\n|\u00afvT(A + AT)v|\n|\u00afvT(A \u2212 AT)v|\n\nwhere S(Cn) denotes the unit sphere in Cn. Then\nq(\u03b3) \u2264\n\n1\n\n.\n\nc + 2\u03c12\u03b3\n\nLemma 9 shows that the imaginary-to-real-part-quotient can be made arbitrarily small for an appro-\npriate choice of \u03b3. According to Proposition 3, this leads to better convergence properties near a local\nNash-equilibrium.\n\n5 Experiments\n\nMixture of Gaussians\nIn our \ufb01rst experiment we evaluate our method on a simple 2D-example\nwhere our goal is to learn a mixture of 8 Gaussians with standard deviations equal to 10\u22122 and modes\n5Note that v(cid:48)(\u00afx) is usually not symmetric and therefore it is possible that v(cid:48)(\u00afx) is negative semi-de\ufb01nite and\n\ninvertible but not negative-de\ufb01nite.\n\n6\n\n\f(a) Simultaneous Gradient Ascent\n\n(b) Consensus optimization\n\nFigure 2: Comparison of Simultaneous Gradient Ascent and Consensus optimization on a circular\nmixture of Gaussians. The images depict from left to right the resulting densities of the algorithm\nafter 0, 5000, 10000 and 20000 iterations as well as the target density (in red).\n\nv(cid:48)(x)\n\nw(cid:48)(x)\n\nBefore\ntraining\n\nAfter\ntraining\n\nFigure 3: Empirical distribution of eigenvalues before and after training using consensus optimization.\nThe \ufb01rst column shows the distribution of the eigenvalues of the Jacobian v(cid:48)(x) of the unmodi\ufb01ed\nvector \ufb01eld v(x). The second column shows the eigenvalues of the Jacobian w(cid:48)(x) of the regularized\nvector \ufb01eld w(x) = v(x) \u2212 \u03b3\u2207L(x) used in consensus optimization. We see that v(cid:48)(x) has eigenval-\nues close to the imaginary axis near the Nash-equilibrium. As predicted theoretically, this is not the\ncase for the regularized vector \ufb01eld w(x). For visualization purposes, the real part of the spectrum of\nw(cid:48)(x) before training was clipped.\n\nuniformly distributed around the unit circle. While simplistic, algorithms training GANs often fail to\nconverge even on such simple examples without extensive \ufb01ne-tuning of the architecture and hyper\nparameters [15].\nFor both the generator and critic we use fully connected neural networks with 4 hidden layers and\n16 hidden units in each layer. For all layers, we use RELU-nonlinearities. We use a 16-dimensional\nGaussian prior for the latent code z and set up the game between the generator and critic using the\nutility functions as in [10]. To test our method, we run both SimGA and our method with RMSProp\nand a learning rate of 10\u22124 for 20000 steps. For our method, we use a regularization parameter of\n\u03b3 = 10.\nThe results produced by SimGA and our method for 0, 5000, 10000 and 20000 iterations are depicted\nin Figure 2. We see that while SimGA jumps around the modes of the distribution and fails to\nconverge , our method converges smoothly to the target distribution (shown in red). Figure 3 shows\nthe empirical distribution of the eigenvalues of the Jacobian of v(x) and the regularized vector \ufb01eld\nw(x). It can be seen that near the Nash-equilibrium most eigenvalues are indeed very close to the\n\n7\n\n\f(a) cifar-10\n\n(b) celebA\n\nFigure 4: Samples generated from a model where both the generator and discriminator are given as in\n[21], but without batch-normalization. For celebA, we also use a constant number of \ufb01lters in each\nlayer and add additional RESNET-layers.\n\n(a) Discriminator loss\n\n(b) Generator loss\n\n(c) Inception score\n\nFigure 5: (a) and (b): Comparison of the generator and discriminator loss on a DC-GAN archi-\ntecture with 3 convolutional layers trained on cifar-10 for consensus optimization (without batch-\nnormalization) and alternating gradient ascent (with batch-normalization). We observe that while\nalternating gradient ascent leads to highly \ufb02uctuating losses, consensus optimization successfully\nstabilizes the training and makes the losses almost constant during training. (c): Comparison of the\ninception score over time which was computed using 6400 samples. We see that on this architecture\nboth methods have comparable rates of convergence and consensus optimization achieves slightly\nbetter end results.\n\nimaginary axis and that the proposed modi\ufb01cation of the vector \ufb01eld used in consensus optimization\nmoves the eigenvalues to the left.\n\nCIFAR-10 and CelebA In our second experiment, we apply our method to the cifar-10 and celebA-\ndatasets, using a DC-GAN-like architecture [21] without batch normalization in the generator or the\ndiscriminator. For celebA, we additionally use a constant number of \ufb01lters in each layer and add\nadditional RESNET-layers. These architectures are known to be hard to optimize using simultaneous\n(or alternating) gradient ascent [21, 4].\nFigure 4a and 4b depict samples from the model trained with our method. We see that our method\nsuccessfully trains the models and we also observe that unlike when using alternating gradient ascent,\nthe generator and discriminator losses remain almost constant during training. This is illustrated\nin Figure 5. For a quantitative evaluation, we also measured the inception-score [23] over time\n(Figure 5c), showing that our method compares favorably to a DC-GAN trained with alternating\ngradient ascent. The improvement of consensus optimization over alternating gradient ascent is even\nmore signi\ufb01cant if we use 4 instead of 3 convolutional layers, see Figure 11 in the supplementary\nmaterial for details.\nAdditional experimental results can be found in the supplementary material.\n\n6 Discussion\n\nWhile we could prove local convergence of our method in Section 4, we believe that even more\ninsights can be gained by examining global convergence properties. In particular, our analysis from\n\n8\n\n\fSection 4 cannot explain why the generator and discriminator losses remain almost constant during\ntraining.\nOur theoretical results assume the existence of a Nash-equilibrium. When we are trying to minimize\nan f-divergence and the dimensionality of the generator distribution is misspeci\ufb01ed, this might not be\nthe case [3]. Nonetheless, we found that our method works well in practice and we leave a closer\ntheoretical investigation of this fact to future research.\nIn practice, our method can potentially make formerly instable stationary points of the gradient vector\n\ufb01eld stable if the regularization parameter is chosen to be high. This may lead to poor solutions. We\nalso found that our method becomes less stable for deeper architectures, which we attribute to the fact\nthat the gradients can have very different scales in such architectures, so that the simple L2-penalty\nfrom Section 4 needs to be rescaled accordingly.\nOur method can be regarded as an approximation to the implicit Euler method for integrating the\ngradient vector \ufb01eld. It can be shown that the implicit Euler method has appealing stability properties\n[7] that can be translated into convergence theorems for local Nash-equilibria. However, the implicit\nEuler method requires the solution of a nonlinear equation in each iteration. Nonetheless, we believe\nthat further progress can be made by \ufb01nding better approximations to the implicit Euler method.\nAn alternative interpretation is to view our method as a second order method. We hence believe that\nfurther progress can be made by revisiting second order optimization methods [2, 18] in the context\nof saddle point problems.\n\n7 Related Work\n\nSaddle point problems do not only arise in the context of training GANs. For example, the popular\nactor-critic models [20] in reinforcement learning are also special cases of saddle-point problems.\nFinding a stable algorithm for training GANs is a long standing problem and multiple solutions have\nbeen proposed. Unrolled GANs [15] unroll the optimization with respect to the critic, thereby giving\nthe generator more informative gradients. Though unrolling the optimization was shown to stabilize\ntraining, it can be cumbersome to implement and in addition it also results in a big model. As was\nrecently shown, the stability of GAN-training can be improved by using objectives derived from\nthe Wasserstein-1-distance (induced by the Kantorovich-Rubinstein-norm) instead of f-divergences\n[4, 11]. While Wasserstein-GANs often provide a good solution for the stable training of GANs, they\nrequire keeping the critic optimal, which can be time-consuming and can in practice only be achieved\napproximately, thus violating the conditions for theoretical guarantees. Moreover, some methods\nlike Adversarial Variational Bayes [14] explicitly prescribe the divergence measure to be used, thus\nmaking it impossible to apply Wasserstein-GANs. Other approaches that try to stabilize training, try\nto design an easy-to-optimize architecture [23, 21] or make use of additional labels [23, 17].\nIn contrast to all the approaches described above, our work focuses on stabilizing training on a wide\nrange of architecture and divergence functions.\n\n8 Conclusion\n\nIn this work, starting from GAN objective functions we analyzed the general dif\ufb01culties of \ufb01nding\nlocal Nash-equilibria in smooth two-player games. We pinpointed the major numerical dif\ufb01culties that\narise in the current state-of-the-art algorithms and, using our insights, we presented a new algorithm\nfor training generative adversarial networks. Our novel algorithm has favorable properties in theory\nand practice: from the theoretical viewpoint, we showed that it is locally convergent to a Nash-\nequilibrium even if the eigenvalues of the Jacobian are problematic. This is particularly interesting\nfor games that arise in the context of GANs where such problems are common. From the practical\nviewpoint, our algorithm can be used in combination with any GAN-architecture whose objective can\nbe formulated as a two-player game to stabilize the training. We demonstrated experimentally that\nour algorithm stabilizes the training and successfully combats training issues like mode collapse. We\nbelieve our work is a \ufb01rst step towards an understanding of the numerics of GAN training and more\ngeneral deep learning objective functions.\n\n9\n\n\fAcknowledgements\n\nThis work was supported by Microsoft Research through its PhD Scholarship Programme.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016.\n\n[2] Shun-ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10(2):251\u2013\n\n276, 1998.\n\n[3] Mart\u00edn Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adver-\n\nsarial networks. CoRR, abs/1701.04862, 2017.\n\n[4] Mart\u00edn Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN.\n\nabs/1701.07875, 2017.\n\nCoRR,\n\n[5] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and\nequilibrium in generative adversarial nets (gans). In Proceedings of the 34th International\nConference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017,\npages 224\u2013232, 2017.\n\n[6] Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic\n\npress, 2014.\n\n[7] John Charles Butcher. Numerical methods for ordinary differential equations. John Wiley &\n\nSons, 2016.\n\n[8] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. CoRR,\n\nabs/1605.09782, 2016.\n\n[9] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Mart\u00edn Arjovsky, Olivier Mas-\ntropietro, and Aaron C. Courville. Adversarially learned inference. CoRR, abs/1606.00704,\n2016.\n\n[10] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances\nin Neural Information Processing Systems 27: Annual Conference on Neural Information\nProcessing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672\u20132680,\n2014.\n\n[11] Ishaan Gulrajani, Faruk Ahmed, Mart\u00edn Arjovsky, Vincent Dumoulin, and Aaron C. Courville.\n\nImproved training of wasserstein gans. CoRR, abs/1704.00028, 2017.\n\n[12] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with\n\nconditional adversarial networks. CoRR, abs/1611.07004, 2016.\n\n[13] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani,\nJohannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution\nusing a generative adversarial network. CoRR, abs/1609.04802, 2016.\n\n[14] Lars M. Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes:\nUnifying variational autoencoders and generative adversarial networks. In Proceedings of the\n34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11\nAugust 2017, pages 2391\u20132400, 2017.\n\n[15] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial\n\nnetworks. CoRR, abs/1611.02163, 2016.\n\n10\n\n\f[16] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-\nplers using variational divergence minimization. In Advances in Neural Information Processing\nSystems 29: Annual Conference on Neural Information Processing Systems 2016, December\n5-10, 2016, Barcelona, Spain, pages 271\u2013279, 2016.\n\n[17] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with\nauxiliary classi\ufb01er gans. In Proceedings of the 34th International Conference on Machine\nLearning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2642\u20132651, 2017.\n\n[18] Razvan Pascanu and Yoshua Bengio. Natural gradient revisited. CoRR, abs/1301.3584, 2013.\n\n[19] Adam Paszke and Soumith Chintala. Pytorch, 2017.\n\n[20] David Pfau and Oriol Vinyals. Connecting generative adversarial networks and actor-critic\n\nmethods. CoRR, abs/1610.01945, 2016.\n\n[21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.\n\n[22] Lillian J. Ratliff, Samuel Burden, and S. Shankar Sastry. Characterization and computation of\nlocal nash equilibria in continuous games. In 51st Annual Allerton Conference on Communica-\ntion, Control, and Computing, Allerton 2013, Allerton Park & Retreat Center, Monticello, IL,\nUSA, October 2-4, 2013, pages 917\u2013924, 2013.\n\n[23] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems\n29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016,\nBarcelona, Spain, pages 2226\u20132234, 2016.\n\n[24] Casper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amortised\n\nMAP inference for image super-resolution. CoRR, abs/1610.04490, 2016.\n\n[25] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running\n\naverage of its recent magnitude, 2012.\n\n[26] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain\n\nadaptation. CoRR, abs/1702.05464, 2017.\n\n[27] Raymond Yeh, Chen Chen, Teck-Yian Lim, Mark Hasegawa-Johnson, and Minh N. Do. Seman-\n\ntic image inpainting with perceptual and contextual losses. CoRR, abs/1607.07539, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1144, "authors": [{"given_name": "Lars", "family_name": "Mescheder", "institution": "Max-Planck Institute Tuebingen"}, {"given_name": "Sebastian", "family_name": "Nowozin", "institution": "Microsoft Research Cambridge"}, {"given_name": "Andreas", "family_name": "Geiger", "institution": "MPI T\u00fcbingen"}]}