{"title": "Learning with a Wasserstein Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 2053, "page_last": 2061, "abstract": "Learning to predict multi-label outputs is challenging, but in many problems there is a natural metric on the outputs that can be used to improve predictions. In this paper we develop a loss function for multi-label learning, based on the Wasserstein distance. The Wasserstein distance provides a natural notion of dissimilarity for probability measures. Although optimizing with respect to the exact Wasserstein distance is costly, recent work has described a regularized approximation that is efficiently computed. We describe an efficient learning algorithm based on this regularization, as well as a novel extension of the Wasserstein distance from probability measures to unnormalized measures. We also describe a statistical learning bound for the loss. The Wasserstein loss can encourage smoothness of the predictions with respect to a chosen metric on the output space. We demonstrate this property on a real-data tag prediction problem, using the Yahoo Flickr Creative Commons dataset, outperforming a baseline that doesn't use the metric.", "full_text": "Learning with a Wasserstein Loss\n\nCharlie Frogner\u21e4 Chiyuan Zhang\u21e4\nCenter for Brains, Minds and Machines\nMassachusetts Institute of Technology\n\nfrogner@mit.edu, chiyuan@mit.edu\n\nMauricio Araya-Polo\n\nShell International E & P, Inc.\n\nMauricio.Araya@shell.com\n\nHossein Mobahi\n\nCSAIL\n\nMassachusetts Institute of Technology\n\nhmobahi@csail.mit.edu\n\nTomaso Poggio\n\nCenter for Brains, Minds and Machines\nMassachusetts Institute of Technology\n\ntp@ai.mit.edu\n\nAbstract\n\nLearning to predict multi-label outputs is challenging, but in many problems there\nis a natural metric on the outputs that can be used to improve predictions. In this\npaper we develop a loss function for multi-label learning, based on the Wasserstein\ndistance. The Wasserstein distance provides a natural notion of dissimilarity for\nprobability measures. Although optimizing with respect to the exact Wasserstein\ndistance is costly, recent work has described a regularized approximation that is\nef\ufb01ciently computed. We describe an ef\ufb01cient learning algorithm based on this\nregularization, as well as a novel extension of the Wasserstein distance from prob-\nability measures to unnormalized measures. We also describe a statistical learning\nbound for the loss. The Wasserstein loss can encourage smoothness of the predic-\ntions with respect to a chosen metric on the output space. We demonstrate this\nproperty on a real-data tag prediction problem, using the Yahoo Flickr Creative\nCommons dataset, outperforming a baseline that doesn\u2019t use the metric.\n\n1\n\nIntroduction\n\nWe consider the problem of learning to predict a non-negative measure over a \ufb01nite set. This prob-\nlem includes many common machine learning scenarios. In multiclass classi\ufb01cation, for example,\none often predicts a vector of scores or probabilities for the classes. And in semantic segmenta-\ntion [1], one can model the segmentation as being the support of a measure de\ufb01ned over the pixel\nlocations. Many problems in which the output of the learning machine is both non-negative and\nmulti-dimensional might be cast as predicting a measure.\nWe speci\ufb01cally focus on problems in which the output space has a natural metric or similarity struc-\nture, which is known (or estimated) a priori. In practice, many learning problems have such struc-\nture. In the ImageNet Large Scale Visual Recognition Challenge [ILSVRC] [2], for example, the\noutput dimensions correspond to 1000 object categories that have inherent semantic relationships,\nsome of which are captured in the WordNet hierarchy that accompanies the categories. Similarly, in\nthe keyword spotting task from the IARPA Babel speech recognition project, the outputs correspond\nto keywords that likewise have semantic relationships. In what follows, we will call the similarity\nstructure on the label space the ground metric or semantic similarity.\nUsing the ground metric, we can measure prediction performance in a way that is sensitive to re-\nlationships between the different output dimensions. For example, confusing dogs with cats might\n\n\u21e4Authors contributed equally.\n1Code and data are available at http://cbcl.mit.edu/wasserstein.\n\n1\n\n\fDivergence\n\nWasserstein\n\ne\nc\nn\na\n\nt\ns\nD\n\ni\n\n0.3\n0.2\n0.1\n0.0\n\nDivergence\n\nWasserstein\n\n0.4\n0.3\n0.2\n0.1\n\ne\nc\nn\na\n\nt\ns\nD\n\ni\n\n3\n\n4\n\n5\n\n6\n\n7\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\nGrid Size\n\nNoise\n\nFigure 2: The Wasserstein loss encourages predictions that are similar to ground truth, robustly\nto incorrect labeling of similar classes (see Appendix E.1). Shown is Euclidean distance between\nprediction and ground truth vs. (left) number of classes, averaged over different noise levels and\n(right) noise level, averaged over number of classes. Baseline is the multiclass logistic loss.\n\nSiberian husky\n\nEskimo dog\n\nSemantically near-\n\nFigure 1:\nequivalent classes in ILSVRC\n\nbe more severe an error than confusing breeds of dogs. A loss function that incorporates this metric\nmight encourage the learning algorithm to favor predictions that are, if not completely accurate, at\nleast semantically similar to the ground truth.\nIn this paper, we develop a loss function for multi-label learn-\ning that measures the Wasserstein distance between a prediction\nand the target label, with respect to a chosen metric on the out-\nput space. The Wasserstein distance is de\ufb01ned as the cost of\nthe optimal transport plan for moving the mass in the predicted\nmeasure to match that in the target, and has been applied to a\nwide range of problems, including barycenter estimation [3], la-\nbel propagation [4], and clustering [5]. To our knowledge, this\npaper represents the \ufb01rst use of the Wasserstein distance as a\nloss for supervised learning.\nWe brie\ufb02y describe a case in which the Wasserstein loss improves learning performance. The setting\nis a multiclass classi\ufb01cation problem in which label noise arises from confusion of semantically\nnear-equivalent categories. Figure 1 shows such a case from the ILSVRC, in which the categories\nSiberian husky and Eskimo dog are nearly indistinguishable. We synthesize a toy version of this\nproblem by identifying categories with points in the Euclidean plane and randomly switching the\ntraining labels to nearby classes. The Wasserstein loss yields predictions that are closer to the ground\ntruth, robustly across all noise levels, as shown in Figure 2. The standard multiclass logistic loss is\nthe baseline for comparison. Section E.1 in the Appendix describes the experiment in more detail.\nThe main contributions of this paper are as follows. We formulate the problem of learning with prior\nknowledge of the ground metric, and propose the Wasserstein loss as an alternative to traditional\ninformation divergence-based loss functions. Speci\ufb01cally, we focus on empirical risk minimization\n(ERM) with the Wasserstein loss, and describe an ef\ufb01cient learning algorithm based on entropic\nregularization of the optimal transport problem. We also describe a novel extension to unnormalized\nmeasures that is similarly ef\ufb01cient to compute. We then justify ERM with the Wasserstein loss\nby showing a statistical learning bound. Finally, we evaluate the proposed loss on both synthetic\nexamples and a real-world image annotation problem, demonstrating bene\ufb01ts for incorporating an\noutput metric into the loss.\n\n2 Related work\n\nDecomposable loss functions like KL Divergence and `p distances are very popular for probabilis-\ntic [1] or vector-valued [6] predictions, as each component can be evaluated independently, often\nleading to simple and ef\ufb01cient algorithms. The idea of exploiting smoothness in the label space\naccording to a prior metric has been explored in many different forms, including regularization [7]\nand post-processing with graphical models [8]. Optimal transport provides a natural distance for\nprobability distributions over metric spaces. In [3, 9], the optimal transport is used to formulate\nthe Wasserstein barycenter as a probability distribution with minimum total Wasserstein distance\nto a set of given points on the probability simplex. [4] propagates histogram values on a graph by\nminimizing a Dirichlet energy induced by optimal transport. The Wasserstein distance is also used\nto formulate a metric for comparing clusters in [5], and is applied to image retrieval [10], contour\n\n2\n\n\fmatching [11], and many other problems [12, 13]. However, to our knowledge, this is the \ufb01rst time\nit is used as a loss function in a discriminative learning framework. The closest work to this pa-\nper is a theoretical study [14] of an estimator that minimizes the optimal transport cost between the\nempirical distribution and the estimated distribution in the setting of statistical parameter estimation.\n\n3 Learning with a Wasserstein loss\n\n3.1 Problem setup and notation\nWe consider the problem of learning a map from X \u21e2 RD into the space Y = RK\n+ of measures over\na \ufb01nite set K of size |K| = K. Assume K possesses a metric dK(\u00b7,\u00b7), which is called the ground\nmetric. dK measures semantic similarity between dimensions of the output, which correspond to\nthe elements of K. We perform learning over a hypothesis space H of predictors h\u2713 : X ! Y,\nparameterized by \u2713 2 \u21e5. These might be linear logistic regression models, for example.\nIn the standard statistical learning setting, we get an i.i.d. sequence of training examples S =\n((x1, y1), . . . , (xN , yN )), sampled from an unknown joint distribution PX\u21e5Y. Given a measure of\nperformance (a.k.a. risk) E(\u00b7,\u00b7), the goal is to \ufb01nd the predictor h\u2713 2 H that minimizes the expected\nrisk E[E(h\u2713(x), y)]. Typically E(\u00b7,\u00b7) is dif\ufb01cult to optimize directly and the joint distribution PX\u21e5Y\nis unknown, so learning is performed via empirical risk minimization. Speci\ufb01cally, we solve\n\nmin\n\nh\u27132H(\u02c6ES[`(h\u2713(x), y) =\n\nNXi=1\nwith a loss function `(\u00b7,\u00b7) acting as a surrogate of E(\u00b7,\u00b7).\n3.2 Optimal transport and the exact Wasserstein loss\n\n1\nN\n\n`(h\u2713(xi), yi))\n\n(1)\n\nInformation divergence-based loss functions are widely used in learning with probability-valued out-\nputs. Along with other popular measures like Hellinger distance and 2 distance, these divergences\ntreat the output dimensions independently, ignoring any metric structure on K.\nGiven a cost function c : K \u21e5 K ! R, the optimal transport distance [15] measures the cheapest\nway to transport the mass in probability measure \u00b51 to match that in \u00b52:\n\nWc(\u00b51, \u00b52) =\n\nc(\uf8ff1, \uf8ff2)(d\uf8ff1, d\uf8ff2)\n\n(2)\n\n2\u21e7(\u00b51,\u00b52)ZK\u21e5K\n\ninf\n\nwhere \u21e7(\u00b51, \u00b52) is the set of joint probability measures on K\u21e5K having \u00b51 and \u00b52 as marginals. An\nimportant case is that in which the cost is given by a metric dK(\u00b7,\u00b7) or its p-th power dp\nK(\u00b7,\u00b7) with p \n1. In this case, (2) is called a Wasserstein distance [16], also known as the earth mover\u2019s distance\n[10]. In this paper, we only work with discrete measures. In the case of probability measures, these\nare histograms in the simplex K. When the ground truth y and the output of h both lie in the\nsimplex K, we can de\ufb01ne a Wasserstein loss.\nDe\ufb01nition 3.1 (Exact Wasserstein Loss). For any h\u2713 2 H, h\u2713 : X ! K, let h\u2713(\uf8ff|x) = h\u2713(x)\uf8ff be\nthe predicted value at element \uf8ff 2 K, given input x 2 X . Let y(\uf8ff) be the ground truth value for \uf8ff\ngiven by the corresponding label y. Then we de\ufb01ne the exact Wasserstein loss as\n\nwhere M 2 RK\u21e5K\n\n+\n\nis the distance matrix M\uf8ff,\uf8ff0 = dp\n\nwhere 1 is the all-one vector.\n\n\u21e7(h(x), y) = {T 2 RK\u21e5K\n\n+\n\nW p\n\np (h(\u00b7|x), y(\u00b7)) =\n\ninf\n\nT2\u21e7(h(x),y)hT, Mi\nK(\uf8ff, \uf8ff0), and the set of valid transport plans is\n: T 1 = h(x), T >1 = y}\n(4)\n\n(3)\n\np is the cost of the optimal plan for transporting the predicted mass distribution h(x) to match\nW p\nthe target distribution y. The penalty increases as more mass is transported over longer distances,\naccording to the ground metric M.\n\n3\n\n\fAlgorithm 1 Gradient of the Wasserstein loss\n\nGiven h(x), y, , K. (a, b if h(x), y unnormalized.)\nu 1\nwhile u has not converged do\n\nh(x)\n\nh(x) \u21b5Ky \u21b5 K>u\na+1 \u21b5\u2713Ky \u21b5 K>u b\n\nu 8><>:\nend while\nIf h(x), y unnormalized: v y\np /@h(x) \u21e2 log u\n@W p\n\n  log u>1\nK 1\na (1  (diag(u)Kv) \u21b5 h(x))\n\nb+1\u25c6 a\nb+1 \u21b5K>u b\n\nb+1\n\nb\n\na\n\na+1\n\nif h(x), y normalized\n\nif h(x), y unnormalized\n\nif h(x), y normalized\nif h(x), y unnormalized\n\n4 Ef\ufb01cient optimization via entropic regularization\n\nTo do learning, we optimize the empirical risk minimization functional (1) by gradient descent.\nDoing so requires evaluating a descent direction for the loss, with respect to the predictions h(x).\nUnfortunately, computing a subgradient of the exact Wasserstein loss (3), is quite costly, as follows.\nThe exact Wasserstein loss (3) is a linear program and a subgradient of its solution can be computed\nusing Lagrange duality. The dual LP of (3) is\n\ndW p\n\np (h(x), y) = sup\n\n\u21b5>h(x) + >y, CM = {(\u21b5, ) 2 RK\u21e5K : \u21b5\uf8ff + \uf8ff0 \uf8ff M\uf8ff,\uf8ff0}.\n\n(5)\n\n\u21b5,2CM\n\nAs (3) is a linear program, at an optimum the values of the dual and the primal are equal (see, e.g.\n[17]), hence the dual optimal \u21b5 is a subgradient of the loss with respect to its \ufb01rst argument.\nComputing \u21b5 is costly, as it entails solving a linear program with O(K2) contraints, with K being\nthe dimension of the output space. This cost can be prohibitive when optimizing by gradient descent.\n\n4.1 Entropic regularization of optimal transport\n\nCuturi [18] proposes a smoothed transport objective that enables ef\ufb01cient approximation of both the\ntransport matrix in (3) and the subgradient of the loss. [18] introduces an entropic regularization\nterm that results in a strictly convex problem:\n\nW p\n\np (h(\u00b7|x), y(\u00b7)) =\n\ninf\n\nT2\u21e7(h(x),y)hT, Mi \n\n1\n\n\nH(T ), H(T ) = X\uf8ff,\uf8ff0\n\nT\uf8ff,\uf8ff0 log T\uf8ff,\uf8ff0.\n\n(6)\n\nImportantly, the transport matrix that solves (6) is a diagonal scaling of a matrix K = eM1:\n\nT \u21e4 = diag(u)Kdiag(v)\n\n(7)\n\nfor u = e\u21b5 and v = e, where \u21b5 and  are the Lagrange dual variables for (6).\nIdentifying such a matrix subject to equality constraints on the row and column sums is exactly a\nmatrix balancing problem, which is well-studied in numerical linear algebra and for which ef\ufb01cient\niterative algorithms exist [19]. [18] and [3] use the well-known Sinkhorn-Knopp algorithm.\n\n4.2 Extending smoothed transport to the learning setting\n\nWhen the output vectors h(x) and y lie in the simplex, (6) can be used directly in place of (3), as\n(6) can approximate the exact Wasserstein distance closely for large enough  [18]. In this case, the\n  log u>1\ngradient \u21b5 of the objective can be obtained from the optimal scaling vector u as \u21b5 = log u\nK 1.\n1 A Sinkhorn iteration for the gradient is given in Algorithm 1.\n\n1Note that \u21b5 is only de\ufb01ned up to a constant shift: any upscaling of the vector u can be paired with a\ncorresponding downscaling of the vector v (and vice versa) without altering the matrix T \u21e4. The choice \u21b5 =\n  log u>1\nlog u\n\nK 1 ensures that \u21b5 is tangent to the simplex.\n\n4\n\n\f(a) Convergence to smoothed trans-\nport.\n\n(b) Approximation\nWasserstein.\n\nof\n\nexact\n\n(c) Convergence of alternating pro-\njections ( = 50).\n\nFigure 3: The relaxed transport problem (8) for unnormalized measures.\n\nFor many learning problems, however, a normalized output assumption is unnatural. In image seg-\nmentation, for example, the target shape is not naturally represented as a histogram. And even when\nthe prediction and the ground truth are constrained to the simplex, the observed label can be subject\nto noise that violates the constraint.\nThere is more than one way to generalize optimal transport to unnormalized measures, and this is a\nsubject of active study [20]. We will develop here a novel objective that deals effectively with the\ndifference in total mass between h(x) and y while still being ef\ufb01cient to optimize.\n\n4.3 Relaxed transport\n\nWe propose a novel relaxation that extends smoothed transport to unnormalized measures. By re-\nplacing the equality constraints on the transport marginals in (6) with soft penalties with respect to\nKL divergence, we get an unconstrained approximate transport problem. The resulting objective is:\n\n+\n\n1\n\n\nhT, Mi\n\nH(T ) + afKL (T 1kh(x)) + bfKLT >1ky (8)\n,a,bWKL(h(\u00b7|x), y(\u00b7)) = min\nT2RK\u21e5K\nwhere fKL (wkz) = w> log(w \u21b5 z)  1>w + 1>z is the generalized KL divergence between\n+ . Here \u21b5 represents element-wise division. As with the previous formulation, the optimal\nw, z 2 RK\ntransport matrix with respect to (8) is a diagonal scaling of the matrix K.\nProposition 4.1. The transport matrix T \u21e4 optimizing (8) satis\ufb01es T \u21e4 = diag(u)Kdiag(v), where\nu = (h(x) \u21b5 T \u21e41)a, v =y \u21b5 (T \u21e4)>1b, and K = eM1.\n\nAnd the optimal transport matrix is a \ufb01xed point for a Sinkhorn-like iteration. 2\nProposition 4.2. T \u21e4 = diag(u)Kdiag(v) optimizing (8) satis\ufb01es: i) u = h(x)\nand ii) v = y\n\nb+1 , where  represents element-wise multiplication.\n\na\n\na+1 (Kv) a\na+1 ,\n\nUnlike the previous formulation, (8) is unconstrained with respect to h(x). The gradient is given by\nrh(x)WKL(h(\u00b7|x), y(\u00b7)) = a (1  T \u21e41 \u21b5 h(x)). The iteration is given in Algorithm 1.\nWhen restricted to normalized measures, the relaxed problem (8) approximates smoothed transport\n(6). Figure 3a shows, for normalized h(x) and y, the relative distance between the values of (8) and\n(6) 3. For  large enough, (8) converges to (6) as a and b increase.\n(8) also retains two properties of smoothed transport (6). Figure 3b shows that, for normalized\noutputs, the relaxed loss converges to the unregularized Wasserstein distance as , a and b increase\n4. And Figure 3c shows that convergence of the iterations in (4.2) is nearly independent of the\ndimension K of the output space.\n\nb\n\nb+1 K>u b\n\n2Note that, although the iteration suggested by Proposition 4.2 is observed empirically to converge (see\n\nFigure 3c, for example), we have not proven a guarantee that it will do so.\n\n3In \ufb01gures 3a-c, h(x), y and M are generated as described in [18] section 5. In 3a-b, h(x) and y have\n\ndimension 256. In 3c, convergence is de\ufb01ned as in [18]. Shaded regions are 95% intervals.\n\n4The unregularized Wasserstein distance was computed using FastEMD [21].\n\n5\n\n\fy\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\nr\no\ni\nr\ne\n\nt\ns\no\nP\n\n0.20\n0.18\n0.16\n0.14\n0.12\n0.10\n0.08\n\n0\n1\n\n2\n3\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\nr\no\ni\nr\ne\n\nt\ns\no\nP\n\n0.20\n0.18\n0.16\n0.14\n0.12\n0.10\n0.08\n\n2\n3\n4\n\n5\n6\n\n0\n\n1\n\n3\n\n2\np-th norm\n\n4\n\n0\n\n1\n\n3\n\n2\np-th norm\n\n4\n\n(a) Posterior predictions for images of digit 0.\n\n(b) Posterior predictions for images of digit 4.\n\nFigure 4: MNIST example. Each curve shows the predicted probability for one digit, for models\ntrained with different p values for the ground metric.\n\n5 Statistical Properties of the Wasserstein loss\n\nLet S = ((x1, y1), . . . , (xN , yN )) be i.i.d. samples and h\u02c6\u2713 be the empirical risk minimizer\n\nh\u02c6\u2713 = argmin\n\nh\u27132H (\u02c6ES\u21e5W p\n\np (h\u2713(\u00b7|x), y)\u21e4 =\n\n1\nN\n\nNXi=1\n\np (hx\u2713(\u00b7|xi), yi)) .\n\nW p\n\n2N\n\n(9)\n\nh\u27132H\n\n1 (h\u02c6\u2713(\u00b7|x), y)\u21e4 \uf8ff inf\n\nFurther assume H = s  Ho is the composition of a softmax s and a base hypothesis space Ho of\nfunctions mapping into RK. The softmax layer outputs a prediction that lies in the simplex K.\nTheorem 5.1. For p = 1, and any  > 0, with probability at least 1  , it holds that\n1 (h\u2713(\u00b7|x), y)\u21e4 + 32KCM RN (Ho) + 2CMr log(1/)\nE\u21e5W 1\nE\u21e5W 1\nwith the constant CM = max\uf8ff,\uf8ff0 M\uf8ff,\uf8ff0. RN (Ho) is the Rademacher complexity [22] measuring\nthe complexity of the hypothesis space Ho.\nThe Rademacher complexity RN (Ho) for commonly used models like neural networks and kernel\nmachines [22] decays with the training set size. This theorem guarantees that the expected Wasser-\nstein loss of the empirical risk minimizer approaches the best achievable loss for H.\nAs an important special case, minimizing the empirical risk with Wasserstein loss is also good for\nmulticlass classi\ufb01cation. Let y = \uf8ff be the \u201cone-hot\u201d encoded label vector for the groundtruth class.\nProposition 5.2. In the multiclass classi\ufb01cation setting, for p = 1 and any  > 0, with probability\nat least 1  , it holds that\n1 (h\u2713(x), y)] + 32K2CM RN (Ho) + 2CM Kr log(1/)\nEx,\uf8ff\u21e5dK(\uf8ff\u02c6\u2713(x), \uf8ff)\u21e4 \uf8ff inf\nwhere the predictor is \uf8ff\u02c6\u2713(x) = argmax\uf8ff h\u02c6\u2713(\uf8ff|x), with h\u02c6\u2713 being the empirical risk minimizer.\nNote that instead of the classi\ufb01cation error Ex,\uf8ff[ {\uf8ff\u02c6\u2713(x) 6= \uf8ff}], we actually get a bound on the\nexpected semantic distance between the prediction and the groundtruth.\n\nKE[W 1\n\nh\u27132H\n\n(10)\n\n2N\n\n6 Empirical study\n\n6.1\n\nImpact of the ground metric\n\nIn this section, we show that the Wasserstein loss encourages smoothness with respect to an arti\ufb01cial\nmetric on the MNIST handwritten digit dataset. This is a multi-class classi\ufb01cation problem with\noutput dimensions corresponding to the 10 digits, and we apply a ground metric dp(\uf8ff, \uf8ff0) = |\uf8ff \n\uf8ff0|p, where \uf8ff, \uf8ff0 2 {0, . . . , 9} and p 2 [0,1). This metric encourages the recognized digit to be\nnumerically close to the true one. We train a model independently for each value of p and plot the\naverage predicted probabilities of the different digits on the test set in Figure 4.\n\n6\n\n\f1.00\n\n0.95\n\n0.90\n\n0.85\n\n0.80\n\n0.75\n\n0.70\n\nt\ns\no\nC\nK\n-\np\no\n\nt\n\n1.00\n\n0.95\n\n0.90\n\n0.85\n\n0.80\n\n0.75\n\n0.70\n\nt\ns\no\nC\nK\n-\np\no\n\nt\n\nLoss Function\nDivergence\nWasserstein (\u21b5=0.5)\nWasserstein (\u21b5=0.3)\nWasserstein (\u21b5=0.1)\n\nLoss Function\nDivergence\nWasserstein (\u21b5=0.5)\nWasserstein (\u21b5=0.3)\nWasserstein (\u21b5=0.1)\n\n5\n\n10\n\n15\nK (# of proposed tags)\n\n20\n\n5\n\n10\n\n15\nK (# of proposed tags)\n\n20\n\n(a) Original Flickr tags dataset.\n\n(b) Reduced-redundancy Flickr tags dataset.\n\nFigure 5: Top-K cost comparison of the proposed loss (Wasserstein) and the baseline (Divergence).\nNote that as p ! 0, the metric approaches the 0  1 metric d0(\uf8ff, \uf8ff0) = \uf8ff6=\uf8ff0, which treats all\nincorrect digits as being equally unfavorable. In this case, as can be seen in the \ufb01gure, the predicted\nprobability of the true digit goes to 1 while the probability for all other digits goes to 0. As p\nincreases, the predictions become more evenly distributed over the neighboring digits, converging\nto a uniform distribution as p ! 1 5.\n6.2 Flickr tag prediction\n\nde\ufb01ned as CK = 1/KPK\n\nWe apply the Wasserstein loss to a real world multi-label learning problem, using the recently re-\nleased Yahoo/Flickr Creative Commons 100M dataset [23]. 6 Our goal is tag prediction: we select\n1000 descriptive tags along with two random sets of 10,000 images each, associated with these tags,\nfor training and testing. We derive a distance metric between tags by using word2vec [24] to\nembed the tags as unit vectors, then taking their Euclidean distances. To extract image features we\nuse MatConvNet [25]. Note that the set of tags is highly redundant and often many semantically\nequivalent or similar tags can apply to an image. The images are also partially tagged, as different\nusers may prefer different tags. We therefore measure the prediction performance by the top-K cost,\nk=1 minj dK(\u02c6\uf8ffk, \uf8ffj), where {\uf8ffj} is the set of groundtruth tags, and {\u02c6\uf8ffk}\nare the tags with highest predicted probability. The standard AUC measure is also reported.\nWe \ufb01nd that a linear combination of the Wasserstein loss W p\np and the standard multiclass logistic loss\nKL yields the best prediction results. Speci\ufb01cally, we train a linear model by minimizing W p\np + \u21b5KL\non the training set, where \u21b5 controls the relative weight of KL. Note that KL taken alone is our\nbaseline in these experiments. Figure 5a shows the top-K cost on the test set for the combined loss\nand the baseline KL loss. We additionally create a second dataset by removing redundant labels\nfrom the original dataset: this simulates the potentially more dif\ufb01cult case in which a single user\ntags each image, by selecting one tag to apply from amongst each cluster of applicable, semantically\nsimilar tags. Figure 3b shows that performance for both algorithms decreases on the harder dataset,\nwhile the combined Wasserstein loss continues to outperform the baseline.\nIn Figure 6, we show the effect on performance of varying the weight \u21b5 on the KL loss. We observe\nthat the optimum of the top-K cost is achieved when the Wasserstein loss is weighted more heavily\nthan at the optimum of the AUC. This is consistent with a semantic smoothing effect of Wasserstein,\nwhich during training will favor mispredictions that are semantically similar to the ground truth,\nsometimes at the cost of lower AUC 7. We \ufb01nally show two selected images from the test set in\nFigure 7. These illustrate cases in which both algorithms make predictions that are semantically\nrelevant, despite overlapping very little with the ground truth. The image on the left shows errors\nmade by both algorithms. More examples can be found in the appendix.\n\n5To avoid numerical issues, we scale down the ground metric such that all of the distance values are in the\n\ninterval [0, 1).\n\n6The dataset used here is available at http://cbcl.mit.edu/wasserstein.\n7The Wasserstein loss can achieve a similar trade-off by choosing the metric parameter p, as discussed in\nSection 6.1. However, the relationship between p and the smoothing behavior is complex and it can be simpler\nto implement the trade-off by combining with the KL loss.\n\n7\n\n\ft\ns\no\nc\nK\n-\np\no\nT\n\nC\nU\nA\n\n0.95\n0.90\n0.85\n0.80\n0.75\n0.70\n0.65\n\n0.64\n0.62\n0.60\n0.58\n0.56\n0.54\n\nK = 1\n\nK = 2\n\nK = 3\n\nK = 4\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n2.0\n\nWasserstein AUC\nDivergence AUC\n\nt\ns\no\nc\nK\n-\np\no\nT\n\nC\nU\nA\n\n0.95\n0.90\n0.85\n0.80\n0.75\n0.70\n0.65\n\n0.64\n0.62\n0.60\n0.58\n0.56\n0.54\n\nK = 1\n\nK = 2\n\nK = 3\n\nK = 4\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n2.0\n\nWasserstein AUC\nDivergence AUC\n\n0.0\n\n0.5\n\n1.0\n\u21b5\n\n1.5\n\n2.0\n\n0.0\n\n0.5\n\n1.0\n\u21b5\n\n1.5\n\n2.0\n\n(a) Original Flickr tags dataset.\n\n(b) Reduced-redundancy Flickr tags dataset.\n\nFigure 6: Trade-off between semantic smoothness and maximum likelihood.\n\n(a) Flickr user tags: street, parade, dragon; our\nproposals: people, protest, parade; baseline pro-\nposals: music, car, band.\n\n(b) Flickr user tags: water, boat, re\ufb02ection, sun-\nshine; our proposals: water, river, lake, summer;\nbaseline proposals: river, water, club, nature.\n\nFigure 7: Examples of images in the Flickr dataset. We show the groundtruth tags and as well as\ntags proposed by our algorithm and the baseline.\n\n7 Conclusions and future work\n\nIn this paper we have described a loss function for learning to predict a non-negative measure over a\n\ufb01nite set, based on the Wasserstein distance. Although optimizing with respect to the exact Wasser-\nstein loss is computationally costly, an approximation based on entropic regularization is ef\ufb01ciently\ncomputed. We described a learning algorithm based on this regularization and we proposed a novel\nextension of the regularized loss to unnormalized measures that preserves its ef\ufb01ciency. We also\ndescribed a statistical learning bound for the loss. The Wasserstein loss can encourage smoothness\nof the predictions with respect to a chosen metric on the output space, and we demonstrated this\nproperty on a real-data tag prediction problem, showing improved performance over a baseline that\ndoesn\u2019t incorporate the metric.\nAn interesting direction for future work may be to explore the connection between the Wasserstein\nloss and Markov random \ufb01elds, as the latter are often used to encourage smoothness of predictions,\nvia inference at prediction time.\n\n8\n\n\fReferences\n[1] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmen-\n\ntation. CVPR (to appear), 2015.\n\n[2] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.\n\n[3] Marco Cuturi and Arnaud Doucet. Fast Computation of Wasserstein Barycenters. ICML, 2014.\n[4] Justin Solomon, Raif M Rustamov, Leonidas J Guibas, and Adrian Butscher. Wasserstein Propagation for\n\nSemi-Supervised Learning. In ICML, pages 306\u2013314, 2014.\n\n[5] Michael H Coen, M Hidayath Ansari, and Nathanael Fillmore. Comparing Clusterings in Space. ICML,\n\npages 231\u2013238, 2010.\n\n[6] Lorenzo Rosasco Mauricio A. Alvarez and Neil D. Lawrence. Kernels for vector-valued functions: A\n\nreview. Foundations and Trends in Machine Learning, 4(3):195\u2013266, 2011.\n\n[7] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algo-\n\nrithms. Physica D: Nonlinear Phenomena, 60(1):259\u2013268, 1992.\n\n[8] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic\n\nimage segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.\n\n[9] Marco Cuturi, Gabriel Peyr\u00b4e, and Antoine Rolet. A Smoothed Dual Approach for Variational Wasserstein\n\nProblems. arXiv.org, March 2015.\n\n[10] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover\u2019s distance as a metric for image\n\nretrieval. IJCV, 40(2):99\u2013121, 2000.\n\n[11] Kristen Grauman and Trevor Darrell. Fast contour matching using approximate earth mover\u2019s distance.\n\nIn CVPR, 2004.\n\n[12] S Shirdhonkar and D W Jacobs. Approximate earth mover\u2019s distance in linear time. In CVPR, 2008.\n[13] Herbert Edelsbrunner and Dmitriy Morozov. Persistent homology: Theory and practice. In Proceedings\n\nof the European Congress of Mathematics, 2012.\n\n[14] Federico Bassetti, Antonella Bodini, and Eugenio Regazzini. On minimum kantorovich distance estima-\n\ntors. Stat. Probab. Lett., 76(12):1298\u20131302, 1 July 2006.\n\n[15] C\u00b4edric Villani. Optimal Transport: Old and New. Springer Berlin Heidelberg, 2008.\n[16] Vladimir I Bogachev and Aleksandr V Kolesnikov. The Monge-Kantorovich problem: achievements,\n\nconnections, and perspectives. Russian Math. Surveys, 67(5):785, 10 2012.\n\n[17] Dimitris Bertsimas, John N. Tsitsiklis, and John Tsitsiklis. Introduction to Linear Optimization. Athena\n\nScienti\ufb01c, Boston, third printing edition, 1997.\n\n[18] Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. NIPS, 2013.\n[19] Philip A Knight and Daniel Ruiz. A fast algorithm for matrix balancing.\n\nIMA Journal of Numerical\n\nAnalysis, 33(3):drs019\u20131047, October 2012.\n\n[20] Lenaic Chizat, Gabriel Peyr\u00b4e, Bernhard Schmitzer, and Franc\u00b8ois-Xavier Vialard. Unbalanced Optimal\n\nTransport: Geometry and Kantorovich Formulation. arXiv.org, August 2015.\n\n[21] O\ufb01r Pele and Michael Werman. Fast and robust Earth Mover\u2019s Distances. ICCV, pages 460\u2013467, 2009.\n[22] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and struc-\n\ntural results. JMLR, 3:463\u2013482, March 2003.\n\n[23] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland,\nDamian Borth, and Li-Jia Li. The new data and new challenges in multimedia research. arXiv preprint\narXiv:1503.01817, 2015.\n\n[24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\n\nwords and phrases and their compositionality. In NIPS, 2013.\n\n[25] A. Vedaldi and K. Lenc. MatConvNet \u2013 Convolutional Neural Networks for MATLAB. CoRR,\n\nabs/1412.4564, 2014.\n\n[26] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Classics in\n\nMathematics. Springer Berlin Heidelberg, 2011.\n\n[27] Clark R. Givens and Rae Michael Shortt. A class of wasserstein metrics for probability distributions.\n\nMichigan Math. J., 31(2):231\u2013240, 1984.\n\n9\n\n\f", "award": [], "sourceid": 1237, "authors": [{"given_name": "Charlie", "family_name": "Frogner", "institution": "MIT"}, {"given_name": "Chiyuan", "family_name": "Zhang", "institution": "MIT"}, {"given_name": "Hossein", "family_name": "Mobahi", "institution": "MIT"}, {"given_name": "Mauricio", "family_name": "Araya", "institution": "Shell Intl. E&P Inc."}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": "MIT"}]}