{"title": "Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters", "book": "Advances in Neural Information Processing Systems", "page_first": 10760, "page_last": 10770, "abstract": "We study the decentralized distributed computation of discrete approximations for the regularized Wasserstein barycenter of a finite set of continuous probability measures distributedly stored over a network. We assume there is a network of agents/machines/computers, and each agent holds a private continuous probability measure and seeks to compute the barycenter of all the measures in the network by getting samples from its local measure and exchanging information with its neighbors. Motivated by this problem, we develop, and analyze, a novel accelerated primal-dual stochastic gradient method for general stochastic convex optimization problems with linear equality constraints. Then, we apply this method to the decen- tralized distributed optimization setting to obtain a new algorithm for the distributed semi-discrete regularized Wasserstein barycenter problem. Moreover, we show explicit non-asymptotic complexity for the proposed algorithm. Finally, we show the effectiveness of our method on the distributed computation of the regularized Wasserstein barycenter of univariate Gaussian and von Mises distributions, as well as some applications to image aggregation.", "full_text": "Decentralize and Randomize: Faster Algorithm for\n\nWasserstein Barycenters\n\nPavel Dvurechensky, Darina Dvinskikh\n\nWeierstrass Institute for Applied Analysis and Stochastics,\n\nInstitute for Information Transmission Problems RAS\n\n{pavel.dvurechensky,darina.dvinskikh}@wias-berlin.de\n\nAlexander Gasnikov\n\nMoscow Institute of Physics and Technology,\n\nInstitute for Information Transmission Problems RAS\n\ngasnikov@yandex.ru\n\nC\u00e9sar A. Uribe\n\nMassachusetts Institute of Technology\n\ncauribe@mit.edu\n\nAngelia Nedi\u00b4c\n\nArizona State University,\n\nMoscow Institute of Physics and Technology\n\nangelia.nedich@asu.edu\n\nAbstract\n\nWe study the decentralized distributed computation of discrete approximations\nfor the regularized Wasserstein barycenter of a \ufb01nite set of continuous probability\nmeasures distributedly stored over a network. We assume there is a network of\nagents/machines/computers, and each agent holds a private continuous probability\nmeasure and seeks to compute the barycenter of all the measures in the network\nby getting samples from its local measure and exchanging information with its\nneighbors. Motivated by this problem, we develop, and analyze, a novel accelerated\nprimal-dual stochastic gradient method for general stochastic convex optimization\nproblems with linear equality constraints. Then, we apply this method to the decen-\ntralized distributed optimization setting to obtain a new algorithm for the distributed\nsemi-discrete regularized Wasserstein barycenter problem. Moreover, we show\nexplicit non-asymptotic complexity for the proposed algorithm. Finally, we show\nthe effectiveness of our method on the distributed computation of the regularized\nWasserstein barycenter of univariate Gaussian and von Mises distributions, as well\nas some applications to image aggregation.1\n\n1\n\nIntroduction\n\nOptimal transport (OT) [30, 25] has become increasingly popular in the machine learning and\noptimization community. Given a basis space (e.g., pixel grid) and a transportation cost function (e.g.,\nsquared Euclidean distance), the OT approach de\ufb01nes a distance between two objects (e.g., images),\nmodeled as two probability measures on the basis space, as the minimal cost of transportation of the\n\ufb01rst measure to the second. Besides images, these probability measures or histograms can model other\nreal-world objects like videos, texts, etc. The optimal transport distance leads to outstanding results\nin unsupervised learning [4, 7], semi-supervised learning [42], clustering [24], text classi\ufb01cation [27],\nas well as in image retrieval, clustering and classi\ufb01cation [38, 11, 39], statistics [20, 36], economics\n\n1The full version of this paper can be found in the supplementary material and is also available as [15].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand \ufb01nance [5], condensed matter physics [8], and other applications [26]. From the computational\npoint of view, the optimal transport distance (or Wasserstein distance) between two histograms of\nsize n requires solving a linear program, which typically requires O(n3 log n) arithmetic operations.\nAn alternative approach is based on entropic regularization of this linear program and application of\neither Sinkhorn\u2019s algorithm [11] or stochastic gradient descent [22], both requiring O(n2) arithmetic\noperations, which can be too costly in the large-scale context.\nGiven a set of objects, the optimal transport distance naturally de\ufb01nes their mean representative. For\nexample, the 2-Wasserstein barycenter [2] is an object minimizing the sum of squared 2-Wasserstein\ndistances to all objects in a set. Wasserstein barycenters capture the geometric structure of objects,\nsuch as images, better than the barycenter with respect to the Euclidean or other distances [12].\nIf the objects in the set are randomly sampled from some distribution, theoretical results such\nas central limit theorem [14] or con\ufb01dence set construction [20] have been proposed, providing\nthe basis for the practical use of Wasserstein barycenter. However, calculating the Wasserstein\nbarycenter of m measures includes repeated computation of m Wasserstein distances. The entropic\nregularization approach was extended for this case in [6], with the proposed algorithm having a\nO(mn2) complexity, which can be very large if m and n are large. Moreover, in the large-scale\nsetup, storage and processing of transportation plans, required to calculate Wasserstein distances,\ncan be intractable for local computation. On the other hand, recent studies [34, 40, 37, 46, 31] on\naccelerated distributed convex optimization algorithms demonstrated their ef\ufb01ciency for convex\noptimization problems over arbitrary networks with inherently distributed data, i.e., the data is\nproduced by a distributed network of sensors [35, 33, 32] or the transmission of information is limited\nby communication or privacy constraints, i.e., only limited amount of information can be shared\nacross the network.\nMotivated by the limited communication issue and the computational complexity of the Wasserstein\nbarycenter problem for large amounts of data stored in a network of computers, we use the entropy\nregularization of the Wasserstein distance and propose a decentralized algorithm to calculate an\napproximation to the Wasserstein barycenter of a set of probability measures. We solve the problem\nin a distributed manner on a connected and undirected network of agents oblivious to the network\ntopology. Each agent locally holds a possibly continuous probability distribution, can sample from\nit, and seeks to cooperatively compute the barycenter of all probability measures exchanging the\ninformation with its neighbors. We consider the semi-discrete case, which means that we \ufb01x the\ndiscrete support for the barycenter and calculate a discrete approximation for the barycenter.\nRelated work. Unlike [44], we propose a decentralized distributed algorithm for the computation of\nthe regularized Wasserstein barycenter of a set of continuous measures. Working with continuous\ndistributions requires the application of stochastic procedures like stochastic gradient method as in\n[22], where it is applied for regularized Wasserstein distance, but not for Wasserstein barycenter. This\nidea was extended to the case of non-regularized barycenter in [43, 10], where parallel algorithms\nwere developed. The critical difference between the parallel and the decentralized setting is that, in\nthe former, the topology of the computational network is \ufb01xed to be a star topology and it is known\nin advance by all the machines, forming a master/slave architecture. We seek to further scale up\nthe barycenter computation to a huge number of input measures using arbitrary network topologies.\nMoreover, unlike [43], we use entropic regularization to take advantage of the problem smoothness\nand obtain faster rates of convergence for the optimization procedure. Unlike [10], we \ufb01x the support\nof the barycenter, which leads to a convex optimization problem and allows us to prove complexity\nbounds for our algorithm.\nThe well-developed approach based on\nSinkhorn\u2019s algorithm [11, 6, 13] naturally\nleads to parallel algorithms. Nevertheless, its\napplication to continuous distributions requires\ndiscretization of these distributions, leading to\ncomputational intractability when one desires\ngood accuracy and, hence, has to use \ufb01ne\ndiscretization with large n, which leads to the\nnecessity of solving an optimization problem of large dimension. Thus, this approach is not directly\napplicable in our setting of continuous distributions, and it is not clear whether it is applicable in the\ndecentralized distributed setting with arbitrary networks.\n\nPAPER\n[11, 6, 13]\n[22]\n[43, 10]\nOUR ALG. 2\n\nTable 1: Summary of literature.\n\n\u221a\n\u00d7\n\u221a\n\u221a\n\nDECENTR.\n\nCONT.\n\nBARYC.\n\n\u00d7\n\u00d7\n\u00d7\n\u221a\n\n\u00d7\n\u221a\n\u221a\n\u221a\n\n2\n\n\fRecently, an alternative accelerated-gradient-based approach was shown to give better results than\nthe Sinkhorn\u2019s algorithm for Wasserstein distance [18, 19]. Moreover, accelerated gradient methods\nhave natural extensions for the decentralized distributed setting [40, 45, 28]. Nevertheless, existing\ndistributed optimization algorithms can not be applied to the barycenter problem in our setting of\ncontinuous distributions as these algorithms are either designed for deterministic problems or for\nstochastic primal problem, whereas in our case the dual problem is a stochastic problem. Table\n1 summarizes the existing literature on Wasserstein barycenter calculation and shows our contribution.\n\nContributions. We propose a novel algorithm for general stochastic optimization problems with\nlinear constraints, namely the Accelerated Primal-Dual Stochastic Gradient Method (APDSGD).\nBased on this algorithm, we introduce a distributed algorithm for the computation of a discrete\napproximation for regularized Wasserstein barycenter of a set of continuous distributions stored\ndistributedly over a network (connected and undirected) with unknown arbitrary topology. For\nour algorithm, we provide iteration and arithmetic operations complexity in terms of the problem\nparameters. Finally, we demonstrate the effectiveness of our algorithm on the distributed computation\nof the regularized Wasserstein barycenter of a set of von Mises distributions for various network\ntopologies and network sizes. Moreover, we show some initial results on the problem of image\naggregation for two datasets, namely, a subset of the MNIST digit dataset [29] and subset of the IXI\nMagnetic Resonance dataset [1].\nPaper organization. In Section 2 we present the regularized Wasserstein barycenter problem for\nthe semi-discrete case and its distributed computation over networks. In Section 3 we introduce a\nnew algorithm for general stochastic optimization problems with linear constraints and analyze its\nconvergence rate. Section 4 extends this algorithm and introduces our method for the distributed\ncomputation of regularized Wasserstein barycenter. Section 5 shows the experimental results for the\nproposed algorithm. The supplementary material contains the full version of the paper, including an\nappendix with the proofs, as well as additional results of numerical experiments.\n+(X ) the set of positive Radon probability measures on a metric space\nNotation. We de\ufb01ne M1\nl=1 al = 1} the probability simplex. We denote by \u03b4(x) the Dirac\nX , and S1(n) = {a \u2208 Rn\nmeasure at point x, and \u2297 the Kronecker product. We refer to \u03bbmax(W ) as the maximum eigenvalue\nm]T \u2208 Rmn,\n1 ,\u00b7\u00b7\u00b7 , pT\nof a symmetric matrix W . We use bold symbols for stacked vectors p = [pT\nwhere p1, ..., pm \u2208 Rn. In this case [p]i = pi \u2013 the i-th block of p. For a vector \u03bb \u2208 Rn, we denote by\nl=1([p]l)2 as 2-norm.\n\n[\u03bb]l its l-th component. We refer to the Euclidean norm of a vector (cid:107)p(cid:107)2 :=(cid:80)n\n\n+ |(cid:80)n\n\n2 The Distributed Wasserstein Barycenter Problem\n\nIn this section, we present the problem of decentralized distributed computation of regularized\nWasserstein barycenters for a family of possibly continuous probability measures distributed over\na network. First, we provide the necessary background for regularized Wasserstein distance and\nbarycenter. Then, we give the details of the distributed formulation of the optimization problem\nde\ufb01ning Wasserstein barycenter, which is a minimization problem with linear equality constraint. To\ndeal with this constraint, we make a transition to the dual problem, which, as we show, due to the\npresence of continuous distributions, is a smooth stochastic optimization problem.\nRegularized semi-discrete formulation of optimal transport problem. We consider entropic\nregularization for the optimal transport problem and the corresponding regularized Wasserstein\ndistance and barycenter [11]. Let \u00b5 \u2208 M1\n+(Y) with density q(y), and a discrete probability measure\ni=1[p]i\u03b4(zi) with weights given by vector p \u2208 S1(n) and \ufb01nite support given by points\nz1, . . . , zn \u2208 Z from a metric space Z. The regularized Wasserstein distance in semi-discrete setting\nbetween continuous measure \u00b5 and discrete measure \u03bd is de\ufb01ned as2\n\n\u03bd = (cid:80)n\n\nW\u03b3(\u00b5, \u03bd) = min\n\u03c0\u2208\u03a0(\u00b5,\u03bd)\n\n(cid:40) n(cid:88)\n\n(cid:90)\n\nY\n\ni=1\n\n(cid:90)\n\nn(cid:88)\n\nY\n\ni=1\n\n(cid:41)\n\n(cid:18) \u03c0i(y)\n\n(cid:19)\n\n\u03be\n\nci(y)\u03c0i(y)dy + \u03b3\n\n\u03c0i(y) log\n\ndy\n\n,\n\n(1)\n\n2Formally, the \u03c1-Wasserstein distance for \u03c1 \u2265 1 is (W0(\u00b5, \u03bd))\n\n\u03c1 if Y = Z and ci(y) = d\u03c1(zi, y), d being a\ndistance on Y. For simplicity, we refer to (1) as regularized Wasserstein distance in a general situation since our\nalgorithm does not rely on any speci\ufb01c choice of cost ci(y).\n\n1\n\n3\n\n\f(cid:40)\n\n(cid:41)\n\nn(cid:88)\n\n(cid:90)\n\nY\n\nwhere ci(y) = c(zi, y) is a cost function for transportation of a unit of mass from point zi \u2208 Z to\npoint y \u2208 Y, \u03be is the uniform distribution on Y \u00d7 Z, and the set of admissible coupling measures \u03c0\nis de\ufb01ned as\n\n\u03a0(\u00b5, \u03bd) =\n\n\u03c0 \u2208 M1\n\n+(Y) \u00d7 S1(n) :\n\n\u03c0i(y) = q(y), y \u2208 Y,\n\n\u03c0i(y)dy = pi,\u2200 i = 1, . . . , n\n\n.\n\ni=1\n\ni=1\n\ni=1\n\n(2)\n\nmin\n\nmin\n\np\u2208S1(n)\n\np1=\u00b7\u00b7\u00b7=pm\n\nm(cid:88)\n\np1,...,pm\u2208S1(n)\n\nW\u03b3,\u00b5i(p) =\n\nregularized Wasserstein barycenter \u03bd and wish to \ufb01nd it in the form \u03bd = (cid:80)n\n\nFor a set of measures \u00b5i \u2208 M1\n+(Z), i = 1, . . . , m, we \ufb01x the support z1, . . . , zn \u2208 Z of their\ni=1[p]i\u03b4(zi), where\np \u2208 Sn(1). Then the regularized Wasserstein barycenter in the semi-discrete setting is de\ufb01ned as the\nm(cid:88)\nsolution to the following convex optimization problem3\n\nW\u03b3,\u00b5i(pi),\nwhere we used notation W\u03b3,\u00b5(p) := W\u03b3(\u00b5, \u03bd) for \ufb01xed probability measure \u00b5.\nNetwork constraints in the distributed barycenter problem. We now describe the distributed\noptimization setting for solving the second problem in (2). We assume that each measure \u00b5i is held\nby an agent i on a network and this agent can sample from this measure. We model such a network\nas a \ufb01xed connected undirected graph G = (V, E), where V is the set of m nodes, and E is the set\nof edges. We assume that the graph G does not have self-loops. The network structure imposes\ninformation constraints; speci\ufb01cally, each node i has access to \u00b5i only and can exchange information\nonly with its immediate neighbors, i.e., nodes j s.t. (i, j) \u2208 E.\nWe represent the communication constraints imposed by the network by introducing a single equality\nconstraint instead of p1 = \u00b7\u00b7\u00b7 = pm in (2). To do so, we de\ufb01ne the Laplacian matrix \u00afW\u2208 Rm\u00d7m\nof the graph G such that a) [ \u00afW ]ij = \u22121 if (i, j) \u2208 E, b) [ \u00afW ]ij = deg(i) if i = j, c) [ \u00afW ]ij = 0\notherwise. Here deg(i) is the degree of the node i, i.e., the number of neighbors of the node. Finally,\nde\ufb01ne the communication matrix (also referred to as an interaction matrix) by W := \u00afW \u2297 In.\nAssuming that G is undirected and connected, the Laplacian matrix \u00afW is symmetric and positive\nsemide\ufb01nite. Furthermore, the vector 1 is the unique (up to a scaling factor) eigenvector associated\nwith the zero eigenvalue. W inherits the properties of \u00afW , i.e., it is symmetric and positive semide\ufb01nite.\nW p = 0 if and only if p1 = \u00b7\u00b7\u00b7 = pm, where we de\ufb01ned stacked column vector\nMoreover,\nm]T \u2208 Rmn. Using this fact, we equivalently rewrite problem (2) as the maximization\np = [pT\nproblem with linear equality constraint\n\n1 ,\u00b7\u00b7\u00b7 , pT\n\n\u221a\n\n\u2212 m(cid:88)\n\ni=1\n\nmax\np1,...,pm\u2208S1(n),\n\n\u221a\n\nW p=0\n\nW\u03b3,\u00b5i(pi).\n\n(3)\n\n\u221a\n\nmax\n\n(cid:41)\n\n1 ,\u00b7\u00b7\u00b7 , \u03bbT\n\n\u221a\n(cid:104)\u03bbi, [\n\n(cid:40) m(cid:88)\n\np1,...,pm\u2208S1(n)\n\u221a\n\nDual formulation of the barycenter problem. Given that problem (3) is an optimization problem\nm]T \u2208 Rmn\nwith linear constraints, we introduce a stacked vector of dual variables \u03bb = [\u03bbT\nfor the constraints\n\nW p = 0 in (3). Then, the Lagrangian dual problem for (3) is\n\u221a\nW\u2217\nmin\n\u03b3,\u00b5i([\n\u03bb\u2208Rmn\n\u221a\n\u221a\nwhere [\nW \u03bb]i denote the i-th n-dimensional block of vectors\nW \u03bb re-\n\u03b3,\u00b5i(\u00b7) is the Fenchel-\nW \u03bb]i, pi(cid:105) was used, and W\u2217\nspectively, the equality\nLegendre transform of W\u03b3,\u00b5i (pi). The following Lemma states that each W\u2217\n\u03b3,\u00b5i(\u00b7) is a smooth\nfunction with Lipschitz-continuous gradient and can be expressed as an expectation of a function of\nadditional random argument.\n+(Y) with density q(\u00b7), the Fenchel-Legendre conjugate for W\u03b3,\u00b5(p) is\nLemma 1. Given \u00b5 \u2208 M1\n\nW p]i(cid:105) \u2212 W\u03b3,\u00b5i(pi)\nm(cid:80)\n\nW \u03bb]i),\n\u221a\n\nW p]i(cid:105) =\n\nW p]i and [\n\n= min\n\u03bb\u2208Rmn\n\nm(cid:88)\n\nW p and\n\nm(cid:80)\n\n(cid:104)\u03bbi, [\n\n\u221a\n\n\u221a\n\n(cid:104)[\n\n(4)\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\nW\u2217\n\n\u03b3,\u00b5(\u00af\u03bb) = EY \u223c\u00b5\u03b3 log\n\n1\n\nq(Y )\n\nexp\n\n(cid:32)\n\n(cid:18) [\u00af\u03bb]l \u2212 cl(Y )\n\n(cid:19)(cid:33)\n\n,\n\n\u03b3\n\nn(cid:88)\n\nl=1\n\nand its gradient is 1/\u03b3-Lipschitz-continuous w.r.t. 2-norm.\n\n3For simplicity, we assume equal weights for each W\u03b3,\u00b5i (p) and do not normalize the sum dividing by m.\n\nOur results can be directly generalized to the case of non-negative weights summing up to 1.\n\n4\n\n\fDenote \u00af\u03bb =\ntive in the r.h.s. of (4). Then, by the chain rule, the l-th n-dimensional block of \u2207W\u2217\n\nm]T = [\u00af\u03bbT\n\n1 , . . . , \u00af\u03bbT\n\n1 , . . . , [\n\nW \u03bb]T\n\nW \u03bb]T\n\n\u03b3 (\u03bb) \u2013 the dual objec-\n\n\u03b3 (\u03bb) is\n\nm]T and W\u2217\n\n\u221a\n\n\u221a\n\n\u221a\nW \u03bb = [[\n\n(cid:2)\u2207W\u2217\n\u03b3 (\u03bb)(cid:3)\n\n=\n\nl\n\n(cid:34)\n\u2207 m(cid:88)\n\n(cid:35)\n\nm(cid:88)\n\n\u221a\n\n\u221a\n\u03b3,\u00b5i([\n\nW\u2217\n\nW \u03bb]i)\n\n=\n\nW lj\u2207W\u2217\n\n\u03b3,\u00b5j (\u00af\u03bbj), l = 1, ..., m.\n\n(5)\n\ni=1\n\nl\n\nj=1\n\nIt follows from (5) and Lemma 1 that the dual problem (4) is a smooth stochastic convex optimization\nproblem. This is in contrast to [28], where the primal problem is a stochastic optimization problem.\nMoreover, as opposed to the existing literature on stochastic convex optimization, we not only\nneed to solve the dual problem but also need to reconstruct an approximate solution for the primal\nproblem (3), which is the barycenter. In the next section, we develop a novel accelerated primal-dual\nstochastic gradient method for a general smooth stochastic optimization problem, which is dual to\nsome optimization problem with linear equality constraints. Furthermore, in Section 4, we apply our\ngeneral algorithm to the particular case of primal-dual pair of problems (3) and (4).\n\n3 General Primal-Dual Framework for Stochastic Optimization\n\n(P )\n\nmin\nx\u2208Q\u2286E\n\n{f (x) : Ax = b} ,\n\nIn this section, we consider a general smooth stochastic convex optimization problem which is dual\nto some optimization problem with linear equality constraints. Extending our works [16, 21, 9, 17,\n19, 3, 18], we develop a novel algorithm for its solution and reconstruction of the primal variable\ntogether with convergence rate analysis. Unlike prior works, we consider the stochastic primal-dual\npair of problems and one of our contributions consists in providing a primal-dual extension of the\naccelerated stochastic gradient method. We believe that our algorithm can be used for problems other\nthan regularized Wasserstein barycenter problem and, thus, we, \ufb01rst, provide a general algorithm and,\nthen, apply it to the barycenter problem. We introduce new notation since this section is independent\nof the others and is focused on a general optimization problem.\nGeneral setup. For any \ufb01nite-dimensional real vector space E, we denote by E\u2217 its dual, by (cid:107) \u00b7 (cid:107)\na norm on E and by (cid:107) \u00b7 (cid:107)\u2217 the norm on E\u2217 which is dual to (cid:107) \u00b7 (cid:107), i.e. (cid:107)\u03bb(cid:107)\u2217 = max(cid:107)x(cid:107)\u22641(cid:104)\u03bb, x(cid:105).\nFor a linear operator A : E1 \u2192 E2, the adjoint operator AT : E\u2217\n1 in de\ufb01ned by (cid:104)u, Ax(cid:105) =\n(cid:104)AT u, x(cid:105),\nx \u2208 E1. We say that a function f : E \u2192 R has a L-Lipschitz-continuous\ngradient w.r.t. norm (cid:107) \u00b7 (cid:107)\u2217 if it is continuously differentiable and its gradient satis\ufb01es Lipschitz\ncondition (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107)\u2217 \u2264 L(cid:107)x \u2212 y(cid:107),\nOur main goal in this section, is to provide an algorithm for a primal-dual (up to a sign) pair of\nproblems\n\n\u2200x, y \u2208 E.\n\n\u2200u \u2208 E\u2217\n2 ,\n\n2 \u2192 E\u2217\n\n(cid:26)\n\n(cid:0)\u2212f (x) \u2212 (cid:104)AT \u03bb, x(cid:105)(cid:1)(cid:27)\n(cid:0)\u2212f (x) \u2212 (cid:104)AT \u03bb, x(cid:105)(cid:1) = (cid:104)\u03bb, b(cid:105) + f\u2217(\u2212AT \u03bb) and as-\n\n(cid:104)\u03bb, b(cid:105) + max\nx\u2208Q\n\n.\n\n(D) min\n\u03bb\u2208\u039b\n\nwhere Q is a simple closed convex set, A : E \u2192 H is given linear operator, b \u2208 H is given,\n\u039b = H\u2217. We de\ufb01ne \u03d5(\u03bb) := (cid:104)\u03bb, b(cid:105) + maxx\u2208Q\nsume it to be smooth with L-Lipschitz-continuous gradient. Here f\u2217 is the Fenchel-Legendre\ndual for f. We also assume that f\u2217(\u2212AT \u03bb) = E\u03beF \u2217(\u2212AT \u03bb, \u03be), where \u03be is random vector\nand F \u2217 is the Fenchel-Legendre conjugate function to some function F (x, \u03be), i.e.\nit satis\ufb01es\nF \u2217(\u2212AT \u03bb, \u03be) = max\n{(cid:104)\u2212AT \u03bb, x(cid:105) \u2212 F (x, \u03be)}. F \u2217(\u00af\u03bb, \u03be) is assumed to be smooth and, hence\nx\u2208Q\n\u2207\u00af\u03bbF \u2217(\u00af\u03bb, \u03be) = x(\u00af\u03bb, \u03be), where x(\u00af\u03bb, \u03be) is the solution of the maximization problem x(\u00af\u03bb, \u03be) =\n{(cid:104)\u00af\u03bb, x(cid:105) \u2212 F (x, \u03be)}. Under these assumptions, the dual problem (D) can be accessed by a\narg max\nx\u2208Q\nstochastic oracle (\u03a6(\u03bb, \u03be),\u2207\u03a6(\u03bb, \u03be)) satisfying E\u03be\u03a6(\u03bb, \u03be) = \u03d5(\u03bb), E\u03be\u2207\u03a6(\u03bb, \u03be) = \u2207\u03d5(\u03bb), which\nwe use in our algorithm.\nAccelerated primal-dual stochastic gradient method. Next, we provide an accelerated algorithm\nfor the primal-dual pair of problems (P ) \u2212 (D). The idea is to apply accelerated stochastic gradient\nmethod to the dual problem (D), endow it with a step in the primal space and show that the new\nalgorithm allows also approximating the solution to the primal problem. We additionally assume\nthat the variance of the stochastic approximation \u2207\u03a6(\u03bb, \u03be) for the gradient of \u03d5 can be controlled\nand made as small as we desire. This can be done, for example by mini-batching the stochastic\napproximation. Finally, since \u2207\u03a6(\u03bb, \u03be) = b \u2212 A\u2207F \u2217(\u2212AT \u03bb, \u03be) = b \u2212 Ax(\u2212AT \u03bb, \u03be), on each\niteration, to \ufb01nd \u2207\u03a6(\u03bb,\u03be) we \ufb01nd the vector x(\u2212AT \u03bb, \u03be) and use it for the primal iterates.\n\n5\n\n\fTheorem 1. Let \u03d5 have L-Lipschitz-continuous gradient w.r.t. 2-norm and (cid:107)\u03bb\u2217(cid:107)2 \u2264 R, where \u03bb\u2217 is\na solution of dual problem (D). Given desired accuracy \u03b5, assume that, at each iteration of Algorithm\n1, the stochastic gradient \u2207\u03a6(\u03bbk, \u03bek) is chosen in such a way that E\u03be(cid:107)\u2207\u03a6(\u03bbk, \u03bek) \u2212 \u2207\u03d5(\u03bbk)(cid:107)2\n2 \u2264\n. Then, for any \u03b5 > 0 and N \u2265 0, and expectation E w.r.t. all the randomness \u03be1, . . . , \u03beN , the\n\n\u03b5L\u03b1k\nCk\noutput \u02c6xN generated by the Algorithm 1 satis\ufb01es\n\nf (E\u02c6xN ) \u2212 f\n\n\u2217 \u2264 16LR2\n\nN 2 +\n\n\u03b5\n2\n\nand (cid:107)AE\u02c6xN \u2212 b(cid:107)2 \u2264 16LR\n\nN 2 +\n\n\u03b5\n2R\n\n.\n\n(6)\n\nIn step 7 of Algorithm 1 we can use a batch of size M and 1\nk+1) to update\nr=1 x(\u03bbk+1, \u03ber\n\u02c6xk+1. Then, under reasonable assumptions, \u02c6xN concentrates around E\u02c6xN [23] and, if f is Lipschitz-\nM\ncontinuous, we obtain that (6) holds with large probability with \u02c6xN instead of E\u02c6xN .\n\n(cid:80)M\n\n4 Solving the Barycenter Problem\n\nAlgorithm 1 Accelerated Primal-Dual Stochastic Gradient\nMethod (APDSGD)\nInput: Number of iterations N.\n1: C0 = \u03b10 = 0, \u03b70 = \u03b60 = \u03bb0 = \u02c6x0 = 0.\n2: for k = 0, . . . , N \u2212 1 do\n3:\n\nIn this section, we apply the general al-\ngorithm APDSGD to solve the primal-\ndual pair of problems (3)-(4) and ap-\nproximate the regularized Wasserstein\nbarycenter which is a solution to (3).\nFirst, in Lemma 2, we make several\ntechnical steps to take care of the as-\nsumption of Theorem (1). Then, we\nintroduce a change of dual variable so\nthat the step 5 of Algorithm 1 becomes\nfeasible for decentralized distributed\nsetting. After that, we provide our al-\ngorithm for regularized Wasserstein\nbarycenter problem with its complex-\nity analysis.\nLemma 2. The gradient of the objective function W\u2217\nLipschitz-continuous w.r.t. 2-norm. If its stochastic approximation is de\ufb01ned as\n\n4:\n5:\n6:\n7:\n8: end for\nOutput: The points \u02c6xN , \u03b7N .\n\nFind \u03b1k+1 > 0 from Ck+1 := Ck +\u03b1k+1 = 2L\u03b12\n\u03c4k+1 = \u03b1k+1/Ck+1.\n\u03bbk+1 = \u03c4k+1\u03b6k + (1 \u2212 \u03c4k+1)\u03b7k\n\u03b6k+1 = \u03b6k \u2212 \u03b1k+1\u2207\u03a6(\u03bbk+1, \u03bek+1).\n\u03b7k+1 = \u03c4k+1\u03b6k+1 + (1 \u2212 \u03c4k+1)\u03b7k.\n\u02c6xk+1 = \u03c4k+1x(\u03bbk+1, \u03bek+1) + (1 \u2212 \u03c4k+1)\u02c6xk.\n\n\u03b3 (\u03bb) in the dual problem (4) is \u03bbmax(W )/\u03b3-\n\nk+1.\n\n\u03b3,\u00b5j (\u00af\u03bbj), i = 1, ..., m, with\n\nm(cid:88)\n\nj=1\n\n1\nM\n\n\u03b3 (\u03bb)]i =\n\n[(cid:101)\u2207W\u2217\n(cid:101)\u2207W\u2217\n\n\u221a\n\nW ij(cid:101)\u2207W\u2217\nM(cid:88)\nr \u223c\u00b5j ,j=1,...,m,r=1,...,M(cid:101)\u2207W\u2217\n\u03b3 (\u03bb) \u2212 \u2207W\u2217\n\nr=1\n\nY j\n\n\u03b3,\u00b5j (\u00af\u03bbj) =\n\npj(\u00af\u03bbj, Y j\nr ), and [pj(\u00af\u03bbj, Y j\n\u221a\nwhere M is the batch size, \u00af\u03bbj := [\n\u00b5j, j = 1, ..., m. Then E\n\nr \u223c\u00b5j ,j=1,...,m,r=1,...,M(cid:107)(cid:101)\u2207W\u2217\n\nE\nY j\n\nW \u03bb]j, j = 1, ..., m, Y j\n\nr )]l =\n\n(cid:80)n\nexp(([\u00af\u03bbj]l \u2212 cl(Y j\nr ))/\u03b3)\n(cid:96)=1 exp(([\u00af\u03bbj](cid:96) \u2212 c(cid:96)(Y j\n1 , ..., Y j\n\u03b3 (\u03bb) = \u2207W\u2217\n\u03b3 (\u03bb)(cid:107)2\n\n2 \u2264 \u03bbmax(W )m/M, \u03bb \u2208 Rmn.\n\nr is a sample from the measure\n\u03b3 (\u03bb) and\n\n.\n\n(7)\n\nr ))/\u03b3)\n\n(8)\n\nj=1\n\n\u221a\n\n(cid:80)m\n\n\u221a\n\u03b3,\u00b5j ([\n\nW ij(cid:101)\u2207W\u2217\n\n\u03b1k\u03b5 , the assumptions of Theorem 1 hold.\n\nBased on this lemma, we see that if, on each iteration of Algorithm 1, the mini-batch size Mk satis\ufb01es\nMk \u2265 m\u03b3Ck\nFor the particular problem (4) the step 5 of Algorithm 1 can be written block-wise [\u03b6k+1]i =\n[\u03b6k]i \u2212 \u03b1k+1\nW \u03bbk+1]j), i = 1, ..., m. Unfortunately, this update can not\nbe made in the decentralized setting since the sparsity pattern of\nW ij can be different from Wij\n\u221a\nand this will require some agents to get information not only from their neighbors. To overcome this\nobstacle, we change the variables and denote \u00af\u03bb =\nW \u03b6. Then the step 5\nof Algorithm 1 becomes [\u00af\u03b6k+1]i = [\u00af\u03b6k]i \u2212 \u03b1k+1\nTheorem 2. Under the assumptions of Section 2, Algorithm 2 after N =\niterations returns an approximation \u02c6pN for the barycenter, which satis\ufb01es\n\nW \u03b7, \u00af\u03b6 =\n\u03b3,\u00b5j ([\u00af\u03bbk+1]j), i = 1, ..., m.\n\n(cid:80)m\nj=1 Wij(cid:101)\u2207W\u2217\n\n16\u03bbmax(W )R2/(\u03b5\u03b3)\n\nW \u03bb, \u00af\u03b7 =\n\n(cid:112)\n\n\u221a\n\n\u221a\n\n\u221a\n\nm(cid:88)\n\nW\u03b3,\u00b5i (E[\u02c6pN ]i) \u2212 m(cid:88)\n\ni=1\n\ni=1\n\n\u221a\n\nW\u03b3,\u00b5i([p\n\u2217\n\n]i) \u2264 \u03b5,\n\n(cid:107)\n\n6\n\nW E\u02c6pN(cid:107)2 \u2264 \u03b5/R.\n\n(9)\n\n\fThe total complexity is O\n\n(cid:18)\n\nmn max\n\n(cid:26)(cid:113) \u03bbmax(W )R2\n\n\u03b5\u03b3\n\n, \u03bbmax(W )mR2\n\n\u03b52\n\n(cid:27)(cid:19)\n\narithmetic operations.\n\nAlgorithm 2 Distributed computation of Wasserstein\nbarycenter\nInput: Each agent i \u2208 V is assigned its measure \u00b5i.\n1: All agents set [\u00af\u03b70]i = [\u00af\u03b60]i = [\u00af\u03bb0]i = [\u02c6p0]i = 0 \u2208 Rn,\nC0 = \u03b10 = 0 and N\n2: For each agent i \u2208 V :\n3: for k = 0, . . . , N \u2212 1 do\n4:\n\nWe underline that even if the measures\n\u00b5i, i = 1, ..., m are discrete with large\nsupport size, it can be more ef\ufb01cient\nto apply our stochastic algorithm than\na deterministic algorithm. We now ex-\nplain it in more details. If a measure \u00b5\nis discrete, then W\u2217\n\u03b3,\u00b5(\u00af\u03bb) in Lemma\n1 is represented as a \ufb01nite expecta-\ntion, i.e., a sum of functions instead of\nan integral, and can be found explic-\nitly. In the same way, its gradient and,\nhence, \u2207W\u2217\n\u03b3 (\u03bb) in (5) can be found\nexplicitly in a deterministic way. Then\na deterministic accelerated decentral-\nized algorithm can be applied to ap-\nproximate the regularized barycenter.\nLet us assume for simplicity that the\nsupport of measure \u00b5 is of the size n.\nThen the calculation of the exact gradi-\nent of W\u2217\n\u03b3,\u00b5(\u00af\u03bb) requires O(n2) arith-\nmetic operations and the overall com-\nplexity of the deterministic algorithm\nis O\n. For\ncomparison, the complexity of our randomized approach in Theorem 2 is proportional to n, but not to\nn2. So, our randomized approach is superior in the regime of large n.\n\nFind \u03b1k+1 > 0 from Ck+1 := Ck +\u03b1k+1 = 2L\u03b12\n\u03c4k+1 = \u03b1k+1/Ck+1.\nSet Mk+1 = max{1, (cid:100)m\u03b3Ck+1/(\u03b1k+1\u03b5)(cid:101)}\n[\u00af\u03bbk+1]i = \u03c4k+1[\u00af\u03b6k]i + (1 \u2212 \u03c4k+1)[\u00af\u03b7k]i\n\u00b5i and set (cid:101)\u2207W\u2217\nr }Mk+1\nGenerate Mk+1 samples {Y i\nShare (cid:101)\u2207W\u2217\n(cid:80)m\nj=1 Wij(cid:101)\u2207W\u2217\n[\u00af\u03b6k+1]i = [\u00af\u03b6k]i \u2212 \u03b1k+1\n[\u00af\u03b7k+1]i = \u03c4k+1[\u00af\u03b6k+1]i + (1 \u2212 \u03c4k+1)[\u00af\u03b7k+1]i\n[\u02c6pk+1]i = \u03c4k+1pi([\u00af\u03bbk+1]i, Y i\nwhere pi(\u00b7,\u00b7) is de\ufb01ned in (7).4\n\n\u03b3,\u00b5j ([\u00af\u03bbk+1]j)\n1 ) + (1 \u2212 \u03c4k+1)[\u02c6pk+1]i,\n\n\u03b3,\u00b5i([\u00af\u03bbk+1]i) with {j | (i, j) \u2208 E}\n\n12: end for\nOutput: \u02c6pN .\n\n\u03b3,\u00b5i ([\u00af\u03bbk+1]i) as in (7).\n\nmn2(cid:112)\n\n\u03bbmax(W )R2/\u03b3\u03b5\n\nfrom the measure\n\nr=1\n\n5:\n6:\n7:\n\n8:\n9:\n10:\n11:\n\n(cid:16)\n\n(cid:17)\n\nk+1.\n\n5 Experimental Results\n\n\u03b3 (\u03bb) and C(\u02c6p) := (cid:107)\u221a\n\nIn this section, we present experimental results for Algorithm 2. Initially, we consider a set of agents\nover a network, where each agent i can samples from a privately held random variable Yi \u223c N (\u03b8i, v2\ni ),\nwhere N (\u03b8, v2) is a univariate Gaussian distribution with mean \u03b8 and variance v2. Moreover, we\nset \u03b8i \u2208 [\u22124, 4] and vi \u2208 [0.1, 0.6]. The objective is to compute a discrete distribution p \u2208 S1(n)\nthat solves (2). We assume n = 100 and the support of p is a set of 100 equally spaced points on\nthe segment [\u22125, 5]. Figure 1 shows the performance of Algorithm 2 for four classes of networks:\ncomplete, cycle, star, and Erd\u02ddos-R\u00e9nyi. Moreover, we show the behavior for different network sizes,\nnamely: m = 10, 100, 200, 500. Particularly we use two metrics: the function value of the dual\nproblem and the distance to consensus, i.e., W\u2217\nW \u02c6p(cid:107)2. As expected, when the\nnetwork is a complete graph, the convergence to the \ufb01nal value and the distance to consensus decreases\nrapidly. Nevertheless, the performance in graphs with degree regularity, such as the cycle graph\nand the Erd\u02ddos-R\u00e9nyi random graph, is similar to a complete graph with much less communication\noverhead. For the star graph, which has the worst case between the maximum and minimum number\nof neighbors among all nodes, the algorithm performs poorly. Figure 2 shows the convergence of the\nlocal barycenter of a set of von Mises distributions. Each agent over an Erd\u02ddos-R\u00e9nyi random graph\ncan access private realizations from a von Mises random variable. Particularly, for the cases of von\nMises distributions, we have used the angle between two points distance function. Figure 3 shows the\ncomputed local barycenter of 9 agents in a network of 500 nodes at different iteration numbers. Each\nagent holds a local copy of a sample of the digit 2 (56 \u00d7 56 image) from the MNIST dataset [29].\nAll agents converge to the same image that structurally represents the aggregation of the original\n500 images held over the network. Finally, Figure 4 shows a simple example of an application of\nWasserstein barycenter on medical image aggregation where we have 4 agents connected over a cycle\ngraph and each agent holds a magnetic resonance image (256 \u00d7 256) from the IXI dataset [1].\n\n4In the experiments, we use\n\n1 ), which does not\nchange the statement of Theorem 2, but reduces the variance of \u02c6pN in practice. Moreover, under mild assumptions,\nwe can obtain high-probability analogue to inequalities (9).\n\nr ) instead of pi([\u00af\u03bbk+1]i, Y i\n\nr=1 pi([\u00af\u03bbk+1]i, Y i\n\nMk+1\n\n1\n\n(cid:80)Mk+1\n\n7\n\n\fFigure 1: Dual function value and distance to consensus for 200, 100, 10, 500 agents, Mk = 100 and \u03b3 = 0.1.\n\nFigure 2: Wasserstein barycenter of von Mises distributions for 10 agents at different iteration numbers.\n\nFigure 3: Wasserstein barycenter of digit 2 from the MNIST dataset [29]. Each block shows a subset of 9\nrandomly selected local barycenters at different time instances.\n\nFigure 4: Wasserstein barycenter for a subset of images from the IXI dataset [1]. Each block shows the local\nbarycenters of 4 agents at different time instances.\n\n6 Conclusions\n\nWe propose a novel distributed algorithm for regularized Wasserstein barycenter problem for a set\nof continuous measures stored distributedly over a network of agents. Our algorithm is based on\na new general algorithm for the solution of stochastic convex optimization problems with linear\nconstraints. In contrast to the recent literature, our algorithm can be executed over arbitrary connected\nand static networks where nodes are oblivious to the network topology, which makes it suitable for\nlarge-scale network optimization setting. Additionally, our analysis indicates that the randomization\nstrategy provides faster convergence rates than the deterministic procedure when the support size of\nthe barycenter is large. The implementation of our algorithm on real networks, requires further work,\nas well as its extension to the decentralized distributed setting of Sinkhorn-type algorithms [6] for\nregularized Wasserstein barycenter and other related algorithms, e.g., Wasserstein propagation [41].\n\n8\n\nCycleErd\u02ddos-R\u00e9nyiStarComplete2004006008001,000\u22122\u2212101Iterationsm=200F(\u02dc\u03bbk)2004006008001,000\u22122\u2212101Iterationsm=1002004006008001,000\u22122\u2212101Iterationsm=102004006008001,000\u22122\u2212101Iterationsm=5002004006008001,00000.20.40.60.81Iterationsm=200C(\u02c6pk)2004006008001,00000.20.40.60.81Iterationsm=1002004006008001,00000.20.40.60.81Iterationsm=102004006008001,00000.20.40.60.81Iterationsm=500N=10\u03c0/2\u03c03\u03c0/2N=1000\u03c0/2\u03c03\u03c0/2N=2000\u03c0/2\u03c03\u03c0/2N=5000\u03c0/2\u03c03\u03c0/2N=1N=1000N=2000N=3000N=4000N=1N=100N=1000N=6000N=10000\fAcknowledgments\n\nThe work of A. Nedi\u00b4c and C.A. Uribe in Sect. 5 is supported by the National Science Foundation\nunder grant no. CPS 15-44953. The research by P. Dvurechensky, D. Dvinskikh, and A. Gasnikov in\nSect. 3 and Sect. 4 was funded by the Russian Science Foundation (project 18-71-10108).\n\nReferences\n[1] IXI Dataset. http://brain-development.org/ixi-dataset/. Accessed: 2018-05-17.\n\n[2] M. Agueh and G. Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis,\n\n43(2):904\u2013924, 2011.\n\n[3] A. S. Anikin, A. V. Gasnikov, P. E. Dvurechensky, A. I. Tyurin, and A. V. Chernov. Dual approaches to the\nminimization of strongly convex functionals with a simple structure under af\ufb01ne constraints. Computational\nMathematics and Mathematical Physics, 57(8):1262\u20131276, Aug 2017.\n\n[4] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv:1701.07875, 2017.\n\n[5] M. Beiglb\u00f6ck, P. Henry-Labordere, and F. Penkner. Model-independent bounds for option prices: a mass\n\ntransport approach. Finance and Stochastics, 17(3):477\u2013501, 2013.\n\n[6] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr\u00e9. Iterative bregman projections for regularized\n\ntransportation problems. SIAM Journal on Scienti\ufb01c Computing, 37(2):A1111\u2013A1138, 2015.\n\n[7] J. Bigot, R. Gouet, T. Klein, and A. L\u00f3pez. Geodesic PCA in the wasserstein space by convex pca. Ann.\n\nInst. H. Poincar\u00e9 Probab. Statist., 53(1):1\u201326, 02 2017.\n\n[8] G. Buttazzo, L. De Pascale, and P. Gori-Giorgi. Optimal-transport formulation of electronic density-\n\nfunctional theory. Physical Review A, 85(6):062502, 2012.\n\n[9] A. Chernov, P. Dvurechensky, and A. Gasnikov. Fast primal-dual gradient method for strongly convex\nminimization problems with linear constraints. In Y. Kochetov, M. Khachay, V. Beresnev, E. Nurminski, and\nP. Pardalos, editors, Discrete Optimization and Operations Research: 9th International Conference, DOOR\n2016, Vladivostok, Russia, September 19-23, 2016, Proceedings, pages 391\u2013403. Springer International\nPublishing, 2016.\n\n[10] S. Claici, E. Chien, and J. Solomon. Stochastic Wasserstein barycenters. In J. Dy and A. Krause, editors,\nProceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of\nMachine Learning Research, pages 999\u20131008. PMLR, 2018.\n\n[11] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou,\nM. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing\nSystems 26, pages 2292\u20132300. Curran Associates, Inc., 2013.\n\n[12] M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters. In International Conference on\n\nMachine Learning, pages 685\u2013693, 2014.\n\n[13] M. Cuturi and G. Peyr\u00e9. A smoothed dual approach for variational wasserstein problems. SIAM Journal\n\non Imaging Sciences, 9(1):320\u2013343, 2016.\n\n[14] E. del Barrio, E. Gine, and C. Matran. Central limit theorems for the wasserstein distance between the\n\nempirical and the true distributions. The Annals of Probability, 27(2):1009\u20131071, 1999.\n\n[15] P. Dvurechensky, D. Dvinskikh, A. Gasnikov, C. A. Uribe, and A. Nedi\u00b4c. Decentralize and randomize:\n\nFaster algorithm for Wasserstein barycenters. arXiv:1806.03915, 2018.\n\n[16] P. Dvurechensky and A. Gasnikov. Stochastic intermediate gradient method for convex problems with\n\nstochastic inexact oracle. Journal of Optimization Theory and Applications, 171(1):121\u2013145, 2016.\n\n[17] P. Dvurechensky, A. Gasnikov, E. Gasnikova, S. Matsievsky, A. Rodomanov, and I. Usik. Primal-\ndual method for searching equilibrium in hierarchical congestion population games. In Supplementary\nProceedings of the 9th International Conference on Discrete Optimization and Operations Research and\nScienti\ufb01c School (DOOR 2016) Vladivostok, Russia, September 19 - 23, 2016, pages 584\u2013595, 2016.\narXiv:1606.08988.\n\n9\n\n\f[18] P. Dvurechensky, A. Gasnikov, and A. Kroshnin. Computational optimal transport: Complexity by\naccelerated gradient descent is better than by Sinkhorn\u2019s algorithm. In J. Dy and A. Krause, editors,\nProceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of\nMachine Learning Research, pages 1367\u20131376, 2018. arXiv:1802.04367.\n\n[19] P. Dvurechensky, A. Gasnikov, S. Omelchenko, and A. Tiurin. Adaptive similar triangles method: a stable\n\nalternative to Sinkhorn\u2019s algorithm for regularized optimal transport. arXiv:1706.07622, 2017.\n\n[20] J. Ebert, V. Spokoiny, and A. Suvorikova. Construction of non-asymptotic con\ufb01dence sets in 2-Wasserstein\n\nspace. arXiv:1703.03658, 2017.\n\n[21] A. V. Gasnikov, E. V. Gasnikova, Y. E. Nesterov, and A. V. Chernov. Ef\ufb01cient numerical methods for\nentropy-linear programming problems. Computational Mathematics and Mathematical Physics, 56(4):514\u2013\n524, 2016.\n\n[22] A. Genevay, M. Cuturi, G. Peyr\u00e9, and F. Bach. Stochastic optimization for large-scale optimal transport. In\nD. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 29, pages 3440\u20133448. Curran Associates, Inc., 2016.\n\n[23] V. Guigues, A. Juditsky, and A. Nemirovski. Non-asymptotic con\ufb01dence bounds for the optimal value of a\n\nstochastic program. Optimization Methods and Software, 32(5):1033\u20131058, 2017.\n\n[24] N. Ho, X. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Phung. Multilevel clustering via Wasserstein\nmeans. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine\nLearning, volume 70 of Proceedings of Machine Learning Research, pages 1501\u20131509, International\nConvention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[25] L. Kantorovich. On the translocation of masses. (Doklady) Acad. Sci. URSS (N.S.), 37:199\u2013201, 1942.\n\n[26] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde. Optimal mass transport: Signal processing\n\nand machine-learning applications. IEEE Signal Processing Magazine, 34(4):43\u201359, July 2017.\n\n[27] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances.\nIn Proceedings of the 32nd International Conference on International Conference on Machine Learning -\nVolume 37, ICML\u201915, pages 957\u2013966. JMLR.org, 2015.\n\n[28] G. Lan, S. Lee, and Y. Zhou. Communication-ef\ufb01cient algorithms for decentralized and stochastic\n\noptimization. Mathematical Programming, pages 1\u201348, 2018.\n\n[29] Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.\n\n[30] G. Monge. M\u00e9moire sur la th\u00e9orie des d\u00e9blais et des remblais. Histoire de l\u2019Acad\u00e9mie Royale des Sciences\n\nde Paris, 1781.\n\n[31] A. Nedi\u00b4c, A. Olshevsky, W. Shi, and C. A. Uribe. Geometrically convergent distributed optimization with\nuncoordinated step-sizes. In American Control Conference (ACC), 2017, pages 3950\u20133955. IEEE, 2017.\n\n[32] A. Nedi\u00b4c, A. Olshevsky, and C. A. Uribe. Distributed learning for cooperative inference. arXiv preprint\n\narXiv:1704.02718, 2017.\n\n[33] A. Nedi\u00b4c, A. Olshevsky, and C. A. Uribe. Fast convergence rates for distributed non-bayesian learning.\n\nIEEE Transactions on Automatic Control, 62(11):5538\u20135553, 2017.\n\n[34] A. Nedi\u00b4c, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributed optimization over\n\ntime-varying graphs. SIAM Journal on Optimization, 27(4):2597\u20132633, 2017.\n\n[35] R. Olfati-Saber, E. Franco, E. Frazzoli, and J. S. Shamma. Belief Consensus and Distributed Hypothesis\n\nTesting in Sensor Networks, pages 169\u2013182. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006.\n\n[36] V. M. Panaretos and Y. Zemel. Amplitude and phase variation of point processes. Ann. Statist., 44(2):771\u2013\n\n812, 04 2016.\n\n[37] A. Rogozin, C. A. Uribe, A. Gasnikov, N. Malkovsky, and A. Nedi\u00b4c. Optimal distributed optimization on\n\nslowly time-varying graphs. arXiv preprint arXiv:1805.06045, 2018.\n\n[38] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover\u2019s distance as a metric for image retrieval.\n\nInternational journal of computer vision, 40(2):99\u2013121, 2000.\n\n10\n\n\f[39] R. Sandler and M. Lindenbaum. Nonnegative matrix factorization with earth mover\u2019s distance metric for\nimage analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1590\u20131602, Aug\n2011.\n\n[40] K. Scaman, F. R. Bach, S. Bubeck, Y. T. Lee, and L. Massouli\u00e9. Optimal algorithms for smooth and\nstrongly convex distributed optimization in networks. In Proceedings of the 34th International Conference\non Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3027\u20133036, 2017.\n\n[41] J. Solomon, F. De Goes, G. Peyr\u00e9, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional\nwasserstein distances: Ef\ufb01cient optimal transportation on geometric domains. ACM Transactions on\nGraphics (TOG), 34(4):66, 2015.\n\n[42] J. Solomon, R. M. Rustamov, L. Guibas, and A. Butscher. Wasserstein propagation for semi-supervised\nlearning. In Proceedings of the 31st International Conference on International Conference on Machine\nLearning - Volume 32, ICML\u201914, pages I\u2013306\u2013I\u2013314. JMLR.org, 2014.\n\n[43] M. Staib, S. Claici, J. M. Solomon, and S. Jegelka. Parallel streaming wasserstein barycenters. In Advances\n\nin Neural Information Processing Systems, pages 2644\u20132655, 2017.\n\n[44] C. A. Uribe, D. Dvinskikh, P. Dvurechensky, A. Gasnikov, and A. Nedi\u00b4c. Distributed computation of\nWasserstein barycenters over networks. In 2018 IEEE 57th Annual Conference on Decision and Control\n(CDC), pages 6544\u20136549, Dec 2018.\n\n[45] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedi\u00b4c. Optimal algorithms for distributed optimization. 2017.\n\narXiv:1712.00232.\n\n[46] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedi\u00b4c. A dual approach for optimal algorithms in distributed\n\noptimization over networks. arXiv preprint arXiv:1809.00710, 2018.\n\n11\n\n\f", "award": [], "sourceid": 6859, "authors": [{"given_name": "Pavel", "family_name": "Dvurechenskii", "institution": "WIAS im Forschungsverbund Berlin e. V."}, {"given_name": "Darina", "family_name": "Dvinskikh", "institution": "WIAS im Forschungsverbund Berlin e. V."}, {"given_name": "Alexander", "family_name": "Gasnikov", "institution": "SkolTech"}, {"given_name": "Cesar", "family_name": "Uribe", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Angelia", "family_name": "Nedich", "institution": "Arizona State University"}]}