{"title": "Learning Probability Measures with respect to Optimal Transport Metrics", "book": "Advances in Neural Information Processing Systems", "page_first": 2492, "page_last": 2500, "abstract": "We study the problem of estimating, in the sense of optimal transport metrics, a measure which is assumed supported on a manifold embedded in a Hilbert space. By establishing a precise connection between optimal transport metrics, optimal quantization, and learning theory, we derive new probabilistic bounds for the performance of a classic algorithm in unsupervised learning (k-means), when used to produce a probability measure derived from the data. In the course of the analysis, we arrive at new lower bounds, as well as probabilistic bounds on the convergence rate of the empirical law of large numbers, which, unlike existing bounds, are applicable to a wide class of measures.", "full_text": "Learning Probability Measures with Respect to\n\nOptimal Transport Metrics\n\nGuillermo D. Canas(cid:63),\u2020\nLorenzo A. Rosasco(cid:63),\u2020\n(cid:63) Laboratory for Computational and Statistical Learning - MIT-IIT\n\u2020 CBCL, McGovern Institute - Massachusetts Institute of Technology\n\n{guilledc,lrosasco}@mit.edu\n\nAbstract\n\nWe study the problem of estimating, in the sense of optimal transport metrics, a\nmeasure which is assumed supported on a manifold embedded in a Hilbert space.\nBy establishing a precise connection between optimal transport metrics, optimal\nquantization, and learning theory, we derive new probabilistic bounds for the per-\nformance of a classic algorithm in unsupervised learning (k-means), when used to\nproduce a probability measure derived from the data. In the course of the analysis,\nwe arrive at new lower bounds, as well as probabilistic upper bounds on the con-\nvergence rate of empirical to population measures, which, unlike existing bounds,\nare applicable to a wide class of measures.\n\n1\n\nIntroduction and Motivation\n\nIn this paper we study the problem of learning from random samples a probability distribution\nsupported on a manifold, when the learning error is measured using transportation metrics.\nThe problem of learning a probability distribution is classic in statistics, and is typically analyzed\nfor distributions in X = Rd that have a density with respect to the Lebesgue measure, with total\nvariation, and L2 among the common distances used to measure closeness of two densities (see for\ninstance [10, 32] and references therein.) The setting in which the data distribution is supported on\na low dimensional manifold embedded in a high dimensional space has only been considered more\nrecently. In particular, kernel density estimators on manifolds have been described in [36], and their\npointwise consistency, as well as convergence rates, have been studied in [25, 23, 18]. A discussion\non several topics related to statistics on a Riemannian manifold can be found in [26].\nInterestingly, the problem of approximating measures with respect to transportation distances has\ndeep connections with the \ufb01elds of optimal quantization [14, 16], optimal transport [35] and, as\nwe point out in this work, with unsupervised learning (see Sec. 4.)\nIn fact, as described in the\nsequel, some of the most widely-used algorithms for unsupervised learning, such as k-means (but\nalso others such as PCA and k-\ufb02ats), can be shown to be performing exactly the task of estimating\nthe data-generating measure in the sense of the 2-Wasserstein distance. This close relation between\nlearning theory, and optimal transport and quantization seems novel and of interest in its own right.\nIndeed, in this work, techniques from the above three \ufb01elds are used to derive the new probabilistic\nbounds described below.\nOur technical contribution can be summarized as follows:\n\n(a) we prove uniform lower bounds for the distance between a measure and estimates based on\ndiscrete sets (such as the empirical measure or measures derived from algorithms such as k-\nmeans);\n\n(b) we provide new probabilistic bounds for the rate of convergence of empirical to population\n\nmeasures which, unlike existing probabilistic bounds, hold for a very large class of measures;\n\n1\n\n\f(c) we provide probabilistic bounds for the rate of convergence of measures derived from k-means\n\nto the data measure.\n\nThe structure of the paper is described at the end of Section 2, where we discuss the exact formula-\ntion of the problem as well as related previous works.\n\n2 Setup and Previous work\nConsider the problem of learning a probability measure \u03c1 supported on a space M, from an i.i.d.\nsample Xn = (x1, . . . , xn) \u223c \u03c1n of size n. We assume M to be a compact, smooth d-dimensional\nmanifold of bounded curvature, with C1 metric and volume measure \u03bbM, embedded in the unit ball\nof a separable Hilbert space X with inner product (cid:104)\u00b7,\u00b7(cid:105), induced norm (cid:107) \u00b7 (cid:107), and distance d (for\ninstance M = Bd\n2 (1) the unit ball in X = Rd.) Following [35, p. 94], let Pp(M) denote the\nWasserstein space of order 1 \u2264 p < \u221e:\n\nPp(M) :=\n\n\u03c1 \u2208 P (M) :\n\n(cid:107)x(cid:107)pd\u03c1(x) < \u221e\n\n(cid:90)\n\nM\n\n(cid:27)\n\n(cid:26)\n\n(cid:110)\n\nof probability measures P (M) supported on M, with \ufb01nite p-th moment. The p-Wasserstein dis-\ntance\n\n[E(cid:107)X \u2212 Y (cid:107)p]1/p : Law(X) = \u03c1, Law(Y ) = \u00b5\n\n(1)\n\n(cid:111)\n\nWp(\u03c1, \u00b5) = inf\nX,Y\n\nwhere the random variables X and Y are distributed according to \u03c1 and \u00b5 respectively, is the optimal\nexpected cost of transporting points generated from \u03c1 to those generated from \u00b5, and is guaranteed to\nbe \ufb01nite in Pp(M) [35, p. 95]. The space Pp(M) with the Wp metric is itself a complete separable\nmetric space [35]. We consider here the problem of learning probability measures \u03c1 \u2208 P2(M),\nwhere the performance is measured by the distance W2.\nThere are many possible choices of distances between probability measures [13]. Among them,\nWp metrizes weak convergence (see [35] theorem 6.9), that is, in Pp(M), a sequence (\u00b5i)i\u2208N of\nmeasures converges weakly to \u00b5 iff Wp(\u00b5i, \u00b5) \u2192 0 and their p-th order moments converge to that of\n\u00b5. There are other distances, such as the L\u00b4evy-Prokhorov, or the weak-* distance, that also metrize\nweak convergence. However, as pointed out by Villani in his excellent monograph [35, p. 98],\n\n1. \u201cWasserstein distances are rather strong, [...]a de\ufb01nite advantage over the weak-* distance\u201d.\n2. \u201cIt is not so dif\ufb01cult to combine information on convergence in Wasserstein distance with\n\nsome smoothness bound, in order to get convergence in stronger distances.\u201d\n\nWasserstein distances have been used to study the mixing and convergence of Markov chains [22], as\nwell as concentration of measure phenomena [20]. To this list we would add the important fact that\nexisting and widely-used algorithms for unsupervised learning can be easily extended (see Sec. 4)\nto compute a measure \u03c1(cid:48) that minimizes the distance W2(\u02c6\u03c1n, \u03c1(cid:48)) to the empirical measure\n\nn(cid:88)\n\ni=1\n\n\u02c6\u03c1n :=\n\n1\nn\n\n\u03b4xi,\n\na fact that will allow us to prove, in Sec. 5, bounds on the convergence of a measure induced by\nk-means to the population measure \u03c1.\nThe most useful versions of Wasserstein distance are p = 1, 2, with p = 1 being the weaker of the\ntwo (by H\u00a8older\u2019s inequality, p \u2264 q \u21d2 Wp \u2264 Wq.) In particular, \u201cresults in W2 distance are usually\nstronger, and more dif\ufb01cult to establish than results in W1 distance\u201d [35, p. 95]. A discussion of\np = \u221e would take us out of topic, since its behavior is markedly different.\n\n2.1 Closeness of Empirical and Population Measures\n\nBy the strong law of large numbers, the empirical measure converges almost surely to the population\nmeasure: \u02c6\u03c1n \u2192 \u03c1 in the sense of the weak topology [34]. Since weak convergence and convergence\nin Wp plus convergence of p-th moments are equivalent in Pp(M), this means that, in the Wp sense,\nthe empirical measure \u02c6\u03c1n converges to \u03c1, as n \u2192 \u221e. A fundamental question is therefore how fast\nthe rate of convergence of \u02c6\u03c1n \u2192 \u03c1 is.\n\n2\n\n\f2.1.1 Convergence in expectation\nThe rate of convergence of \u02c6\u03c1n \u2192 \u03c1 in expectation has been widely studied in the past, result-\ning in upper bounds of order EW2(\u03c1, \u02c6\u03c1n) = O(n\u22121/(d+2)) [19, 8], and lower bounds of order\nEW2(\u03c1, \u02c6\u03c1n) = \u2126(n\u22121/d) [29] (both assuming that the absolutely continuous part of \u03c1 is \u03c1A (cid:54)= 0,\nwith possibly better rates otherwise).\nMore recently, an upper bound of order EWp(\u03c1, \u02c6\u03c1n) = O(n\u22121/d) has been proposed [2] by proving\na bound for the Optimal Bipartite Matching (OBM) problem [1], and relating this problem to the\nexpected distance EWp(\u03c1, \u02c6\u03c1n).\nIn particular, given two independent samples Xn, Yn, the OBM\n\nproblem is that of \ufb01nding a permutation \u03c3 that minimizes the matching cost n\u22121(cid:80)(cid:107)xi\u2212y\u03c3(i)(cid:107)p [24,\n\n30]. It is not hard to show that the optimal matching cost is Wp(\u02c6\u03c1Xn , \u02c6\u03c1Yn )p, where \u02c6\u03c1Xn , \u02c6\u03c1Yn are\nthe empirical measures associated to Xn, Yn. By Jensen\u2019s inequality, the triangle inequality, and\n(a + b)p \u2264 2p\u22121(ap + bp), it holds\n\nEWp(\u03c1, \u02c6\u03c1n)p \u2264 EWp(\u02c6\u03c1Xn , \u02c6\u03c1Yn )p \u2264 2p\u22121EWp(\u03c1, \u02c6\u03c1n)p,\n\nand therefore a bound of order O(n\u2212p/d) for the OBM problem [2] implies a bound EWp(\u03c1, \u02c6\u03c1n) =\nO(n\u22121/d). The matching lower bound is only known for a special case: \u03c1A constant over a bounded\nset of non-null measure [2] (e.g. \u03c1A uniform.) Similar results, with matching lower bounds are found\nfor W1 in [11].\n\n2.1.2 Convergence in probability\n\nResults for convergence in probability, one of the main results of this work, appear to be considerably\nharder to obtain. One fruitful avenue of analysis has been the use of so-called transportation, or\nTalagrand inequalities Tp, which can be used to prove concentration inequalities on Wp [20]. In\nparticular, we say that \u03c1 satis\ufb01es a Tp(C) inequality with C > 0 iff Wp(\u03c1, \u00b5)2 \u2264 CH(\u00b5|\u03c1),\u2200\u00b5 \u2208\nPp(M), where H(\u00b7|\u00b7) is the relative entropy [20]. As shown in [6, 5], it is possible to obtain\nprobabilistic upper bounds on Wp(\u03c1, \u02c6\u03c1n), with p = 1, 2, if \u03c1 is known to satisfy a Tp inequality\nof the same order, thereby reducing the problem of bounding Wp(\u03c1, \u02c6\u03c1n) to that of obtaining a Tp\ninequality. Note that, by Jensen\u2019s inequality, and as expected from the behavior of Wp, the inequality\nT2 is stronger than T1 [20].\nWhile it has been shown that \u03c1 satis\ufb01es a T1 inequality iff it has a \ufb01nite square-exponential moment\n(E[e\u03b1(cid:107)x(cid:107)2\n] \ufb01nite for some \u03b1 > 0) [4, 7], no such general conditions have been found for T2. As\nan example, consider that, if M is compact with diameter D then, by theorem 6.15 of [35], and the\ncelebrated Csisz\u00b4ar-Kullback-Pinsker inequality [27], for all \u03c1, \u00b5 \u2208 Pp(M), it is\nTV \u2264 22p\u22121D2pH(\u00b5|\u03c1),\n\nWp(\u03c1, \u00b5)2p \u2264 (2D)2p(cid:107)\u03c1 \u2212 \u00b5(cid:107)2\n\nwhere (cid:107) \u00b7 (cid:107)TV is the total variation norm. Clearly, this implies a Tp=1 inequality, but for p \u2265 2 it\ndoes not.\nThe T2 inequality has been shown by Talagrand to be satis\ufb01ed by the Gaussian distribution [31], and\nthen slightly more generally by strictly log-concave measures (see [20, p. 123], and [3].) However, as\nnoted in [6], \u201ccontrary to the T1 case, there is no hope to obtain T2 inequalities from just integrability\nor decay estimates.\u201d\nStructure of this paper.\nIn this work we obtain bounds in probability (learning rates) for the\nproblem of learning a probability measure in the sense of W2. We begin by establishing (lower)\nbounds for the convergence of empirical to population measures, which serve to set up the problem\nand introduce the connection between quantization and measure learning (sec. 3.) We then describe\nhow existing unsupervised learning algorithms that compute a set (k-means, k-\ufb02ats, PCA,. . . ) can\nbe easily extended to produce a measure (sec. 4.) Due to its simplicity and widespread use, we focus\nhere on k-means. Since the two measure estimates that we consider are the empirical measure, and\nthe measure induced by k-means, we next set out to prove upper bounds on their convergence to\nthe data-generating measure (sec. 5.) We arrive at these bounds by means of intermediate measures,\nwhich are related to the problem of optimal quantization. The bounds apply in a very broad setting\n(unlike existing bounds based on transportation inequalities, they are not restricted to log-concave\nmeasures [20, 3].)\n\n3\n\n\f3 Learning probability measures, optimal transport and quantization\n\nWe address the problem of learning a probability measure \u03c1 when the only observations we have at\nour disposal are n i.i.d. samples Xn = (x1, . . . , xn). We begin by establishing some notation and\nuseful intermediate results.\nGiven a closed set S \u2286 X , let {Vq : q \u2208 S} be a Borel Voronoi partition of X composed of sets\nVq closest to each q \u2208 S, that is, such that each Vq \u2286 {x \u2208 X : (cid:107)x \u2212 q(cid:107) = minr\u2208S (cid:107)x \u2212 r(cid:107)} is\nmeasurable (see for instance [15].) Consider the projection function \u03c0S : X \u2192 S mapping each\nx \u2208 Vq to q. By virtue of {Vq}q\u2208S being a Borel Voronoi partition, the map \u03c0S is measurable [15],\nand it is d (x, \u03c0S (x)) = minq\u2208S (cid:107)x \u2212 q(cid:107) for all x \u2208 X .\nFor any \u03c1 \u2208 Pp(M), let \u03c0S \u03c1 be the pushforward, or image measure of \u03c1 under the mapping \u03c0S ,\n\u22121\nwhich is de\ufb01ned to be (\u03c0S \u03c1)(A) := \u03c1(\u03c0\nS (A)) for all Borel measurable sets A. From its de\ufb01nition,\nit is clear that \u03c0S \u03c1 is supported on S.\nWe now establish a connection between the expected distance to a set S, and the distance between \u03c1\nand the set\u2019s induced pushforward measure. Notice that, for discrete sets S, the expected Lp distance\nto S is exactly the expected quantization error\n\nEp,\u03c1(S) := Ex\u223c\u03c1d(x, S)p = Ex\u223c\u03c1(cid:107)x \u2212 \u03c0S (x)(cid:107)p\n\nincurred when encoding points x drawn from \u03c1 by their closest point \u03c0S (x) in S [14]. This close\nconnection between optimal quantization and Wasserstein distance has been pointed out in the past\nin the statistics [28], optimal quantization [14, p. 33], and approximation theory [16] literatures.\nThe following two lemmas are key tools in the reminder of the paper. The \ufb01rst highlights the close\nlink between quantization and optimal transport.\nLemma 3.1. For closed S \u2286 X , \u03c1 \u2208 Pp(M), 1 \u2264 p < \u221e, it holds Ex\u223c\u03c1d(x, S)p = Wp(\u03c1, \u03c0S \u03c1)p.\nNote that the key element in the above lemma is that the two measures in the expression Wp(\u03c1, \u03c0S \u03c1)\nmust match. When there is a mismatch, the distance can only increase. That is, Wp(\u03c1, \u03c0S \u00b5) \u2265\nWp(\u03c1, \u03c0S \u03c1) for all \u00b5 \u2208 Pp(M). In fact, the following lemma shows that, among all the measures\nwith support in S, \u03c0S \u03c1 is closest to \u03c1.\nLemma 3.2. For closed S \u2286 X , and all \u00b5 \u2208 Pp(M) with supp(\u00b5) \u2286 S, 1 \u2264 p < \u221e, it holds\nWp(\u03c1, \u00b5) \u2265 Wp(\u03c1, \u03c0S \u03c1).\nWhen combined, lemmas 3.1 and 3.2 indicate that the behavior of the measure learning problem is\nlimited by the performance of the optimal quantization problem. For instance, Wp(\u03c1, \u02c6\u03c1n) can only\nbe, in the best-case, as low as the optimal quantization cost with codebook of size n. The following\nsection makes this claim precise.\n\ncan be written in this case as \u03c0X4 \u03c1 = (cid:80)4\ndouble-counting would imply(cid:80)\n\n3.1 Lower bounds\nConsider the situation depicted in \ufb01g. 1, in which a sample X4 = {x1, x2, x3, x4} is drawn from\na distribution \u03c1 which we assume here to be absolutely continuous on its support. As shown, the\nprojection map \u03c0X4\nsends points x to their closest point in X4. The resulting Voronoi decomposition\nof supp(\u03c1) is drawn in shades of blue. By lemma 5.2 of [9], the pairwise intersections of Voronoi\nregions have null ambient measure, and since \u03c1 is absolutely continuous, the pushforward measure\nj=1 \u03c1(Vxj )\u03b4xj , where Vxj is the Voronoi region of xj.\nNote that, even for \ufb01nite sets S, this particular decomposition is not always possible if the {Vq}q\u2208S\nform a Borel Voronoi tiling, instead of a Borel Voronoi partition. If, for instance, \u03c1 has an atom\nfalling on two Voronoi regions in a tiling, then both regions would count the atom as theirs, and\nq \u03c1(Vq) > 1. The technicalities required to correctly de\ufb01ne a Borel\nVoronoi partition are such that, in general, it is simpler to write \u03c0S\u03c1, even though (if S is discrete)\nthis measure can clearly be written as a sum of deltas with appropriate masses.\nBy lemma 3.1, the distance Wp(\u03c1, \u03c0X4 \u03c1)p is the (expected) quantization cost of \u03c1 when using X4\nas codebook. Clearly, this cost can never be lower than the optimal quantization cost of size 4. This\nreasoning leads to the following lower bound between empirical and population measures.\n\n4\n\n\fTheorem 3.3. For \u03c1 \u2208 Pp(M) with absolutely continuous part \u03c1A (cid:54)= 0, and 1 \u2264 p < \u221e, it holds\nWp(\u03c1, \u02c6\u03c1n) = \u2126(n\u22121/d) uniformly over \u02c6\u03c1n, where the constants depend on d and \u03c1A only.\nProof: Let Vn,p(\u03c1) := inf S\u2282M,|S|=n Ex\u223c\u03c1d(x, S)p be the optimal quantization cost of \u03c1 of order\np with n centers. Since \u03c1A (cid:54)= 0, and since \u03c1 has a \ufb01nite (p + \u03b4)-th order moment, for some \u03b4 > 0\n(since it is supported on the unit ball), then it is Vn,p(\u03c1) = \u0398(n\u2212p/d), with constants depending on\nd and \u03c1A (see [14, p. 78] and [16].) Since supp(\u02c6\u03c1n) = Xn, it follows that\n\nWp(\u03c1, \u02c6\u03c1n)p \u2265\n\nlemma 3.2\n\nWp(\u03c1, \u03c0Xn \u03c1)p =\n\nlemma 3.1\n\nEx\u223c\u03c1d(x, Xn)p \u2265 Vn,p(\u03c1) = \u0398(n\n\n\u2212p/d)\n\nNote that the bound of theorem 3.3 holds for \u02c6\u03c1n derived from any sample Xn, and is therefore\nstronger than the existing lower bounds on the convergence rates of EWp(\u03c1, \u02c6\u03c1n) \u2192 0. In particular,\nit trivially induces the known lower bound \u2126(n\u22121/d) on the rate of convergence in expectation.\n\n4 Unsupervised learning algorithms for learning a probability measure\n\netc. Performance is measured by the empirical quantity n\u22121(cid:80)n\n\nAs described in [21], several of the most widely used unsupervised learning algorithms can be\ninterpreted to take as input a sample Xn and output a set \u02c6Sk, where k is typically a free parameter\nof the algorithm, such as the number of means in k-means1, the dimension of af\ufb01ne spaces in PCA,\ni=1 d(xi, \u02c6Sk)2, which is minimized\namong all sets in some class (e.g. sets of size k, af\ufb01ne spaces of dimension k,. . . ) This formulation is\ngeneral enough to encompass k-means and PCA, but also k-\ufb02ats, non-negative matrix factorization,\nand sparse coding (see [21] and references therein.)\nUsing the discussion of Sec. 3, we can establish a clear connection between unsupervised learning\nand the problem of learning probability measures with respect to W2. Consider as a running example\nthe k-means problem, though the argument is general. Given an input Xn, the k-means problem is\nto \ufb01nd a set | \u02c6Sk| = k minimizing its average distance from points in Xn. By associating to \u02c6Sk the\npushforward measure \u03c0 \u02c6Sk\n\n\u02c6\u03c1n, we \ufb01nd that\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nd(xi, \u02c6Sk)2 = Ex\u223c \u02c6\u03c1n d(x, \u02c6Sk)2 =\n\nlemma 3.1\n\nW2(\u02c6\u03c1n, \u03c0 \u02c6Sk\n\n\u02c6\u03c1n)2.\n\n(2)\n\nSince k-means minimizes equation 2, it also \ufb01nds the measure that is closest to \u02c6\u03c1n, among those\nwith support of size k. This connection between k-means and W2 measure approximation was, to\nthe best of the authors\u2019 knowledge, \ufb01rst suggested by Pollard [28] though, as mentioned earlier, the\nargument carries over to many other unsupervised learning algorithms.\n\nUnsupervised measure learning algorithms. We brie\ufb02y clarify the steps involved in using an\nexisting unsupervised learning algorithm for probability measure learning. Let Uk be a parametrized\nalgorithm (e.g. k-means) that takes a sample Xn and outputs a set Uk(Xn). The measure learning\nalgorithm Ak : Mn \u2192 Pp(M) corresponding to Uk is de\ufb01ned as follows:\n\n1. Ak takes a sample Xn and outputs the measure \u03c0 \u02c6Sk\n2. since \u02c6\u03c1n is discrete, then so must \u03c0 \u02c6Sk\n3. in practice, we can simply store an n-vector\n\n(cid:104)\n\n\u02c6\u03c1n be, and thus Ak(Xn) = 1\n\n\u02c6\u03c1n, supported on \u02c6Sk = Uk(Xn);\n(xi);\n\nn\n\n\u03c0 \u02c6Sk\n\n(x1), . . . , \u03c0 \u02c6Sk\n\n(xn)\n\n(cid:80)n\n(cid:105)\n\ni=1 \u03b4\u03c0 \u02c6Sk\n, from which Ak(Xn)\n\ncan be reconstructed by placing atoms of mass 1/n at each point.\n\nIn the case that Uk is the k-means algorithm, only k points and k masses need to be stored.\nNote that any algorithm A(cid:48) that attempts to output a measure A(cid:48)(Xn) close to \u02c6\u03c1n can be cast in the\nabove framework. Indeed, if S(cid:48) is the support of A(cid:48)(Xn) then, by lemma 3.2, \u03c0S(cid:48) \u02c6\u03c1n is the measure\nclosest to \u02c6\u03c1n with support in S(cid:48). This effectively reduces the problem of learning a measure to that of\n1In a slight abuse of notation, we refer to the k-means algorithm here as an ideal algorithm that solves the\n\nk-means problem, even though in practice an approximation algorithm may be used.\n\n5\n\n\f\ufb01nding a set, and is akin to how the fact that every optimal quantizer is a nearest-neighbor quantizer\n(see [15], [12, p. 350], and [14, p. 37\u201338]) reduces the problem of \ufb01nding an optimal quantizer to\nthat of \ufb01nding an optimal quantizing set.\nClearly, the minimum of equation 2 over sets of size k (the output of k-means) is monotonically\n\u02c6\u03c1n = \u02c6\u03c1n, it is Ex\u223c \u02c6\u03c1n d(x, \u02c6Sn)2 =\nnon-increasing with k. In particular, since \u02c6Sn = Xn and \u03c0 \u02c6Sn\n\u02c6\u03c1n)2 = 0. That is, we can always make the learned measure arbitrarily close to \u02c6\u03c1n\nW2(\u02c6\u03c1n, \u03c0 \u02c6Sn\nby increasing k. However, as pointed out in Sec. 2, the problem of measure learning is concerned\nwith minimizing the 2-Wasserstein distance W2(\u03c1, \u03c0 \u02c6Sk\n\u02c6\u03c1n) to the data-generating measure. The\nactual performance of k-means is thus not necessarily guaranteed to behave in the same way as the\nempirical one, and the question of characterizing its behavior as a function of k and n naturally\narises.\nFinally, we note that, while it is Ex\u223c \u02c6\u03c1nd(x, \u02c6Sk)2 = W2(\u02c6\u03c1n, \u03c0 \u02c6Sk\n\u02c6\u03c1n)2 (the empirical performances\nare the same in the optimal quantization, and measure learning problem formulations), the actual\nperformances satisfy\n\nEx\u223c\u03c1d(x, \u02c6Sk)2 =\n\nlemma 3.1\n\nW2(\u03c1, \u03c0 \u02c6Sk\n\n\u03c1)2 \u2264\n\nlemma 3.2\n\nW2(\u03c1, \u03c0 \u02c6Sk\n\n\u02c6\u03c1n)2,\n\n1 \u2264 k \u2264 n.\n\nConsequently, with the identi\ufb01cation between sets S and measures \u03c0S \u02c6\u03c1n, the measure learning\nproblem is, in general, harder than the set-approximation problem (for example, if M = Rd and \u03c1\nis absolutely continuous over a set of non-null volume, it is not hard to show that the inequality is\nalmost surely strict: Ex\u223c\u03c1d(x, \u02c6Sk)2 < W2(\u03c1, \u03c0 \u02c6Sk\nIn the remainder, we characterize the performance of k-means on the measure learning problem, for\nvarying k, n. Although other unsupervised learning algorithms could have been chosen as basis for\nour analysis, k-means is one of the oldest and most widely used, and the one for which the deep\nconnection between optimal quantization and measure approximation is most clearly manifested.\nNote that, by setting k = n, our analysis includes the problem of characterizing the behavior of\nthe distance W2(\u03c1, \u02c6\u03c1n) between empirical and population measures which, as indicated in Sec. 2.1,\nis a fundamental question in statistics (i.e. the speed of convergence of empirical to population\nmeasures.)\n\n\u02c6\u03c1n)2 for 1 < k < n.)\n\n5 Learning rates\n\nIn order to analyze the performance of k-means as a measure learning algorithm, and the conver-\ngence of empirical to population measures, we propose the decomposition shown in \ufb01g. 2. The\ndiagram includes all the measures considered in the paper, and shows the two decompositions used\nto prove upper bounds. The upper arrow (green), illustrates the decomposition used to bound the dis-\ntance W2(\u03c1, \u02c6\u03c1n). This decomposition uses the measures \u03c0Sk \u03c1 and \u03c0Sk \u02c6\u03c1n as intermediates to arrive\nat \u02c6\u03c1n, where Sk is a k-point optimal quantizer of \u03c1, that is, a set Sk minimizing Ex\u223c\u03c1d(x, S)2 over\nall sets of size |S| = k. The lower arrow (blue) corresponds to the decomposition of W2(\u03c1, \u03c0 \u02c6Sk\n\u02c6\u03c1n)\n(the performance of k-means), whereas the labelled black arrows correspond to individual terms in\nthe bounds. We begin with the (slightly) simpler of the two results.\n\n5.1 Convergence rates for the empirical to population measures\n\nLet Sk be the optimal k-point quantizer of \u03c1 of order two [14, p. 31]. By the triangle inequality and\nthe identity (a + b + c)2 \u2264 3(a2 + b2 + c2), it follows that\n\nW2(\u03c1, \u02c6\u03c1n)2 \u2264 3(cid:2)W2(\u03c1, \u03c0Sk \u03c1)2 + W2(\u03c0Sk \u03c1, \u03c0Sk \u02c6\u03c1n)2 + W2(\u03c0Sk \u02c6\u03c1n, \u02c6\u03c1n)2(cid:3) .\n\n(3)\n\nThis is the decomposition depicted in the upper arrow of \ufb01g. 2.\nBy lemma 3.1, the \ufb01rst term in the sum of equation 3 is the optimal k-point quantization error of\n\u03c1 over a d-manifold M which, using recent techniques from [16] (see also [17, p. 491]), is shown\nin the proof of theorem 5.1 (part a) to be of order \u0398(k\u22122/d). The remaining terms, b) and c), are\nslightly more technical and are bounded in the proof of theorem 5.1.\nSince equation 3 holds for all 1 \u2264 k \u2264 n, the best bound on W2(\u03c1, \u02c6\u03c1n) can be obtained by optimiz-\ning the right-hand side over all possible values of k, resulting in the following probabilistic bound\nfor the rate of convergence of the empirical to population measures.\n\n6\n\n\fFigure 1: A sample {x1, x2, x3, x4} is\ndrawn from a distribution \u03c1 with support in\nsupp \u03c1. The projection map \u03c0{x1,x2,x3,x4}\nsends points x to their closest one in the sam-\nple. The induced Voronoi tiling is shown in\nshades of blue.\n\nFigure 2: The measures considered in this paper\nare linked by arrows for which upper bounds for\ntheir distance are derived. Bounds for the quan-\ntities of interest W2(\u03c1, \u02c6\u03c1n)2, and W2(\u03c1, \u03c0 \u02c6Sk\n\u02c6\u03c1n)2,\nare decomposed by following the top and bottom\ncolored arrows.\n\nTheorem 5.1. Given \u03c1 \u2208 Pp(M) with absolutely continuous part \u03c1A (cid:54)= 0, suf\ufb01ciently large n, and\n\u03c4 > 0, it holds\n\nW2(\u03c1, \u02c6\u03c1n) \u2264 C \u00b7 m(\u03c1A) \u00b7 n\n\n\u22121/(2d+4) \u00b7 \u03c4,\n\nwith probability 1 \u2212 e\n\n\u2212\u03c4 2\n\n.\n\nwhere m(\u03c1A) :=(cid:82)\n\nM \u03c1A(x)d/(d+2)d\u03bbM(x), and C depends only on d.\n\n5.2 Learning rates of k-means\n\nThe key element in the proof of theorem 5.1 is that the distance between population and empirical\nmeasures can be bounded by choosing an intermediate optimal quantizing measure of an appropriate\nsize k. In the analysis, the best bounds are obtained for k smaller than n. If the output of k-means\nis close to an optimal quantizer (for instance if suf\ufb01cient data is available), then we would similarly\nexpect that the best bounds for k-means correspond to a choice of k < n.\nThe decomposition of the bottom (blue) arrow in \ufb01gure 2 leads to the following bound in probability.\nTheorem 5.2. Given \u03c1 \u2208 Pp(M) with absolutely continuous part \u03c1A (cid:54)= 0, and \u03c4 > 0, then for all\nsuf\ufb01ciently large n, and letting\n\nk = C \u00b7 m(\u03c1A) \u00b7 nd/(2d+4),\n\nit holds\n\nwhere m(\u03c1A) :=(cid:82)\n\nW2(\u03c1, \u03c0 \u02c6Sk\n\n\u02c6\u03c1n) \u2264 C \u00b7 m(\u03c1A) \u00b7 n\n\n\u22121/(2d+4) \u00b7 \u03c4,\n\nwith probability 1 \u2212 e\n\n\u2212\u03c4 2\n\n.\n\nM \u03c1A(x)d/(d+2)d\u03bbM(x), and C depends only on d.\n\nNote that the upper bounds in theorem 5.1 and 5.2 are exactly the same. Although this may appear\nsurprising, it stems from the following fact. Since S = \u02c6Sk is a minimizer of W2(\u03c0S \u02c6\u03c1n, \u02c6\u03c1n)2, the\nbound d) of \ufb01gure 2 satis\ufb01es:\n\nW2(\u03c0 \u02c6Sk\n\n\u02c6\u03c1n, \u02c6\u03c1n)2 \u2264 W2(\u03c0Sk \u02c6\u03c1n, \u02c6\u03c1n)2\n\nand therefore (by the de\ufb01nition of c), the term d) is of the same order as c). It follows then that\nadding term d) to the bound only affects the constants, but otherwise leaves it unchanged. Since\nd) is the term that takes the output measure of k-means to the empirical measure, this implies that\nthe rate of convergence of k-means (for suitably chosen k) cannot be worse than that of \u02c6\u03c1n \u2192 \u03c1.\nConversely, bounds for \u02c6\u03c1n \u2192 \u03c1 are obtained from best rates of convergence of optimal quantizers,\nwhose convergence to \u03c1 cannot be slower than that of k-means (since the quantizers that k-means\nproduces are suboptimal.)\n\n7\n\nxsupp\u03c1\u03c0{x1,x2,x3,x4}x1x2x3x4\u03c1\u02c6\u03c1n\u03c0Sk\u02c6\u03c1n\u03c0\u02c6Sk\u02c6\u03c1n\u03c0Sk\u03c1a)b)c)d)W2(\u03c1,\u02c6\u03c1n)W2(\u03c1,\u03c0\u02c6Sk\u02c6\u03c1n)\fSince the bounds obtained for the convergence of \u02c6\u03c1n \u2192 \u03c1 are the same as those for k-means with\nk of order k = \u0398(nd/(2d+4)), this suggests that estimates of \u03c1 that are as accurate as those derived\nfrom an n point-mass measure \u02c6\u03c1n can be derived from k point-mass measures with k (cid:28) n.\nFinally, we note that the introduced bounds are currently limited by the statistical bound\n\n|W2(\u03c0S \u02c6\u03c1n, \u02c6\u03c1n)2 \u2212 W2(\u03c0S\u03c1, \u03c1)2| =\n\nlemma 3.1\n\nsup\n|S|=k\n\nsup\n|S|=k\n\n|Ex\u223c \u02c6\u03c1n d(x, S)2 \u2212 Ex\u223c\u03c1d(x, S)2|\n\n(4)\n\n(see for instance [21]), for which non-matching lower bounds are known. This means that, if better\nupper bounds can be obtained for equation 4, then both bounds in theorems 5.1 and 5.2 would\nautomatically improve (would become closer to the lower bound.)\n\nReferences\n[1] M. Ajtai, J. Komls, and G. Tusndy. On optimal matchings. Combinatorica, 4:259\u2013264, 1984.\n[2] Franck Barthe and Charles Bordenave. Combinatorial optimization over two random point sets. Technical\n\nReport arXiv:1103.2734, Mar 2011.\n\n[3] Gordon Blower. The Gaussian isoperimetric inequality and transportation. Positivity, 7:203\u2013224, 2003.\n[4] S. G. Bobkov and F. G\u00a8otze. Exponential integrability and transportation cost related to logarithmic\n\nSobolev inequalities. Journal of Functional Analysis, 163(1):1\u201328, April 1999.\n\n[5] Emmanuel Boissard. Simple bounds for the convergence of empirical and occupation measures in 1-\n\nwasserstein distance. Electron. J. Probab., 16(83):2296\u20132333, 2011.\n\n[6] F. Bolley, A. Guillin, and C. Villani. Quantitative concentration inequalities for empirical measures on\n\nnon-compact spaces. Probability Theory and Related Fields, 137(3):541\u2013593, 2007.\n\n[7] F. Bolley and C. Villani. Weighted Csisz\u00b4ar-Kullback-Pinsker inequalities and applications to transporta-\n\ntion inequalities. Annales de la Faculte des Sciences de Toulouse, 14(3):331\u2013352, 2005.\n\n[8] Claire Caillerie, Fr\u00b4ed\u00b4eric Chazal, J\u00b4er\u02c6ome Dedecker, and Bertrand Michel. Deconvolution for the Wasser-\n\nstein metric and geometric inference. Rapport de recherche RR-7678, INRIA, July 2011.\n\n[9] Kenneth L. Clarkson. Building triangulations using \u0001-nets. In Proceedings of the thirty-eighth annual\nACM symposium on Theory of computing, STOC \u201906, pages 326\u2013335, New York, NY, USA, 2006. ACM.\n[10] Luc Devroye and G\u00b4abor Lugosi. Combinatorial methods in density estimation. Springer Series in Statis-\n\ntics. Springer-Verlag, New York, 2001.\n\n[11] V. Dobri and J. Yukich. Asymptotics for transportation cost in high dimensions. Journal of Theoretical\n\nProbability, 8:97\u2013118, 1995.\n\n[12] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Kluwer International Series in\n\nEngineering and Computer Science. Kluwer Academic Publishers, 1992.\n\n[13] Alison L. Gibbs and Francis E. Su. On choosing and bounding probability metrics. International Statis-\n\ntical Review, 70:419\u2013435, 2002.\n\n[14] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer-\n\nVerlag New York, Inc., Secaucus, NJ, USA, 2000.\n\n[15] Siegfried Graf, Harald Luschgy, and Gilles Page`s. Distortion mismatch in the quantization of probability\n\nmeasures. Esaim: Probability and Statistics, 12:127\u2013153, 2008.\n\n[16] Peter M. Gruber. Optimum quantization and its applications. Adv. Math, 186:2004, 2002.\n[17] P.M. Gruber. Convex and discrete geometry. Grundlehren der mathematischen Wissenschaften. Springer,\n\n2007.\n\n[18] Guillermo Henry and Daniela Rodriguez. Kernel density estimation on riemannian manifolds: Asymp-\n\ntotic results. J. Math. Imaging Vis., 34(3):235\u2013239, July 2009.\n\n[19] Joseph Horowitz and Rajeeva L. Karandikar. Mean rates of convergence of empirical measures in the\n\nWasserstein metric. J. Comput. Appl. Math., 55(3):261\u2013273, November 1994.\n\n[20] M. Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. Amer-\n\nican Mathematical Society, 2001.\n\n[21] A. Maurer and M. Pontil. K\u2013dimensional coding schemes in Hilbert spaces.\n\nInformation Theory, 56(11):5839 \u20135846, nov. 2010.\n\nIEEE Transactions on\n\n[22] Yann Ollivier. Ricci curvature of markov chains on metric spaces. J. Funct. Anal., 256(3):810\u2013864, 2009.\n\n8\n\n\f[23] Arkadas Ozakin and Alexander Gray. Submanifold density estimation. In Y. Bengio, D. Schuurmans,\nJ. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems\n22, pages 1375\u20131382. 2009.\n\n[24] C. Papadimitriou. The probabilistic analysis of matching heuristics. In Proc. of the 15th Allerton Conf.\n\non Communication, Control and Computing, pages 368\u2013378, 1978.\n\n[25] Bruno Pelletier. Kernel density estimation on Riemannian manifolds. Statist. Probab. Lett., 73(3):297\u2013\n\n304, 2005.\n\n[26] Xavier Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. J.\n\nMath. Imaging Vis., 25(1):127\u2013154, July 2006.\n\n[27] M. S. Pinsker. Information and information stability of random variables and processes. San Francisco:\n\nHolden-Day, 1964.\n\n[28] David Pollard. Quantization and the method of k-means.\n\n28(2):199\u2013204, 1982.\n\nIEEE Transactions on Information Theory,\n\n[29] S.T. Rachev. Probability metrics and the stability of stochastic models. Wiley series in probability and\n\nmathematical statistics: Applied probability and statistics. Wiley, 1991.\n\n[30] J.M. Steele. Probability Theory and Combinatorial Optimization. Cbms-Nsf Regional Conference Series\n\nin Applied Mathematics. Society for Industrial and Applied Mathematics, 1997.\n\n[31] M. Talagrand. Transportation cost for Gaussian and other product measures. Geometric And Functional\n\nAnalysis, 6:587\u2013600, 1996.\n\n[32] Alexandre B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer,\n\nNew York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats.\n\n[33] A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes. Springer Series in\n\nStatistics. Springer, 1996.\n\n[34] V. S. Varadarajan. On the convergence of sample probability distributions. Sankhy\u00afa: The Indian Journal\n\nof Statistics, 19(1/2):23\u201326, Feb. 1958.\n\n[35] C. Villani. Optimal Transport: Old and New. Grundlehren der Mathematischen Wissenschaften. Springer,\n\n2009.\n\n[36] P. Vincent and Y. Bengio. Manifold Parzen Windows. In Advances in Neural Information Processing\n\nSystems 22, pages 849\u2013856. 2003.\n\n9\n\n\f", "award": [], "sourceid": 1198, "authors": [{"given_name": "Guillermo", "family_name": "Canas", "institution": null}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": null}]}