{"title": "Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance", "book": "Advances in Neural Information Processing Systems", "page_first": 250, "page_last": 260, "abstract": "Minimum expected distance estimation (MEDE) algorithms have been widely used for probabilistic models with intractable likelihood functions and they have become increasingly popular due to their use in implicit generative modeling (e.g.\\ Wasserstein generative adversarial networks, Wasserstein autoencoders). Emerging from computational optimal transport, the Sliced-Wasserstein (SW) distance has become a popular choice in MEDE thanks to its simplicity and computational benefits. While several studies have reported empirical success on generative modeling with SW, the theoretical properties of such estimators have not yet been established. In this study, we investigate the asymptotic properties of estimators that are obtained by minimizing SW. We first show that convergence in SW implies weak convergence of probability measures in general Wasserstein spaces. Then we show that estimators obtained by minimizing SW (and also an approximate version of SW) are asymptotically consistent. We finally prove a central limit theorem, which characterizes the asymptotic distribution of the estimators and establish a convergence rate of $\\sqrt{n}$, where $n$ denotes the number of observed data points. We illustrate the validity of our theory on both synthetic data and neural networks.", "full_text": "Asymptotic Guarantees for Learning Generative\n\nModels with the Sliced-Wasserstein Distance\n\nKimia Nadjahi1, Alain Durmus2, Umut \u00b8Sim\u00b8sekli1,3, Roland Badeau1\n\n1: LTCI, T\u00e9l\u00e9com Paris, Institut Polytechnique de Paris, France\n2: CMLA, ENS Cachan, CNRS, Universit\u00e9 Paris-Saclay, France\n\n3: Department of Statistics, University of Oxford, UK\n\n{kimia.nadjahi, umut.simsekli, roland.badeau}@telecom-paris.fr\n\nalain.durmus@cmla.ens-cachan.fr\n\nAbstract\n\nMinimum expected distance estimation (MEDE) algorithms have been widely\nused for probabilistic models with intractable likelihood functions and they have\nbecome increasingly popular due to their use in implicit generative modeling (e.g.\nWasserstein generative adversarial networks, Wasserstein autoencoders). Emerging\nfrom computational optimal transport, the Sliced-Wasserstein (SW) distance has\nbecome a popular choice in MEDE thanks to its simplicity and computational\nbene\ufb01ts. While several studies have reported empirical success on generative\nmodeling with SW, the theoretical properties of such estimators have not yet been\nestablished. In this study, we investigate the asymptotic properties of estimators\nthat are obtained by minimizing SW. We \ufb01rst show that convergence in SW implies\nweak convergence of probability measures in general Wasserstein spaces. Then we\nshow that estimators obtained by minimizing SW (and also an approximate version\nof SW) are asymptotically consistent. We \ufb01nally prove a central limit theorem,\nwhich characterizes the asymptotic distribution of the estimators and establish a\nconvergence rate of \u221an, where n denotes the number of observed data points. We\nillustrate the validity of our theory on both synthetic data and neural networks.\n\n1\n\nIntroduction\n\nMinimum distance estimation (MDE) is a generalization of maximum-likelihood inference, where\nthe goal is to minimize a distance between the empirical distribution of a set of independent and\nidentically distributed (i.i.d.) observations Y1:n = (Y1, . . . , Yn) and a family of distributions indexed\nby a parameter \u03b8. The problem is formally de\ufb01ned as follows [1, 2]:\n\n(1)\nwhere D denotes a distance (or a divergence in general) between probability measures, \u00b5\u03b8 denotes a\nprobability measure indexed by \u03b8, \u0398 denotes the parameter space, and\n\n\u02c6\u03b8n = argmin\u03b8\u2208\u0398 D(\u02c6\u00b5n, \u00b5\u03b8) ,\n\n\u02c6\u00b5n =\n\n\u03b4Yi\n\n(2)\n\n1\n\nn(cid:88)n\n\ni=1\n\ndenotes the empirical measure of Y1:n, with \u03b4Y being the Dirac distribution with mass on the point Y .\nWhen D is chosen as the Kullback-Leibler divergence, this formulation coincides with the maximum\nlikelihood estimation (MLE) [2].\nWhile MDE provides a fruitful framework for statistical inference, when working with generative\nmodels, solving the optimization problem in (1) might be intractable since it might be impossible to\nevaluate the probability density function associated with \u00b5\u03b8. Nevertheless, in various settings, even if\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe density is not available, one can still generate samples from the distribution \u00b5\u03b8, and such samples\nturn out to be useful for making inference. More precisely, under such settings, a natural alternative\nto (1) is the minimum expected distance estimator, which is de\ufb01ned as follows [3]:\n\nHere,\n\n\u02c6\u03b8n,m = argmin\u03b8\u2208\u0398\n\n1\n\n\u02c6\u00b5\u03b8,m =\n\nE [D(\u02c6\u00b5n, \u02c6\u00b5\u03b8,m)|Y1:n] .\nm(cid:88)m\n\n\u03b4Zi\n\ni=1\n\n(3)\n\n(4)\n\ndenotes the empirical distribution of Z1:m, that is a sequence of i.i.d. random variables with distribu-\ntion \u00b5\u03b8. This algorithmic framework has computationally favorable properties since one can replace\nthe expectation with a simple Monte Carlo average in practical applications.\nIn the context of MDE, distances that are based on optimal transport (OT) have become increasingly\npopular due to their computational and theoretical properties [4, 5, 6, 7, 8]. For instance, if we\nreplace the distance D in (3) with the Wasserstein distance (de\ufb01ned in Section 2 below), we obtain\nthe minimum expected Wasserstein estimator [3]. In the classical statistical inference setting, the\ntypical use of such an estimator is to infer the parameters of a measure whose density does not\nadmit an analytical closed-form formula [2]. On the other hand, in the implicit generative modeling\n(IGM) setting, this estimator forms the basis of two popular IGM strategies: Wasserstein generative\nadversarial networks (GAN) [4] and Wasserstein variational auto-encoders (VAE) [5] (cf. [9] for their\nrelation). The goal of these two methods is to \ufb01nd the best parametric transport map T\u03b8, such that T\u03b8\ntransforms a simple distribution \u00b5 (e.g. standard Gaussian or uniform) to a potentially complicated\ndata distribution \u02c6\u00b5n by minimizing the Wasserstein distance between the transported distribution\n\u00b5\u03b8 = T\u03b8(cid:93)\u00b5 and \u02c6\u00b5n, where (cid:93) denotes the push-forward operator, to be de\ufb01ned in the next section.\nIn practice, \u03b8 is typically chosen as a neural network, for which it is often impossible to evaluate\nthe induced density \u00b5\u03b8. However, one can easily generate samples from \u00b5\u03b8 by \ufb01rst generating a\nsample from \u00b5 and then applying T\u03b8 to that sample, making minimum expected distance estimation\n(3) feasible for this setting. Motivated by its practical success, the theoretical properties of this\nestimator have been recently taken under investigation [10, 11] and very recently Bernton et al. [3]\nhave established the consistency (for the general setting) and the asymptotic distribution (for one\ndimensional setting) of this estimator.\nEven though estimation with the Wasserstein distance has served as a fertile ground for many\ngenerative modeling applications, except for the case when the measures are supported on R1, the\ncomputational complexity of minimum Wasserstein estimators rapidly becomes excessive with the\nincreasing problem dimension, and developing accurate and ef\ufb01cient approximations is a highly\nnon-trivial task. Therefore, there have been several attempts to use more practical alternatives to the\nWasserstein distance [12, 6]. In this context, the Sliced-Wasserstein (SW) distance [13, 14, 15] has\nbeen an increasingly popular alternative to the Wasserstein distance, which is de\ufb01ned as an average\nof one-dimensional Wasserstein distances, which allows it to be computed in an ef\ufb01cient manner.\nWhile several studies have reported empirical success on generative modeling with SW [16, 17, 18,\n19], the theoretical properties of such estimators have not yet been fully established. Bonnotte [14]\nproved that SW is a proper metric, and in compact domains SW is equivalent to the Wasserstein\ndistance, hence convergence in SW implies weak convergence in compact domains. [14] also analyzed\nthe gradient \ufb02ows based on SW, which then served as a basis for a recently proposed IGM algorithm\n[18]. Finally, recent studies [16, 20] investigated the sample complexity of SW and established\nbounds for the SW distance between two measures and their empirical instantiations.\nIn this paper, we investigate the asymptotic properties of estimators given in (1) and (3) when D is\nreplaced with the SW distance. We \ufb01rst prove that convergence in SW implies weak convergence\nof probability measures de\ufb01ned on general domains, which generalizes the results given in [14].\nThen, by using similar techniques to the ones given in [3], we show that the estimators de\ufb01ned by\n(1) and (3) are consistent, meaning that as the number of observations n increases the estimates will\nget closer to the data-generating parameters. We \ufb01nally prove a central limit theorem (CLT) in the\nmultidimensional setting, which characterizes the asymptotic distribution of these estimators and\nestablishes a convergence rate of \u221an. The CLT that we prove is stronger than the one given in [3] in\nthe sense that it is not restricted to the one-dimensional setting as opposed to [3].\nWe support our theory with experiments that are conducted on both synthetic and real data. We \ufb01rst\nconsider a more classical statistical inference setting, where we consider a Gaussian model and a\n\n2\n\n\fmultidimensional \u03b1-stable model whose density is not available in closed-form. In both models,\nthe experiments validate our consistency and CLT results. We further observe that, especially for\nhigh-dimensional problems, the estimators obtained by minimizing SW have signi\ufb01cantly better\ncomputational properties when compared to the ones obtained by minimizing the Wasserstein distance,\nas expected. In the IGM setting, we consider the neural network-based generative modeling algorithm\nproposed in [16] and show that our results also hold in the real data setting as well.\n\nN\u2217\n\n2 Preliminaries and Technical Background\nWe consider a probability space (\u2126,F, P) with associated expectation operator E, on which all\nthe random variables are de\ufb01ned. Let (Yk)k\u2208N be a sequence of random variables associated with\nobservations, where each observation takes value in Y \u2282 Rd. We assume that these observations are\ni.i.d. according to \u00b5(cid:63) \u2208 P(Y), where P(Y) stands for the set of probability measures on Y.\nA statistical model is a family of distributions on Y and is denoted by M = {\u00b5\u03b8 \u2208 P(Y), \u03b8 \u2208 \u0398},\nwhere \u0398 \u2282 Rd\u03b8 is the parametric space. In this paper, we focus on parameter inference for purely\nfrom \u00b5\u03b8, but the\ngenerative models: for all \u03b8 \u2208 \u0398, we can generate i.i.d. samples (Zk)k\u2208N\u2217 \u2208 Y\nassociated likelihood is numerically intractable. In the sequel, (Zk)k\u2208N\u2217 denotes an i.i.d. sequence\nfrom \u00b5\u03b8 with \u03b8 \u2208 \u0398, and for any m \u2208 N\u2217, \u02c6\u00b5\u03b8,m = (1/m)(cid:80)m\ni=1 \u03b4Zi denotes the corresponding\nempirical distribution.\nThroughout our study, we assume that the following conditions hold: (1) Y, endowed with the\nEuclidean distance \u03c1, is a Polish space, (2) \u0398, endowed with the distance \u03c1\u0398, is a Polish space, (3) \u0398\nis a \u03c3-compact space, i.e. the union of countably many compact subspaces, and (4) parameters are\nidenti\ufb01able, i.e. \u00b5\u03b8 = \u00b5\u03b8(cid:48) implies \u03b8 = \u03b8(cid:48). We endow P(Y) with the L\u00e9vy-Prokhorov distance dP,\nwhich metrizes the weak convergence by [21, Theorem 6.8] since Y is assumed to be a Polish space.\nWe denote by Y the Borel \u03c3-\ufb01eld of (Y, \u03c1).\nWasserstein distance. For p \u2265 1, we denote by Pp(Y) the set of probability measures on Y with\n\ufb01nite p\u2019th moment: Pp(Y) =(cid:8)\u00b5 \u2208 P(Y) : (cid:82)Y (cid:107)y \u2212 y0(cid:107)p d\u00b5(y) < +\u221e, for some y0 \u2208 Y(cid:9). The\nWasserstein distance of order p between any \u00b5, \u03bd \u2208 Pp(Y) is de\ufb01ned by [22],\n\u03b3\u2208\u0393(\u00b5,\u03bd)(cid:26)(cid:90)Y\u00d7Y (cid:107)x \u2212 y(cid:107)p d\u03b3(x, y)(cid:27) ,\n\nwhere \u0393(\u00b5, \u03bd) is the set of probability measures \u03b3 on (Y \u00d7 Y,Y \u2297 Y) satisfying \u03b3(A \u00d7 Y) = \u00b5(A)\nand \u03b3(Y \u00d7 A) = \u03bd(A) for any A \u2208 B(Y). The space Pp(Y) endowed with the distance Wp is a\nPolish space by [22, Theorem 6.18] since (Y, \u03c1) is assumed to be Polish.\nThe one-dimensional case is a favorable scenario for which computing the Wasserstein distance of\norder p between \u00b5, \u03bd \u2208 Pp(R) becomes relatively easy since it has a closed-form formula, given by\n[23, Theorem 3.1.2.(a)]:\n\np(\u00b5, \u03bd) = inf\n\nWp\n\n(5)\n\nWp\n\np(\u00b5, \u03bd) =(cid:90) 1\n\n\u00b5 (t) \u2212 F \u22121\n\n\u03bd\n\n0 (cid:12)(cid:12)F \u22121\n\n(t)(cid:12)(cid:12)p\n\ndt =(cid:90)R(cid:12)(cid:12)s \u2212 F \u22121\n\n\u03bd\n\n(F\u00b5(s))(cid:12)(cid:12)p\n\nd\u00b5(s) ,\n\n(6)\n\n\u03bd\n\n\u00b5\n\nand F \u22121\n\nwhere F\u00b5 and F\u03bd denote the cumulative distribution functions (CDF) of \u00b5 and \u03bd respectively, and\nare the quantile functions of \u00b5 and \u03bd respectively. For empirical distributions, (6) is\nF \u22121\ncalculated by simply sorting the n samples drawn from each distribution and computing the average\ncost between the sorted samples.\nSliced-Wasserstein distance. The analytical form of the Wasserstein distance for one-dimensional\ndistributions is an attractive property that gives rise to an alternative metric referred to as the Sliced-\nWasserstein (SW) distance [13, 15]. The idea behind SW is to \ufb01rst, obtain a family of one-dimensional\nrepresentations for a higher-dimensional probability distribution through linear projections, and then,\ncompute the average of the Wasserstein distance between these one-dimensional representations.\n\nMore formally, let Sd\u22121 =(cid:8)u \u2208 Rd : (cid:107)u(cid:107) = 1(cid:9) be the d-dimensional unit sphere, and denote by\n(cid:104)\u00b7,\u00b7(cid:105) the Euclidean inner-product. For any u \u2208 S, we de\ufb01ne u(cid:63) the linear form associated with u\nfor any y \u2208 Y by u(cid:63)(y) = (cid:104)u, y(cid:105). The Sliced-Wasserstein distance of order p is de\ufb01ned for any\n\u00b5, \u03bd \u2208 Pp(Y) as,\n(7)\n\nSWp\n\nWp\n\np(u(cid:63)\n\n(cid:93) \u00b5, u(cid:63)\n\n(cid:93) \u03bd)d\u03c3(u)\n\np(\u00b5, \u03bd) =(cid:90)Sd\u22121\n\n3\n\n\fwhere \u03c3 is the uniform distribution on Sd\u22121 and for any measurable function f : Y \u2192 R and\n\u03b6 \u2208 P(Y), f(cid:93)\u03b6 is the push-forward measure of \u03b6 by f, i.e. for any A \u2208 B(R), f(cid:93)\u03b6(A) = \u03b6(f\u22121(A))\nwhere f\u22121(A) = {y \u2208 Y : f (y) \u2208 A}.\nSWp is a distance on Pp(Y) [14] and has important practical implications: in practice, the integration\nin (7) is approximated using a Monte Carlo scheme that randomly draws a \ufb01nite set of samples from\n\u03c3 on Sd\u22121 and replaces the integral with a \ufb01nite-sample average. Therefore, the evaluation of the\nSW distance between \u00b5, \u03bd \u2208 Pp(Y) has signi\ufb01cantly lower computational requirements than the\nWasserstein distance, since it consists in solving several one-dimensional optimal transport problems,\nwhich have closed-form solutions.\n\n3 Asymptotic Guarantees for Minimum Sliced-Wasserstein Estimators\n\nWe de\ufb01ne the minimum Sliced-Wasserstein estimator (MSWE) of order p as the estimator obtained by\nplugging SWp in place of D in (1). Similarly, we de\ufb01ne the minimum expected Sliced-Wasserstein\nestimator (MESWE) of order p as the estimator obtained by plugging SWp in place of D in (3). In\nthe rest of the paper, MSWE and MESWE will be denoted by \u02c6\u03b8n and \u02c6\u03b8n,m respectively.\nWe present the asymptotic properties that we derived for MSWE and MESWE, namely their existence\nand consistency. We study their measurability in Section 2.2 of the supplementary document. We\nalso formulate a CLT that characterizes the asymptotic distribution of MSWE and establishes a\nconvergence rate for any dimension. We provide all the proofs in Section 3 of the supplementary doc-\nument. Note that, since the Sliced-Wasserstein distance is an average of one-dimensional Wasserstein\ndistances, some proofs are, inevitably, similar to the proofs done in [3]. However, the adaptation of\nthese techniques to the SW case is made possible by the identi\ufb01cation of novel properties regarding\nthe topology induced by the SW distance, to the best of our knowledge, which we establish for the\n\ufb01rst time in this study.\n\n3.1 Topology induced by the Sliced-Wasserstein distance\n\nWe begin this section by a useful result which we believe is interesting on its own and implies that the\ntopology induced by SWp on Pp(Rd) is \ufb01ner than the weak topology induced by the L\u00e9vy-Prokhorov\nmetric dP.\nTheorem 1. Let p \u2208 [1, +\u221e). The convergence in SWp implies the weak convergence in P(Rd). In\nother words, if (\u00b5k)k\u2208N is a sequence of measures in Pp(Rd) satisfying limk\u2192+\u221e SWp(\u00b5k, \u00b5) = 0,\nwith \u00b5 \u2208 Pp(Rd), then (\u00b5k)k\u2208N w\nThe property that convergence in SWp implies weak convergence has already been proven in [14]\nonly for compact domains. While the implication of weak convergence is one of the most crucial\nrequirements that a distance metric should satisfy, to the best of our knowledge, this implication has\nnot been proved for general domains before. In [14], the main proof technique was based on showing\nthat SWp is equivalent to Wp in compact domains, whereas we follow a different path and use the\nL\u00e9vy characterization.\n\n\u2212\u2192 \u00b5.\n\n3.2 Existence and consistency of MSWE and MESWE\n\nIn our next set of results, we will show that both MSWE and MESWE are consistent, in the sense\nthat, when the number of observations n increases, the estimators will converge to a parameter \u03b8(cid:63)\nthat minimizes the ideal problem \u03b8 (cid:55)\u2192 SWp(\u00b5(cid:63), \u00b5\u03b8). Before we make this argument more precise,\nlet us \ufb01rst present the assumptions that will imply our results.\nA1. The map \u03b8 (cid:55)\u2192 \u00b5\u03b8 is continuous from (\u0398, \u03c1\u0398) to (P(Y), dP ), i.e. for any sequence (\u03b8n)n\u2208N in\n\u0398, satisfying limn\u2192+\u221e \u03c1\u0398(\u03b8n, \u03b8) = 0, we have (\u00b5\u03b8n )n\u2208N w\nA2. The data-generating process is such that limn\u2192+\u221e SWp(\u02c6\u00b5n, \u00b5(cid:63)) = 0, P-almost surely.\nA3. There exists \u0001 > 0, such that setting \u0001(cid:63) = inf \u03b8\u2208\u0398 SWp(\u00b5(cid:63), \u00b5\u03b8), the set \u0398(cid:63)\nSWp(\u00b5(cid:63), \u00b5\u03b8) \u2264 \u0001(cid:63) + \u0001} is bounded.\nThese assumptions are mostly related to the identi\ufb01ability of the statistical model and the regularity\nof the data generating process. They are arguably mild assumptions, analogous to those that have\n\n\u0001 = {\u03b8 \u2208 \u0398 :\n\n\u2212\u2192 \u00b5\u03b8.\n\n4\n\n\falready been considered in the literature [3]. Note that, without Theorem 1, the formulation and use\nof A2 in our proofs in the supplementary document would not be possible. In the next result, we\nestablish the consistency of MSWE.\nTheorem 2 (Existence and consistency of MSWE). Assume A1, A2 and A3. There exists E \u2208 F\nwith P(E) = 1 such that, for all \u03c9 \u2208 E,\n(8)\n\nSWp(\u00b5(cid:63), \u00b5\u03b8), and\n\nlim\n\ninf\n\u03b8\u2208\u0398\n\nSWp(\u02c6\u00b5n(\u03c9), \u00b5\u03b8) = inf\n\u03b8\u2208\u0398\n\nn\u2192+\u221e\nargmin\u03b8\u2208\u0398SWp(\u02c6\u00b5n(\u03c9), \u00b5\u03b8) \u2282 argmin\u03b8\u2208\u0398SWp(\u00b5(cid:63), \u00b5\u03b8) ,\n\n(9)\n\nlim sup\nn\u2192+\u221e\n\nwhere \u02c6\u00b5n is de\ufb01ned by (2). Besides, for all \u03c9 \u2208 E, there exists n(\u03c9) such that, for all n \u2265 n(\u03c9), the\nset argmin\u03b8\u2208\u0398SWp(\u02c6\u00b5n(\u03c9), \u00b5\u03b8) is non-empty.\nOur proof technique is similar to the one given in [3]. This result shows that, when the number of\nobservations goes to in\ufb01nity, the estimate \u02c6\u03b8n will converge to a global minimizer of the problem\nmin\u03b8\u2208\u0398 SWp(\u00b5(cid:63), \u00b5\u03b8).\nIn our next result, we prove a similar property for MESWEs as min(m, n) goes to in\ufb01nity. In order\nto increase clarity, and without loss of generality, in this setting, we consider m as a function of n\nsuch that limn\u2192+\u221e m(n) = +\u221e. Now, we derive an analogous version of Theorem 2 for MESWE.\nFor this result, we need to introduce another continuity assumption.\nA4. If limn\u2192+\u221e \u03c1\u0398(\u03b8n, \u03b8) = 0, then limn\u2192+\u221e\nThe next theorem establishes the consistency of MESWE.\nTheorem 3 (Existence and consistency of MESWE). Assume A1, A2, A3 and A4. Let (m(n))n\u2208N\u2217\nbe an increasing sequence satisfying limn\u2192+\u221e m(n) = +\u221e. There exists a set E \u2282 \u2126 with\nP(E) = 1 such that, for all w \u2208 E,\n\nE[SWp(\u00b5\u03b8n , \u02c6\u00b5\u03b8n,n)|Y1:n] = 0.\n\nlim\n\ninf\nn\u2192+\u221e\n\u03b8\u2208\u0398\nargmin\u03b8\u2208\u0398\n\nlim sup\nn\u2192+\u221e\n\nE(cid:2)SWp(\u02c6\u00b5n, \u02c6\u00b5\u03b8,m(n))(cid:12)(cid:12)Y1:n(cid:3) = inf\nE(cid:2)SWp(\u02c6\u00b5n, \u02c6\u00b5\u03b8,m(n))(cid:12)(cid:12)Y1:n(cid:3) \u2282 argmin\u03b8\u2208\u0398 SWp(\u00b5(cid:63), \u00b5\u03b8) ,\nE[SWp(\u02c6\u00b5n, \u02c6\u00b5\u03b8,m(n))|Y1:n] is non-empty.\n\nwhere \u02c6\u00b5n and \u02c6\u00b5\u03b8,m(n) are de\ufb01ned by (2) and (4) respectively. Besides, for all \u03c9 \u2208 E, there exists\nn(\u03c9) such that, for all n \u2265 n(\u03c9), the set argmin\u03b8\u2208\u0398\nSimilar to Theorem 2, this theorem shows that, when the number of observations goes to in\ufb01nity, the\nestimator obtained with the expected distance will converge to a global minimizer.\n\nSWp(\u00b5(cid:63), \u00b5\u03b8), and\n\n(11)\n\n(10)\n\n\u03b8\u2208\u0398\n\n3.3 Convergence of MESWE to MSWE\n\nIn practical applications, we can only use a \ufb01nite number of generated samples Z1:m.\nIn this\nsubsection, we analyze the case where the observations Y1:n are kept \ufb01xed while the number of\ngenerated samples increases, i.e. m \u2192 +\u221e and we show in this scenario that MESWE converges to\nMSWE, assuming the latter exists.\nBefore deriving this result, we formulate a technical assumption below.\nA5. For some \u0001 > 0 and \u0001n = inf \u03b8\u2208\u0398 SWp(\u02c6\u00b5n, \u00b5\u03b8), the set \u0398\u0001,n = {\u03b8 \u2208 \u0398 : SWp(\u02c6\u00b5n, \u00b5\u03b8) \u2264\n\u0001n + \u0001} is bounded almost surely.\nTheorem 4 (MESWE converges to MSWE as m \u2192 +\u221e). Assume A1, A4 and A5. Then,\n\nlim\n\ninf\nm\u2192+\u221e\n\u03b8\u2208\u0398\nargmin\u03b8\u2208\u0398\n\nlim sup\nm\u2192+\u221e\n\nE [SWp(\u02c6\u00b5n, \u02c6\u00b5\u03b8,m)|Y1:n] = inf\n\u03b8\u2208\u0398\nE [SWp(\u02c6\u00b5n, \u02c6\u00b5\u03b8,m)|Y1:n] \u2282 argmin\u03b8\u2208\u0398SWp(\u02c6\u00b5n, \u00b5\u03b8)\n\nSWp(\u02c6\u00b5n, \u00b5\u03b8)\n\n(12)\n\n(13)\nE [SWp(\u02c6\u00b5n, \u02c6\u00b5\u03b8,m)|Y1:n] is\n\nBesides, there exists m\u2217 such that, for any m \u2265 m\u2217, the set argmin\u03b8\u2208\u0398\nnon-empty.\n\nThis result shows that MESWE would be indeed promising in practice, as one get can more accurate\nestimations by increasing m.\n\n5\n\n\f3.4 Rate of convergence and the asymptotic distribution\n\nIn our last set of theoretical results, we investigate the asymptotic distribution of MSWE and we\nestablish a rate of convergence. We now suppose that we are in the well-speci\ufb01ed setting, i.e. there\nexists \u03b8(cid:63) in the interior of \u0398 such that \u00b5\u03b8(cid:63) = \u00b5(cid:63), and we consider the following two assumptions.\nFor any u \u2208 Sd\u22121 and t \u2208 R, we de\ufb01ne F\u03b8(u, t) =(cid:82)Y\n1(\u2212\u221e,t]((cid:104)u, y(cid:105))d\u00b5\u03b8(y). Note that for any\nu \u2208 Sd\u22121, F\u03b8(u,\u00b7) is the cumulative distribution function (CDF) associated to the measure u(cid:63)\n(cid:93) \u00b5\u03b8.\nA6. For all \u0001 > 0, there exists \u03b4 > 0 such that inf \u03b8\u2208\u0398: \u03c1\u0398(\u03b8,\u03b8(cid:63))\u2265\u0001 SW1(\u00b5\u03b8(cid:63) , \u00b5\u03b8) > \u03b4 .\nLet L1(Sd\u22121\u00d7R) denote the class of functions that are absolutely integrable on the domain Sd\u22121\u00d7R,\nwith respect to the measure d\u03c3 \u2297 Leb, where Leb denotes the Lebesgue measure.\nA7. Assume that there exists a measurable function D(cid:63) = (D(cid:63),1, . . . , D(cid:63),d\u03b8 ) : Sd\u22121 \u00d7 R (cid:55)\u2192 Rd\u03b8\nsuch that for each i = 1, . . . , d\u03b8, D(cid:63),i \u2208 L1(Sd\u22121 \u00d7 R) and\n\n(cid:90)Sd\u22121(cid:90)R |F\u03b8(u, t) \u2212 F\u03b8(cid:63) (u, t) \u2212 (cid:104)\u03b8 \u2212 \u03b8(cid:63), D(cid:63)(u, t)(cid:105)| dtd\u03c3(u) = \u0001(\u03c1\u0398(\u03b8, \u03b8(cid:63))) ,\n\ni=1 are linearly independent in\n\nwhere \u0001 : R+ \u2192 R+ satis\ufb01es limt\u21920 \u0001(t) = 0. Besides, {D(cid:63),i}d\u03b8\nL1(Sd\u22121 \u00d7 R).\nFor any u \u2208 Sd\u22121, and t \u2208 R, de\ufb01ne: \u02c6Fn(u, t) = n\u22121 card{i \u2208 {1, . . . , n} : (cid:104)u, Yi(cid:105) \u2264 t}, where\ncard denotes the cardinality of a set, and for any u \u2208 Sd\u22121, \u02c6Fn(u,\u00b7) is the CDF associated to the\nmeasure u(cid:63)\nA8. There exists a random element G(cid:63) : Sd\u22121\u00d7R (cid:55)\u2192 R such that the stochastic process \u221an( \u02c6Fn\u2212F\u03b8(cid:63) )\nconverges weakly in L1(Sd\u22121 \u00d7 R) to G(cid:63)\nTheorem 5. Assume A1, A2, A3, A6, A7 and A8. Then, the asymptotic distribution of the goodness-\nof-\ufb01t statistic is given by,\n\n(cid:93) \u02c6\u00b5n.\n\n1.\n\n\u221an inf\n\u03b8\u2208\u0398\n\nSW1(\u02c6\u00b5n, \u00b5\u03b8)\n\nw\n\n\u2212\u2192 inf\n\n\u03b8\u2208\u0398(cid:90)Sd\u22121(cid:90)R |G(cid:63)(u, t) \u2212 (cid:104)\u03b8, D(cid:63)(u, t)(cid:105)| dtd\u03c3(u),\n\nas n \u2192 +\u221e ,\n\nwhere \u02c6\u00b5n is de\ufb01ned by (2).\nTheorem 6. Assume A1, A2, A3, A6, A7 and A8. Suppose also that the random map \u03b8 (cid:55)\u2192\n(cid:82)Sd\u22121(cid:82)R |G(cid:63)(u, t) \u2212 (cid:104)\u03b8, D(cid:63)(u, t)(cid:105)| dtd\u03c3(u) has a unique in\ufb01mum almost surely. Then, MSWE with\np = 1 satis\ufb01es,\n\u221an(\u02c6\u03b8n \u2212 \u03b8(cid:63))\nas n \u2192 +\u221e ,\n\n\u2212\u2192 argmin\u03b8\u2208\u0398(cid:90)Sd\u22121(cid:90)R |G(cid:63)(u, t) \u2212 (cid:104)\u03b8, D(cid:63)(u, t)(cid:105)| dtd\u03c3(u),\n\nw\n\nwhere \u02c6\u03b8n is de\ufb01ned by (1) with SW1 in place of D.\n\nThese results show that the estimator and the associated goodness-of-\ufb01t statistics will converge to a\nrandom variable in distribution, where the rate of convergence is \u221an. Note that G(cid:63) is de\ufb01ned as a\nrandom element (see A8), therefore we can not claim that the convergence in distribution derived in\nTheorem 5 and 6 implies the convergence in probability.\nThis CLT is also inspired by [3], where they identi\ufb01ed the asymptotic distribution associated to the\nminimum Wasserstein estimator. However, since Wp admits an analytical form only when d = 1,\ntheir result is restricted to the scalar case, and in their conclusion, [3] conjecture that the rate of the\nminimum Wasserstein estimators would depend negatively on the dimension of the observation space.\nOn the contrary, since SWp is de\ufb01ned in terms of one-dimensional Wp distances, we circumvent\nthe curse of dimensionality and our result holds for any \ufb01nite dimension. While the perceived\ncomputational burden has created a pessimism in the machine learning community about the use of\nWasserstein-based methods in large dimensional settings, which motivated the rise of regularized\noptimal transport [26], we believe that our \ufb01ndings provide an interesting counter-example to this\nconception.\n\n1Under mild assumptions on the tails of u(cid:63)\n\n(cid:93) \u00b5(cid:63) for any u \u2208 Sd\u22121, we believe that one can prove that A8 holds\n\nin general by extending [24, Proposition 3.5] and [25, Theorem 2.1a].\n\n6\n\n\f(a) MSWE vs. n\n\n(b) MESWE vs. n = m\n\n(c) MESWE with n = 2000 vs. m\nFigure 2: Min. SW estimation on Gaussians in R10. Figure 2a and Figure 2b show the mean\nsquared error between (m(cid:63), \u03c32\nn,n)) for n\nfrom 10 to 10 000, illustrating Theorems 2 and 3. Figure 2c shows the error between ( \u02c6mn, \u02c6\u03c32\nn) and\nn,m) for 2000 observations and m from 10 to 10 000, to illustrate Theorem 4. Results are\n( \u02c6mn,m, \u02c6\u03c32\naveraged over 100 runs, the shaded areas represent the standard deviation.\n\n(cid:63)) = (0, 1) and MSWE ( \u02c6mn, \u02c6\u03c32\n\nn) (resp. MESWE ( \u02c6mn,n, \u02c6\u03c32\n\n4 Experiments\n\nWe conduct experiments on synthetic and real data to empirically con\ufb01rm our theorems. We explain\nin Section 4 of the supplementary document the optimization methods used to \ufb01nd the estimators.\nSpeci\ufb01cally, we can use stochastic iterative optimization algorithm (e.g., stochastic gradient descent).\nNote that, since we calculate (expected) SW with Monte Carlo approximations over a \ufb01nite set of\nprojections (and a \ufb01nite number of \u2018generated datasets\u2019), MSWE and MESWE fall into the category\nof doubly stochastic algorithms. Our experiments on synthetic data actually show that using only one\nrandom projection and one randomly generated dataset at each iteration of the optimization process\nis enough to illustrate our theorems. We provide the code to reproduce the experiments.2\nMultivariate Gaussian distributions: We con-\nsider the task of estimating the parameters of a\n10-dimensional Gaussian distribution using our\nSW estimators: we are interested in the model\n\nM =(cid:8)N (m, \u03c32I) : m \u2208 R10, \u03c32 > 0(cid:9) and\n\nwe draw i.i.d. observations with (m(cid:63), \u03c32\n(cid:63)) =\n(0, 1). The advantage of this simple setting\nis that the density of the generated data has a\nclosed-form expression, which makes MSWE\ntractable. We empirically verify our central limit\ntheorem: for different values of n, we compute\n500 times MSWE of order 1 using one random\nprojection, then we estimate the density of \u02c6\u03c32\nn\nwith a kernel density estimator. Figure 1 shows\nthe distributions centered and rescaled by \u221an\nfor each n, and con\ufb01rms the convergence rate\nthat we derived (Theorem 6). To illustrate the consistency property in Theorem 2, we approximate\nMSWE of order 2 for different numbers of observed data n using one random projection and we report\nfor each n the mean squared error between the estimate mean and variance and the data-generating\nparameters (m(cid:63), \u03c32\n(cid:63)). We proceed the same way to study the consistency of MESWE (Theorem 3),\nwhich we approximate using one random projections and one generated dataset z1:m of size m = n\nfor different values of n. We also verify the convergence of MESWE to MSWE (Theorem 4): we\ncompute these estimators on a \ufb01xed set of n = 2000 observations for different m, and we measure\n\nFigure 1: Probability density estimates of the\nn of order 1, centered and rescaled by\nMSWE \u02c6\u03c32\n\u221an, on the 10-dimensional Gaussian model for\ndifferent values of n.\n\n2See https://github.com/kimiandj/min_swe.\n\n7\n\n101102103104numberofobservationsn0.000.050.100.15kbmn\u2212m?k22101102103104numberofobservationsn0.0000.0250.0500.0750.100kb\u03c32n\u2212\u03c32?k22101102103104numberofobservationsn0.000.050.100.15kbmn,n\u2212m?k22101102103104numberofobservationsn0.00.10.2kb\u03c32n,n\u2212\u03c32?k22101102103104numberofgeneratedsamplesm0.00000.00020.00040.0006kbmn\u2217,m\u2212bmn\u2217k22101102103104numberofgeneratedsamplesm0.0000.0020.0040.0060.008kb\u03c32n\u2217,m\u2212b\u03c32n\u2217k22\u22122\u22121012\u221an(b\u03c32n\u2212\u03c32?)0.00.20.40.60.8densityn=250n=500n=750n=1000n=2500n=5000n=10000\f(a) Comparison of MESWE and MEWE\n\n(b) MESWE\n\n\u2217\n(c) MESWE, n\n\n= 100\n\nFigure 3: Min. SW estimation for the location parameter of multivariate elliptically contoured stable\ndistributions. Figure 3a compares the quality of the estimation provided by SW and Wasserstein-based\nestimators as well as their average computational time, for different values of dimension d. Figure 3b\nand Figure 3c illustrate, for d = 10, the consistency of MESWE \u02c6mn,m and its convergence to the\nMSWE \u02c6mn. Results are averaged over 100 runs, the shaded area represent the standard deviation.\n\nthe error between them for each m. Results are shown in Figure 2. We see that our estimators\nindeed converge to (m(cid:63), \u03c32\n(cid:63)) as the number of observations increases (Figures 2a, 2b), and on a \ufb01xed\nobserved dataset, MESWE converges to MSWE as we generate more samples (Figure 2c).\nMultivariate elliptically contoured stable distributions: We focus on parameter inference for a\nsubclass of multivariate stable distributions, called elliptically contoured stable distributions and\ndenoted by E\u03b1Sc [27]. Stable distributions refer to a family of heavy-tailed probability distributions\nthat generalize Gaussian laws and appear as the limit distributions in the generalized central limit\ntheorem [28]. These distributions have many attractive theoretical properties and have been proven\nuseful in modeling \ufb01nancial [29] data or audio signals [30, 31]. While special univariate cases include\nGaussian, L\u00e9vy and Cauchy distributions, the density of stable distributions has no general analytic\nform, which restricts their practical application, especially for the multivariate case.\nIf Y \u2208 Rd \u223c E\u03b1Sc(\u03a3, m), then its joint characteristic function is de\ufb01ned for any t \u2208 Rd as\nE[exp(itT Y )] = exp(cid:0)\u2212(tT \u03a3t)\u03b1/2 + itT m(cid:1), where \u03a3 is a positive de\ufb01nite matrix (akin to a\ncorrelation matrix), m \u2208 Rd is a location vector (equal to the mean if it exists) and \u03b1 \u2208 (0, 2)\ncontrols the thickness of the tail. Even though their densities cannot be evaluated easily, it is\nstraightforward to sample from E\u03b1Sc [27], therefore it is particularly relevant here to apply MESWE\ninstead of MLE.\nTo demonstrate the computational advantage of MESWE over the minimum expected Wasserstein\nestimator [3, MEWE], we consider observations in Rd i.i.d. from E\u03b1Sc(I, m(cid:63)) where each compo-\nnent of m(cid:63) is 2 and \u03b1 = 1.8, and M =(cid:8)E\u03b1Sc(I, m) : m \u2208 Rd(cid:9). The Wasserstein distance on\nmultivariate data is either computed exactly by solving the linear program in (5), or approximated\nby solving a regularized version of this problem with Sinkhorn\u2019s algorithm [12]. The MESWE\nis approximated using 10 random projections and 10 sets of generated samples. Then, following\nthe approach in [3], we use the gradient-free optimization method Nelder-Mead to minimize the\nWasserstein and SW distances. We report on Figure 3a the mean squared error between each estimate\nand m(cid:63), as well as their average computational time for different values of dimension d. We see\nthat MESWE provides the same quality of estimation as its Wasserstein-based counterparts while\nconsiderably reducing the computational time, especially in higher dimensions. We focus on this\nmodel in R10 and we illustrate the consistency of the MESWE \u02c6mn,m, approximated with one random\nprojection and one generated dataset, the same way as for the Gaussian model: see Figure 3b. To\ncon\ufb01rm the convergence of \u02c6mn,m to the MSWE \u02c6mn, we \ufb01x n = 100 observations and we compute\nthe mean squared error between the two approximate estimators (using one random projection and\none generated dataset) for different values of m (Figure 3c). Note that the MSWE is approximated\nwith the MESWE obtained for a large enough value of m: \u02c6mn \u2248 \u02c6mn,10 000.\nHigh-dimensional real data using GANs: Finally, we run experiments on image generation using\nthe Sliced-Wasserstein Generator (SWG), an alternative GAN formulation based on the minimization\nof the SW distance [16]. Speci\ufb01cally, the generative modeling approach consists in introducing a\nrandom variable Z which takes value in Z with a \ufb01xed distribution, and then transforming Z through\na neural network. This de\ufb01nes a parametric function T\u03b8 : Z \u2192 Y that is able to produce images from\na distribution \u00b5\u03b8.\n\n8\n\n2510dimension0.0000.0050.0100.0150.020k\u02c6m\u2212m?k22MESWEapprox.MEWEexactMEWE2510dimension05001000averagetime(s)101102103104numberofobservationsn0.00.10.20.30.40.5kbmn,n\u2212m?k22101102103numberofgeneratedsamplesm0.00000.00050.00100.0015kbmn\u2217,m\u2212bmn\u2217k22\fThe goal is to optimize the neural net-\nwork parameters such that the gen-\nerated images are close to the ob-\nserved ones.\n[16] proposes to min-\nimize the SW distance between \u00b5\u03b8\nand the real data distribution over \u03b8\nas the generator objective, and train\non MESWE in practice. For our ex-\nperiments, we design a neural network\nwith the fully-connected con\ufb01guration\ngiven in [16, Appendix D] and we use\nthe MNIST dataset, made of 60 000\ntraining images and 10 000 test im-\nages of size 28 \u00d7 28. Our training ob-\njective is MESWE of order 2 approxi-\nmated with 20 random projections and\n20 different generated datasets. We\nstudy the consistent behavior of the MESWE by training the neural network on different sizes n\nof training data and different numbers m of generated samples and by comparing the \ufb01nal training\nloss and test loss to the ones obtained when learning on the whole training dataset (n = 60 000)\nand m = 200. Results are averaged over 10 runs and shown on Figure 4, where the shaded areas\ncorrespond to the standard deviation over the runs. We observe that our results con\ufb01rm Theorem 3.\nWe would like to point out that, in all of our experiments, the random projections used in the Monte\nCarlo average that estimates the integral in (7) were picked uniformly on Sd\u22121 (see Section 4 in the\nsupplementary document for more details). The sampling on Sd\u22121 directly impacts the quality of\nthe resulting approximation of SW, and might induce variance in practice when learning generative\nmodels. On the theoretical side, studying the asymptotic properties of SW-based estimators obtained\nwith a \ufb01nite number of projections is an interesting question (e.g., their behavior might depend on the\nsampling method or the number of projections used). We leave this study for future research.\n\nfor (n, m) \u2208 (cid:8)(1, 1), (100, 20), (1000, 40), (10 000, 60)(cid:9)\n\nFigure 4: Mean-squared error between the training (test) loss\n\nand the training (test) loss for (n, m) = (60 000, 200) on\nMNIST using the SW generator. We trained for 20 000 itera-\ntions with the ADAM optimizer [32].\n\n5 Conclusion\n\nThe Sliced-Wasserstein distance has been an attractive metric choice for learning in generative\nmodels, where the densities cannot be computed directly. In this study, we investigated the asymptotic\nproperties of estimators that are obtained by minimizing SW and the expected SW. We showed that (i)\nconvergence in SW implies weak convergence of probability measures in general Wasserstein spaces,\n(ii) the estimators are consistent, (iii) the estimators converge to a random variable in distribution\nwith a rate of \u221an. We validated our mathematical results on both synthetic data and neural networks.\nWe believe that our techniques can be further extended to the extensions of SW such as [20, 33, 34].\n\nAcknowledgements\n\nThe authors are grateful to Pierre Jacob for his valuable comments on an earlier version of this\nmanuscript. This work is partly supported by the French National Research Agency (ANR) as a part\nof the FBIMATRIX project (ANR-16-CE23-0014) and by the industrial chair Machine Learning\nfor Big Data from T\u00e9l\u00e9com ParisTech. Alain Durmus acknowledges support from Polish National\nScience Center grant: NCN UMO-2018/31/B/ST1/00253.\n\nReferences\n[1] J. Wolfowitz. The minimum distance method. Ann. Math. Statist., 28(1):75\u201388, 03 1957.\n\n[2] A. Basu, H. Shioya, and C. Park. Statistical Inference: The Minimum Distance Approach.\n\nChapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press, 2011.\n\n[3] E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. On parameter estimation with the\n\nWasserstein distance. Information and Inference: A Journal of the IMA, Jan 2019.\n\n9\n\n1100100010000trainingdatasize0.00.10.20.30.40.5meansquarederrorofthelosstrainlosstestloss\f[4] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. In International Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-\n\nencoders. arXiv preprint arXiv:1711.01558, 2017.\n\n[6] Aude Genevay, Gabriel Peyr\u00e9, and Marco Cuturi. Learning generative models with Sinkhorn\n\ndivergences. arXiv preprint arXiv:1706.00292, 2017.\n\n[7] Giorgio Patrini, Marcello Carioni, Patrick Forre, Samarth Bhargav, Max Welling, Rianne\nvan den Berg, Tim Genewein, and Frank Nielsen. Sinkhorn autoencoders. arXiv preprint\narXiv:1810.01118, 2018.\n\n[8] Jonas Adler and Sebastian Lunz. Banach Wasserstein GAN. In Advances in Neural Information\n\nProcessing Systems, pages 6754\u20136763, 2018.\n\n[9] Aude Genevay, Gabriel Peyr\u00e9, and Marco Cuturi. GAN and VAE from an optimal transport\n\npoint of view. arXiv preprint arXiv:1706.01807, 2017.\n\n[10] Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard\nSchoelkopf. From optimal transport to generative modeling: the vegan cookbook. arXiv preprint\narXiv:1705.07642, 2017.\n\n[11] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence\nproperties of generative adversarial learning. In Advances in Neural Information Processing\nSystems, pages 5545\u20135553, 2017.\n\n[12] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances\n\nin Neural Information Processing Systems, pages 2292\u20132300, 2013.\n\n[13] J. Rabin, G. Peyr\u00e9, J. Delon, and M. Bernot. Wasserstein barycenter and its application to\ntexture mixing. In Alfred M. Bruckstein, Bart M. ter Haar Romeny, Alexander M. Bronstein,\nand Michael M. Bronstein, editors, Scale Space and Variational Methods in Computer Vision,\npages 435\u2013446, 2012.\n\n[14] Nicolas Bonnotte. Unidimensional and Evolution Methods for Optimal Transportation. PhD\n\nthesis, Paris 11, 2013.\n\n[15] N. Bonneel, J. Rabin, G. Peyr\u00e9, and H. P\ufb01ster. Sliced and Radon Wasserstein barycenters of\n\nmeasures. Journal of Mathematical Imaging and Vision, 51(1):22\u201345, 2015.\n\n[16] Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced\nWasserstein distance. In IEEE Conference on Computer Vision and Pattern Recognition, pages\n3483\u20133491, 2018.\n\n[17] Soheil Kolouri, Phillip E. Pope, Charles E. Martin, and Gustavo K. Rohde. Sliced Wasserstein\n\nauto-encoders. In International Conference on Learning Representations, 2019.\n\n[18] Antoine Liutkus, Umut \u00b8Sim\u00b8sekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stoter.\nSliced-Wasserstein \ufb02ows: Nonparametric generative modeling via optimal transport and diffu-\nsions. In International Conference on Machine Learning, 2019.\n\n[19] Jiqing Wu, Zhiwu Huang, Wen Li, Janine Thoma, and Luc Van Gool. Sliced wasserstein\n\ngenerative models. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.\n\n[20] Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo,\nZhizhen Zhao, David Forsyth, and Alexander Schwing. Max-Sliced Wasserstein distance and\nits use for GANs. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.\n\n[21] Patrick Billingsley. Convergence of probability measures. Wiley Series in Probability and\nStatistics: Probability and Statistics. John Wiley & Sons Inc., New York, second edition, 1999.\nA Wiley-Interscience Publication.\n\n[22] C\u00e9dric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wis-\n\nsenschaften. Springer, 2009 edition, September 2008.\n\n10\n\n\f[23] S. T. Rachev and L. R\u00fcschendorf. Mass transportation problems. Vol. I. Probability and its\n\nApplications (New York). Springer-Verlag, New York, 1998. Theory.\n\n[24] Sophie Dede. An empirical central limit theorem in l1 for stationary sequences. Stochastic\n\nProcesses and their Applications, 119(10):3494\u20133515, 2009.\n\n[25] Eustasio del Barrio, Evarist Gin\u00e9, and Carlos Matr\u00e1n. Central limit theorems for the wasserstein\ndistance between the empirical and the true distributions. Ann. Probab., 27(2):1009\u20131071, 04\n1999.\n\n[26] G. Peyr\u00e9, M. Cuturi, et al. Computational optimal transport. Foundations and Trends R(cid:13) in\n\nMachine Learning, 11(5-6):355\u2013607, 2019.\n\n[27] John P. Nolan. Multivariate elliptically contoured stable distributions: theory and estimation.\n\nComputational Statistics, 28(5):2067\u20132089, Oct 2013.\n\n[28] G. Samorodnitsky and M.S. Taqqu. Stable Non-Gaussian Random Processes: Stochastic Models\n\nwith In\ufb01nite Variance. Stochastic Modeling Series. Taylor & Francis, 1994.\n\n[29] B. B. Mandelbrot. Fractals and Scaling in Finance: Discontinuity, Concentration, Risk. Selecta\n\nVolume E. Springer Science & Business Media, 2013.\n\n[30] U. \u00b8Sim\u00b8sekli, A. Liutkus, and A. T. Cemgil. Alpha-stable matrix factorization. IEEE Signal\n\nProcessing Letters, 22(12):2289\u20132293, 2015.\n\n[31] Simon Leglaive, Umut \u00b8Sim\u00b8sekli, Antoine Liutkus, Roland Badeau, and Ga\u00ebl Richard. Alpha-\nIn 2017 IEEE International Conference on\n\nstable multichannel audio source separation.\nAcoustics, Speech and Signal Processing (ICASSP), pages 576\u2013580. IEEE, 2017.\n\n[32] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd\nInternational Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May\n7-9, 2015, Conference Track Proceedings, 2015.\n\n[33] Fran\u00e7ois-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. In International\n\nConference on Machine Learning, 2019.\n\n[34] S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. K. Rohde. Generalized Sliced Wasser-\n\nstein Distances. In Advances in Neural Information Processing Systems, 2019.\n\n11\n\n\f", "award": [], "sourceid": 104, "authors": [{"given_name": "Kimia", "family_name": "Nadjahi", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Alain", "family_name": "Durmus", "institution": "ENS Paris Saclay"}, {"given_name": "Umut", "family_name": "Simsekli", "institution": "Institut Polytechnique de Paris/ University of Oxford"}, {"given_name": "Roland", "family_name": "Badeau", "institution": "T\u00e9l\u00e9com ParisTech"}]}