{"title": "Minimax Estimation of Maximum Mean Discrepancy with Radial Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1930, "page_last": 1938, "abstract": "Maximum Mean Discrepancy (MMD) is a distance on the space of probability measures which has found numerous applications in machine learning and nonparametric testing. This distance is based on the notion of embedding probabilities in a reproducing kernel Hilbert space. In this paper, we present the first known lower bounds for the estimation of MMD based on finite samples. Our lower bounds hold for any radial universal kernel on $\\R^d$ and match the existing upper bounds up to constants that depend only on the properties of the kernel. Using these lower bounds, we establish the minimax rate optimality of the empirical estimator and its $U$-statistic variant, which are usually employed in applications.", "full_text": "Minimax Estimation of Maximum Mean Discrepancy\n\nwith Radial Kernels\n\nIlya Tolstikhin\n\nDepartment of Empirical Inference\n\nMPI for Intelligent Systems\nT\u00fcbingen 72076, Germany\nilya@tuebingen.mpg.de\n\nBharath K. Sriperumbudur\n\nDepartment of Statistics\n\nPennsylvania State University\nUniversity Park, PA 16802, USA\n\nbks18@psu.edu\n\nBernhard Sch\u00f6lkopf\n\nDepartment of Empirical Inference\n\nMPI for Intelligent Systems\nT\u00fcbingen 72076, Germany\nbs@tuebingen.mpg.de\n\nAbstract\n\nMaximum Mean Discrepancy (MMD) is a distance on the space of probability\nmeasures which has found numerous applications in machine learning and nonpara-\nmetric testing. This distance is based on the notion of embedding probabilities in a\nreproducing kernel Hilbert space. In this paper, we present the \ufb01rst known lower\nbounds for the estimation of MMD based on \ufb01nite samples. Our lower bounds\nhold for any radial universal kernel on Rd and match the existing upper bounds up\nto constants that depend only on the properties of the kernel. Using these lower\nbounds, we establish the minimax rate optimality of the empirical estimator and its\nU-statistic variant, which are usually employed in applications.\n\n1\n\nIntroduction\n\nOver the past decade, the notion of embedding probability measures in a Reproducing Kernel\nHilbert Space (RKHS) [1, 13, 18, 17] has gained a lot of attention in machine learning, owing to\nits wide applicability. Some popular applications of RKHS embedding of probabilities include two-\nsample testing [5, 6], independence [7] and conditional independence testing [3], feature selection\n[14], covariate-shift [13], causal discovery [9], density estimation [15], kernel Bayes\u2019 rule [4],\nand distribution regression [20]. This notion of embedding probability measures can be seen as a\ngeneralization of classical kernel methods which deal with embedding points of an input space as\nelements in an RKHS. Formally, given a probability measure P and a continuous positive de\ufb01nite\nreal-valued kernel k (we denote H to be the corresponding RKHS) de\ufb01ned on a separable topological\nspace X , P is embedded into H as \u00b5P :=R k(\u00b7, x) dP (x), called the mean element or the kernel\nmean assuming k and P satisfyRXpk(x, x) dP (x) < 1. Based on the above embedding of P , [5]\n\nde\ufb01ned a distance\u2014called the Maximum Mean Discrepancy (MMD)\u2014on the space of probability\nmeasures as the distance between the corresponding mean elements, i.e.,\n\nMMDk(P, Q) = k\u00b5P \u00b5QkH\n\n.\n\nWe refer the reader to [18, 17] for a detailed study on the properties of MMD and its relation to other\ndistances on probabilities.\nEstimation of kernel mean.\nIn all the above mentioned applications, since the only knowledge of\nthe underlying distribution is through random samples drawn from it, an estimate of \u00b5P is employed\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fin practice. In applications such as two-sample test [5, 6] and independence test [7] that involve\nMMD, an estimate of MMD is constructed based on the estimates of \u00b5P and \u00b5Q respectively. The\nnPn\ni=1 k(\u00b7, Xi) which\nsimple and most popular estimator of \u00b5P is the empirical estimator, \u00b5Pn := 1\nis a Monte Carlo approximation of \u00b5P based on random samples (Xi)n\ni=1 drawn i.i.d. from P .\nRecently, [10] proposed a shrinkage estimator of \u00b5P based on the idea of James-Stein shrinkage,\nwhich is demonstrated to empirically outperform \u00b5Pn. While both these estimators are shown to\nbe pn-consistent [13, 5, 10], it was not clear until the recent work of [21] whether any of these\nestimators are minimax rate optimal, i.e., is there an estimator of \u00b5P that yields a convergence rate\nfaster than n1/2? Based on the minimax optimality of the sample mean (i.e., X := 1\ni=1 Xi) for\nthe estimation of a \ufb01nite dimensional mean of a normal distribution at a minimax rate of n1/2 [8,\nChapter 5, Example 1.14], while one can intuitively argue that the empirical and shrinkage estimators\nof \u00b5P are minimax rate optimal, it is dif\ufb01cult to extend the \ufb01nite dimensional argument in a rigorous\nmanner to the estimation of the in\ufb01nite dimensional object, \u00b5P . Note that H is in\ufb01nite dimensional\nif k is universal [19, Chapter 4], e.g., Gaussian kernel. By establishing a remarkable relation between\nthe MMD of two Gaussian distributions and the Euclidean distance between their means for any\nbounded continuous translation invariant universal kernel on X = Rd, [21] rigorously showed that\nthe estimation of \u00b5P is only as hard as the estimation of the \ufb01nite dimensional mean of a normal\ndistribution and thereby established the minimax rate of estimating \u00b5P to be n1/2. This in turn\ndemonstrates the minimax rate optimality of empirical and shrinkage estimators of \u00b5P .\nEstimation of MMD.\nIn this paper, we are interested in the minimax optimal estimation of\nMMDk(P, Q). The question of \ufb01nding optimal estimators of MMD is of interest in applications such\nas kernel-based two-sample [5] and independence tests [7] as the test statistic is indeed an estimate of\nMMD and it is important to use statistically optimal estimators in the construction of these kernel\nbased tests. An estimator of MMD that is currently employed in these applications is based on the\nempirical estimators of \u00b5P and \u00b5Q, i.e.,\n\nnPn\n\nMMDn,m := k\u00b5Pn \u00b5QmkH,\ni.i.d.\u21e0 P and (Yi)m\n\ni=1\n\ni.i.d.\u21e0 Q. [5, 7] also considered a\nwhich is constructed from samples (Xi)n\nU-statistic variant of MMDn,m as a test statistic in these applications. As discussed above, while \u00b5Pn\nand \u00b5Qm are minimax rate optimal estimators of \u00b5P and \u00b5Q respectively, they need not guarantee\nthat MMDn,m is minimax rate optimal. Using the fact that k\u00b5Pn \u00b5PkH = Op(n1/2) and\n\ni=1\n\n|MMDk(P, Q) MMDn,m|\uf8ff k \u00b5P \u00b5PnkH + k\u00b5Qm \u00b5QkH,\n\nit is easy to see that\n\n|MMDk(P, Q) MMDn,m| = Op(n1/2 + m1/2).\n\n(1)\nIn fact, if k is a bounded kernel, it can be shown that the constants (which are hidden in the order\nnotation in (1)) depend only on the bound on the kernel and are independent of X , P and Q. The\ngoal of this work is to \ufb01nd the minimax rate rn,m,k(P) and a positive constant ck(P) (independent\nof m and n) such that\n\ninf\n\u02c6Fn,m\n\nsup\nP,Q2P\n\nP n \u21e5 Qmnr1\n\nn,m,k(P) | \u02c6Fn,m MMDk(P, Q)| ck(P)o > 0,\n\n(2)\n\ni=1, (Yi)m\n\nwhere P is a suitable subset of Borel probability measures on X , the in\ufb01mum is taken over all\nestimators \u02c6Fn,m mapping the i.i.d. sample {(Xi)n\ni=1} to R+, and P n \u21e5 Qm denotes the\ni.i.d.\u21e0 Q. In addition\ni.i.d.\u21e0 P and (Yi)m\nprobability measure associated with the sample when (Xi)n\nto the rate, we are also interested in the behavior of ck(P) in terms of its dependence on k, X and P.\nContributions. The main contribution of the paper is in establishing m1/2 + n1/2, i.e.,\nrn,m,k(P) = p(m + n)/mn as the minimax rate for estimating MMDk(P, Q) when k is a ra-\ndial universal kernel (examples include the Gaussian, Mat\u00e9rn and inverse multiquadric kernels) on Rd\nand P is the set of all Borel probability measures on Rd with in\ufb01nitely differentiable densities. This\nresult guarantees that MMDn,m and its U-statistic variant are minimax rate optimal estimators of\nMMDk(P, Q), which thereby ensures the minimax optimality of the test statistics used in kernel\ntwo-sample and independence tests. We would like to highlight the fact that our result of the minimax\nlower bound on MMDk(P, Q) implies part of the results of [21] related to the minimax estimation\n\ni=1\n\ni=1\n\n2\n\n\fof \u00b5P , as it can be seen that any \u270f-accurate estimators \u02c6\u00b5P and \u02c6\u00b5Q of \u00b5P and \u00b5Q respectively in the\nRKHS norm lead to the 2\u270f-accurate estimator \u02c6Fn,m := k\u02c6\u00b5P \u02c6\u00b5QkH of MMDk(P, Q), i.e.,\nck(P)(n1/2 + m1/2) \uf8ff| MMDk(P, Q) \u02c6Fn,m|\uf8ff k \u00b5P \u02c6\u00b5PkH + k\u00b5Q \u02c6\u00b5QkH.\n\nIn Section 2, we present the main results of our work wherein Theorem 1 is developed by employing\nthe ideas of [21] involving Le Cam\u2019s method (see Theorem 3) [22, Sections 2.3 and 2.6]. However,\nwe show that while the minimax rate is m1/2 + n1/2, there is a sub-optimal dependence on d in\nthe constant ck(P) which makes the result uninteresting in high dimensional scenarios. To alleviate\nthis issue, we present a re\ufb01ned result in Theorem 2 based on the method of two fuzzy hypotheses (see\nTheorem 4) [22, Section 2.7.4] which shows that ck(P) in (2) is independent of d (i.e., X ). This result\nprovides a sharp lower bound for MMD estimation both in terms of the rate and the constant (which\nis independent of X ) that matches with behavior of the upper bound for MMDn,m. The proofs of\nthese results are provided in Section 3 while supplementary results are collected in an appendix.\nNotation.\nIn this work we focus on radial kernels, i.e., k(x, y) = (kx yk2) for all x, y 2 Rd.\nSchoenberg\u2019s theorem [12] states that a radial kernel k is positive de\ufb01nite for every d if and only if\nthere exists a non-negative \ufb01nite Borel measure \u232b on [0,1) such that\n\netkxyk2\n\nd\u232b(t)\n\n(3)\n\nk(x, y) =Z 1\n\n0\n\nfor all x, y 2 Rd. An important example of a radial kernel is the Gaussian kernel k(x, y) =\nexp{kx yk2/(2\u23182)} for \u23182 > 0. [17, Proposition 5] showed that k in (3) is universal if and only\nif supp(\u232b) 6= {0}, where for a \ufb01nite non-negative Borel measure \u00b5 on Rd we de\ufb01ne supp(\u00b5) =\n{x 2 Rd | if x 2 U and U is open then \u00b5(U ) > 0}.\n2 Main results\n\nIn this section, we present the main results of our work wherein we develop minimax lower bounds for\nthe estimation of MMDk(P, Q) when k is a radial universal kernel on Rd. We show that the minimax\ni.i.d.\u21e0 Q is\nrate for estimating MMDk(P, Q) based on random samples (Xi)n\nm1/2+n1/2, thereby establishing the minimax rate optimality of the empirical estimator MMDn,m\nof MMD(P, Q). First, we present the following result (proved in Section 3.1) for Gaussian kernels,\nwhich is based on an argument similar to the one used in [21] to obtain a minimax lower bound for\nthe estimation of \u00b5P .\nTheorem 1. Let P be the set of all Borel probability measures over Rd with continuously in\ufb01nitely\ndifferentiable densities. Let k be a Gaussian kernel with bandwidth parameter \u23182 > 0. Then the\nfollowing holds:\n\ni.i.d.\u21e0 P and (Yi)m\n\ni=1\n\ni=1\n\ninf\n\u02c6Fn,m\n\nsup\nP,Q2P\n\nP n \u21e5 Qm(MMDk(P, Q) \u02c6Fn,m \n\n1\n\n8r 1\n\nd + 1\n\nmax\u21e2 1\n\npn\n\n,\n\n1\n\npm) \n\n1\n5\n\n.\n\n(4)\n\nThe following remarks can be made about Theorem 1.\n(a) Theorem 1 shows that MMDk(P, Q) cannot be estimated at a rate faster than max{n1/2, m1/2}\nby any estimator \u02c6Fn,m for all P, Q 2P . Since max{m1/2, n1/2} 1\n2 (m1/2 + n1/2), the\nresult combined with (1) therefore establishes the minimax rate optimality of the empirical estimator,\nMMDn,m.\n(b) While Theorem 1 shows the right order of dependence on m and n, the dependence on d seems\nto be sub-optimal as the upper bound on |MMDn,m MMDk(P, Q)| depends only on the bound\non the kernel and is independent of d. This sub-optimal dependence on d may be due to the fact the\nproof of Theorem 1 (see Section 3.1) as aforementioned is closely based on the arguments applied\nin [21] for the minimax estimation of \u00b5P . While the lower bounding technique used in [21]\u2014which\nis commonly known as Le Cam\u2019s method based on many hypotheses [22, Chapter 2]\u2014provides\noptimal results in the problem of estimation of functions (e.g., estimation of \u00b5P in the norm of H), it\noften fails to do so in the case of estimation of real-valued functionals, which is precisely the focus of\nour work. Even though Theorem 1 is sub-optimal, we presented the result to highlight the fact that\n\n3\n\n\fthe minimax lower bounds for estimation of \u00b5P may not yield optimal results for MMDk(P, Q). In\nTheorem 2, we will develop a new argument based on two fuzzy hypotheses, which is a method of\nchoice for nonparametric estimation of functionals [22, Section 2.7.4]. This will allow us to get rid of\nthe super\ufb02uous dependence on the dimensionality d in the lower bound.\n(c) While Theorem 1 holds for only Gaussian kernels, we would like to mention that by using\nthe analysis of [21], Theorem 1 can be straightforwardly improved in various ways: (i) it can be\ngeneralized to hold for a wide class of radial universal kernels, (ii) the factor d1/2 in (4) can be\nremoved altogether for the case when P consists of all Borel discrete distributions on Rd. However,\nthese improvements do not involve any novel ideas than those captured by the proof of Theorem 1\nand so will not be discussed in this work. For details, we refer an interested reader to Theorems 2\nand 6 of [21] for extension to radial universal kernels and discrete measures, respectively.\n(d) Finally, it is worth mentioning that any lower bound on the minimax probability (including the\nbounds of Theorems 1 and 2) leads to a lower bound on the minimax risk, which is based on a simple\n\napplication of the Markov\u2019s inequality: EP n\u21e5Qm\u21e5s1\n\nn,m \u00b7 |An,m|\u21e4 P n \u21e5 Qm{|An,m| sn,m}.\n\nThe following result (proved in Section 3.2) is the main contribution of this work. It provides a\nminimax lower bound for the problem of MMD estimation, which holds for general radial universal\nkernels. In contrast to Theorem 1, it avoids the super\ufb02uous dependence on d and depends only on the\nproperties of k while exhibiting the correct rate.\nTheorem 2. Let P be the set of all Borel probability measures over Rd with continuously in\ufb01nitely\ndifferentiable densities. Let k be a radial kernel on Rd of the form (3), where \u232b is a bounded non-\nnegative measure on [0,1). Assume that there exist 0 < t0 \uf8ff t1 < 1 and 0 << 1 such that\n\u232b([t0, t1]) . Then the following holds:\n\ninf\n\u02c6Fn,m\n\nsup\nP,Q2P\n\nP n \u21e5 Qm(MMDk(P, Q) \u02c6Fn,m \n\n1\n\n20r t0\n\nt1e\n\nmax\u21e2 1\n\npn\n\n,\n\n1\n\npm) \n\n1\n14\n\n.\n\n(5)\n\nNote that the existence of 0 < t0 \uf8ff t1 < 1 and 0 << 1 such that \u232b([t0, t1]) ensures that\nsupp(\u232b) 6= {0} (i.e., the kernel is not a constant function), which implies k is universal. If k is a\nGaussian kernel with bandwidth parameter \u23182 > 0, it is easy to verify that t0 = t1 = (2\u23182)1 and\n = 1 satisfy \u232b([t0, t1]) as the Gaussian kernel is generated by \u232b = 1/(2\u23182) in (3), where x is\na Dirac measure supported at x. Therefore we obtain a dimension independent constant in (5) for\nGaussian kernels compared to the bound in (4).\n\n3 Proofs\n\nIn this section, we present the proofs of Theorems 1 and 2. Before we present the proofs, we \ufb01rst\nintroduce the setting of nonparametric estimation. Let F :\u21e5 ! R be a functional de\ufb01ned on a\nmeasurable space \u21e5 and P\u21e5 = {P\u2713 : \u2713 2 \u21e5} be a family of probability distributions indexed by \u21e5\nand de\ufb01ned over a measurable space X associated with data. We observe the data D 2X distributed\naccording to an unknown element P\u2713 2P \u21e5 and the goal is to estimate F (\u2713). Usually X , D, and P\u2713\nwill depend on sample size n. Let \u02c6Fn := \u02c6Fn(D) be an estimator of F (\u2713) based on D. The following\nwell known result [22, Theorem 2.2] provides a lower bound on the minimax probability of this\nproblem. We refer the reader to Appendix A for a proof of its more general version.\nTheorem 3. Assume there exist \u27130,\u2713 1 2 \u21e5 such that |F (\u27130) F (\u27131)| 2s > 0 and\nKL(P\u27131kP\u27130) \uf8ff \u21b5 with 0 <\u21b5< 1. Then\n\nP\u2713n| \u02c6Fn(D) F (\u2713)| so max 1\n\n4\n\ne\u21b5,\n\n1 p\u21b5/2\n\n2\n\n! ,\n\ninf\n\u02c6Fn\n\nsup\n\u27132\u21e5\n\nwhere KL(P\u27131kP\u27130) :=R log\u21e3 dP\u27131\n\ndP\u27130\u2318 dP\u27131 denotes the Kullback-Leibler divergence between P\u27131\nand P\u27130.\nThe above result (also called the Le Cam\u2019s method) provides the recipe for obtaining minimax lower\nbounds, where the goal is to construct two hypotheses \u27130,\u2713 1 2 \u21e5 such that (i) F (\u27130) and F (\u27131) are\nfar apart, while (ii) the corresponding distributions, P\u27130 and P\u27131 are close enough. The requirement\n(i) can be relaxed by introducing two random (fuzzy) hypotheses \u27130,\u2713 1 2 \u21e5, and requiring F (\u27130)\n\n4\n\n\fand F (\u27131) to be far apart with high probability. This weaker requirement leads to a lower bounding\ntechnique, called the method of two fuzzy hypotheses. This method is captured by the following\ntheorem [22, Theorem 2.14] and is commonly used to derive lower bounds on the minimax risk in\nthe problem of estimation of functionals [22, Section 2.7.4].\nTheorem 4. Let \u00b50 and \u00b51 be any probability distributions over \u21e5. Assume that\n\n1. There exist c 2 R, s > 0, 0 \uf8ff 0, 1 < 1 such that \u00b50\u2713 : F (\u2713) \uf8ff c 1 0 and\n\u00b51\u2713 : F (\u2713) c + 2s 1 1.\n2. There exist \u2327> 0 and 0 <\u21b5< 1 such that P1\u21e3 dPa\nPi(D) =Z P\u2713(D)\u00b5i(d\u2713),\n\ndP1 \u2327\u2318 1 \u21b5, where\n\ni 2{ 0, 1}\n\n0 is the absolutely continuous component of P0 with respect to P1.\n\nand Pa\n\n0\n\nThen\n\ninf\n\u02c6Fn\n\nsup\n\u27132\u21e5\n\nP\u2713n| \u02c6Fn(D) F (\u2713)| so \n\n\u2327 (1 \u21b5 1) 0\n\n1 + \u2327\n\n.\n\nWith this set up and background, we are ready to prove Theorems 1 and 2.\n\nP n\u21e5QmnMMDk(P, Q) \u02c6Fn,m so sup\n\n3.1 Proof of Theorem 1\nThe proof is based on Theorem 3 and treats two cases m n and m < n separately. We consider\nonly the case m n as the second one follows the same steps. Let Gd denote a class of multivariate\nGaussian distributions over Rd with covariance matrices proportional to identity matrix Id 2 Rd\u21e5d.\nIn our case Gd \u2713P , which leads to the following lower bound for any s > 0:\nsup\nP,Q2P\nNote that every element G(\u00b5, 2Id) 2G d is indexed by a pair (\u00b5, 2) 2 Rd \u21e5 (0,1) =: \u02dc\u21e5. Given\ntwo elements P, Q 2G d, the data is distributed according to P n \u21e5 Qm. This brings us into the\n2 for \u2713 = (\u02dc\u27131, \u02dc\u27132) 2 \u21e5\ncontext of Theorem 3 with \u21e5:= \u02dc\u21e5 \u21e5 \u02dc\u21e5, X := (Rd)n+m, P\u2713 := Gn\nwith Gaussian distributions G1 and G2 corresponding to parameters \u02dc\u27131, \u02dc\u27132 2 \u02dc\u21e5 respectively, and\nF (\u2713) = MMDk(G1, G2).\nIn order to apply Theorem 3 we need to choose two probability distributions P\u27130 and P\u27131. We de\ufb01ne\nfour different d-dimensional Gaussian distributions:\n0 , 2Id), Q0 = G(\u00b5Q\n\nP n\u21e5QmnMMDk(P, Q) \u02c6Fn,m so .\n\n0 , 2Id), P1 = Q1 = G(0, 2Id)\n\n1 \u21e5 Gm\n\nP0 = G(\u00b5P\n\nP,Q2Gd\n\nwith\n\n1\n\nn\n\nc2\u23182\n\nc1\u23182\n\n2 =\n\nk\u00b5P\n\nm\u2318 ,\n\nd \u21e32 +\n\nm\u25c6 ,\nd \u2713 1\nn \uf8ffqc2 1\nthis construction is possible as long asp c3\n\n0 k2 =\n\n+\n\nn\n\nwhere c1, c2, c3 > 0 are positive constants independent of m and n to be speci\ufb01ed later. Note that\nm , which is clearly satis\ufb01ed if\n\nn + 1\n\nc3 \uf8ff c2.\nFirst we will check the upper bound on the KL divergence between the distributions. Using the chain\nrule of KL divergence and its closed form expression for Gaussian distributions we write\n\nk\u00b5Q\n0 k2 =\nm +p c2\n\nc2\u23182\ndm\n\n,\n\nk\u00b5P\n\n0 \u00b5Q\n\n0 k2 =\n\nc3\u23182\ndn\n\n,\n\nNext we need to lower bound an absolute value between MMDk(P0, Q0) and MMDk(P1, Q1). Note\nthat\n\n|MMDk(P0, Q0) MMDk(P1, Q1)| = MMDk(P0, Q0).\n\n(6)\n\n5\n\nKL(P n\n\n1 \u21e5 Qm\n\n1 kP n\n\n0 \u21e5 Qm\n\n22 + m \u00b7 k\u00b5Q\n0 k2\n0 k2\n0 ) = n \u00b7 k\u00b5P\n22 = n \u00b7\n\nn + 1\n\nm\nc2\u23182 1\n2c1\u231822 + n\nm + m \u00b7\n2c12 + n\nm =\n\n2 + n\nm\n\nc2\n2c1\n\n.\n\n= c2\n\nc2\u23182 1\nm\n\n2c1\u231822 + n\nm\n\n\fUsing a closed-form expression for the MMD between Gaussian distribution [21, Eq. 25] we write\n\nMMD2\n\nk(P0, Q0) = 2\u2713\n\n\u23182\n\n\u23182 + 22\u25c6d/2 1 exp k\u00b5P\n\n2\u23182 + 42 !! .\n0 \u00b5Q\n0 k2\n\n(7)\n\nAssume\n\n0 \u00b5Q\nk\u00b5P\n0 k2\n2\u23182 + 42 \uf8ff 1.\nUsing 1 ex x/2, which holds for x 2 [0, 1], we write\n\n|MMDk(P0, Q0) MMDk(P1, Q1)| \n\nSince m n and (1 1\nm!d\n \nd + 2c12 + n\n\nd\n\n4\n\nUsing this and setting c3 = c2 we get\n\n0 \u00b5Q\n0 k2\n2\u23182 + 42 .\n\nd\n\nx )x1 monotonically decreases to e1 for x 1, we have\n\u2713 d\n\nm!d/4sk\u00b5P\nd + 2c12 + n\n= \u27131 \n1 + d/(6c1)\u25c6(1+d/(6c1)1)! 6c1\n2 r\n\nd + 6c1\u25c6d\n\n1\npn\n\n1\npn\n\ne 3c1\n\nc2\n\n1\n\n4\n\ne 3c1\n\nd \u00b7 d\n\n4\n\n2 .\n\n e 3c1\n2 r c2\n\n2d + 12c1\n\n.\n\n2d + 4c12 + n\nm \n2 r c2\n\nand\n\n1\n8\n\n>\n\n2\n\n1\n\nd + 6c1\n\n>\n\n1\n\nd + 1\n\n|MMDk(P0, Q0) MMDk(P1, Q1)|\nNow we set c1 = 0.16, c2 = 0.23. Checking that Condition (7) is satis\ufb01ed and noting that\n\nmax 1\n\n4\n\ne c2\n2c1 ,\n\n1 pc2/(4c1)\n\n2\n\n! >\n\n1\n5\n\n,\n\n1\n2\n\ne 3c1\n\nwe conclude the proof with an application of Theorem 3.\n\n3.2 Proof of Theorem 2\nFirst, we repeat the argument presented in the proof of Theorem 1 to bring ourselves into the context\nof minimax estimation, introduced in the beginning of Section 3.1. Namely, we reduce the class of\ndistributions P to its subset Gd containing all the multivariate Gaussian distributions over Rd with\ncovariance matrices proportional to identity matrix Id 2 Rd\u21e5d. The proof is based on Theorem 4 and\ntreats two cases m n and m < n separately. We consider only the case m n as the second one\nfollows the same steps.\nIn order to apply Theorem 4 we need to choose two \u201cfuzzy hypotheses\u201d, that is two probability\ndistributions \u00b50 and \u00b51 over \u21e5.\nIn our setting there is a one-to-one correspondence between\nparameters \u2713 2 \u21e5 and pairs of Gaussian distributions (G1, G2) 2G d \u21e5G d. Throughout the proof it\nwill be more convenient to treat \u00b50 and \u00b51 as distributions over Gd \u21e5G d. We will set \u00b50 to be a Dirac\nmeasure supported on (P0, Q0) with P0 = Q0 = G(0, 2Id). Clearly, MMDk(P0, Q0) = 0. This\ngives\n\nand the \ufb01rst inequality of Condition 1 in Theorem 4 holds with c = 0 and 0 = 0. Next we set \u00b51 to\nbe a distribution of a random pair (P, Q) with\n\n\u00b50\u2713 : F (\u2713) = 0 = 1\n\nQ = Gd(0, 2Id), P = Gd(\u00b5, 2Id),\n\n2 =\n\n1\n\n2t1d\n\n,\n\nwhere \u00b5 \u21e0 P\u00b5 for some probability distribution P\u00b5 over Rd to be speci\ufb01ed later. Next we are going to\ncheck Condition 2 of Theorem 4. For D = (x1, . . . , xn, y1, . . . , ym) de\ufb01ne \u201cposterior\u201d distributions\n\nas in Theorem 4. Using Markov\u2019s inequality we write\n\nPi(D) =Z P\u2713(D)\u00b5i(d\u2713),\n<\u2327 \u25c6 = P1\u2713 dP1\n\ndP0\n\nP1\u2713 dP0\n\ndP1\n\ni 2{ 0, 1}\n\n>\u2327 1\u25c6 \uf8ff \u2327E1\uf8ff dP1\ndP0 .\n\n(8)\n\n6\n\n\fWe have\n\n2\n\n22\n\ndP0\n\nhPn\n\nj=1 xj ,\u00b5i\n\ndP\u00b5(\u00b5).\n\ndP1\ndP0\n\ne nk\u00b5k2\n\n22 dP\u00b5(\u00b5)\n\ne nk\u00b5k2\n22 e\n\nk=1 e kykk2\nk=1 e kykk2\nNow we compute the expected value appearing in (8):\n\n(D) = RRdQn\n22 Qm\nj=1 e kxj\u00b5k2\nQn\n22 Qm\nj=1 e kxjk2\nED\u21e0P1\uf8ff dP1\n(D) =ZRd\n=ZRd\n\n=ZRd\n22 ED\u21e0P1hehPn\nj=1 xj , \u00b5i/2i dP\u00b5(\u00b5)\nE\uf8ffe\n22 \u2713ZRd\n2DPn\nj , \u00b5E \u21e0 Gnh\u00b50, \u00b5i, n 2k\u00b5k2. Using\nj \u21e0 Gd(n\u00b50, n 2Id) and as a resultDPn\nj , \u00b5E = e\n\nPn\nthe closed form for the moment generating function of a Gaussian distribution Z \u21e0 G(\u00b5, 2),\nE\u21e5etZ\u21e4 = e\u00b5te 1\n\nn are independent and distributed according to Gd(\u00b50, 2Id). Note that\n\nj , \u00b5E dP\u00b5(\u00b50)\u25c6 dP\u00b5(\u00b5),\n\nwhere X \u00b50\nj=1 X \u00b50\n\n1 , . . . , X \u00b50\n\n2 2t2, we get\n\nj=1 X \u00b50\n\ne nk\u00b5k2\n\nnk\u00b5k2\n22 .\n\nj=1 X \u00b50\n\n(9)\n\ne\n\n1\n\nTogether with (9) this gives\n\nED\u21e0P1\uf8ff dP1\n\ndP0\n\n(D) =ZRd\n\n22 dP\u00b5(\u00b50)\u25c6 dP\u00b5(\u00b5) = E\uf8ffe\n\nnh\u00b50,\u00b5i\n\n2 ,\n\n(10)\n\nwhere \u00b5 and \u00b50 are independent random variables both distributed according to P\u00b5. Now we set P\u00b5\nto be a uniform distribution in the d-dimensional cube of appropriate size\n\nIn this case, using Lemma B.1 presented in Appendix B we get\n\nE\uf8ffe\n\nnh\u00b50,\u00b5i\n\n2 =\n\ndYi=1\n\nUsing (10) and also assuming\n\ndnt1\u25c6 =\u2713 1\n\n4c2\n1\n\n1\u25c6d\nShi2c2\n\n.\n\n(11)\n\n1\n\ne\n\ne\n\n2\n\n2\n\nnk\u00b5k2\n\nnh\u00b50,\u00b5i\n\nnh\u00b50,\u00b5i\n\nj=1 X \u00b50\n\ne nk\u00b5k2\n\nE\uf8ffe\n2DPn\n22 \u2713ZRd\nP\u00b5 := Uhc1/pdnt1, c1/pdnt1id\nE\uf8ffe\n2 =\n\ndn2t1\n2nc2\n1\n\nn\u00b5i\u00b50i\n\nc2\n1\n\n2\n\n.\n\nShi\u2713 n\n1 \uf8ff 1\n\n1\n4c2\n1\n\ndYi=1\nShi2c2\n(D) \uf8ff\n<\u2327 \u2318 \uf8ff \u2327\n\nwe get\n\nED\u21e0P1\uf8ff dP1\nCombining with (8) we \ufb01nally get P1\u21e3 dP0\n1 \u2327\nFinally, we need to check the second inequality of Condition 1 in Theorem 4. Take two Gaussian\ndistributions P = Gd(\u00b5, 2Id) and Q = Gd(0, 2Id). Using [21, Eq. 30] we have\n\nShi2c2\n1 .\ndP1 \u2327\u2318 \n1 or equivalently P1\u21e3 dP0\nShi2c2\n1 . This shows that Condition 2 of Theorem 4 is satis\ufb01ed with \u21b5 = \u2327\nShi2c2\n1.\n\nShi2c2\n\n1\n4c2\n1\n\ndP0\n\ndP1\n\n4c2\n1\n\n4c2\n1\n\n4c2\n1\n\nMMD2\n\nk(P, Q) \n\n2\n\nt0\n\ne \u27131 \n\n2 + d\u25c6k\u00b5k2\nand t1k\u00b5k2 \uf8ff 1 + 4t12.\n\n1\n\ngiven\n\n2t1d\n\n2 =\n\n(12)\nNotice that the largest diagonal of a d-dimensional cube scales as pd. Using this we conclude that\nfor \u00b5 \u21e0 P\u00b5 with probability 1 it holds that k\u00b5k2 \uf8ff c2\nt1n and the second condition in (12) holds as\nlong as c2\nt1en) P\n\n(P,Q)\u21e0\u00b51(MMDk(P, Q) c2r t0\n\n1 \uf8ff n. Using this we get for any c2 > 0\n\n\u00b5\u21e0P\u00b5\u21e2k\u00b5k2 \n\nt1n\u2713 2 + d\n\nd \u25c6 .\n\n(13)\n\nc2\n2\n\nP\n\n1\n\n7\n\n\fNote that for \u00b5 \u21e0 P\u00b5, k\u00b5k2 =Pd\n\ncomputations show that\n\ni=1 \u00b52\n\ni is a sum of d i.i.d. bounded random variables. Also simple\n\nEk\u00b5k2 =\n\ndXi=1\n\nE\u00b52\n\ni = d\n\nc2\n1\n\n3dnt1\n\n=\n\nc2\n1\n3nt1\n\nand\n\nVk\u00b5k2 =\n\nV\u00b52\n\ni =\n\n4c4\n1\n\n45dn2t2\n1\n\n.\n\ndXi=1\n\nUsing Chebyshev-Cantelli\u2019s inequality of Theorem B.2 (Appendix B) we get for any \u270f> 0\n\nor equivalently for any \u270f> 0,\n\nP\n\n\u00b5\u21e0P\u00b5k\u00b5k2 > Ek\u00b5k2 + \u270f 1 \n\u00b5\u21e0P\u00b5k\u00b5k2 Ek\u00b5k2 \u270f = 1 P\n\u00b5\u21e0P\u00b5\u21e2k\u00b5k2 c2\n1\u2713 1\nP\n2 \u21e3 c2\nc1\u23182\n2 9p5\np5\n\nnt1 1 \n\n, we can further lower bound (13):\n\n3p5d\u25c6 1\n\n1 + \u270f2 .\n\n3 \n\n2\u270f\n\n1\n\nChoosing \u270f \uf8ff\n\nP\n\nt1en) P\n\n(P,Q)\u21e0\u00b51(MMDk(P, Q) c2r t0\n\n\u00b5\u21e0P\u00b5\u21e2k\u00b5k2 c2\n1\u2713 1\n2 \u21e3 c2\nc1\u23182\np5\n2 9p5\nWe \ufb01nally set \u2327 = 0.4, c1 = 0.8, c2 = 0.1,\u270f =\nand the second condition of (12) are satis\ufb01ed, while\n1+\u270f2\u2318\n1 1\n\n\u2327\u21e31 \u2327\n\nShi2c2\n\n1 + \u2327\n\n4c2\n1\n\n>\n\n1\n14\n\n3 \n\n2\u270f\n\n3p5d\u25c6 1\n\nnt1 1\n\n, and check that inequality (11)\n\n.\n\n1\n\n1 + 45dn2t2\n\n1\n\n4c4\n1\n\n\u270f2\n\n1\n\n1 + \u270f2 .\n\nWe complete the proof by application of Theorem 4.\n\n4 Discussion\n\nIn this paper, we provided the \ufb01rst known lower bounds for the estimation of maximum mean\ndiscrepancy (MMD) based on \ufb01nite random samples. Based on this result, we established the minimax\nrate optimality of the empirical estimator. Interestingly, we showed that for radial kernels on Rd, the\noptimal speed of convergence depends only on the properties of the kernel and is independent of d.\nHowever, the paper does not address an important question about the minimax rates for MMD based\ntests. We believe that the minimax rates of testing with MMD matches with that of the minimax rates\nfor MMD estimation and we intend to build on this work in future to establish minimax testing results\ninvolving MMD.\nSince MMD is an integral probability metric (IPM) [11], a related problem of interest is the minimax\nestimation of IPMs.\nIPM is a class of distances on probability measures, which is de\ufb01ned as\n(P, Q) := sup{R f (x) d(P Q)(x) : f 2F} , where F is a class of bounded measurable\nfunctions on a topological space X with P and Q being Borel probability measures. It is well known\n[16] that the choice of F = {f 2H : kfkH \uf8ff 1} yields MMDk(P, Q) where H is a reproducing\nkernel Hilbert space with a bounded reproducing kernel k. [16] studied the empirical estimation\nof (P, Q) for various choices of F and established the consistency and convergence rates for the\nempirical estimator. However, it remains an open question as to whether these rates are minimax\noptimal.\n\nReferences\n[1] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and\n\nStatistics. Kluwer Academic Publishers, London, UK, 2004.\n\n[2] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory\n\nof Independence. Oxford University Press, 2013.\n\n8\n\n\f[3] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00f6lkopf. Kernel measures of conditional dependence.\nIn J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information\nProcessing Systems 20, pages 489\u2013496, Cambridge, MA, 2008. MIT Press.\n\n[4] K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes\u2019 rule: Bayesian inference with positive\n\nde\ufb01nite kernels. J. Mach. Learn. Res., 14:3753\u20133783, 2013.\n\n[5] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel method for the\ntwo sample problem. In B. Sch\u00f6lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural\nInformation Processing Systems 19, pages 513\u2013520, Cambridge, MA, 2007. MIT Press.\n\n[6] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. J. Smola. A kernel two-sample\n\ntest. Journal of Machine Learning Research, 13:723\u2013773, 2012.\n\n[7] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch\u00f6lkopf, and A. J. Smola. A kernel statistical\ntest of independence. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in\nNeural Information Processing Systems 20, pages 585\u2013592. MIT Press, 2008.\n\n[8] E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer-Verlag, New York, 2008.\n[9] D. Lopez-Paz, K. Muandet, B. Sch\u00f6lkopf, and I. Tolstikhin. Towards a learning theory of cause-\neffect inference. In Proceedings of the 32nd International Conference on Machine Learning,\nICML 2015, Lille, France, 6-11 July 2015, 2015.\n\n[10] K. Muandet, B. Sriperumbudur, K. Fukumizu, A. Gretton, and B. Sch\u00f6lkopf. Kernel mean\n\nshrinkage estimators. Journal of Machine Learning Research, 2016. To appear.\n\n[11] A. M\u00fcller. Integral probability metrics and their generating classes of functions. Advances in\n\nApplied Probability, 29:429\u2013443, 1997.\n\n[12] I. J. Schoenberg. Metric spaces and completely monotone functions. The Annals of Mathematics,\n\n39(4):811\u2013841, 1938.\n\n[13] A. J. Smola, A. Gretton, L. Song, and B. Sch\u00f6lkopf. A Hilbert space embedding for distributions.\nIn Proceedings of the 18th International Conference on Algorithmic Learning Theory (ALT),\npages 13\u201331. Springer-Verlag, 2007.\n\n[14] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence\n\nmaximization. Journal of Machine Learning Research, 13:1393\u20131434, 2012.\n\n[15] L. Song, X. Zhang, A. Smola, A. Gretton, and B. Sch\u00f6lkopf. Tailoring density estimation via\nreproducing kernel moment matching. In Proceedings of the 25th International Conference on\nMachine Learning, ICML 2008, pages 992\u2013999, 2008.\n\n[16] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch\u00f6lkopf, and G. R. G. Lanckriet. On the\nempirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550\u2013\n1599, 2012.\n\n[17] B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels\n\nand RKHS embedding of measures. J. Mach. Learn. Res., 12:2389\u20132410, 2011.\n\n[18] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00f6lkopf, and G. R. G. Lanckriet. Hilbert\nspace embeddings and metrics on probability measures. J. Mach. Learn. Res., 11:1517\u20131561,\n2010.\n\n[19] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n[20] Z. Szab\u00f3, A. Gretton, B. P\u00f3czos, and B. K. Sriperumbudur. Two-stage sampled learning\ntheory on distributions. In Proceedings of the Eighteenth International Conference on Arti\ufb01cial\nIntelligence and Statistics, volume 38, pages 948\u2013957. JMLR Workshop and Conference\nProceedings, 2015.\n\n[21] I. Tolstikhin, B. Sriperumbudur, and K. Muandet. Minimax estimation of kernel mean embed-\n\ndings. arXiv:1602.04361 [math.ST], 2016.\n\n[22] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, NY, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1050, "authors": [{"given_name": "Ilya", "family_name": "Tolstikhin", "institution": "MPI for Intelligent Systems"}, {"given_name": "Bharath", "family_name": "Sriperumbudur", "institution": "Penn State University"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI for Intelligent Systems T\u00fcbingen, Germany"}]}