{"title": "Testing for Homogeneity with Kernel Fisher Discriminant Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 609, "page_last": 616, "abstract": "We propose to test for the homogeneity of two samples by using Kernel Fisher discriminant Analysis. This provides us with a consistent nonparametric test statistic, for which we derive the asymptotic distribution under the null hypothesis. We give experimental evidence of the relevance of our method on both artificial and real datasets.", "full_text": "Testing for Homogeneity\n\nwith Kernel Fisher Discriminant Analysis\n\nZa\u00a8\u0131d Harchaoui\n\nLTCI, TELECOM ParisTech and CNRS\n\n46, rue Barrault, 75634 Paris cedex 13, France\n\nzaid.harchaoui@enst.fr\n\nFrancis Bach\n\nWillow Project, INRIA-ENS\n\n45, rue d\u2019Ulm, 75230 Paris, France\nfrancis.bach@mines.org\n\n\u00b4Eric Moulines\n\nLTCI, TELECOM ParisTech and CNRS\n\n46, rue Barrault, 75634 Paris cedex 13, France\n\neric.moulines@enst.fr\n\nAbstract\n\nWe propose to investigate test statistics for testing homogeneity based on kernel\nFisher discriminant analysis. Asymptotic null distributions under null hypothesis\nare derived, and consistency against \ufb01xed alternatives is assessed. Finally, exper-\nimental evidence of the performance of the proposed approach on both arti\ufb01cial\nand real datasets is provided.\n\n1 Introduction\n\n1 , . . . , X (1)\n\nAn important problem in statistics and machine learning consists in testing whether the distributions\nof two random variables are identical under the alternative that they may differ in some ways. More\nprecisely, let {X (1)\nn2 } be independent random variables taking val-\nues in the input space (X, d), with common distributions P1 and P2, respectively. The problem con-\nsists in testing the null hypothesis H0 : P1 = P2 against the alternative HA : P1 6= P2. This problem\narises in many applications, ranging from computational anatomy [10] to process monitoring [7]. We\nshall allow the input space X to be quite general, including for example \ufb01nite-dimensional Euclidean\nspaces or more sophisticated structures such as strings or graphs (see [17]) arising in applications\nsuch as bioinformatics [4].\n\nn1 } and {X (2)\n\n1 , . . . , X (2)\n\nTraditional approaches to this problem are based on distribution functions and use a certain distance\nbetween the empirical distributions obtained from the two samples. The most popular procedures\nare the two-sample Kolmogorov-Smirnov tests or the Cramer-Von Mises tests, that have been the\nstandard for addressing these issues (at least when the dimension of the input space is small, and\nmost often when X = R). Although these tests are popular due to their simplicity, they are known\nto be insensitive to certain characteristics of the distribution, such as densities containing high-\nfrequency components or local features such as bumps. The low-power of the traditional density\nbased statistics can be improved on using test statistics based on kernel density estimators [2] and\n[1] and wavelet estimators [6]. Recent work [11] has shown that one could difference in means in\nRKHSs in order to consistently test for homogeneity. In this paper, we show that taking into account\nthe covariance structure in the RKHS allows to obtain simple limiting distributions.\n\nThe paper is organized as follows: in Section 2 and Section 3, we state the main de\ufb01nitions and we\nconstruct the test statistics. In Section 4, we give the asymptotic distribution of our test statistic under\nthe null hypothesis, and investigate, the consistency and the power of the test for \ufb01xed alternatives. In\n\n1\n\n\fSection 5 we provide experimental evidence of the performance of our test statistic on both arti\ufb01cial\nand real datasets. Detailed proofs are presented in the last sections.\n\n2 Mean and covariance in reproducing kernel Hilbert spaces\n\nWe \ufb01rst highlight the main assumptions we make in the paper on the reproducing kernel, then intro-\nduce operator-theoretic tools for working with distributions in in\ufb01nite-dimensional spaces.\n\n2.1 Reproducing kernel Hilbert spaces\n\nLet (X, d) be a separable metric space, and denote by X the associated \u03c3-algebra. Let X be X-\nvalued random variable, with probability measure P; the corresponding expectation is denoted E.\nConsider a Hilbert space (H,h\u00b7,\u00b7iH) of functions from X to R. The Hilbert space H is an RKHS if\nat each x \u2208 X, the point evaluation operator \u03b4x : H \u2192 R, which maps f \u2208 H to f (x) \u2208 R, is a\nbounded linear functional. To each point x \u2208 X, there corresponds an element \u03a6(x) \u2208 H (we call \u03a6\nthe feature map) such that h\u03a6(x), fiH = f (x) for all f \u2208 H, and h\u03a6(x), \u03a6(y)iH = k(x, y), where\nk : X \u00d7 X \u2192 R is a positive de\ufb01nite kernel. We denote by kfkH = hf, fi1/2\nH the associated norm.\nIt is assumed in the remainder that H is a separable Hilbert space. Note that this is always the case\nif X is a separable metric space and if the kernel is continuous (see [18]). Throughout this paper, we\nmake the following two assumptions on the kernel:\n\n(A1) The kernel k is bounded, that is |k|\u221e = sup(x,y)\u2208X\u00d7X k(x, y) < \u221e.\n(A2) For all probability measures P on (X,X ), the RKHS associated with k(\u00b7,\u00b7) is dense in\n\nL2(P).\n\nThe asymptotic normality of our test statistics is valid without assumption (A2), while consistency\nresults against \ufb01xed alternatives does need (A2). Assumption (A2) is true for translation-invariant\nkernels [8], and in particular for the Gaussian kernel on Rd [18]. Note that we do not require the\ncompactness of X as in [18],\n\n2.2 Mean element and covariance operator\n\nWe shall need some operator-theoretic tools to de\ufb01ne mean elements and covariance operators in\nRKHS. A linear operator T is said to be bounded if there is a number C such that kT fkH \u2264 C kfkH\nfor all f \u2208 H. The operator-norm of T is then de\ufb01ned as the in\ufb01mum of such numbers C, that is\nkTk = supkf kH\u22641 kT fkH (see [9]).\nWe recall below some basic facts about \ufb01rst and second-order moments of RKHS-valued random\nvariables. IfR k1/2(x, x)P(dx) < \u221e, the mean element \u00b5P is de\ufb01ned for all functions f \u2208 H as the\nunique element in H satisfying,\n\n(1)\n\n(2)\n\nIf furthermoreR k(x, x)P(dx) < \u221e, then the covariance operator \u03a3P is de\ufb01ned as the unique linear\noperator onto H satisfying for all f, g \u2208 H,\n\nh\u00b5P, fiH = Pf def= Z f dP .\n\nhf, \u03a3PgiH\n\ndef= Z (f \u2212 Pf )(g \u2212 Pg)dP .\n\nNote that when assumption (A2) is satis\ufb01ed, then the map from P 7\u2192 \u00b5P is injective. The operator\n\u03a3P is a self-adjoint nonnegative trace-class operator. In the sequel, the dependence of \u00b5P and \u03a3P in\nP is omitted whenever there is no risk of confusion.\nGiven a sample {X1, . . . , Xn}, the empirical estimates respectively of the mean element and the\ncovariance operator are then de\ufb01ned using empirical moments and lead to:\n\n\u02c6\u00b5 = n\u22121\n\nnXi=1\n\nk(Xi,\u00b7) ,\n\n\u02c6\u03a3 = n\u22121\n\nnXi=1\n\n2\n\nk(Xi,\u00b7) \u2297 k(Xi,\u00b7) \u2212 \u02c6\u00b5 \u2297 \u02c6\u00b5 .\n\n(3)\n\n\fThe operator \u03a3 is a self-adjoint nonnegative trace-class operators. Hence, it can de diagonalized in\nan orthonormal basis, with a spectrum composed of a strictly decreasing sequence \u03bbp > 0 tending\n\nto zero and potentially a null space N (\u03a3) composed of functions f in H such thatR {f \u2212 Pf}2dP =\n\n0 [5], i.e., functions which are constant in the support of P.\nThe null space may be reduced to the null element (in particular for the Gaussian kernel), or may\nbe in\ufb01nite-dimensional. Similarly, there may be in\ufb01nitely many strictly positive eigenvalues (true\nnonparametric case) or \ufb01nitely many (underlying \ufb01nite dimensional problems).\n\n3 KFDA-based test statistic\n\n1 , . . . , X (1)\n\n1 , . . . , X (2)\n\nn1 } and {X (2)\n\nIn the feature space, the two-sample homogeneity test procedure can be formulated as follows. Given\n{X (1)\nn2 } from distributions P1 and P2, two independent identically\ndistributed samples respectively from P1 and P2, having mean and covariance operators respectively\ngiven by (\u00b51, \u03a31) and (\u00b52, \u03a32), we wish to test the null hypothesis H0, \u00b51 = \u00b52 and \u03a31 = \u03a32,\nagainst the alternative hypothesis HA, \u00b51 6= \u00b52.\nIn this paper, we tackle the problem by using a (regularized) kernelized version of the Fisher dis-\ndef= (n1/n)\u03a31 +(n2/n)\u03a32 the pooled covariance operator, where\ncriminant analysis. Denote by \u03a3W\nn def= n1 + n2, corresponding to the within-class covariance matrix in the \ufb01nite-dimensional setting\ndef= (n1n2/n2)(\u00b52\u2212\u00b51)\u2297(\u00b52\u2212\u00b51) the between-class covariance oper-\n(see [14]. Let us denote \u03a3B\nator. For a = 1, 2, denote by (\u02c6\u00b5a, \u02c6\u03a3a) respectively the empirical estimates of the mean element and\ndef= (n1/n) \u02c6\u03a31 + (n2/n) \u02c6\u03a32\nthe covariance operator, de\ufb01ned as previously stated in (3). Denote \u02c6\u03a3W\ndef= (n1n2/n2)(\u02c6\u00b52 \u2212 \u02c6\u00b51) \u2297 (\u02c6\u00b52 \u2212 \u02c6\u00b51) the em-\nthe empirical pooled covariance estimator, and \u02c6\u03a3B\npirical between-class covariance operator. Let {\u03b3n}n\u22650 be a sequence of strictly positive numbers.\nThe maximum Fisher discriminant ratio serves as a basis of our test statistics:\n\nn max\nf \u2208H\n\nDf, \u02c6\u03a3BfEH\n\nDf, ( \u02c6\u03a3W + \u03b3nI)fEH\n\n=\n\nn1n2\n\nn\n\n(cid:13)(cid:13)(cid:13)( \u02c6\u03a3W + \u03b3nI)\u2212 1\n2 \u02c6\u03b4(cid:13)(cid:13)(cid:13)\n\n2\n\nH\n\n,\n\n(4)\n\nwhere I denotes the identity operator. Note that if the input space is Euclidean, e.g. X = Rd, the\nkernel is linear k(x, y) = x\u22a4y and \u03b3n = 0, this quantity matches the so-called Hotelling\u2019s T 2-\nstatistic in the two-sample case [15]. Moreover, in practice it may be computed thanks to the kernel\ntrick, adapted to the kernel Fisher discriminant analysis and outlined in [17, Chapter 6]. We shall\nmake the following assumptions respectively on \u03a31 and \u03a32\n\np\n\np=1 \u03bb1/2\n\n(B1) For u = 1, 2, the eigenvalues {\u03bbp(\u03a3u)}p\u22651 satisfyP\u221e\n(B2) For u = 1, 2, there are in\ufb01nitely many strictly positive eigenvalues {\u03bbp(\u03a3u)}p\u22651 of \u03a3u.\nThe statistical analysis conducted in Section 4 shall demonstrate, as \u03b3n \u2192 0 at an appropriate\nrate, the need to respectively recenter and rescale (a standard statistical transformation known as\nstudentization) the maximum Fisher discriminant ratio, in order to get a theoretically well-calibrated\ntest statistic. These roles, recentering and rescaling, will be played respectively by d1(\u03a3W , \u03b3) and\nd2(\u03a3W , \u03b3), where for a given compact operator \u03a3 with decreasing eigenvalues \u03bbp(S), the quantity\ndr(\u03a3, \u03b3) is de\ufb01ned for all q \u2265 1 as\n\n(\u03a3u) < \u221e.\n\ndr(\u03a3, \u03b3) def= ( \u221eXp=1\n\np)1/r\n\n(\u03bbp + \u03b3)\u2212r\u03bbr\n\n.\n\n(5)\n\n4 Theoretical results\n\nWe consider in the sequel the following studentized test statistic:\n\nn1n2\n\nn\n\nbTn(\u03b3n) =\n\n2\n\n(cid:13)(cid:13)(cid:13)( \u02c6\u03a3W + \u03b3nI)\u22121/2\u02c6\u03b4(cid:13)(cid:13)(cid:13)\n\n\u221a2d2( \u02c6\u03a3W , \u03b3n)\n\n3\n\nH \u2212 d1( \u02c6\u03a3W , \u03b3n)\n\n.\n\n(6)\n\n\fIn this paper, we \ufb01rst consider the asymptotic behavior of bTn under the null hypothesis, and then\n\nagainst a \ufb01xed alternative. This will establish that our nonparametric test procedure is consistent in\npower.\n\n4.1 Asymptotic normality under null hypothesis\n\nIn this section, we derive the distribution of the test statistics under the null hypothesis H0 : P1 = P2\nof homogeneity, i.e. \u00b51 = \u00b52 and \u03a31 = \u03a32 = \u03a3. As \u03b3n \u2192 0 tends to zero,\nTheorem 1. Assume (A1) and (B1). If P1 = P2 = P and if \u03b3n + \u03b3\u22121\n\nn n\u22121/2 \u2192 0, then\n\n(7)\n\nbTn(\u03b3n) D\u2212\u2192 N (0, 1)\n\nThe proof is postponed to Section 7. Under the assumptions of Theorem 1, the sequence of tests that\nrejects the null hypothesis when \u02c6Tn(\u03b3n) \u2265 z1\u2212\u03b1, where z1\u2212\u03b1 is the (1\u2212 \u03b1)-quantile of the standard\nnormal distribution, is asymptotically level \u03b1. Note that the limiting distribution does not depend on\nthe kernel nor on the regularization parameter.\n\n4.2 Power consistency\n\nWe study the power of the test based on bTn(\u03b3n) under alternative hypotheses. The minimal re-\n\nquirement is to to prove that this sequence of tests is consistent in power. A sequence of tests of\nconstant level \u03b1 is said to be consistent in power if the probability of accepting the null hypothesis\nof homogeneity goes to zero as the sample size goes to in\ufb01nity under a \ufb01xed alternative.\n\nThe following proposition shows that the limit is \ufb01nite, strictly positive and independent of the kernel\notherwise (see [8] for similar results for canonical correlation analysis). The following result gives\non\n\n, i.e.the population counterpart of(cid:13)(cid:13)(cid:13)( \u02c6\u03a3\u22121/2\n\nn n\u22121/2 \u2192 0, then for any probability distributions\n\nwhich our test statistics is based upon.\nProposition 2. Assume (A1) and (A2). If \u03b3n +\u03b3\u22121\nP1 and P2,\n\nsome useful insights on(cid:13)(cid:13)(cid:13)\u03a3\u22121/2\nW \u03b4(cid:13)(cid:13)(cid:13)H\n\u03c11\u03c12(cid:18)1 \u2212Z\n\n=\n\n1\n\nH\n\n2\n\nW + \u03b3nI)\u22121/2\u02c6\u03b4(cid:13)(cid:13)(cid:13)H\nd\u03c1(cid:19)\u22121\n\n,\n\np1p2\n\n\u03c11p1 + \u03c12p2\n\nd\u03bd(cid:19)(cid:18)Z\n\np1p2\n\n\u03c11p1 + \u03c12p2\n\nW \u03b4(cid:13)(cid:13)(cid:13)\n(cid:13)(cid:13)(cid:13)\u03a3\u22121/2\nThe norm(cid:13)(cid:13)(cid:13)\u03a3\u22121/2\nW \u03b4(cid:13)(cid:13)(cid:13)\n\n2\n\nH\n\nwhere \u03bd is any probability measure such that P1 and P2 are absolutely continuous w.r.t. \u03bd and p1\nand p2 are the densities of P1 and P2 with respect to \u03bd.\n\nzero if the \u03c72-divergence is null, that is, if and only if P1 = P2.\n\nis \ufb01nite when the \u03c72-divergenceR p\u22121\n\n1 (p2 \u2212 p1)2d\u03c1 is \ufb01nite. It is equal to\n\nBy combining the two previous propositions, we therefore obtain the following consistency Theo-\nrem.\nTheorem 3. Assume (A1) and (A2). Let P1 and P2 be two distributions over (X,X ), such that\nP2 6= P1. If \u03b3n + \u03b3\u22121\n\nn n\u22121/2 \u2192 0, then\n\n5 Experiments\n\nPHA(bTn(\u03b3) > z1\u2212\u03b1) \u2192 \u221e .\n\n(8)\n\nIn this section, we investigate the experimental performances of our test statistic KFDA, and com-\npare it in terms of power against other nonparametric test statistics.\n\n5.1 Arti\ufb01cial data\n\nWe shall focus here on a particularly simple setting, in order analyze the major issues arising in\napplying our approach in practice. Indeed, we consider the periodic smoothing spline kernel (see\n\n4\n\n\f10\u22121\n\n\u03b3 =\nKFDA 0.01\u00b10.0032\nMMD 0.01\u00b10.0023\n\n10\u22124\n0.11\u00b10.0062\nid.\n\n10\u22127\n0.98\u00b10.0031\nid.\n\n10\u221210\n0.99\u00b10.0001\nid.\n\nTable 1: Evolution of power of KFDA and MMD respectively, as \u03b3 goes to 0.\n\n[19] for a detailed derivation), for which explicit formulae are available for the eigenvalues of the\ncorresponding covariance operator when the underlying distribution is uniform. This allows us to\nalleviate the issue of estimating the spectrum of the covariance operator, and weigh up the practical\nimpact of the regularization on the power of our test statistic.\n\nPeriodic smoothing spline kernel Consider X as the two-dimensional circle identi\ufb01ed with the\ninterval [0, 1] (with periodicity conditions). We consider the strictly positive sequence K\u03bd =\n(2\u03c0\u03bd)\u22122m and the following norm:\n\nH = hf, c0i2\nK0\n\nkfk2\n\nhf, c\u03bdi2 + hf, s\u03bdi2\n\nK\u03bd\n\n+X\u03bd>0\n\nwhere c\u03bd(t) = \u221a2 cos 2\u03c0\u03bdt and s\u03bd(t) = \u221a2 sin 2\u03c0\u03bdt for \u03bd \u2265 1 and c0(t) = 1X. This is always an\n\nRKHS norm associated with the following kernel\n(\u22121)m\u22121\n(2m)!\n\nK(s, t) =\n\nB2m((s \u2212 t) \u2212 \u230as \u2212 t\u230b)\n\nwhere B2m is the 2m-th Bernoulli polynomial. We have B2(x) = x2 \u2212 x + 1/6.\nWe consider the following testing problem\n\nH0 :\nHA :\n\np1 = p2\np2 6= p2\n\nwith p1 the uniform density (i.e., the density with respect to the Lebesgue measure is equal to c0),\nand densities p2 = p1(c0 + .25\u2217 c4). The covariance operator \u03a3(p1) has eigenvectors c0, c\u03bd, s\u03bd with\neigenvalues 0 for c0 and K\u03bd for others.\n\nComparison with MMD We conducted experimental comparison in terms of power, for m = 2\nand n = 104 and \u03b5 = 0.5. All quantities involving the eigenvalues of the covariance operator were\ncomputed from their counterparts instead of being estimated. The sampling from pn\n2 was performed\nby inverting the cumulative distribution function. The table below displays the results, averaged\nover 10 Monte-Carlo runs.\n\n5.2 Speaker veri\ufb01cation\n\nWe conducted experiments in a speaker veri\ufb01cation task [3], on a subset of 8 female speakers using\ndata from the NIST 2004 Speaker Recognition Evaluation. We refer the reader to [16] for instance\nfor details on the pre-processing of data. The \ufb01gure shows averaged results over all couples of speak-\ners. For each couple of speaker, at each run we took 3000 samples of each speaker and launched our\nKFDA-test to decide whether samples come from the same speaker or not, and computed the type\nII error by comparing the prediction to ground truth. We averaged the results for 100 runs for each\ncouple, and all couples of speaker. The level was set to \u03b1 = 0.05, since the empirical level seemed\nto match the prescribed for this value of the level as we noticed in previous subsection. We per-\nformed the same experiments for the Maximum Mean Discrepancy and the Tajvidi-Hall test statistic\n(TH, [13]). We summed up the results by plotting the ROC-curve for all competing methods. Our\nmethod reaches good empirical power for a small value of the prescribed level (1 \u2212 \u03b2 = 90% for\n\u03b1 = 0.05%). Maximum Mean Discrepancy also yields good empirical performance on this task.\n\n6 Conclusion\n\nWe proposed a well-calibrated test statistic, built on kernel Fisher discriminant analysis, for which\nwe proved that the asymptotic limit distribution under null hypothesis is standard normal distribu-\ntion. Our test statistic can be readily computed from Gram matrices once a kernel is de\ufb01ned, and\n\n5\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\ne\nw\no\nP\n\n0\n \n0\n\n0.1\n\nROC Curve\n\n \n\nKFDA\nMMD\nTH\n\n0.4\n\n0.5\n\n0.2\n\n0.3\n\nLevel\n\nFigure 1: Comparison of ROC curves in a speaker veri\ufb01cation task\n\nallows us to perform nonparametric hypothesis testing for homogeneity for high-dimensional data.\nThe KFDA-test statistic yields competitive performance for speaker identi\ufb01cation.\n\n7 Sketch of proof of asymptotic normality under null hypothesis\n\nOutline. The proof of the asymptotic normality of the test statistics under null hypothesis follows\nfour steps. As a \ufb01rst step, we derive an asymptotic approximation of the test statistics as \u03b3n +\nn n\u22121/2 \u2192 0 , where the only remaining stochastic term is \u02c6\u03b4. The test statistics is then spanned\n\u03b3\u22121\nonto the eigenbasis of \u03a3, and decomposed into two terms Bn and Cn. The second step allows to\nprove the asymptotic negligibility of Bn, while the third step establishes the asymptotic normality\nof Cn by a martingale central limit theorem (MCLT).\n\nStep 1: bTn(\u03b3n) = \u02dcTn(\u03b3n) + oP (1). First, we may prove, using perturbation results of covariance\n\nn n\u22121/2 \u2192 0 , we have\n\noperators, that, as \u03b3n + \u03b3\u22121\n\n(n1n2/n) (cid:13)(cid:13)(cid:13)(\u03a3 + \u03b3I)\u22121/2 \u02c6\u03b4(cid:13)(cid:13)(cid:13)\n\n\u221a2d2(\u03a3, \u03b3)\n\n2\n\nH \u2212 d1(\u03a3, \u03b3)\n\n+ oP (1) .\n\n(9)\n\nFor ease of notation, in the following, we shall often omit \u03a3 in quantities involving it. Hence, from\nnow on, \u03bbp, \u03bbq, d2,n stand for \u03bbp(\u03a3), \u03bbq(\u03a3), d2(\u03a3, \u03b3n). De\ufb01ne\n\nbTn(\u03b3n) =\ndef= \uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n(cid:16) n2\nn1n(cid:17)1/2(cid:16)ep(X (1)\n\u2212(cid:16) n1\nn2n(cid:17)1/2(cid:16)ep(X (2)\nnXi=1\n\ni\n\nYn,p,i\n\n1 )](cid:17)\n) \u2212 E[ep(X (1)\n1 )](cid:17) n1 + 1 \u2264 i \u2264 n .\ni\u2212n1 ) \u2212 E[ep(X (2)\n\n1 \u2264 i \u2264 n1 ,\n\n(10)\n\nWe now give formulas for the moments of {Yn,p,i}1\u2264i\u2264n,p\u22651, often used in the proof. Straightfor-\nward calculations give\n\nE[Yn,p,iYn,q,i] = \u03bb1/2\n\np \u03bb1/2\n\nq\n\n\u03b4p,q ,\n\n(11)\n\nwhile the Cauchy-Schwarz inequality and the reproducing property give\n\nDenote Sn,p\nwith\n\nAn\n\ndef=\n\nn1n2\n\nn\n\n.\n\np \u03bb1/2\n\nq\n\nCov(Y 2\n\nn,p,i, Y 2\n\nn,q,i) \u2264 Cn\u22122|k|\u221e\u03bb1/2\n\n(12)\ni=1 Yn,p,i. Using Eq. (11), our test statistics now writes as \u02dcTn = (\u221a2d2,n)\u22121An\nn,p(cid:9) = Bn + 2Cn .\n\n(\u03bbp + \u03b3n)\u22121(cid:8)S2\n\ndef= Pn\n(cid:13)(cid:13)(cid:13)(\u03a3 + \u03b3nI)\u22121/2\u02c6\u03b4(cid:13)(cid:13)(cid:13)\n\nn,p \u2212 ES2\n\n\u2212 d1,n =\n\n\u221eXp=1\n\n(13)\n\n2\n\n6\n\n\fStep 2: Bn = oP (1). The proof consists in computing the variance of this term. Since the variables\n\n(14)\n\n(15)\n\n.\n\nwhere Bn and Cn are de\ufb01ned as follows\n\nBn\n\ndef=\n\n\u221eXp=1\n\u221eXp=1\n\nnXi=1(cid:8)Y 2\n\nn,p,i \u2212 EY 2\nnXi=1\n\nn,p,i(cid:9) ,\nYn,p,i\uf8f1\uf8f2\uf8f3\ni\u22121Xj=1\nYn,p,i and Yn,q,j are independent if i 6= j, then Var(Bn) =Pn\nn,p,i \u2212 E[Y 2\n\n(\u03bbp + \u03b3n)\u22121{Y 2\n\n(\u03bbp + \u03b3n)\u22121\n\ndef=\n\nCn\n\nYn,p,j\uf8fc\uf8fd\uf8fe\nn,p,i]}!\n\ni=1 vn,i, where\n\nvn,i\n\ndef= Var \u221eXp=1\n\u221eXp,q=1\nUsing Eq. (12), we get Pn\n\n=\n\nnegligible, since by assumption we have \u03b3\u22121\n\ni=1 vn,i \u2264 Cn\u22121\u03b3\u22122\n\np=1 \u03bb1/2\n\np (cid:17)2\nn (cid:16)P\u221e\nn n\u22121/2 \u2192 0 andP\u221e\n\np=1 \u03bb1/2\n\np < \u221e.\n\n(\u03bbp + \u03b3n)\u22121(\u03bbq + \u03b3n)\u22121Cov(Y 2\n\nn,p,i, Y 2\n\nn,q,i) .\n\nwhere the RHS above is indeed\n\nStep 3: d\u22121\nmartingale differences (see e.g. [12, Theorem 3.2]). For = 1, . . . , n, denote\n\nD\u2212\u2192 N(0, 1/2). We use the central limit theorem (MCLT) for triangular arrays of\n\n2,nCn\n\n\u03ben,i\n\ndef= d\u22121\n2,n\n\n(\u03bbp + \u03b3n)\u22121Yn,p,iMn,p,i\u22121 , where Mn,p,i\n\ndef=\n\nYn,p,j ,\n\n(16)\n\nand let Fn,i = \u03c3 (Yn,p,j, p \u2208 {1, . . . , n}, j \u2208 {0, . . . , i}). Note that, by construction, \u03ben,i is a mar-\ntingale increment, i.e. E [ \u03ben,i |Fn,i\u22121] = 0. The \ufb01rst step in the proof of the CLT is to establish\nthat\n\niXj=1\n\n\u221eXp=1\n\ns2\nn =\n\nnXi=1\n\nn,i(cid:12)(cid:12)Fn,i\u22121(cid:3) P\u2212\u2192 1/2 .\nE(cid:2) \u03be2\n\nThe second step of the proof is to establish the negligibility condition. We use [12, Theorem\n3.2], which requires to establish that max1\u2264i\u2264n |\u03ben,i| P\u2212\u2192 0 (smallness) and E(max1\u2264i\u2264n \u03be2\nn,i)\nis bounded in n (tightness), where \u03ben,i is de\ufb01ned in (16). We will establish the two conditions\nsimultaneously by checking that\n\n\u03be2\n\nn,i(cid:19) = o(1) .\n\n1\u2264i\u2264n\n\nE(cid:18) max\nnXi=1\n\nM 2\n\nnXi=1\n\n\u221eXp=1\n2,nXp6=q\n\nSplitting the sum s2\n\nn, between diagonal terms Dn, and off-diagonal terms En, we have\n\nDn = d\u22122\n2,n\n\n(\u03bbp + \u03b3n)\u22122\n\nn,p,i\u22121\n\nE[Y 2\n\nn,p,i] ,\n\nEn = d\u22122\n\n(\u03bbp + \u03b3n)\u22121(\u03bbq + \u03b3n)\u22121\n\nMn,p,i\u22121Mn,q,i\u22121E[Yn,p,iYn,q,i] .\n\nConsider \ufb01rst the diagonal terms En. We \ufb01rst compute its mean. Note that E[M 2\n\nn,p,i] =\n\nj=1\n\nPi\n\u221eXp=1\n\nE[Y 2\n\nn,p,j]. Using Eq. (11) we get\n\n(\u03bbp + \u03b3n)\u22122\n\nn,p,j]E[Y 2\n\nn,p,i]\n\nE[Y 2\n\nnXi=1\ni\u22121Xj=1\n(\u03bbp + \u03b3n)\u22122\uf8f1\uf8f2\uf8f3\n\" nXi=1\n\n=\n\n1\n2\n\n\u221eXp=1\n\nE[Y 2\n\nn,p,i]#2\n\n\u2212\n\nnXi=1\n\nE2[Y 2\n\nn,p,i]\uf8fc\uf8fd\uf8fe\n\n7\n\n=\n\n1\n2\n\nd2\n\n2,n(cid:8)1 + O(n\u22121)(cid:9) .\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\n\fTherefore, E[Dn] = 1/2 + o(1). Next, we may prove that Dn \u2212 E[Dn] = oP (1) is negligible, by\nchecking that Var[Dn] = o(1). We \ufb01nally consider En de\ufb01ned in (20), and prove that En = oP (1)\nusing Eq. (11). This concludes the proof of Eq. (17).\nWe \ufb01nally show Eq. (18). Since |Yn,p,i| \u2264 n\u22121/2|k|1/2\n\n\u221e P-a.s we may bound\n\n1\u2264i\u2264n|\u03ben,i| \u2264 Cd\u22121\nmax\n\n2,nn\u22121/2\n\n\u221eXp=1\n\n(\u03bbp + \u03b3n)\u22121 max\n\n1\u2264i\u2264n|Mn,p,i\u22121| .\n\n(21)\n\nn,p,n\u22121] \u2264 C\u03bb1/2\n\np\n\n.\n\nThen, the Doob inequality implies that E1/2[max1\u2264i\u2264n |Mn,p,i\u22121|2] \u2264 E1/2[M 2\nPlugging this bound in (21), the Minkowski inequality\np ) ,\n\nn,i(cid:19) \u2264 C(d\u22121\n\nE1/2(cid:18) max\n\n2,n\u03b3\u22121\n\nn n\u22121/2\n\n\u03bb1/2\n\n\u221eXp=1\n\n\u03be2\n\n1\u2264i\u2264n\n\nand the proof is concluded using the fact that \u03b3n + \u03b3\u22121\n\nn n\u22121/2 \u2192 0 and Assumption (B1).\n\nReferences\n[1] D. L. Allen. Hypothesis testing using an L1-distance bootstrap. The American Statistician, 51(2):145\u2013\n\n150, 1997.\n\n[2] N. H. Anderson, P. Hall, and D. M. Titterington. Two-sample test statistics for measuring discrepancies\nbetween two multivariate probability density functions using kernel-based density estimates. Journal of\nMultivariate Analysis, 50(1):41\u201354, 1994.\n\n[3] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin,\nJ. Ortega-Garcia, D. Petrovska-Delacretaz, and D. A. Reynolds. A tutorial on text-independent speaker\nveri\ufb01cation. EURASIP, 4:430\u201351, 2004.\n\n[4] K. Borgwardt, A. Gretton, M. Rasch, H.-P. Kriegel, Sch\u00a8olkopf, and A. J. Smola. Integrating structured\n\nbiological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):49\u201357, 2006.\n\n[5] H. Brezis. Analyse Fonctionnelle. Masson, 1980.\n[6] C. Butucea and K. Tribouley. Nonparametric homogeneity tests. Journal of Statistical Planning and\n\nInference, 136(3):597\u2013639, 2006.\n\n[7] E. Carlstein, H. M\u00a8uller, and D. Siegmund, editors. Change-point Problems, number 23 in IMS Mono-\n\ngraph. Institute of Mathematical Statistics, Hayward, CA, 1994.\n\n[8] K. Fukumizu, A. Gretton, X. Sunn, and B. Sch\u00a8olkopf. Kernel measures of conditional dependence. In\n\nAdv. NIPS, 2008.\n\n[9] I. Gohberg, S. Goldberg, and M. A. Kaashoek. Classes of Linear Operators Vol. I. Birkh\u00a8auser, 1990.\n[10] U. Grenander and M. Miller. Pattern Theory: from representation to inference. Oxford Univ. Press, 2007.\n[11] A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A. Smola. A kernel method for the two-sample\n\nproblem. In Adv. NIPS, 2006.\n\n[12] P. Hall and C. Heyde. Martingale Limit Theory and Its Application. Academic Press, 1980.\n[13] P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings.\n\nBiometrika, 89(2):359\u2013374, 2002.\n\n[14] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in\n\nStatistics. Springer, 2001.\n\n[15] E. Lehmann and J. Romano. Testing Statistical Hypotheses (3rd ed.). Springer, 2005.\n[16] J. Louradour, K. Daoudi, and F. Bach. Feature space mahalanobis sequence kernels: Application to svm\n\nspeaker veri\ufb01cation. IEEE Transactions on Audio, Speech and Language Processing, 2007. To appear.\n\n[17] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.\n[18] I. Steinwart, D. Hush, and C. Scovel. An explicit description of the reproducing kernel hilbert spaces of\n\ngaussian RBF kernels. IEEE Transactions on Information Theory, 52:4635\u20134643, 2006.\n\n[19] G. Wahba. Spline Models for Observational Data. SIAM, 1990.\n\n8\n\n\f", "award": [], "sourceid": 1056, "authors": [{"given_name": "Moulines", "family_name": "Eric", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "Za\u00efd", "family_name": "Harchaoui", "institution": null}]}