{"title": "B-test: A Non-parametric, Low Variance Kernel Two-sample Test", "book": "Advances in Neural Information Processing Systems", "page_first": 755, "page_last": 763, "abstract": "We propose a family of maximum mean discrepancy (MMD) kernel two-sample tests that have low sample complexity and are consistent. The test has a hyperparameter that allows one to control the tradeoff between sample complexity and computational time. Our family of tests, which we denote as B-tests, is both computationally and statistically efficient, combining favorable properties of previously proposed MMD two-sample tests. It does so by better leveraging samples to produce low variance estimates in the finite sample case, while avoiding a quadratic number of kernel evaluations and complex null-hypothesis approximation as would be required by tests relying on one sample U-statistics. The B-test uses a smaller than quadratic number of kernel evaluations and avoids completely the computational burden of complex null-hypothesis approximation while maintaining consistency and probabilistically conservative thresholds on Type I error. Finally, recent results of combining multiple kernels transfer seamlessly to our hypothesis test, allowing a further increase in discriminative power and decrease in sample complexity.", "full_text": "B-tests: Low Variance Kernel Two-Sample Tests\n\nWojciech Zaremba\n\nCenter for Visual Computing\n\n\u00b4Ecole Centrale Paris\n\nCh\u02c6atenay-Malabry, France\n\nMatthew Blaschko\n\n\u00b4Equipe GALEN\n\nInria Saclay\n\nCh\u02c6atenay-Malabry, France\n\nArthur Gretton\n\nGatsby Unit\n\nUniversity College London\n\nUnited Kingdom\n\n{woj.zaremba,arthur.gretton}@gmail.com, matthew.blaschko@inria.fr\n\nAbstract\n\nA family of maximum mean discrepancy (MMD) kernel two-sample tests is intro-\nduced. Members of the test family are called Block-tests or B-tests, since the test\nstatistic is an average over MMDs computed on subsets of the samples. The choice\nof block size allows control over the tradeoff between test power and computation\ntime. In this respect, the B-test family combines favorable properties of previ-\nously proposed MMD two-sample tests: B-tests are more powerful than a linear\ntime test where blocks are just pairs of samples, yet they are more computation-\nally ef\ufb01cient than a quadratic time test where a single large block incorporating all\nthe samples is used to compute a U-statistic. A further important advantage of the\nB-tests is their asymptotically Normal null distribution: this is by contrast with\nthe U-statistic, which is degenerate under the null hypothesis, and for which esti-\nmates of the null distribution are computationally demanding. Recent results on\nkernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding\na means to optimize test power via kernel choice.\n\nIntroduction\n\ni=1 where xi \u223c P i.i.d., and {yi}n\n\n1\nGiven two samples {xi}n\ni=1, where yi \u223c Q i.i.d, the two sample\nproblem consists in testing whether to accept or reject the null hypothesis H0 that P = Q, vs the\nalternative hypothesis HA that P and Q are different. This problem has recently been addressed\nusing measures of similarity computed in a reproducing kernel Hilbert space (RKHS), which apply\nin very general settings where P and Q might be distributions over high dimensional data or struc-\ntured objects. Kernel test statistics include the maximum mean discrepancy [10, 6] (of which the\nenergy distance is an example [18, 2, 22]), which is the distance between expected features of P and\nQ in the RKHS; the kernel Fisher discriminant [12], which is the distance between expected feature\nmaps normalized by the feature space covariance; and density ratio estimates [24]. When used in\ntesting, it is necessary to determine whether the empirical estimate of the relevant similarity mea-\nsure is suf\ufb01ciently large as to give the hypothesis P = Q low probability; i.e., below a user-de\ufb01ned\nthreshold \u03b1, denoted the test level. The test power denotes the probability of correctly rejecting the\nnull hypothesis, given that P (cid:54)= Q.\nThe minimum variance unbiased estimator MMDu of the maximum mean discrepancy, on the basis\nof n samples observed from each of P and Q, is a U-statistic, costing O(n2) to compute. Unfor-\ntunately, this statistic is degenerate under the null hypothesis H0 that P = Q, and its asymptotic\ndistribution takes the form of an in\ufb01nite weighted sum of independent \u03c72 variables (it is asymptot-\nically Gaussian under the alternative hypothesis HA that P (cid:54)= Q). Two methods for empirically\nestimating the null distribution in a consistent way have been proposed: the bootstrap [10], and a\nmethod requiring an eigendecomposition of the kernel matrices computed on the merged samples\nfrom P and Q [7]. Unfortunately, both procedures are computationally demanding: the former costs\nO(n2), with a large constant (the MMD must be computed repeatedly over random assignments\nof the pooled data); the latter costs O(n3), but with a smaller constant, hence can in practice be\n\n1\n\n\ffaster than the bootstrap. Another approach is to approximate the null distribution by a member\nof a simpler parametric family (for instance, a Pearson curve approximation), however this has no\nconsistency guarantees.\nMore recently, an O(n) unbiased estimate MMDl of the maximum mean discrepancy has been pro-\nposed [10, Section 6], which is simply a running average over independent pairs of samples from P\nand Q. While this has much greater variance than the U-statistic, it also has a simpler null distribu-\ntion: being an average over i.i.d. terms, the central limit theorem gives an asymptotically Normal\ndistribution, under both H0 and HA. It is shown in [9] that this simple asymptotic distribution makes\nit easy to optimize the Hodges and Lehmann asymptotic relative ef\ufb01ciency [19] over the family of\nkernels that de\ufb01ne the statistic: in other words, to choose the kernel which gives the lowest Type II\nerror (probability of wrongly accepting H0) for a given Type I error (probability of wrongly reject-\ning H0). Kernel selection for the U-statistic is a much harder question due to the complex form of\nthe null distribution, and remains an open problem.\nIt appears that MMDu and MMDl fall at two extremes of a spectrum: the former has the lowest\nvariance of any n-sample estimator, and should be used in limited data regimes; the latter is the\nestimator requiring the least computation while still looking at each of the samples, and usually\nachieves better Type II error than MMDu at a given computational cost, albeit by looking at much\nmore data (the \u201climited time, unlimited data\u201d scenario). A major reason MMDl is faster is that its\nnull distribution is straightforward to compute, since it is Gaussian and its variance can be calculated\nat the same cost as the test statistic. A reasonable next step would be to \ufb01nd a compromise between\nthese two extremes: to construct a statistic with a lower variance than MMDl, while retaining an\nasymptotically Gaussian null distribution (hence remaining faster than tests based on MMDu). We\nstudy a family of such test statistics, where we split the data into blocks of size B, compute the\nquadratic-time MMDu on each block, and then average the resulting statistics. We call the resulting\ntests B-tests. As long as we choose the size B of blocks such that n/B \u2192 \u221e, we are still guaranteed\nasymptotic Normality by the central limit theorem, and the null distribution can be computed at the\nsame cost as the test statistic. For a given sample size n, however, the power of the test can increase\ndramatically over the MMDl test, even for moderate block sizes B, making much better use of the\navailable data with only a small increase in computation.\nThe block averaging scheme was originally proposed in [13], as an instance of a two-stage U-\nstatistic, to be applied when the degree of degeneracy of the U-statistic is indeterminate. Differences\nwith respect to our method are that Ho and Shieh compute the block statistics by sampling with\nreplacement [13, (b) p. 863], and propose to obtain the variance of the test statistic via Monte\nCarlo, jackknife, or bootstrap techniques, whereas we use closed form expressions. Ho and Shieh\nfurther suggest an alternative two-stage U-statistic in the event that the degree of degeneracy is\nknown; we return to this point in the discussion. While we con\ufb01ne ourselves to the MMD in this\npaper, we emphasize that the block approach applies to a much broader variety of test situations\nwhere the null distribution cannot easily be computed, including the energy distance and distance\ncovariance [18, 2, 22] and Fisher statistic [12] in the case of two-sample testing, and the Hilbert-\nSchmidt Independence Criterion [8] and distance covariance [23] for independence testing. Finally,\nthe kernel learning approach of [9] applies straightforwardly, allowing us to maximize test power\nover a given kernel family. Code is available at http://github.com/wojzaremba/btest.\n2 Theory\nIn this section we describe the mathematical foundations of the B-test. We begin with a brief review\nof kernel methods, and of the maximum mean discrepancy. We then present our block-based average\nMMD statistic, and derive its distribution under the H0 (P = Q) and HA (P (cid:54)= Q) hypotheses. The\ncentral idea employed in the construction of the B-test is to generate a low variance MMD estimate\nby averaging multiple low variance kernel statistics computed over blocks of samples. We show\nsimple suf\ufb01cient conditions on the block size for consistency of the estimator. Furthermore, we\nanalyze the properties of the \ufb01nite sample estimate, and propose a consistent strategy for setting the\nblock size as a function of the number of samples.\n\n2.1 De\ufb01nition and asymptotics of the block-MMD\nLet Fk be an RKHS de\ufb01ned on a topological space X with reproducing kernel k, and P a Borel\nprobability measure on X . The mean embedding of P in Fk, written \u00b5k(p) \u2208 Fk is de\ufb01ned such\n\n2\n\n\f(b) B = 250\n\n(a) B = 2. This setting corresponds to the MMDl\nstatistic [10].\nFigure 1: Empirical distributions under H0 and HA for different regimes of B for the music experiment\n(Section 3.2). In both plots, the number of samples is \ufb01xed at 500. As we vary B, we trade off the quality of the\n\ufb01nite sample Gaussian approximation to the null distribution, as in Theorem 2.3, with the variances of the H0\nand HA distributions, as outlined in Section 2.1. In (b) the distribution under H0 does not resemble a Gaussian\n(it does not pass a level 0.05 Kolmogorov-Smirnov (KS) normality test [16, 20]), and a Gaussian approximation\nresults in a conservative test threshold (vertical green line). The remaining empirical distributions all pass a KS\nnormality test.\n\nfor all f \u2208 Fk, and exists for all Borel probability measures when\nthat Ex\u223cpf (x) = (cid:104)f, \u00b5k(p)(cid:105)Fk\nk is bounded and continuous [3, 10]. The maximum mean discrepancy (MMD) between a Borel\nprobability measure P and a second Borel probability measure Q is the squared RKHS distance\nbetween their respective mean embeddings,\n\u03b7k(P, Q) = (cid:107)\u00b5k(P ) \u2212 \u00b5k(Q)(cid:107)2Fk\n\n= Exx(cid:48)k(x, x(cid:48)) + Eyy(cid:48)k(y, y(cid:48)) \u2212 2Exyk(x, y),\n\nwhere x(cid:48) denotes an independent copy of x [11]. Introducing the notation z = (x, y), we write\n\n(1)\n\n\u03b7k(P, Q) = Ezz(cid:48)hk(z, z(cid:48)),\n\nh(z, z(cid:48)) = k(x, x(cid:48)) + k(y, y(cid:48)) \u2212 k(x, y(cid:48)) \u2212 k(x(cid:48), y).\n\n(2)\nWhen the kernel k is characteristic, then \u03b7k (P, Q) = 0 iff P = Q [21]. Clearly, the minimum\nvariance unbiased estimate MMDu of \u03b7k(P, Q) is a U-statistic.\nBy analogy with MMDu, we make use of averages of h(x, y, x(cid:48), y(cid:48)) to construct our two-sample\ntest. We denote by \u02c6\u03b7k(i) the ith empirical estimate MMDu based on a subsample of size B, where\n1 \u2264 i \u2264 n\nB (for notational purposes, we will index samples as though they are presented in a random\n\ufb01xed order). More precisely,\n\n\u02c6\u03b7k(i) =\n\n1\n\nB(B \u2212 1)\n\nh(za, zb).\n\n(3)\n\nThe B-test statistic is an MMD estimate obtained by averaging the \u02c6\u03b7k(i). Each \u02c6\u03b7k(i) under H0\nconverges to an in\ufb01nite sum of weighted \u03c72 variables [7]. Although setting B = n would lead to the\nlowest variance estimate of the MMD, computing sound thresholds for a given p-value is expensive,\ninvolving repeated bootstrap sampling [5, 14], or computing the eigenvalues of a Gram matrix [7].\nare i.i.d. variables, and averaging them allows us to apply\nIn contrast, we note that \u02c6\u03b7k(i)i=1,..., n\nthe central limit theorem in order to estimate p-values from a normal distribution. We denote the\naverage of the \u02c6\u03b7k(i) by \u02c6\u03b7k,\n\nB\n\niB(cid:88)\n\niB(cid:88)\n\na=(i\u22121)B+1\n\nb=(i\u22121)B+1,b(cid:54)=a\n\nB(cid:88)\n\nn\n\ni=1\n\n\u02c6\u03b7k =\n\nB\nn\n\n\u02c6\u03b7k(i).\n\n(4)\n\nWe would like to apply the central limit theorem to variables \u02c6\u03b7k(i)i=1,..., n\n. It remains for us to\nderive the distribution of \u02c6\u03b7k under H0 and under HA. We rely on the result from [11, Theorem 8]\nfor HA. According to our notation, for every i,\n\nB\n\n3\n\n\u22120.05\u22120.04\u22120.03\u22120.02\u22120.0100.010.020.030.040.05050100150200250 HA histogramH0 histogramapproximated 5% quantile of H0\u22124\u221220246810x 10\u22123050100150200250 HA histogramH0 histogramapproximated 5% quantile of H0\fTheorem 2.1 Assume 0 < E(h2) < \u221e, then under HA, \u02c6\u03b7k converges in distribution to a Gaussian\naccording to\n\nu = 4(cid:0)Ez[(Ez(cid:48)h(z, z(cid:48)))2 \u2212 Ez,z(cid:48)(h(z, z(cid:48)))]2(cid:1).\n\n2 (\u02c6\u03b7k(i) \u2212 MMD2) D\u2192 N (0, \u03c32\nu),\n\nB\n\n1\n\nwhere \u03c32\n\nThis in turn implies that\n\nFor an average of {\u02c6\u03b7k(i)}i=1,..., n\n\nB\n\nD\u2192 N(cid:16)\n\n\u02c6\u03b7k(i) D\u2192 N (MMD2, \u03c32\n, the central limit theorem implies that under HA,\n\nuB\u22121).\n\n= N(cid:0)MMD2, \u03c32\n\nun\u22121(cid:1) .\n\n\u22121(cid:17)\n\n\u02c6\u03b7k\n\nMMD2, \u03c32\n\n(7)\nThis result shows that the distribution of HA is asymptotically independent of the block size, B.\nTurning to the null hypothesis, [11, Theorem 8] additionally implies that under H0 for every i,\nTheorem 2.2\n\nu (Bn/B)\n\n(5)\n\n(6)\n\n\u221e(cid:88)\n\nl=1\n\nB \u02c6\u03b7k(i) D\u2192\n\n\u03bbl[z2\n\nl \u2212 2],\n\n(8)\n\n(cid:90)\n\nwhere zl \u223c N (0, 2)2 i.i.d, \u03bbl are the solutions to the eigenvalue equation\n\n(9)\nand \u00afk(xi, xj) := k(xi, xj)\u2212 Exk(xi, x)\u2212 Exk(x, xj) + Ex,x(cid:48)k(x, x(cid:48)) is the centered RKHS kernel.\n\n\u00afk(x, x(cid:48))\u03c8l(x)dp(x) = \u03bbl\u03c8l(x(cid:48)),\n\nX\n\nAs a consequence, under H0, \u02c6\u03b7k(i) has expected variance 2B\u22122(cid:80)\u221e\n= N(cid:0)0, C(nB)\u22121(cid:1)\n\nvariance by CB\u22122. The central limit theorem implies that under H0,\n\n0, C(cid:0)B2n/B(cid:1)\u22121(cid:17)\n\nD\u2192 N(cid:16)\n\n\u02c6\u03b7k\n\nl=1 \u03bb2. We will denote this\n\n(10)\n\nThe asymptotic distributions for \u02c6\u03b7k under H0 and HA are Gaussian, and consequently it is easy\nto calculate the distribution quantiles and test thresholds. Asymptotically, it is always bene\ufb01cial to\nincrease B, as the distributions for \u03b7 under H0 and HA will be better separated. For consistency, it\nis suf\ufb01cient to ensure that n/B \u2192 \u221e.\nA related strategy of averaging over data blocks to deal with large sample sizes has recently been\ndeveloped in [15], with the goal of ef\ufb01ciently computing bootstrapped estimates of statistics of\ninterest (e.g. quantiles or biases). Brie\ufb02y, the approach splits the data (of size n) into s subsamples\neach of size B, computes an estimate of the n-fold bootstrap on each block, and averages these\nestimates. The difference with respect to our approach is that we use the asymptotic distribution\nof the average over block statistics to determine a threshold for a hypothesis test, whereas [15] is\nconcerned with proving the consistency of a statistic obtained by averaging over bootstrap estimates\non blocks.\n\n2.2 Convergence of Moments\n\nIn this section, we analyze the convergence of the moments of the B-test statistic, and comment on\npotential sources of bias.\nThe central limit theorem implies that the empirical mean of {\u02c6\u03b7k(i)}i=1,..., n\nconverges to E(\u02c6\u03b7k(i)).\nconverges to E(\u02c6\u03b7k(i))2\u2212E(\u02c6\u03b7k(i)2). Finally, all\nMoreover it states that the variance {\u02c6\u03b7k(i)}i=1,..., n\nremaining moments tend to zero, where the rate of convergence for the jth moment is of the order\n[1]. This indicates that the skewness dominates the difference of the distribution from a\n\n(cid:1) j+1\n\n(cid:0) n\n\nB\n\nB\n\n2\n\nB\nGaussian.\n\n4\n\n\fUnder both H0 and HA, thresholds computed from normal distribution tables are asymptotically un-\nbiased. For \ufb01nite samples sizes, however, the bias under H0 can be more severe. From Equation (8)\nwe have that under H0, the summands, \u02c6\u03b7k(i), converge in distribution to in\ufb01nite weighted sums of\n\u03c72 distributions. Every unweighted term of this in\ufb01nite sum has distribution N (0, 2)2, which has\n\ufb01nite skewness equal to 8. The skewness for the entire sum is \ufb01nite and positive,\n\nC =\n\n(11)\nas \u03bbl \u2265 0 for all l due to the positive de\ufb01niteness of the kernel k. The skew for the mean of the\n\u02c6\u03b7k(i) converges to 0 and is positively biased. At smaller sample sizes, test thresholds obtained from\nthe standard Normal table may therefore be inaccurate, as they do not account for this skew. In our\nexperiments, this bias caused the tests to be overly conservative, with lower Type I error than the\ndesign level required (Figures 2 and 5).\n\n8\u03bb3\nl ,\n\nl=1\n\n\u221e(cid:88)\n\n2.3 Finite Sample Case\n\nIn the \ufb01nite sample case, we apply the Berry-Ess\u00b4een theorem, which gives conservative bounds on\nthe (cid:96)\u221e convergence of a series of \ufb01nite sample random variables to a Gaussian distribution [4].\nTheorem 2.3 Let X1, X2, . . . , Xn be i.i.d. variables. E(X1) = 0, E(X 2\nE(|X1|3) = \u03c1 < \u221e. Let Fn be a cumulative distribution of\nnormal distribution. Then for every x,\n\n1 ) = \u03c32 > 0, and\n\u221a\ni=1 Xi\nn\u03c3 , and let \u03a6 denote the standard\n\n(cid:80)n\n\n|Fn(x) \u2212 \u03a6(x)| \u2264 C\u03c1\u03c3\u22123n\u22121/2,\n\n(12)\n\nwhere C < 1.\n\nO(1)\nO(B\u22121)\n3\n2\nn ).\n\nn\nB\n\nThis result allows us to ensure fast point-wise convergence of the B-test. We have that \u03c1(\u02c6\u03b7k) =\nO(1), i.e., it is dependent only on the underlying distributions of the samples and not on the sample\nsize. The number of i.i.d. samples is nB\u22121. Based on Theorem 2.3, the point-wise error can be\nn ) under HA. Under H0, the error can be bounded by\nupper bounded by\n\n= O( B2\u221a\n\n\u221a\n\n\u221a\n\n= O( B3.5\u221a\n\nn\nB\n\nO(1)\nO(B\u22122)\n3\n2\nWhile the asymptotic results indicate that convergence to an optimal predictor is fastest for larger\nB, the \ufb01nite sample results support decreasing the size of B in order to have a suf\ufb01cient number\nof samples for application of the central limit theorem. As long as B \u2192 \u221e and n\nB \u2192 \u221e, the\nassumptions of the B-test are ful\ufb01lled.\nBy varying B, we make a fundamental tradeoff in the construction of our two sample test. When B\nis small, we have many samples, hence the null distribution is close to the asymptotic limit provided\nby the central limit theorem, and the Type I error is estimated accurately. The disadvantage of a\nsmall B is a lower test power for a given sample size. Conversely, if we increase B, we will have\na lower variance empirical distribution for H0, hence higher test power, but we may have a poor\nestimate of the number of Type I errors (Figure 1). A sensible family of heuristics therefore is to set\n(13)\nfor some 0 < \u03b3 < 1, where we round to the nearest integer. In this setting the number of samples\navailable for application of the central limit theorem will be [n(1\u2212\u03b3)]. For given \u03b3 computational\n\ncomplexity of the B-test is O(cid:0)n1+\u03b3(cid:1). We note that any value of \u03b3 \u2208 (0, 1) yields a consistent\nO(cid:0)n1.5(cid:1): we emphasize that this is a heuristic, and just one choice that ful\ufb01ls our assumptions.\n\n2 in the experimental results section, with resulting complexity\n\nestimator. We have chosen \u03b3 = 1\n\nB = [n\u03b3]\n\n3 Experiments\n\nWe have conducted experiments on challenging synthetic and real datasets in order to empirically\nmeasure (i) sample complexity, (ii) computation time, and (iii) Type I / Type II errors. We evaluate\nB-test performance in comparison to the MMDl and MMDu estimators, where for the latter we\ncompare across different strategies for null distribution quantile estimation.\n\n5\n\n\fMethod\n\nKernel parameters\n\nAdditional\nparameters\n\nMinimum number\n\nof samples\n\nB-test\n\nPearson curves\n\nGamma approximation\nGram matrix spectrum\n\nBootstrap\n\nPearson curves\n\nGamma approximation\nGram matrix spectrum\n\nBootstrap\n\nB = 2\n\u221a\nB = 8\nB =\nn\nany B\nB = 2\nB = 8\n\nB =(cid:112) n\n\n2\n\nB = n\n\n\u03c3 = 1\n\n\u03c3 = median\n\nmultiple kernels\n\n\u03c3 = 1\n\n\u03c3 = median\n\n26400\n3850\n886\n\n> 60000\n37000\n5400\n1700\n186\n183\n186\n190\n\n> 60000, or 2h\n\nper iteration\n\ntimeout\n\nComputation\n\ntime (s)\n0.0012\n0.0039\n0.0572\n\n0.0700\n0.1295\n0.8332\n387.4649\n0.2667\n407.3447\n129.4094\n\nConsistent\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\u00d7\n\u00d7\n(cid:88)\n(cid:88)\n\u00d7\n\u00d7\n(cid:88)\n(cid:88)\n\nTable 1: Sample complexity for tests on the distributions described in Figure 3. The fourth column indicates\nthe minimum number of samples necessary to achieve Type I and Type II errors of 5%. The \ufb01fth column is the\ncomputation time required for 2000 samples, and is not presented for settings that have unsatisfactory sample\ncomplexity.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Type I errors on the distributions shown in Figure 3 for \u03b1 = 5%: (a) MMD, single kernel, \u03c3 = 1, (b)\nMMD, single kernel, \u03c3 set to the median pairwise distance, and (c) MMD, non-negative linear combination of\nmultiple kernels. The experiment was repeated 30000 times. Error bars are not visible at this scale.\n3.1 Synthetic data\nFollowing previous work on kernel hypothesis testing [9], our synthetic distributions are 5 \u00d7 5 grids\nof 2D Gaussians. We specify two distributions, P and Q. For distribution P each Gaussian has\nidentity covariance matrix, while for distribution Q the covariance is non-spherical. Samples drawn\nfrom P and Q are presented in Figure 3. These distributions have proved to be very challenging for\nexisting non-parametric two-sample tests [9].\nWe employed three different kernel selection strategies\nin the hypothesis test. First, we used a Gaussian kernel\nwith \u03c3 = 1, which approximately matches the scale of\nthe variance of each Gaussian in mixture P . While this\nis a somewhat arbitrary default choice, we selected it as\nit performs well in practice (given the lengthscale of the\ndata), and we treat it as a baseline. Next, we set \u03c3 equal\nto the median pairwise distance over the training data,\nwhich is a standard way to choose the Gaussian kernel\nbandwidth [17], although it is likewise arbitrary in this\ncontext. Finally, we applied a kernel learning strategy, in\nwhich the kernel was optimized to maximize the test power for the alternative P (cid:54)= Q [9]. This\napproach returned a non-negative linear combination combination of base kernels, where half the\ndata were used in learning the kernel weights (these data were excluded from the testing phase).\nThe base kernels in our experiments were chosen to be Gaussian, with bandwidths in the set \u03c3 \u2208\n{2\u221215, 2\u221214, . . . , 210}. Testing was conducted using the remaining half of the data.\n\nFigure 3: Synthetic data distributions P and\nQ. Samples belonging to these classes are\ndif\ufb01cult to distinguish.\n\n(b) Distribution Q\n\n(a) Distribution P\n\n6\n\n24816326412800.010.020.030.040.050.060.070.08Size of inner blockType I error EmpiricalTypeIerrorExpectedTypeIerrorB=\u221an24816326412800.010.020.030.040.050.060.070.08Size of inner blockType I error EmpiricalTypeIerrorExpectedTypeIerrorB=\u221an24816326412800.010.020.030.040.050.060.070.08Size of inner blockType I error EmpiricalTypeIerrorExpectedTypeIerrorB=pn2\fFigure 4: Synthetic experiment: number of Type II er-\nrors vs B, given a \ufb01xed probability \u03b1 of Type I er-\nrors. As B grows, the Type II error drops quickly when\nthe kernel is appropriately chosen. The kernel selec-\ntion method is described in [9], and closely approx-\nimates the baseline performance of the well-informed\nuser choice of \u03c3 = 1.\n\nFor comparison with the quadratic time U-\nstatistic MMDu [7, 10], we evaluated four\nnull distribution estimates: (i) Pearson curves,\n(ii) gamma approximation, (iii) Gram matrix\nspectrum, and (iv) bootstrap. For methods us-\ning Pearson curves and the Gram matrix spec-\ntrum, we drew 500 samples from the null distri-\nbution estimates to obtain the 1 \u2212 \u03b1 quantiles,\nfor a test of level \u03b1. For the bootstrap, we \ufb01xed\nthe number of shuf\ufb02es to 1000. We note that\nPearson curves and the gamma approximation\nare not statistically consistent. We considered\nonly the setting with \u03c3 = 1 and \u03c3 set to the\nmedian pairwise distance, as kernel selection is\nnot yet solved for tests using MMDu [9].\nIn the \ufb01rst experiment we set the Type I error to\nbe 5%, and we recorded the Type II error. We\nconducted these experiments on 2000 samples\nover 1000 repetitions, with varying block size,\nB. Figure 4 presents results for different kernel\nchoice strategies, as a function of B. The me-\ndian heuristic performs extremely poorly in this\nexperiment. As discussed in [9, Section 5], the reason for this failure is that the lengthscale of the\ndifference between the distributions P and Q differs from the lengthscale of the main data variation\nas captured by the median, which gives too broad a kernel for the data.\nIn the second experiment, our aim was to compare the empirical sample complexity of the various\nmethods. We again \ufb01xed the same Type I error for all methods, but this time we also \ufb01xed a Type\nII error of 5%, increasing the number of samples until the latter error rate was achieved. Column\nfour of Table 1 shows the number of samples required in each setting to achieve these error rates.\nWe additionally compared the computational ef\ufb01ciency of the various methods. The computation\ntime for each method with a \ufb01xed sample size of 2000 is presented in column \ufb01ve of Table 1. All\nexperiments were run on a single 2.4 GHz core.\nFinally, we evaluated the empirical Type I error for \u03b1 = 5% and increasing B. Figure 2 displays the\nempirical Type I error, where we note the location of the \u03b3 = 0.5 heuristic in Equation (13). For the\nuser-chosen kernel (\u03c3 = 1, Figure 2(a)), the number of Type I errors closely matches the targeted\ntest level. When median heuristic is used, however, the test is overly conservative, and makes fewer\nType I errors than required (Figure 2(b)). This indicates that for this choice of \u03c3, we are not in the\nasymptotic regime, and our Gaussian null distribution approximation is inaccurate. Kernel selection\nvia the strategy of [9] alleviates this problem (Figure 2(c)). This setting coincides with a block size\nsubstantially larger than 2 (MMDl), and therefore achieves lower Type II errors while retaining the\ntargeted Type I error.\n3.2 Musical experiments\nIn this set of experiments, two amplitude modulated Rammstein songs were compared (Sehnsucht\nvs. Engel, from the album Sehnsucht). Following the experimental setting in [9, Section 5], samples\nfrom P and Q were extracts from AM signals of time duration 8.3 \u00d7 10\u22123 seconds in the original\naudio. Feature extraction was identical to [9], except that the amplitude scaling parameter was set\n\n32. Table 2 summarizes the empirical Type I and Type II errors over 1000 repetitions, and the\naverage computation times. Figure 5 shows the average number of Type I errors as a function of\nB: in this case, all kernel selection strategies result in conservative tests (lower Type I error than\nrequired), indicating that more samples are needed to reach the asymptotic regime. Figure 1 shows\nthe empirical H0 and HA distributions for different B.\n4 Discussion\nWe have presented experimental results both on a dif\ufb01cult synthetic problem, and on real-world data\nfrom amplitude modulated audio recordings. The results show that the B-test has a much better\n\nto 0.3 instead of 0.5. As the feature vector had size 1000 we set the block size B = (cid:6)\u221a\n\n1000(cid:7) =\n\n7\n\n10110210300.20.40.60.81Size of inner blockEmprical number of Type II errors B\u2212test, a single kernel, \u03c3 = 1B\u2212test, a single kernel, \u03c3 = medianB\u2212test kernel selectionTests estimating MMDu with \u03c3=1Tests estimating MMDu with \u03c3=median\fMethod\n\nB-test\n\nKernel\n\nparameters\n\n\u03c3 = 1\n\n\u03c3 = median\n\nmultiple kernels\n\nAdditional\nparameters\n\n\u221a\nB = 2\nB =\nn\n\u221a\nB = 2\nn\nB =\nB = 2\n\nB =(cid:112) n\n\n2\n\nGram matrix spectrum\n\nBootstrap\n\nGram matrix spectrum\n\nBootstrap\n\n\u03c3 = 1\n\n\u03c3 = median\n\nB = 2000\n\nType I error Type II error Computational\n\ntime (s)\n0.039\n1.276\n0.047\n1.259\n0.607\n18.285\n160.1356\n121.2570\n286.8649\n122.8297\n\n0.038\n0.006\n0.043\n0.026\n0.0481\n0.025\n\n0\n\n0.01\n\n0\n\n0.01\n\n0.927\n0.597\n0.786\n\n0.867\n0.012\n\n0\n\n0\n0\n0\n0\n\nTable 2: A comparison of consistent tests on the music experiment described in Section 3.2. Here computation\ntime is reported for the test achieving the stated error rates.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 5: Empirical Type I error rate for \u03b1 = 5% on the music data (Section 3.2). (a) A single kernel test with\n\u03c3 = 1, (b) A single kernel test with \u03c3 = median, and (c) for multiple kernels. Error bars are not visible at this\nscale. The results broadly follow the trend visible from the synthetic experiments.\n\nsample complexity than MMDl over all tested kernel selection strategies. Moreover, it is an order\nof magnitude faster than any test that consistently estimates the null distribution for MMDu (i.e.,\nthe Gram matrix eigenspectrum and bootstrap estimates): these estimates are impractical at large\nsample sizes, due to their computational complexity. Additionally, the B-test remains statistically\nconsistent, with the best convergence rates achieved for large B. The B-test combines the best\nfeatures of MMDl and MMDu based two-sample tests: consistency, high statistical ef\ufb01ciency, and\nhigh computational ef\ufb01ciency.\nA number of further interesting experimental trends may be seen in these results. First, we have\nobserved that the empirical Type I error rate is often conservative, and is less than the 5% targeted\nby the threshold based on a Gaussian null distribution assumption (Figures 2 and 5). In spite of this\nconservatism, the Type II performance remains strong (Tables 1 and 2), as the gains in statistical\npower of the B-tests improve the testing performance (cf. Figure 1). Equation (7) implies that the\nsize of B does not in\ufb02uence the asymptotic variance under HA, however we observe in Figure 1 that\nthe empirical variance of HA drops with larger B. This is because, for these P and Q and small B,\nthe null and alternative distributions have considerable overlap. Hence, given the distributions are\neffectively indistinguishable at these sample sizes n, the variance of the alternative distribution as a\nfunction of B behaves more like that of H0 (cf. Equation (10)). This effect will vanish as n grows.\nFinally, [13] propose an alternative approach for U-statistic based testing when the degree of de-\ngeneracy is known: a new U-statistic (the TU-statistic) is written in terms of products of centred\nU-statistics computed on the individual blocks, and a test is formulated using this TU-statistic. Ho\nand Shieh show that a TU-statistic based test can be asymptotically more powerful than a test using\na single U-statistic on the whole sample, when the latter is degenerate under H0, and nondegenerate\nunder HA. It is of interest to apply this technique to MMD-based two-sample testing.\nAcknowledgments We thank Mladen Kolar for helpful discussions. This work is partially funded by ERC\nGrant 259112, and by the Royal Academy of Engineering through the Newton Alumni Scheme.\n\n8\n\n24816326412800.010.020.030.040.050.060.070.08Size of inner blockType I error EmpiricalTypeIerrorExpectedTypeIerrorB=\u221an24816326412800.010.020.030.040.050.060.070.08Size of inner blockType I error EmpiricalTypeIerrorExpectedTypeIerrorB=\u221an24816326412800.010.020.030.040.050.060.070.08Size of inner blockType I error EmpiricalTypeIerrorExpectedTypeIerrorB=pn2\fReferences\n[1] Bengt Von Bahr. On the convergence of moments in the central limit theorem. The Annals of\n\nMathematical Statistics, 36(3):pp. 808\u2013818, 1965.\n\n[2] L. Baringhaus and C. Franz. On a new multivariate two-sample test. J. Multivariate Anal.,\n\n88:190\u2013206, 2004.\n\n[3] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and\n\nStatistics. Kluwer, 2004.\n\n[4] Andrew C Berry. The accuracy of the gaussian approximation to the sum of independent\n\nvariates. Transactions of the American Mathematical Society, 49(1):122\u2013136, 1941.\n\n[5] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.\n[6] M. Fromont, B. Laurent, M. Lerasle, and P. Reynaud-Bouret. Kernels based tests with non-\n\nasymptotic bootstrap approaches for two-sample problems. In COLT, 2012.\n\n[7] A Gretton, K Fukumizu, Z Harchaoui, and BK Sriperumbudur. A fast, consistent kernel two-\nsample test. In Advances in Neural Information Processing Systems 22, pages 673\u2013681, 2009.\n[8] A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch\u00a8olkopf, and A. J. Smola. A kernel\nstatistical test of independence. In Advances in Neural Information Processing Systems 20,\npages 585\u2013592, Cambridge, MA, 2008. MIT Press.\n\n[9] A Gretton, B Sriperumbudur, D Sejdinovic, H Strathmann, S Balakrishnan, M Pontil, and\nK Fukumizu. Optimal kernel choice for large-scale two-sample tests. In Advances in Neural\nInformation Processing Systems 25, pages 1214\u20131222, 2012.\n\n[10] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00a8olkopf, and Alexander\n\nSmola. A kernel two-sample test. J. Mach. Learn. Res., 13:723\u2013773, March 2012.\n\n[11] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00a8olkopf, and Alexander J.\n\nSmola. A kernel method for the two-sample-problem. In NIPS, pages 513\u2013520, 2006.\n\n[12] Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel Fisher discrimi-\n\nnant analysis. In NIPS, pages 609\u2013616. MIT Press, Cambridge, MA, 2008.\n\n[13] H.-C. Ho and G. Shieh. Two-stage U-statistics for hypothesis testing. Scandinavian Journal\n\nof Statistics, 33(4):861\u2013873, 2006.\n\n[14] Norman Lloyd Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. Continuous uni-\n\nvariate distributions. Distributions in statistics. Wiley, 2nd edition, 1994.\n\n[15] A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan. A scalable bootstrap for massive data.\n\nJournal of the Royal Statistical Society, Series B, In Press.\n\n[16] Andrey N Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. Giornale\n\ndellIstituto Italiano degli Attuari, 4(1):83\u201391, 1933.\n\n[17] B Sch\u00a8olkopf. Support vector learning. Oldenbourg, M\u00a8unchen, Germany, 1997.\n[18] D. Sejdinovic, A. Gretton, B. Sriperumbudur, and K. Fukumizu. Hypothesis testing using\n\npairwise distances and associated kernels. In ICML, 2012.\n\n[19] R. Ser\ufb02ing. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.\n[20] Nickolay Smirnov. Table for estimating the goodness of \ufb01t of empirical distributions. The\n\nAnnals of Mathematical Statistics, 19(2):279\u2013281, 1948.\n\n[21] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch\u00a8olkopf. Hilbert space\nembeddings and metrics on probability measures. Journal of Machine Learning Research,\n11:1517\u20131561, 2010.\n\n[22] G. Sz\u00b4ekely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, (5),\n\nNovember 2004.\n\n[23] G. Sz\u00b4ekely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of\n\ndistances. Ann. Stat., 35(6):2769\u20132794, 2007.\n\n[24] M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama. Relative density-ratio\nestimation for robust distribution comparison. Neural Computation, 25(5):1324\u20131370, 2013.\n\n9\n\n\f", "award": [], "sourceid": 434, "authors": [{"given_name": "Wojciech", "family_name": "Zaremba", "institution": "\u00c9cole Centrale Paris"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "UCL"}, {"given_name": "Matthew", "family_name": "Blaschko", "institution": "\u00c9cole Centrale Paris"}]}