{"title": "A More Powerful Two-Sample Test in High Dimensions using Random Projection", "book": "Advances in Neural Information Processing Systems", "page_first": 1206, "page_last": 1214, "abstract": "We consider the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension p to exceed the sample size n. Our contribution is a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T squared statistic. Working within a high- dimensional framework that allows (p,n) to tend to infinity, we first derive an asymptotic power function for our test, and then provide sufficient conditions for it to achieve greater power than other state-of-the-art tests. Using ROC curves generated from simulated data, we demonstrate superior performance against competing tests in the parameter regimes anticipated by our theoretical results. Lastly, we illustrate an advantage of our procedure with comparisons on a high-dimensional gene expression dataset involving the discrimination of different types of cancer.", "full_text": "A More Powerful Two-Sample Test in High\n\nDimensions using Random Projection\n\nMiles E. Lopes1\n\nLaurent Jacob1\n\nMartin J. Wainwright1,2\n\nDepartments of Statistics1 and EECS2\n\nUniversity of California, Berkeley\n\n{mlopes,laurent,wainwrig}@stat.berkeley.edu\n\nBerkeley, CA 94720-3860\n\nAbstract\n\nWe consider the hypothesis testing problem of detecting a shift between the means\nof two multivariate normal distributions in the high-dimensional setting, allowing\nfor the data dimension p to exceed the sample size n. Our contribution is a new test\nstatistic for the two-sample test of means that integrates a random projection with\nthe classical Hotelling T 2 statistic. Working within a high-dimensional framework\nthat allows (p, n) \u2192 \u221e, we \ufb01rst derive an asymptotic power function for our\ntest, and then provide suf\ufb01cient conditions for it to achieve greater power than\nother state-of-the-art tests. Using ROC curves generated from simulated data,\nwe demonstrate superior performance against competing tests in the parameter\nregimes anticipated by our theoretical results. Lastly, we illustrate an advantage\nof our procedure with comparisons on a high-dimensional gene expression dataset\ninvolving the discrimination of different types of cancer.\n\n1\n\nIntroduction\n\nTwo-sample hypothesis tests are concerned with the question of whether two samples of data are\ngenerated from the same distribution. Such tests are among the most widely used inference pro-\ncedures in treatment-control studies in science and engineering [1]. Application domains such\nas molecular biology and fMRI have stimulated considerable interest in detecting shifts between\ndistributions in the high-dimensional setting, where the two samples of data {X1, . . . , Xn1} and\n{Y1, . . . , Yn2} are subsets of Rp, and n1, n2 (cid:28) p [e.g., 2\u20135]. In transcriptomics, for instance, p\ngene expression measures on the order of hundreds or thousands may be used to investigate differ-\nences between two biological conditions, and it is often dif\ufb01cult to obtain sample sizes n1 and n2\nlarger than several dozen in each condition. In high-dimensional situations such as these, classical\nmethods may be ineffective, or not applicable at all. Likewise, there has been growing interest in\ndeveloping testing procedures that are better suited to deal with the effects of dimension [e.g., 6\u201310].\nA fundamental instance of the general two-sample problem is the two-sample test of means with\nGaussian data. In this case, two independent sets of samples {X1, . . . , Xn1} and {Y1, . . . , Yn2} are\ngenerated in an i.i.d. manner from p-dimensional multivariate normal distributions N (\u00b51, \u03a3) and\nN (\u00b52, \u03a3) respectively, where the mean vectors \u00b51, \u00b52 \u2208 Rp and covariance matrix \u03a3 (cid:31) 0 are all\n\ufb01xed and unknown. The hypothesis testing problem of interest is\n\nH0 : \u00b51 = \u00b52 versus H1 : \u00b51 (cid:54)= \u00b52.\n\nThe most well-known test statistic for this problem is the Hotelling T 2 statistic, de\ufb01ned by\n\n( \u00afX \u2212 \u00afY )(cid:62)(cid:98)\u03a3\u22121 ( \u00afX \u2212 \u00afY ),\n\nT 2 :=\n\nn1 n2\nn1 + n2\n\n(1)\n\n(2)\n\n1\n\n\f(cid:80)n1\n\n(cid:80)n2\nj=1 Yj are the sample means, and(cid:98)\u03a3 is the pooled sample\n(cid:80)n1\ncovariance matrix, given by (cid:98)\u03a3 := 1\nwhere \u00afX := 1\nn1\nj=1(Xj \u2212 \u00afX)(Xj \u2212 \u00afX)(cid:62) + 1\nWhen p > n, the matrix (cid:98)\u03a3 is singular, and the Hotelling test is not well-de\ufb01ned. Even when\nn := n1 + n2 \u2212 2.\n\n(cid:80)n2\nj=1(Yj \u2212 \u00afY )(Yj \u2212 \u00afY )(cid:62), with\n\nj=1 Xj and \u00afY := 1\nn2\n\nn\n\nn\n\np \u2264 n, the Hotelling test is known to perform poorly if p is nearly as large as n. This behavior\nwas demonstrated in a seminal paper of Bai and Saranadasa [6] (or BS for short), who studied the\nperformance of the Hotelling test under (p, n) \u2192 \u221e with p/n \u2192 1 \u2212 \u0001, and showed that the\nasymptotic power of the test suffers for small values of \u0001 > 0. In subsequent years, a number of\nimprovements on the Hotelling test in the high-dimensional setting have been proposed [e.g., 6\u20139].\nIn this paper, we propose a new test statistic for the two-sample test of means with multivariate\nnormal data, applicable when p \u2265 n/2. We provide an explicit asymptotic power function for\nour test with (p, n) \u2192 \u221e, and show that under certain conditions, our test has greater asymptotic\npower than other state-of-the-art tests. These comparison results are valid with p/n tending to a\npositive constant or in\ufb01nity. In addition to its advantage in terms of asymptotic power, our procedure\nspeci\ufb01es exact level-\u03b1 critical values for multivariate normal data, whereas competing procedures\noffer only approximate level-\u03b1 critical values. Furthermore, our experiments in Section 4 suggest\nthat the critical values of our test may also be more robust than those of competing tests. Lastly, the\ncomputational cost of our procedure is modest in the n < p setting, being of order O(n2p).\nThe remainder of this paper is organized as follows. In Section 2, we provide background on hy-\npothesis testing and describe our testing procedure. Section 3 is devoted to a number of theoretical\nresults about its performance. Theorem 1 in Section 3.1 provides an asymptotic power function,\nand Theorems 2 and 3 in Sections 3.3 and 3.4 give suf\ufb01cient conditions for achieving greater power\nthan state-of-the-art tests in the sense of asymptotic relative ef\ufb01ciency. In Section 4 we provide\nperformance comparisons with ROC curves on synthetic data against a broader collection of meth-\nods, including some recent kernel-based and non-parametric approaches such as MMD [11], KFDA\n[12], and TreeRank [10]. Lastly, we study a high-dimensional gene expression dataset involving the\ndiscrimination of different cancer types, demonstrating that our test\u2019s false positive rate is reliable in\npractice. We refer the reader to the preprint [13] for proofs of our theoretical results.\nNotation. Let \u03b4 := \u00b51 \u2212 \u00b52 denote the shift vector between the distributions N (\u00b51, \u03a3) and\nN (\u00b52, \u03a3), and de\ufb01ne the ordered pair of parameters \u03b8 := (\u03b4, \u03a3). Let z1\u2212\u03b1 denote the 1 \u2212 \u03b1\nquantile of the standard normal distribution, and let \u03a6 be its cumulative distribution function. If A\nis a matrix in Rp\u00d7p, let |||A|||2 denote its spectral norm (maximum singular value), and de\ufb01ne the\nFrobenius norm |||A|||F :=\nij. When all the eigenvalues of A are real, we denote them\nby \u03bbmin(A) = \u03bbp(A) \u2264 \u00b7\u00b7\u00b7 \u2264 \u03bb1(A) = \u03bbmax(A). For a positive-de\ufb01nite covariance matrix \u03a3,\n\u22121/2\n. We use the\nlet D\u03c3 := diag(\u03a3), and de\ufb01ne the associated correlation matrix R := D\n\u03c3\nnotation f (n) (cid:46) g(n) if there is some absolute constant c such that the inequality f (n) \u2264 c n holds\nfor all large n. If both f (n) (cid:46) g(n) and g(n) (cid:46) f (n) hold, then we write f (n) (cid:16) g(n). The\nnotation f (n) = o(g(n)) means f (n)/g(n) \u2192 0 as n \u2192 \u221e.\n\n(cid:113)(cid:80)\n\n\u22121/2\n\u03a3D\n\u03c3\n\ni,j A2\n\n2 Background and random projection method\n\nFor the remainder of the paper, we retain the set-up for the two-sample test of means (1) with\nGaussian data, assuming throughout that p \u2265 n/2, and n = n1 + n2 \u2212 2.\nReview of hypothesis testing terminology. The primary focus of our results will be on the compar-\nison of power between test statistics, and here we give precise meaning to this notion. When testing\na null hypothesis H0 versus an alternative hypothesis H1, a procedure based on a test statistic T\nspeci\ufb01es a critical value, such that H0 is rejected if T exceeds that critical value, and H0 is ac-\ncepted otherwise. The chosen critical value \ufb01xes a trade-off between the risk of rejecting H0 when\nH0 actually holds, and the risk of accepting H0 when H1 holds. The former error is referred to as\na type I error and the latter as a type II error. A test is said to have level \u03b1 if the probability of com-\nmitting a type I error is at most \u03b1. Finally, at a given level \u03b1, the power of a test is the probability\nof rejecting H0 under H1, i.e., 1 minus the probability of a type II error. When evaluating testing\nprocedures at a given level \u03b1, we seek to identify the one with the greatest power.\n\n2\n\n\fPast work. The Hotelling T 2 statistic (2) discriminates between the hypotheses H0 and H1 by pro-\nviding an estimate of the \u201cstatistical distance\u201d separating the distributions N (\u00b51, \u03a3) and N (\u00b52, \u03a3).\nMore speci\ufb01cally, the Hotelling statistic is essentially an estimate of the Kullback-Leibler (KL) di-\npooled sample covariance matrix (cid:98)\u03a3 in the de\ufb01nition of T 2 is not invertible when p > n, several\n2 \u03b4(cid:62)\u03a3\u22121\u03b4, where \u03b4 := \u00b51 \u2212 \u00b52. Due to the fact that the\nvergence DKL\n\n(cid:0)N (\u00b51, \u03a3)(cid:107)N (\u00b52, \u03a3)(cid:1) = 1\n\nrecent procedures have offered substitutes for the Hotelling statistic in the high-dimensional setting:\nBai and Saranadasa [6], Srivastava and Du [7, 8], Chen and Qin [9], hereafter BS, SD and CQ re-\nspectively. Up to now, the route toward circumventing this dif\ufb01culty has been to form an estimate of\n\u03a3 that is diagonal, and hence easily invertible. We shall see later that this limited use of covariance\nstructure sacri\ufb01ces power when the data exhibit non-trivial correlation. In this regard, our proce-\ndure is motivated by the idea that covariance structure may be used more effectively by testing with\nprojected samples in a space of lower dimension.\nIntuition for random projection. To provide some further intuition for our method, it is possible\nto consider the problem (1) in terms of a competition between the dimension p, and the statistical\ndistance separating H0 and H1. On one hand, the accumulation of variance from a large number\nof variables makes it dif\ufb01cult to discriminate between the hypotheses, and thus, it is desirable to\nreduce the dimension of the data. On the other hand, most methods for reducing dimension will also\nbring H0 and H1 \u201ccloser together,\u201d making them harder to distinguish. Mindful of the fact that the\nHotelling test measures the separation of H0 and H1 in terms of \u03b4(cid:62)\u03a3\u22121\u03b4, we see that the statistical\ndistance is driven by the Euclidean length of \u03b4. Consequently, we seek to transform the data in such\na way that the dimension is reduced, while the length of the shift \u03b4 is mostly preserved upon passing\nto the transformed distributions. From this geometric point of view, it is natural to exploit the fact\nthat random projections can simultaneously reduce dimension and approximately preserve lengths\nwith high probability [14]. The use of projection-based test statistics has been considered previously\nin Jacob et al., [15], Cl\u00b4emenc\u00b8on et al. [10], and Cuesta-Albertos et al. [16].\nAt a high level, our method can be viewed as a two step procedure. First, a single random projection\nis drawn, and is used to map the samples from the high-dimensional space Rp to a low-dimensional\nspace1 Rk, with k := (cid:98)n/2(cid:99). Second, the Hotelling T 2 test is applied to a new hypothesis testing\nproblem, H0,proj versus H1,proj, in the projected space. A decision is then pulled back to the original\nproblem by simply rejecting H0 whenever the Hotelling test rejects H0,proj.\nk \u2208 Rk\u00d7p denote a random projection with i.i.d. N (0, 1) entries,\nFormal testing procedure. Let P (cid:62)\ndrawn independently of the data, where k = (cid:98)n/2(cid:99). Conditioning on the drawn matrix P (cid:62)\nk , the\nprojected samples {P (cid:62)\nk Xn1} and {P (cid:62)\nk Yn2} are distributed i.i.d. according\nk X1, . . . , P (cid:62)\nk \u03a3Pk) respectively, with i = 1, 2. Since n \u2265 k, the projected data satisfy the usual\nto N (P (cid:62)\nconditions [17, p. 211] for applying the Hotelling T 2 procedure to the following new two-sample\nproblem in the projected space Rk:\n\nk Y1, . . . , P (cid:62)\n\nk \u00b5i, P (cid:62)\n\nH0,proj : P (cid:62)\n\nk \u00b51 = P (cid:62)\n\nk \u00b52 versus H1,proj : P (cid:62)\n\nk \u00b51 (cid:54)= P (cid:62)\n\nk \u00b52.\n\n(3)\n\nFor this projected problem, the Hotelling test statistic takes the form2\n\nk (cid:98)\u03a3Pk)\u22121P (cid:62)\n\nk ( \u00afX \u2212 \u00afY ),\n\nk := n1n2\nT 2\nn1+n2\n\n( \u00afX \u2212 \u00afY )(cid:62)Pk(P (cid:62)\n\nk n\n\nk,n\u2212k+1(\u03b1), where F \u2217\n\nwhere \u00afX, \u00afY , and (cid:98)\u03a3 are as de\ufb01ned in Section 1. Lastly, de\ufb01ne the critical value t\u03b1 :=\nn\u2212k+1 F \u2217\nk,n\u2212k+1(\u03b1) is the upper \u03b1 quantile of the Fk,n\u2212k+1 distribution [17].\nk \u2265 t\u03b1 is a level-\u03b1\nIt is a basic fact about the classical Hotelling test that rejecting H0,proj when T 2\ntest for the projected problem (3) (e.g., see Muirhead [17, p.217]). Inspection of the formula for T 2\nk\nshows that its distribution is the same under both H0 and H0,proj. Therefore, rejecting the original\nk \u2265 t\u03b1 is also a level \u03b1 test for the original problem (1). Likewise, we de\ufb01ne this as the\nH0 when T 2\ncondition for rejecting H0 at level \u03b1 in our procedure for (1). We summarize our procedure below.\n\n1The choice of projected dimension k = (cid:98)n/2(cid:99) is explained in the preprint [13].\n2Note that P (cid:62)\n\nk (cid:98)\u03a3Pk is invertible with probability 1 when P (cid:62)\n\nk has i.i.d. N (0, 1) entries.\n\n3\n\n\f1. Generate a single random matrix P (cid:62)\n2. Compute T 2\n3. If T 2\n\nk \u2265 t\u03b1, reject H0; otherwise accept H0.\n\nk , using P (cid:62)\n\nk and the two sets of samples.\n\nk with i.i.d. N (0, 1) entries.\n\n((cid:63))\n\nProjected Hotelling test at level \u03b1 for problem (1).\n\n3 Main results and their consequences\n\nThis section is devoted to the statement and discussion of our main theoretical results, including\na characterization of the asymptotic power function of our test (Theorem 1), and comparisons of\nasymptotic relative ef\ufb01ciency with state-of-the-art tests proposed in past work (Theorems 2 and 3).\n\n3.1 Asymptotic power function\n\nAs is standard in high-dimensional asymptotics, we will consider a sequence of hypothesis testing\nproblems indexed by n, allowing the dimension p, mean vectors \u00b51 and \u00b52 and covariance matrix\n\u03a3 to implicitly vary as functions of n, with n \u2192 \u221e. We also make another type of asymptotic\nassumption, known as a local alternative [18, p.193], which is commonplace in hypothesis testing.\nThe idea lying behind a local alternative assumption is that if the dif\ufb01culty of discriminating between\nH0 and H1 is \u201cheld \ufb01xed\u201d with respect to n, then it is often the case that most testing procedures\nhave power tending to 1 under H1 as n \u2192 \u221e. In such a situation, it is not possible to tell if one\ntest has greater asymptotic power than another. Consequently, it is standard to derive asymptotic\npower results under the extra condition that H0 and H1 become harder to distinguish as n grows.\nThis theoretical device aids in identifying the conditions under which one test is more powerful\nthan another. The following local alternative (A1), and balancing assumption (A2), are similar to\nthose used in previous works [6\u20139] on problem (1). In particular, condition (A1) means that the\nKL-divergence between N (\u00b51, \u03a3) and N (\u00b52, \u03a3) tends to 0 as n \u2192 \u221e.\n(A1) Suppose that \u03b4(cid:62)\u03a3\u22121\u03b4 = o(1).\n(A2) Let there be a constant b \u2208 (0, 1) such that n1/n \u2192 b.\nTo set the notation for Theorem 1, it is important to notice that each time the procedure ((cid:63)) is im-\nplemented, a draw of P (cid:62)\nk . To make this dependence clear, recall\n\u03b8 := (\u03b4, \u03a3), and let \u03b2(\u03b8; P (cid:62)\nk ) denote the exact (non-asymptotic) power function of our level-\u03b1\ntest for problem (1), induced by a draw of P (cid:62)\nk , as in ((cid:63)). Another key quantity that depends on\nP (cid:62)\nk is the KL-divergence between the projected sampling distributions N (P (cid:62)\nk \u03a3Pk) and\nN (P (cid:62)\nk, and a simple calculation shows that\n2 \u22062\nTheorem 1. Under conditions (A1) and (A2), for almost all sequences of projections P (cid:62)\nk ,\n\nk \u03a3Pk). We denote this divergence by 1\n\nk induces a new test statistic T 2\n\nk \u00b52, P (cid:62)\nk = 1\n\nk \u03a3Pk)\u22121P (cid:62)\nk \u03b4.\n\n2 \u03b4(cid:62)Pk(P (cid:62)\n\nk \u00b51, P (cid:62)\n\n2 \u22062\n\n1\n\n(cid:17) \u2192 0 as n \u2192 \u221e.\n\n(4)\n\n(cid:16)\u2212z1\u2212\u03b1 + b(1\u2212b)\u221a\n\n2\n\n\u221a\n\nn \u22062\nk\n\n\u03b2(\u03b8; P (cid:62)\n\nk ) \u2212 \u03a6\n\nk = 0, e.g. under H0, then \u03a6(\u2212z1\u2212\u03b1 +0) = \u03b1, which corresponds to blind\nRemarks. Note that if \u22062\n\u221a\nguessing at level \u03b1. Consequently, the second term (b(1 \u2212 b)/\nk determines the advantage\nof our procedure over blind guessing. Since \u22062\nk is proportional to the KL-divergence between the\nprojected sampling distributions, these observations conform to the intuition from Section 2 that the\nKL-divergence measures the discrepancy between H0 and H1.\n\nn\u22062\n\n\u221a\n\n2)\n\n3.2 Asymptotic relative ef\ufb01ciency (ARE)\n\nHaving derived an asymptotic power function for our test in Theorem 1, we are now in position to\nprovide suf\ufb01cient conditions for achieving greater power than two other recent procedures for prob-\nlem (1): Srivastava and Du [7, 8] (SD), and Chen and Qin [9] (CQ). To the best of our knowledge,\n\n4\n\n\fthese works represent the state of the art3 among tests for problem (1) with a known asymptotic\npower function under (p, n) \u2192 \u221e.\nFrom Theorem 1, the asymptotic power function of our random projection-based test at level \u03b1 is\n\n\u03b2RP(\u03b8; P (cid:62)\n\nk ) := \u03a6\n\n(cid:16)\u2212z1\u2212\u03b1 + b(1\u2212b)\u221a\n\n2\n\n(cid:17)\n\nn (cid:107)\u03b4(cid:107)2\n|||\u03a3|||F\n\n2\n\n(cid:16)\u2212z1\u2212\u03b1 + b(1\u2212b)\u221a\n\n2\n\n\u221a\n\nn \u22062\nk\n\n(cid:17)\n(cid:16)\u2212z1\u2212\u03b1 + b(1\u2212b)\u221a\n\n.\n\n2\n\n(cid:17)\n\n(5)\n\n.\n\nn \u03b4(cid:62)D\u22121\n\u03c3 \u03b4\n|||R|||F\n\nThe asymptotic power functions for the CQ and SD testing procedures at level \u03b1 are\n\n\u03b2CQ(\u03b8) := \u03a6\n\n,\n\nand \u03b2SD(\u03b8) := \u03a6\n\nRecall that D\u03c3 := diag(\u03a3), and R denotes the correlation matrix associated with \u03a3. The functions\n\u03b2CQ and \u03b2SD are derived under local alternatives and asymptotic assumptions that are similar to the\nones used here to obtain \u03b2RP. In particular, all three functions can be obtained allowing p/n to tend\nto an arbitrary positive constant or in\ufb01nity.\nA standard method of comparing asymptotic power functions under local alternatives is through\nthe concept of asymptotic relative ef\ufb01ciency (ARE) e.g., see van der Vaart [18, p.192]). Since \u03a6 is\nmonotone increasing, the term added to \u2212z1\u2212\u03b1 inside the \u03a6 functions above controls the power. To\ncompare power between tests, the ARE is simply de\ufb01ned via the ratio of such terms. More explicitly,\nwe de\ufb01ne ARE (\u03b2CQ; \u03b2RP) :=\n\n(cid:16) n \u03b4(cid:62)D\u22121\n\n, and ARE (\u03b2SD; \u03b2RP) :=\n\n(cid:16) n (cid:107)\u03b4(cid:107)2\n\n(cid:14)\u221a\n\n(cid:14)\u221a\n\n(cid:17)2\n\n(cid:17)2\n\n\u03c3 \u03b4\n\n.\n\nn \u22062\nk\n\nn\u22062\nk\n\n2\n\n|||\u03a3|||F\n\n|||R|||F\n\nk through \u22062\n\nk. Moreover, the quantity \u22062\n\nWhenever the ARE is less than 1, our procedure is considered to have greater asymptotic power than\nthe competing test\u2014with our advantage being greater for smaller values of the ARE. Consequently,\nwe seek suf\ufb01cient conditions in Theorems 2 and 3 for ensuring that the ARE is small.\nIn the present context, the analysis of ARE is complicated by the fact that the ARE varies with\nn and depends on a random draw of P (cid:62)\nk, and hence the\nARE, are affected by the orientation of \u03b4 with respect to the eigenvectors of \u03a3. In order to consider\nan average-case scenario, where no single orientation of \u03b4 is of particular importance, we place a\nprior on the unit vector \u03b4/(cid:107)\u03b4(cid:107)2, and assume that it is uniformly distributed on the unit sphere of Rp.\nWe emphasize that our procedure ((cid:63)) does not rely on this assumption, and that it is only a device\nfor making an average-case comparison. Therefore, to be clear about the meaning of Theorems 2\nk and \u03b4/(cid:107)\u03b4(cid:107)2, and our probability\nand 3, we regard the ARE as a function two random objects, P (cid:62)\nstatements are made with this understanding. We complete the preparation for our comparison\ntheorems by isolating four assumptions with n \u2192 \u221e.\n(A3) The vector\n(A4) There is a constant a \u2208 [0, 1) such that k/p \u2192 a.\n(A5) The ratio 1\u221a\n(A6) The matrix D\u03c3 = diag(\u03a3) satis\ufb01es\n\nis uniformly distributed on the p-dimensional unit sphere, independent of P (cid:62)\nk .\n\ntr(\u03a3)(cid:14)(p \u03bbmin(\u03a3)) = o(1).\n\n= o(1).\n\n\u03b4(cid:107)\u03b4(cid:107)2\n\nk\n\n|||D\u22121\n\u03c3 |||2\n\u22121\n\u03c3 )\n\ntr(D\n\n3.3 Comparison with Chen and Qin [9]\n\nThe next result compares the asymptotic power of our projection-based test with that of Chen and\nQin [9]. The choice of \u00011 = 1 below (and in Theorem 3) is the reference for equal asymptotic\nperformance, with smaller values of \u00011 corresponding to better performance of random projection.\nTheorem 2. Assume conditions (A3), (A4), and (A5). Fix a number \u00011 > 0, and let c(\u00011) be any\nconstant strictly greater than\n\n\u00011 (1\u2212\u221a\n\n4\n\na)4 . If the inequality\nn \u2265 c(\u00011) tr(\u03a3)2\n|||\u03a3|||2\n\nF\n\n\u03a3, we have 1 \u2264 tr(\u03a3)2(cid:14)|||\u03a3|||2\n\nholds for all large n, then P [ARE (\u03b2CQ; \u03b2RP) \u2264 \u00011] \u2192 1 as n \u2192 \u221e.\nInterpretation. To interpret the result, note that Jensen\u2019s inequality implies that for any choice of\nF \u2264 p. As such, it is reasonable to interpret this ratio as a measure of\n3Two other high-dimensional tests have been proposed in older works [6, 19, 20] that lead to the asymptotic\n\npower function \u03b2CQ, but under more restrictive assumptions.\n\n(6)\n\n5\n\n\fspectrum of \u03a3, with tr(\u03a3)2(cid:14)|||\u03a3|||2\n\nthe effective dimension of the covariance structure. The message of Theorem 2 is that as long as\nthe sample size n exceeds the effective dimension, then our projection-based test is asymptotically\nsuperior to CQ. The ratio tr(\u03a3)2/|||\u03a3|||2\nF can also be viewed as measuring the decay rate of the\nF (cid:28) p indicating rapid decay. This condition means that the data\nhas low variance in \u201cmost\u201d directions in Rp, and so projecting onto a random set of k directions will\nlikely map the data into a low-variance subspace in which it is harder for chance variation to explain\naway the correct hypothesis, thereby resulting in greater power.\n\n3.4 Comparison with Srivastava and Du [7, 8]\n\nWe now turn to comparison of asymptotic power with the test of Srivastava and Du (SD).\nTheorem 3. In addition to the conditions of Theorem 2, assume that condition (A6) holds. Fix a\nnumber \u00011 > 0, and let c(\u00011) be any constant strictly greater than\n\na)4 . If the inequality\n\n\u00011 (1\u2212\u221a\n\n4\n\n(cid:16) tr(\u03a3)\n\n(cid:17)2(cid:16) tr(D\u22121\n\n(cid:17)2\n\n\u03c3 )\n|||R|||F\n\np\n\nn \u2265 c(\u00011)\n\n(7)\n\nholds for all large large n, then P [ARE (\u03b2SD; \u03b2RP) \u2264 \u00011] \u2192 1 as n \u2192 \u221e.\nInterpretation. Unlike the comparison with the CQ test, the correlation matrix R plays a large role\nin determining the relative ef\ufb01ciency between our procedure and the SD test. The correlation matrix\nenters in two different ways. First, the Frobenius norm |||R|||F is larger when the data variables are\nmore correlated. Second, correlation mitigates the growth of tr(D\u22121\n\u03c3 ), since this trace is largest\nwhen \u03a3 is nearly diagonal and has a large number of small eigenvalues. Inspection of the SD test\nstatistic in [7] shows that it does not make any essential use of correlation. By contrast, our T 2\nk\nstatistic does take correlation into account, and so it is understandable that correlated data enhance\nthe performance of our test relative to SD.\nAs a simple example, let \u03c1 \u2208 (0, 1) and consider a highly correlated situation where all variables\nhave \u03c1 correlation will all other variables. Then, R = (1 \u2212 \u03c1)Ip\u00d7p + \u03c111(cid:62) where 1 \u2208 Rp is the all\nones vector. We may also let \u03a3 = R for simplicity. In this case, we see that |||R|||2\n(cid:46) 1 and tr(\u03a3)/p = 1, and\np2, and tr(D\u22121\nthen the suf\ufb01cient condition (7) for outperforming SD is easily satis\ufb01ed in terms of rates. We could\neven let the correlation \u03c1 decay at a rate of n\u2212q with q \u2208 (0, 1/2), and (7) would still be satis\ufb01ed\nfor large enough n. More generally, it is not necessary to use specially constructed covariance\nmatrices \u03a3 to demonstrate the superior performance of our method. Section 4 illustrates simulations\ninvolving randomly selected covariance matrices where T 2\nConversely, it is possible to show that condition (7) requires non-trivial correlation. To see this,\n\ufb01rst note that in the complete absence of correlation, we have |||R|||2\nF = p. Jensen\u2019s\ninequality implies that tr(D\u22121\nthis shows if the data exhibits very low correlation, then (7) cannot hold when p grows faster than\nn. This will be illustrated in the simulations of Section 4.\n\n(cid:17)2 \u2265 p. Altogether,\n\n\u03c3 )2 = tr(Ip\u00d7p)2 = p2. This implies tr(D\u22121\n\n(cid:17)2(cid:16) tr(D\u22121\nF = |||Ip\u00d7p|||2\n\nF = p + 2(cid:0)p\n\n\u03c3 )2(cid:14)|||R|||2\n\nk is more powerful than SD.\n\ntr(D\u03c3) = p2\n\ntr(\u03a3), and so\n\n(cid:1)\u03c12 (cid:38)\n\n(cid:16) tr(\u03a3)\n\np\n\n\u03c3 ) \u2265 p2\n\n\u03c3\n\n|||R|||F\n\nF\n\n2\n\n4 Performance comparisons on real and synthetic data\n\nIn this section, we compare our procedure to state-of-the-art methods on real and synthetic data,\nillustrating the effects of the different factors involved in Theorems 2 and 3.\n\nComparison on synthetic data.\nIn order to validate the consequences of our theory and compare\nagainst other methods in a controlled fashion, we performed simulations in four settings: slow/fast\nspectrum decay, and diagonal/random covariance structure. To consider two rates of spectrum decay,\nwe selected p equally spaced values between 0.01 and 1, and raised them to the power 20 for fast\ndecay and the power 5 for slow decay. Random covariance structure was generated by specifying\nthe eigenvectors of \u03a3 as the column vectors of the orthogonal component of a QR decomposition of\na p \u00d7 p matrix with i.i.d. N (0, 1) entries. In all cases, we sampled n1 = n2 = 50 data points from\ntwo multivariate normal distributions in p = 200 dimensions, and repeated the process 500 times\n\n6\n\n\fwith \u03b4 = 0 for H0, and 500 times with (cid:107)\u03b4(cid:107)2 = 1 for H1. In the case of H1, \u03b4 was drawn uniformly\nfrom the unit sphere, as in Theorems 2 and 3. We \ufb01xed the total amount of variance by setting\n|||\u03a3|||F = 50 in all cases. In addition to our random projection (RP)-based test, we implemented\nthe methods of BS [6], SD [7], and CQ [9], all of which are designed speci\ufb01cally for problem\n(1) in the high-dimensional setting. For the sake of completeness, we also compare against recent\nnon-parametric procedures for the general two-sample problem that are based on kernel methods\n(MMD) [11] and (KFDA) [12], as well as area-under-curve maximization (TreeRank) [10].\nThe ROC curves from our simulations are displayed in the left block of four panels in Figure 1.\nThese curves bear out the results of Theorems 2 and 3 in several ways. First notice that fast spectral\ndecay improves the performance of our test relative to CQ, as expected from Theorem 2. If we set\na = 0 and \u00011 = 1 in Theorem 2, then condition (6) for outperforming CQ is approximately n \u2265 75\nin the case of fast decay. Given that n = 50 + 50 \u2212 2 = 98, the advantage of our method over CQ\nin panels (b) and (d) is consistent with condition (6) being satis\ufb01ed. In the case of slow decay, the\nsame settings of a and \u00011 indicate that n \u2265 246 is suf\ufb01cient for outperforming CQ. Since the ROC\ncurve of our method is roughly the same as that of CQ in panels (a) and (c) (where again n = 98),\nour condition (6) is somewhat conservative for slow decay at the \ufb01nite sample level.\nTo study the consequences of Theorem 3, observe that when the covariance matrix \u03a3 is generated\nrandomly, the amount of correlation is much larger than in the idealized case that \u03a3 is diagonal.\n\u03c3 )/|||R|||F , is much smaller in in the\nSpeci\ufb01cally, for a \ufb01xed value of tr(\u03a3), the quantity tr(D\u22121\npresence of correlation. Consequently, when comparing (a) with (c), and (b) with (d), we see that\ncorrelation improves the performance of our test relative to SD, as expected from the bound in\nTheorem 3. More generally, the ROC curves illustrate that our method has an overall advantage\nover BS, CQ, KFDA, and MMD. Note that KFDA and MMD are not designed speci\ufb01cally for the\nn (cid:28) p regime. In the case of zero correlation, it is notable that the TreeRank procedure displays a\nsuperior ROC curve to our method, given that it also employs a dimension reduction strategy.\n\n(a) diagonal \u03a3, slow decay\n\n(b) diagonal \u03a3, fast decay\n\n(e) FPR for genomic data\n\n(c) random \u03a3, slow decay\n\n(d) random \u03a3, fast decay\n\n(f) FPR for genomic data (zoom)\n\nFigure 1: Left and middle panels: ROC curves of several test statistics for two different choices of\ncorrelation structure and decay rate. (a) Diagonal covariance slow decay, (b) Diagonal covariance\nfast decay, (c) Random covariance slow decay, (d) Random covariance fast decay. Right panels: (e)\nFalse positive rate against p-value threshold on the gene expression experiment of Section 4 for RP\n((cid:63)), BS, CQ, SD and enrichment test, (f) zoom on the p-value < 0.1 region.\n\n7\n\nFalse positive rateTrue positive rate0.00.20.40.60.81.00.00.20.40.60.81.0RPSDCQBSKFDAMMDTreeRankFalse positive rateTrue positive rate0.00.20.40.60.81.00.00.20.40.60.81.0RPSDCQBSKFDAMMDTreeRank0.00.20.40.60.81.00.00.20.40.60.81.0Nominal level \u03b1False positive rateRPSDCQBSHGFalse positive rateTrue positive rate0.00.20.40.60.81.00.00.20.40.60.81.0RPSDCQBSKFDAMMDTreeRankFalse positive rateTrue positive rate0.00.20.40.60.81.00.00.20.40.60.81.0RPSDCQBSKFDAMMDTreeRank0.000.020.040.060.080.100.000.020.040.060.080.10Nominal level \u03b1False positive rateRPSDCQBSHG\fsample problems is genuinely high-dimensional. Speci\ufb01cally, we have 14\u00d7 ((cid:0)6\n\nComparison on high-dimensional gene expression data. The ability to identify gene sets having\ndifferent expression between two types of conditions, e.g., benign and malignant forms of a disease,\nis of great value in many areas of biomedical research. Likewise, there is considerable motivation to\nstudy our procedure in the context of detecting differential expression of p genes between two small\ngroups of patients of sizes n1 and n2.\nTo compare the performance our T 2\nk statistic against competitors CQ and SD in this type of appli-\ncation, we constructed a collection of 1680 distinct two-sample problems in the following manner,\nusing data from three genomic studies of ovarian [21], myeloma [22] and colorectal [23] cancers.\nFirst, we randomly split the 3 datasets respectively into 6, 4, and 6 groups of approximately 50\npatients. Next, we considered pairwise comparisons between all sets of patients on each of 14\nbiologically meaningful gene sets from the canonical pathways of MSigDB [24], with each gene set\ncontaining between 75 and 128 genes. Since n1 (cid:39) n2 (cid:39) 50 for all patient sets, our collection of two-\nproblems under H0 and 14\u00d7 (6\u00b7 4 + 6\u00b7 4 + 6\u00b7 6) = 1176 problems under H1\u2014assuming that every\ngene set was differentially expressed between two sets of patients with different cancers, and that no\ngene set was differentially expressed between two sets of patients with the same cancer type.4\nA natural performance measure for comparing test statistics is the actual false positive rate (FPR)\nas a function of the nominal level \u03b1. When testing at level \u03b1, the actual FPR should be as close to \u03b1\nas possible, but differences may occur if the distribution of the test statistic under H0 is not known\nexactly (as is the case in practice). Figure 1 (e) shows that the curve for our procedure is closer to\nthe optimal diagonal line for most values of \u03b1 than the competing curves. Furthermore, the lower-\nleft corner of Figure 1 (e) is of particular importance, as practitioners are usually only interested in\np-values lower than 10\u22121. Figure 1 (f) is a zoomed plot of this region and shows that the SD and\nCQ tests commit too many false positives at low thresholds. Again, in this regime, our procedure\nis closer to the diagonal and safely commits fewer than the allowed number of false positives. For\nexample, when thresholding p-values at 0.01, SD has an actual FPR of 0.03, and an even more\nexcessive FPR of 0.02 when thresholding at 0.001. The tests of CQ and BS are no better. The same\nthresholds on the p-values of our test lead to false positive rates of 0.008 and 0 respectively.\nWith consideration to ROC curves, the samples arising from different cancer types are dissimilar\nenough that BS, CQ, SD, and our method all obtain perfect ROC curves (no H1 case has a larger p-\nvalue than any H0 case). We also note that the hypergeometric test-based (HG) enrichment analysis\noften used by experimentalists on this problem [25] gives a suboptimal area-under-curve of 0.989.\n\n(cid:1)) = 504\n\n2\n\n(cid:1) +(cid:0)4\n\n(cid:1) +(cid:0)6\n\n2\n\n2\n\n5 Conclusion\n\nWe have proposed a novel testing procedure for the two-sample test of means in high dimensions.\nThis procedure can be implemented in a simple manner by \ufb01rst projecting a dataset with a single\nrandomly drawn matrix, and then applying the standard Hotelling T 2 test in the projected space. In\naddition to obtaining the asymptotic power of this test, we have provided interpretable conditions on\nthe covariance matrix \u03a3 for achieving greater power than competing tests in the sense of asymptotic\nrelative ef\ufb01ciency. Speci\ufb01cally, our theoretical comparisons show that our test is well suited to\ninteresting regimes where most of the variance in the data can be captured in a relatively small\nnumber of variables, or where the variables are highly correlated. Furthermore, in the realistic case\nof (n, p) = (98, 200), these regimes were shown to correspond to favorable performance of our test\nagainst several competitors in ROC curve comparisons on simulated data. Finally, we showed on\nreal gene expression data that our procedure was more reliable than competitors in terms of its false\npositive rate. Extensions of this work may include more re\ufb01ned applications of random projection\nto high-dimensional testing problems.\nAcknowledgements. The authors thank Sandrine Dudoit, Anne Biton, and Peter Bickel for helpful\ndiscussions. MEL gratefully acknowledges the support of the DOE CSGF Fellowship under grant\nnumber DE-FG02-97ER25308, and LJ the support of Stand Up to Cancer. MJW was partially\nsupported by NSF grant DMS-0907632.\n\n4Although this assumption could be violated by the existence of various cancer subtypes, or technical dif-\nferences between original tissue samples, our initial step of randomly splitting the three cancer datasets into\nsubsets guards against these effects.\n\n8\n\n\fReferences\n[1] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Texts in Statistics. Springer,\n\nNew York, third edition, 2005.\n\n[2] Y. Lu, P. Liu, P. Xiao, and H. Deng. Hotelling\u2019s T2 multivariate pro\ufb01ling for detecting differential expres-\n\nsion in microarrays. Bioinformatics, 21(14):3105\u20133113, Jul 2005.\n\n[3] J. J. Goeman and P. B\u00a8uhlmann. Analyzing gene expression data in terms of gene sets: methodological\n\nissues. Bioinformatics, 23(8):980\u2013987, Apr 2007.\n\n[4] D. V. D. Ville, T. Blue, and M. Unser. Integrated wavelet processing and spatial statistical testing of fMRI\n\ndata. Neuroimage, 23(4):1472\u20131485, 2004.\n\n[5] U. Ruttimann et al. Statistical analysis of functional MRI data in the wavelet domain. IEEE Transactions\n\non Medical Imaging, 17(2):142\u2013154, 1998.\n\n[6] Z. Bai and H. Saranadasa. Effect of high dimension: by an example of a two sample problem. Statistica\n\nSinica, 6:311,329, 1996.\n\n[7] M. S. Srivastava and M. Du. A test for the mean vector with fewer observations than the dimension.\n\nJournal of Multivariate Analysis, 99:386\u2013402, 2008.\n\n[8] M. S. Srivastava. A test for the mean with fewer observations than the dimension under non-normality.\n\nJournal of Multivariate Analysis, 100:518\u2013532, 2009.\n\n[9] S. X. Chen and Y. L. Qin. A two-sample test for high-dimensional data with applications to gene-set\n\ntesting. Annals of Statistics, 38(2):808\u2013835, Feb 2010.\n\n[10] S. Cl\u00b4emenc\u00b8on, M. Depecker, and Vayatis N. AUC optimization and the two-sample problem. In Advances\n\nin Neural Information Processing Systems (NIPS 2009), 2009.\n\n[11] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00a8olkop, and A.J. Smola. A kernel method for the two-\nIn B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information\n\nsample-problem.\nProcessing Systems 19, pages 513\u2013520. MIT Press, Cambridge, MA, 2007.\n\n[12] Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel Fisher discriminant analysis.\n\nIn John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, NIPS. MIT Press, 2007.\n\n[13] M. E. Lopes, L. J. Jacob, and M. J. Wainwright. A more powerful two-sample test in high dimensions\n\nusing random projection. Technical Report arXiv: 1108.2401, 2011.\n\n[14] S. S. Vempala. The Random Projection Method. DIMACS Series in Discrete Mathematics and Theoretical\n\nComputer Science. American Mathematical Society, 2004.\n\n[15] L. Jacob, P. Neuvial, and S. Dudoit. Gains in power from structured two-sample tests of means on graphs.\n\nTechnical Report arXiv: q-bio/1009.5173v1, 2010.\n\n[16] J. A. Cuesta-Albertos, E. Del Barrio, R. Fraiman, and C. Matr\u00b4an. The random projection method in\ngoodness of \ufb01t for functional data. Computational Statistics & Data Analysis, 51(10):4814\u20134831, 2007.\n\n[17] R. J. Muirhead. Aspects of Multivariate Statistical Theory. John Wiley & Sons, inc., 1982.\n[18] A. W. van der Vaart. Asymptotic Statistics. Cambridge, 2007.\n[19] A. P. Dempster. A high dimensional two sample signi\ufb01cance test. Annals of Mathematical Statistics,\n\n29(4):995\u20131010, 1958.\n\n[20] A. P. Dempster. A signi\ufb01cance test for the separation of two highly multivariate small samples. Biomet-\n\nrics, 16(1):41\u201350, 1960.\n\n[21] R. W. Tothill et al. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical\n\noutcome. Clin Cancer Res, 14(16):5198\u20135208, Aug 2008.\n\n[22] J. Moreaux et al. A high-risk signature for patients with multiple myeloma established from the molecular\n\nclassi\ufb01cation of human myeloma cell lines. Haematologica, 96(4):574\u2013582, Apr 2011.\n\n[23] R. N. Jorissen et al. Metastasis-associated gene expression changes predict poor outcomes in patients\n\nwith dukes stage b and c colorectal cancer. Clin Cancer Res, 15(24):7642\u20137651, Dec 2009.\n\n[24] A. Subramanian et al. Gene set enrichment analysis: a knowledge-based approach for interpreting\n\ngenome-wide expression pro\ufb01les. Proc. Natl. Acad. Sci. USA, 102(43):15545\u201315550, Oct 2005.\n\n[25] T. Beissbarth and T. P. Speed. Gostat: \ufb01nd statistically overrepresented gene ontologies within a group of\n\ngenes. Bioinformatics, 20(9):1464\u20131465, Jun 2004.\n\n9\n\n\f", "award": [], "sourceid": 694, "authors": [{"given_name": "Miles", "family_name": "Lopes", "institution": null}, {"given_name": "Laurent", "family_name": "Jacob", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}]}