{"title": "Power analysis of knockoff filters for correlated designs", "book": "Advances in Neural Information Processing Systems", "page_first": 15446, "page_last": 15455, "abstract": "The knockoff filter introduced by Barber and Cand\\`es 2016 is an elegant framework for controlling the false discovery rate in variable selection. \nWhile empirical results indicate that this methodology is not too conservative,\nthere is no conclusive theoretical result on its power. When the predictors are i.i.d.\\ Gaussian, it is known that as the signal to noise ratio tend to infinity, the knockoff filter is consistent in the sense that one can make FDR go to 0 and power go to 1 simultaneously. In this work we study the case where the predictors have a general covariance matrix $\\bsigma$. We introduce a simple functional called \\emph{effective signal deficiency (ESD)} of the covariance matrix of the predictors\nthat predicts consistency of various variable selection methods. \nIn particular,\nESD reveals that the structure of the precision matrix \nplays a central role in consistency and therefore, so does the conditional independence structure of the predictors. To leverage this connection, we introduce \\emph{Conditional Independence knockoff}, a simple procedure that is able to compete with the more sophisticated knockoff filters and that is defined when the predictors obey a Gaussian tree graphical models (or when the graph is sufficiently sparse). Our theoretical results are supported by numerical evidence on synthetic data.", "full_text": "Power analysis of knockoff \ufb01lters for correlated\n\ndesigns\n\nJingbo Liu\n\nInstitute for Data, Systems, and Society\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\njingbo@mit.edu\n\nPhilippe Rigollet\n\nDepartment of Mathematics\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nrigollet@math.mit.edu\n\nAbstract\n\nThe knockoff \ufb01lter introduced by Barber and Cand\u00e8s 2016 is an elegant framework\nfor controlling the false discovery rate in variable selection. While empirical\nresults indicate that this methodology is not too conservative, there is no conclusive\ntheoretical result on its power. When the predictors are i.i.d. Gaussian, it is known\nthat as the signal to noise ratio tend to in\ufb01nity, the knockoff \ufb01lter is consistent in\nthe sense that one can make FDR go to 0 and power go to 1 simultaneously. In this\nwork we study the case where the predictors have a general covariance matrix \u03a3.\nWe introduce a simple functional called effective signal de\ufb01ciency (ESD) of the\ncovariance matrix of the predictors that predicts consistency of various variable\nselection methods. In particular, ESD reveals that the structure of the precision\nmatrix plays a central role in consistency and therefore, so does the conditional\nindependence structure of the predictors. To leverage this connection, we introduce\nConditional Independence knockoff, a simple procedure that is able to compete\nwith the more sophisticated knockoff \ufb01lters and that is de\ufb01ned when the predictors\nobey a Gaussian tree graphical models (or when the graph is suf\ufb01ciently sparse).\nOur theoretical results are supported by numerical evidence on synthetic data.\n\n1\n\nIntroduction\n\nVariable selection is a cornerstone of modern high-dimensional statistics and, more generally, of\ndata-driven scienti\ufb01c discovery. Examples include selecting a few genes correlated to the incidence\nof a certain disease, or discovering a number of demographic attributes correlated to crime rates.\nA fruitful theoretical framework to study this question is the linear regression model in which we\nobserve n independent copies of the pair (X, Y ) \u2208 Rp \u00d7 R such that\n\nY = X(cid:62)\u03b8 + \u03be ,\n\nwhere \u03b8 \u2208 Rp is an unknown vector of coef\ufb01cients, and \u03be \u223c N (0, n\u03c32) is a noise random variable.\nThroughout this work we assume that X \u223c N (0, \u03a3) for some known covariance matrix \u03a3. Note\nthat for notational simplicity our linear regression model is multiplied by\nn compared to standard\nscaling in high-dimensional linear regression [BRT09]. Clearly, this scaling, also employed in [JM14]\nhas no effect on our results. In this work, we consider asymptotics where n/p \u2192 \u03b4 is \ufb01xed.\nIn this model, a variable selection procedure is a sequence of test statistics \u03c81, . . . , \u03c8p \u2208 {0, 1} for\neach of the hypothesis testing problem\n\n\u221a\n\n(1)\nWhen p is large, a simultaneous control of all the type I errors leads to overly conservative procedures\nthat impedes statistical signi\ufb01cant variables, and ultimately, scienti\ufb01c discovery. The False Discovery\n\n: \u03b8j = 0 ,\n\nvs.\n\n1\n\nH (j)\n\n0\n\nH (j)\n\n: \u03b8j (cid:54)= 0 , j = 1 . . . , p\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fRate (FDR) is a less conservative alternative to global type I error. The FDR of a procedure\n(\u03c81, . . . , \u03c8p) is the expected proportion of erroneoulsy rejected tests. Formally\n\n(cid:104) #{j : \u03c8j = 1, \u03b8j = 0}\n\n#{j : \u03c8j = 1} \u2228 1\n\n(cid:105)\n\nFDR := E\n\nSince its introduction more than two decades ago, various procedures have been developed to provably\ncontrol this quantity under various assumptions. Central among these is the Benjamini-Hochberg\nprocedure which is guaranteed to lead to a desired FDR control under the assumption that the\ndesign matrix X = (X1, . . . , Xn)(cid:62) \u2208 Rn\u00d7p formed by the concatenation of the n column vectors\nX1, . . . , Xn is deterministic and orthogonal [BH95, STS04].\nIn the presence of correlation between the variables, that is when the design matrix fails to be\northogonal, the problem becomes much more dif\ufb01cult. Indeed, if the variables Xj and Xk are highly\ncorrelated, any standard procedure will tend to output a similar coef\ufb01cient for both, or in the case of\nLasso for example, simply chose one of the two variables rather than both.\nRecently, the knockoff \ufb01lter of Barber and Cand\u00e8s [BC15, CFJL18] has emerged as a competitive\nalternative to the Benjamini-Hochberg procedure for FDR control in the presence of correlated\nvariables, and has demonstrated great empirical success [KS19, SKB+]. The terminology \u201cknockoffs\"\nrefers to a vector \u02dcX \u2208 Rp that is easy to mistake for the original vector X but is crucially independent\nof Y given X. Formally, \u02dcX is a knockoff of X if (i) \u02dcX is independent of Y given X and (ii) for any\nS \u2282 {1, . . . , p}, it holds\n\n(2)\nwhere d= denotes equality in distribution and (X, \u02dcX)swap(S) is the vector Z \u2208 R2p with jth coordinate\ngiven by\n\n(X, \u02dcX)swap(S)\n\nd=(X, \u02dcX)\n\nif j \u2208 ({1, . . . , p} \\ S) \u222a (S + {p})\nif j \u2208 S \u222a ({p + 1, . . . , 2p} \\ (S + {p})\n\n(cid:26) Xj\n\n\u02dcXj\n\nZj =\n\nIn words, for any vector R2p, the operator (\u00b7)swap(S) swaps each coordinate in j \u2208 S with the\ncoordinate j + p and leaves the other coordinates unchanged. We call a knockoff mechanism any\nprobability family of probability distributions (Px, x \u2208 Rp) over Rp such that \u02dcX \u223c PX is a knockoff\nof X. Since the knockoff is constructed independently of Y , it serves as a benchmark to evaluate\nhow much of the coef\ufb01cient of a certain variable is due to its correlation with Y and how much of it\nis due to its correlation with the other variables.\nWith this idea in mind, the knockoff \ufb01lter is then constructed from the following four steps:\n\n1. Generate knockoffs. For i = 1, . . . , n, given Xi \u2208 Rp, generate knockoff \u02dcXi \u223c PXi and\nform the n \u00d7 2p design matrix [X, \u02dcX] where \u02dcX = ( \u02dcX1, . . . , \u02dcXn)(cid:62) \u2208 Rn\u00d7p is obtained by\nconcatenating the knockoff vectors.\n\n2. Collect scores for each variable. De\ufb01ne the 2p dimensional vector 1 \u02c6\u03b8 as the Lasso\n\nestimator\n\n1\n2n\n\n(cid:107)Y \u2212 [X, \u02dcX]\u03b8(cid:107)2\n\n2 + \u03bb(cid:107)\u03b8(cid:107)1 ,\n\n\u02c6\u03b8 = argmin\n\u03b8\u2208R2p\n\n(3)\nwhere Y = (Y1, . . . , Yn)(cid:62) is the response vector and, collect the differences of absolute\ncoef\ufb01cients between variables and knockoffs into a set D = {|\u2206j| , j = 1, . . . , p} \\ {0}\nwhere \u2206j\u2019s are any constructed statistics satisfying certain symmetry conditions [BC15]. A\nfrequent choice is\n\nIn this work we replace \u02c6\u03b8 by the debiased version \u02c6\u03b8u (see (7) ahead) in the above de\ufb01nition.\n\n3. Threshold. Given a desired FDR bound q \u2208 (0, 1), de\ufb01ne the threshold\n\n\u2206j := |\u02c6\u03b8j| \u2212 |\u02c6\u03b8j+p| , j = 1, . . . , p.\n(cid:26)\n\n#{j : \u2206j \u2264 \u2212t}\n#{j : \u2206j \u2265 t} \u2228 1\n\nt \u2208 D :\n\n(cid:27)\n\n\u2264 q\n\n.\n\nT := min\n\n1Regression problems with knockoffs are 2p dimensional rather than p dimensional. To keep track of this\n\nfact, we use \u00b7 to denote a 2p dimensional vector.\n\n2\n\n\f4. Test. For all j = 1, . . . , p, answer the hypothesis testing problem (1) with test\n\n\u03a8j = 1{\u2206j \u2265 T} .\n\nThis procedure is guaranteed to satisfy FDR \u2264 q [BC15, Theorem 1] no matter the choice of\nknockoffs. Clearly, \u02dcX = X is a valid choice for knockoffs but it will inevitably lead to no discoveries.\nThe ability of a variable selection procedure (\u03c81, . . . , \u03c8p) to discover true positive is captured its\npower (or true positive proportion) de\ufb01ned as\n\n(cid:104) #{j : \u03c8j = 1, \u03b8j (cid:54)= 0}\n\n(cid:105)\n\nPWR = E\n\n#{j : \u03b8j (cid:54)= 0}\n\nIntuitively, to maximize power, knockoffs should be as uncorrelated with X as possible while\nsatisfying the exchangeability property (2). Following this principle, various knockoff mechanisms\nhave been proposed in different settings, which typically involves solving an optimization to minimize\na heuristic notion of correlation [BC15, CFJL18, RSC18]. Because of this optimization problem,\nknockoff mechanisms with analytical expressions are rare, with the exception of the equi-knockoff\n[BC15] and metropolized knockoff sampling [BCJW19]). Partly due to this, the theoretical analysis\nof the power of the knockoff \ufb01lter has been very limited, even in the Gaussian setting. In the special\ncase where X \u223c N (0, D) for some diagonal matrix, i.e. when the variables are independent, one\ncan simply take \u02dcX \u223c N (0, D) independent of X. In this case, the power of the knockoff \ufb01lter tends\nto 1 as the signal-to-noise ratio tends to in\ufb01nity [WBC17].\nWhen predictors are correlated, [FDLL19] proved a lower bound on the power, where the limiting\npower as n \u2192 \u221e is bounded below in terms of the number p of predictors and extremal eigenvalues\nof the covariance matrix of the true and knockoff variables. While this lower bound provides a\nsuf\ufb01cient condition for situations when the power tends to 1, it is loose in certain scenarios. For\nexample, if all predictors are independent except that two of them are almost surely equal, then the\nminimum eigenvalue of the covariance matrix is zero and yet, experimental results indicate that the\nFDR and the power of the knockoff \ufb01lter are almost unchanged.\nOur contribution. In this paper, we revisit the statistical performance of the knockoff \ufb01lter X \u223c\nN (0, \u03a3) and characterize the situation the knockoff \ufb01lter is consistent, that is when its FDR tends to\n0 and its power tends to 1 simultaneously. More speci\ufb01cally, under suitable limit assumptions, we\nshow that the knockoff \ufb01lter is consistent if and only if the empirical distribution of the diagonal\nelements of the precision matrix of P := \u03a3\u22121 converges to 0, where \u03a3 denotes the covariance matrix\nof [X, \u02dcX] \u2208 R2p converges to a point mass at 0. In turn, we propose an explicit criterion, called\neffective signal de\ufb01ciency de\ufb01ned formally in (8) to practically evaluate consistency or lack thereof.\nHere the term \u201csignal\" refers to the covariance structure \u03a3 of X and the effective signal de\ufb01ciency\nessentially how much weak such a signal should be for a knockoff mechanism to be consistent.\nA second contribution is to propose a new knockoffs mechanism, called Conditionally Independent\nKnockoffs (CIK), which possesses both simple analytic expressions and excellent experimental\nperformance. CIK does not exist for all \u03a3, but we show its existence for tree graphical models or\nother suf\ufb01ciently sparse graphs. Note that in practice, the so-called model-X knockoff \ufb01lter requires\nthe knowledge of \u03a3, an estimation of which is often prohibitive except when the graph has sparse or\ntree structures. CIK has simple explicit expressions of the effective signal de\ufb01ciency for tree models,\nsince the empirical distribution of the diagonals of \u03a3\u22121 is the same as that of (P2\nj=1. We\nremark that CIK is different than metropolized knockoff sampling studied in [BCJW19] (originally\nappeared in [CFJL18, Section 3.4.1]), even in the case of Gaussian Markov chains. The latter exists\nfor generic distributions and is computationally ef\ufb01cient for Markov chains.\n\njj\u03a3jj)p\n\nNotation. We write [n] := {1, . . . , n} and 1 to denote the all-ones vector. For any vector \u03b8, let (cid:107)\u03b8(cid:107)0\nand (cid:107)\u03b8(cid:107)1 denote its (cid:96)0 and (cid:96)1 norms. Given a vector x, we denote by diag(x) the diagonal matrix\nwhose diagonal elements are given by the entries of x and for a matrix M, we denote by diag(M) the\nvector whose entries are given by the diagonal entries of M. For a standard Gaussian random variable\n\u03be \u223c N (0, 1) and any real number r, we denote by Q(r) = P[\u03be > r], the Gaussian tail probability.\nFinally we use the notation A (cid:22) B to indicate the loewner order: B \u2212 A is positive semide\ufb01nite.\n\n3\n\n\f2 Existing work\n\nWe focus this discussion on the case of Gaussian design X.\ncondition (2) implies that [X, \u02dcX] has a covariance matrix of the form\n\nIn this case, the exchangeability\n\n\u03a3 =\n\n\u03a3\n\n\u03a3 \u2212 diag(s)\n\n\u03a3 \u2212 diag(s)\n\n\u03a3\n\n.\n\n(4)\n\n(cid:20)\n\n(cid:21)\n\nAs observed in [BC15], positive semi-de\ufb01niteness of this matrix is equivalent to\n\n0 (cid:22) diag(s) (cid:22) 2\u03a3\n\nFor some s \u2208 Rp. As a result, \ufb01nding a knockoff mechanism consists in \ufb01nding s.\nThe seminal work [BC15][CFJL18] introduce the following knockoff mechanisms:\nEQUI-KNOCKOFFS: The vector s is chosen of the form s = s1 for some s \u2265 0. In light of (5) the\nsmallest value possible for s is 2\u03bbmin(\u03a3). Assuming the normalization diag(\u03a3) = 1, [CFJL18]\nrecommend choosing\n\n(5)\n\n(6)\n\ns = 2\u03bbmin(\u03a3) \u2227 1,\n\nwith the goal of minimizing the correlation between Xj and \u02dcXj.\nSDP-KNOCKOFFS: The vector s is chosen to solve the following semide\ufb01nite program:\n\nmin (cid:107) diag(\u03a3) \u2212 s(cid:107)1\n\ns.t.\n\n0 (cid:22) diag(s) (cid:22) diag(\u03a3)\n\ndiag(s) (cid:22) 2\u03a3.\n\nASDP-KNOCKOFFS: Assume the normalization diag(\u03a3) = 1. Choose an approximation \u03a3a of \u03a3\n(see [CFJL18]) and solve:\n\nminimize (cid:107)1 \u2212 \u02c6s(cid:107)1\nsubject to \u02c6s \u2265 0, diag(\u02c6s) (cid:22) 2\u03a3a\n\nand then solve:\n\nminimize \u03b3\nsubject to diag(\u03b3\u02c6s) (cid:22) 2\u03a3\n\nand put s = \u03b3\u02c6s.\nWe do not discuss other knockoff constructions, such as the exact construction [CFJL18, Section 3.4.1]\nand deep knockoff [RSC18], which mostly target at general non-Gaussian distributions.\nAs alluded, previously, [WBC17] performed power analysis in the linear (\ufb01xed n/p) regime for\n\u03a3 = Ip, in which case all the above knockoff mechanisms give the same answer of s = 1. For a\ngeneral \u03a3, [FDLL19] derived lower bounds on the power in terms of the minimum eigenvalue of the\nextended covariance matrix \u03a3 (no speci\ufb01c knockoff mechanism is assumed).\n\n3 Overview of the main results\n\nIn the paper, we focus on the so-called linear regime where the sampling n/p converges to a constant\n\u03b4. We allow for general \u03a3 and for simplicity, rather than using the Lasso estimator \u02c6\u03b8 de\ufb01ned in (3),\nwe employ a debiased version [ZZ14, vdGBRD14, JM14]\n\nd\nn\n\n\u03a3\u22121X(cid:62)(Y \u2212 X\u02c6\u03b8),\n\n\u02c6\u03b8u := \u02c6\u03b8 +\n\n(7)\nwhere 1/d = 1\u2212(cid:107)\u02c6\u03b8(cid:107)0/n. To allow for asymptotic results, we consider a sequence {(\u03a3(p), \u03b8(p))}p\u22651\nwhere \u03a3(p) are covariance matrices of size m(p) \u00d7 m(p) and \u03b8(p) \u2208 R\n(p) are vectors of coef\ufb01cients.\nNote that we will only consider the cases where m(p) = p or m(p) = 2p, depending on whether we\nconsider predictors with or without knockoffs.\nAt \ufb01rst glance, it is unclear that for such general sequences, any meaningful result can be said about\nthe debiased Lasso estimator \u02c6\u03b8u de\ufb01ned in (7). To overcome this obvious limitation, we consider the\nasymptotic setting where a standard distributional limit exists in the sense [JM14, De\ufb01nition 4.1].\n\n4\n\n\fDe\ufb01nition 1 (Standard distributional limit). Assume constant sampling rate n(p) = \u03b4m(p). A\nsequence {(\u03a3(p), \u03b8(p))}p\u22651 is said to have a standard distributional limit with sparsity (\u03b1, \u03b2), if\n(i) there exist \u03c4 (cid:54)= 0 deterministic and d, possibly random, such that the empirical measure\n\n\u03b4(cid:0)\u03b8j ,\n\n1\n\nm(p)\n\nm(p)(cid:88)\n\nj=1\n\n(cid:1)(p)\n\n\u2212\u03b8j\n\n\u02c6\u03b8u\nj\n\n\u03c4\n\n,(\u03a3\u22121)jj\n\nconverges almost surely weakly to a probability measure \u03bd on R3 as p \u2192 \u221e. Here, \u03bd is the\nprobability distribution of (\u0398, \u03a51/2Z, \u03a5), where Z \u223c N (0, 1), and \u0398 and \u03a5 are some random\nvariables independent of Z. Moreover, we ask that\n(ii) as p \u2192 \u221e, it holds almost surely that\n\n(cid:107)\u03b8(p)(cid:107)0 \u2192 \u03b1 := P[|\u0398| > 0] ,\n\nand\n\n1\np\n\n(cid:107)\u03b8(p)(cid:107)1 \u2192 \u03b2 := E[|\u0398|] .\n\n1\np\n\nNote that (i) implies that lim inf p\u2192\u221e (cid:107)\u03b8(p)(cid:107)1/p \u2265 E[|\u0398|], and lim inf p\u2192\u221e (cid:107)\u03b8(p)(cid:107)0/p \u2265 P[|\u0398| > 0],\nalmost surely. We further impose that equalities are achieved in (ii).\nAs mentioned in [JM14], characterizing instances having a standard distributional limit is highly\nnontrivial. Yet, at least, the de\ufb01nition is non-empty since it contains the case of standard Gaussian\ndesign. Moreover, a non-rigorous replica argument indicates that the standard distributional limit\nexists as long as a certain functional de\ufb01ned on R2 has a differentiable limit [JM14, Replica Method\nClaim 4.6], which is always satis\ufb01ed for block diagonal \u03a3 where the empirical distribution of the\nblocks converges.\nWe remark that in the sparse regime where (cid:107)\u03b8(cid:107)0 = o(p), rigorous results, that do not appeal to the\nreplica method, show that the weak convergence of the distribution of {(\u03b8j, Pjj)}p\nj=1is essentially\nsuf\ufb01cient for the existence of a standard distributional limit ([JM14, Theorem 4.5]), although the\npresent paper does not concern that regime.\nWe now introduce the key criterion to characterize consistency of a knockoff mechanism and more\ngenerally of a variable selection procedure.\nDe\ufb01nition 2 (Effective signal de\ufb01ciency). For a given variable selection procedure, ESD(p) \u2265 0 is a\nfunction of \u03a3(p) with the following property: for the class of sequences (\u03b8(p), \u03a3(p))p\u22651 satisfying\nsuitable distributional limit conditions, vanishing ESD is equivalent to consistency of the test:\n\n(cid:8)FDR(p) + (1 \u2212 PWR(p))(cid:9) \u2192 0 .\n\nESD := lim sup\np\u2192\u221e\n\nESD(p) \u2192 0 \u21d0\u21d2 lim sup\np\u2192\u221e\n\nWhen we consider knockoff \ufb01lters, ESD is frequently expressed in terms of the extended covariance\nmatrix \u03a3, which is in turn a function of \u03a3 for a given knockoff mechanism. In that setting, the\n\u201csuitable distributional limit conditions\u201d in the above de\ufb01nition requires that the sequence of extended\ninstances (\u03b8(p), \u03a3(p))p\u22651 has a standard distributional limit.\nNote that by de\ufb01nition, ESD is not unique, and our goal is to \ufb01nd simple representations of its\nequivalence class. ESD is a potentially useful concept in comparing or evaluating different ways\nof generating knockoff matrices. As an analogy, think of the various notions of convergences of\nprobability measures. A sequence of probability measures may converge in one topology but not\nin another. Similarly, one may cook up different functionals of the covariance matrix, such as\nlimp\u2192\u221e p Tr\u22121(\u03a3) and limp\u2192\u221e p Tr(\u03a3\u22121), which both intuitively characterize some sort of signal\nde\ufb01ciency since they tend to be small when the signal gets stronger. However, they are not equivalent,\nand the second convergence to 0 is stronger in the sense that the \ufb01rst must vanish when the second\nvanishes. ESD is intended to be the correct notion of \u201cconvergence\u201d that characterizes FDR tending\nto 0 and power tending to 1.\nOf course, by de\ufb01nition it is not obvious that a succinct expression of such an effective signal\nde\ufb01ciency exists. Remarkably, we \ufb01nd that the effective signal de\ufb01ciency can be characterized by\nthe convergence of certain empirical distribution derived from \u03a3. The effective signal de\ufb01ciency for\nvarious (old and new) variable selection procedures is as follows:\n\n5\n\n\fLASSO: The debiased Lasso [JM14] is a popular method for high-dimensional statistical inference.\nIt is implemented by \ufb01rst computing a Lasso estimator\n\n(cid:26) 1\n\n2n\n\n\u02c6\u03b8 = argmin\nt\u2208Rp\n\n(cid:107)Y \u2212 X\u03b8(cid:107)2 + \u03bb(cid:107)\u03b8(cid:107)1\n\n(cid:27)\n\nwhere \u03bb > 0 can be chosen as any \ufb01xed positive number independent of p. Instead of a direct\nthreshold test on \u02c6\u03b8, we \ufb01rst compute an \u201cunbiased version\u201d \u02c6\u03b8u de\ufb01ned in (7), as in [JM14], and pass\na threshold to select non-nulls. We show in Theorem 3 and Proposition 4 that we may chose\n\n(cid:0) 1\n\np(cid:88)\n\np\n\nj=1\n\nESD = lim\n\np\u2192\u221e dLP\n\n\u03b4P(p)\n\njj\n\n, \u03b40) ,\n\nwhere dLP denotes the L\u00e9vy-Prokhorov distance between de\ufb01ned for any two measures \u00b5 and \u03bd\nde\ufb01ned over a metric space as\n\ndLP(\u00b5, \u03bd) := inf{\u0001 > 0 : \u00b5(A) \u2264 \u03bd(A\u0001) + \u0001, \u03bd(A) \u2264 \u00b5(A\u0001) + \u0001,\u2200A} ,\n\nwhere A\u0001 denotes the \u0001-neighborhood of A. In particular, we have\n\n(cid:0) 1\n\np(cid:88)\n\np\n\nj=1\n\ndLP\n\n\u03b4P(p)\n\njj\n\n, \u03b40) := inf\n\n\u0001 > 0 :\n\n#{j : P(p)\n\njj \u2265 \u0001}\np\n\n\u2264 \u0001\n\n(cid:40)\n\n(cid:41)\n\n.\n\n(8)\n\nThe assumption of the standard distributional limit ensures the weak convergence of the empirical\nj=1, and hence the convergence of (8). Hereafter, for any vector x \u2208 Rm, we\ndistribution of (P(p)\nuse the shorthand (abusive) notation\n\njj )p\n\n(cid:107)(xj)j(cid:107)LP := dLP\n\n\u03b4xj , \u03b40) .\n\n(cid:0) 1\n\nm(cid:88)\n\nm\n\nj=1\n\nThis characterization if ESD is, in fact tight: ESD \u2192 0 is a necessary and suf\ufb01cient condition for\nconsistency of thresholded Lasso as a variable selection procedure (see Proposition 4)\n\nGENERAL KNOCKOFF: for a general knockoff construction, including variational formulations such\nas SDP-knockoffs, it seems hopeless to \ufb01nd simple expressions of ESD in terms of \u03a3. Nevertheless,\nif (\u03b8(p), \u03a3(p)) has a standard distributional limit, we can choose ESD = limp\u2192\u221e (cid:107)(P(p)\njj )j(cid:107)LP where\nwe recall that P is the extended precision matrix of [X, \u02dcX].\n\nEQUI-KNOCKOFF: Specializing the above result to the equi-knockoff case, we see that we can choose\nESD = limp\u2192\u221e \u03bbmax(P(p)), achieved when s = a\u03bbmin(\u03a3) for any a \u2208 (0, 2). Note that this is\nslightly different from the choice (6) prescribed in [BC15, CFJL18] where s := min{1, 2\u03bbmin(\u03a3)}.\n\nCI-KNOCKOFF: We introduce a new method for generating the knockoff matrix, called conditional\nindependence knockoff or CI-knockoff in short. If the Gaussian graphical model associated to X\nif the sparsity pattern of \u03a3\u22121 corresponds to the adjacency matrix of a tree, then\nis a tree, i.e.\nthe conditional independence knockoff always exists and ESD = limp\u2192\u221e (cid:107)(P(p)\njj \u03a3jj)j(cid:107)LP . For\nexample, in the independent case where \u03a3 is diagonal, we get ESD = 1 which readily yields\nconsistency.\nThe last knockoff construction, conditional independence knockoff, appears to be new. It is both\nanalytically simple and empirically competitive. Comparing equi- and CI- knockoffs: the latter is\nmore robust, since having a small fraction of j with large P2\njj\u03a3jj does not increase its ESD much.\nFor example, two predictors are identical, then the ESD for conditional independence knockoff almost\ndoes not change, but equi-knockoff completely fails. Compared to other previous knockoffs, we \ufb01nd\nthat CI-knockoff usually shows similar or improved performance empirically, while being easier to\ncompute and to manipulate.\n\n6\n\n\f4 Baseline: Lasso with oracle threshold\n\nConsider a variable selection algorithm in which the Lasso parameters with absolute values above a\nthreshold are selected, and suppose that the threshold which controls the FDR is given by an oracle.\nNote that the knockoff \ufb01lter is based on the Lasso estimator but it must choose threshold in a data\ndriven fashion. As a result, the Lasso with oracle threshold presents a strong baseline against which\nthe performance of a given knockoff \ufb01lter should be compared. Not surprisingly, and also as noted in\n[FDLL19], although the knockoff \ufb01lter has the advantage of controlling FDR, it usually has a lower\npower than Lasso with oracle threshold. This fact will become more transparent as we determine\ntheir ESD.\nTheorem 3. Let \u03bb > 0 be arbitrary and let {(\u03a3(p), \u03b8(p))}p\u22651 admit a standard distributional limit,\nand denote the distributional limit by (\u0398, \u03a51/2Z, \u03a5), where Z \u223c N (0, 1), and \u0398 and \u03a5 are some\nrandom variables independent of Z. Assume further that L := limp\u2192\u221e (cid:107)(P(p)\njj )j(cid:107)LP where the limit\nexists almost surely by the standard distributional limit assumption. Consider the algorithm which\nselects j for which |\u02c6\u03b8u\n\nj | \u2265 t, where \u02c6\u03b8u is de\ufb01ned in (7). Then with the choice of t = L1/4,\n\n{FDR(p) + (1 \u2212 PWR(p))} \u2264 CL,\u00b5\u0398,\u03c4\n\nlim sup\np\u2192\u221e\n\nwhere limL\u21920 CL,\u00b5\u0398,\u03c4 = 0 for any \u00b5\u0398 with P[|\u0398| > 0] > 0 and \u03c4 as in the de\ufb01nition of the standard\ndistributional limit. In particular, if \u03b4 > 1, then \u03c4 can be bounded in terms of \u03c3, \u03bb, \u03b4 and \u00b5\u0398 only\n(independent of \u00b5\u03a5), and hence CL,\u00b5\u0398,\u03c4 in the above inequality can be replaced by CL,\u00b5\u0398,\u03c3,\u03bb,\u03b4\nwhere limL\u21920 CL,\u00b5\u0398,\u03c3,\u03bb,\u03b4 = 0.\nThe above theorem implies that L \u2192 0 is a suf\ufb01cient condition for consistency; this is in fact also\nnecessary, as indicated by the following complementary lower bound.\nProposition 4. (Lower bound). In the previous theorem, assume further that \u03a5 is independent of \u0398.\nThen for any t > 0,\n\np\u2192\u221e {FDR(p) + (1 \u2212 PWR(p))} \u2265 cL,\u03c3,\u00b5\u0398.\n\nlim inf\n\nwhere cL,\u03c3,\u00b5\u0398 is increasing in L, strictly positive as long as L > 0.\n\nCombining the above two results, we get the following interpretation. Suppose that the distribution\nof \u0398 and the values of \u03c3 are \ufb01xed, and suppose that the parameters \u03bb and t in the algorithm optimally\ntuned (i.e. minimizing lim supp\u2192\u221e{FDR(p) + (1 \u2212 PWR(p))} for any given distributions). If \u03b4 > 1,\nthen, remarkably, the variable selection procedure is consistent if and only if L being small \u2013 as\nlong as \u03a5 is independent of \u0398, while other characteristics of the law of \u03a5 are not necessary to know.\nIn other words, we proved that ESD = L := limp\u2192\u221e (cid:107)(P(p)\njj )j(cid:107)LP. If \u03b4 \u2264 1, small L may not be\nsuf\ufb01cient for consistency since CL,\u00b5\u0398,\u03c3,\u03bb,\u03b4 also depends on \u00b5\u03a5 through \u03c4.\n\n5 Results for general knockoff mechanisms\nGiven \u03a3, let \u03a3 be the extended 2p \u00d7 2p covariance matrix for the true predictors and their knockoffs.\nLet \u03b8 = [\u03b8, 0] \u2208 R2p. Consider the procedure of the knockoff \ufb01lter described in Section 2, with a\nslight tweak: de\ufb01ne \u2206j := |\u02c6\u03b8\n\nj+p|, where\n\nu\n\nu\n\nj | \u2212 |\u02c6\u03b8\n\nu\n\u02c6\u03b8\n\n= \u02c6\u03b8 +\n\nd\nn\n\n\u03a3\u22121[X, \u02dcX]\n\n(cid:62)\n\n(Y \u2212 [X, \u02dcX]\u02c6\u03b8)\n\nand \u02c6\u03b8 is de\ufb01ned in (3). This modi\ufb01cation still ful\ufb01lls the suf\ufb01ciency and antisymmetry condition\nin [BC15, Section 2.2], so its FDR can still be controlled. This change allows us to perform\nanalysis using results in [JM14]. We also assume that the Lasso parameter \u03bb is an arbitrary number\nindependent of p.\nTheorem 5. Let {(\u03a3(p), \u03b8(p))}p\u22651 admit a standard distributional limit for a given \u03bb \u2265 0, and\ndenote the distributional limit by (\u0398, \u03a51/2Z, \u03a5), where Z \u223c N (0, 1), and \u0398 and \u03a5 are some random\nvariables independent of Z. Assume further that L := limp\u2192\u221e (cid:107)(P(p)\njj )j(cid:107)LP where the limit exists\n\n7\n\n\falmost surely under the standard distributional limit assumption. Then the knockoff \ufb01lter with FDR\nbudget q \u2208 (0, 1) satis\ufb01es:\n\np\u2192\u221e PWR(p) \u2265 1 \u2212 CL,q,\u03c4,\u00b5\u0398,\n\nlim inf\n\nwhere limL\u21920 CL,q,\u03c4,\u00b5\u0398 = 0 for any given q, \u03c4, \u00b5\u0398. Further if \u03b4 > 2, then CL,q,\u03c4,\u00b5\u0398 in the above\ninequality can be replaced by CL,q,\u03bb,\u03c3,\u03b4,\u00b5\u0398.\nTaking q \u2192 0 in the above theorem implies that L \u2192 0 is suf\ufb01cient for consistency; the following\nresult shows the necessity in a representative setting:\nProposition 6. In the previous theorem, further assume that \u03b8j = 1{j \u2208 H1} where |H1| = \u03b1p\n(\u03b1 > 0) is selected uniformly at random. Then, under a suitable distributional limit assumption, the\nknockoff \ufb01lter with FDR budget q \u2208 (0, \u03b1LQ2( 1\n\u221a\n\nL\n\n)) satis\ufb01es:\nPWR(p) \u2264 3/4.\n\n\u03c3\n\nlim sup\np\u2192\u221e\n\nu\n\nj \u2212 \u03b8j, \u02c6\u03b8\n\nj+p \u2212 \u03b8j+p)p\n\nThe \u201csuitable distributional limit assumption\u201d in Proposition 6 postulates a Gaussian limit for the\nu\nempirical distribution of the pair (\u02c6\u03b8\nj=1, which is stronger than the marginal\nGaussian limit assumption in De\ufb01nition 1, but nevertheless supported by the replica heuristics.\nMoreover, this condition can be rigorously shown for the case of \u03b4 > 2, \u03bb = 0 (least squares) and\nblock diagonal \u03a3. The assumption that \u03b8j = 1 under H1 in Proposition 6 facilitates the proof but we\nexpect that a similar inconsistency result holds for general \u00b5\u0398. The assumption that H1 is selected\nuniformly at random is a counterpart of the independence of \u0398 and \u03a5 in Proposition 4.\nTogether, Theorem 5 and Proposition 6 show that for the knockoff \ufb01ler, ESD = limp\u2192\u221e (cid:107)(P(p)\njj )j(cid:107)LP\nin the regime of \u03b4 > 1. This suggests that one should construct the knockoff variables so that the\nempirical distribution of (Pjj)2p\n\nj=1 converges to 0 weakly.\n\n6 Conditional independence knockoff and ESD\n\nWe introduce the conditional independence knockoff, where Xj and \u02dcXj are independent conditionally\non X\u00acj := {Xk, k \u2208 [p] \\ {j}}, for each j = 1, . . . , p. This condition implies that\n\nTherefore recalling that s1, . . . , sp are as de\ufb01ned in (4), we get\n\nE[Xj \u02dcXj] = E(cid:2)E[Xj \u02dcXj|X\u00acj](cid:3) = E(cid:2)(E[Xj|X\u00acj])2(cid:3)\nj |X\u00acj](cid:3) \u2212 E(cid:2)(E[Xj|X\u00acj])2(cid:3)\n\n= E(cid:2)E[X 2\n\nsj = \u03a3jj \u2212 E[Xj \u02dcXj]\n\n= E[Var(Xj|X\u00acj)] = P \u22121\njj .\n\n(9)\n\nHowever such an s may violate the positive semide\ufb01nite assumption for the joint covariance matrix\n(examples exist already in the case p = 3). Yet, interestingly, we \ufb01nd that in the case of tree graphical\nmodels, this construction always exists. In many practical scenarios, the predictors X p comes from a\ntree graphical model, and we can estimate the underlying graph sing the Chow-Liu algorithm [CL68].\nTheorem 7. The covariance matrix \u03a3 de\ufb01ned in (4) is positive semide\ufb01nite with s de\ufb01ned in (9), if\neither 1) \u03a3 is the covariance matrix of a tree graphical model; or 2) P is diagonally dominant.\n\nEither condition in the theorem intuitive imposes that the graph is sparse. In practice, \u03a3 needs to be\nestimated, which is generally only feasible with some sparse structure (e.g. via graphical lasso).\nAssuming the existence of a standard distributional limit and \u03b4 > 1, we have the following results:\nTheorem 8. For tree graphical models, ESD = limp\u2192\u221e (cid:107)(P(p)\nTheorem 9. ESD = \u03bbmax(\u03a3) for EQUI-KNOCKOFF if sj = a\u03bbmin(\u03a3), a \u2208 (0, 2), j = 1, . . . , p.\n\njj \u03a3jj)j(cid:107)LP for CI-KNOCKOFF.\n\n8\n\n\fFigure 1: Comparisons of EQUI-KNOCKOFF, ASDP-KNOCKOFF, and CI-KNOCKOFF. Left: Binary\ntree, equal correlations. Right: Markov chain, randomly chosen correlation strengths.\n\n7 Experimental results\nFirst consider the setting where X1, . . . , Xp \u223c N (0, 1) and the conditional independence graph\nforms a binary tree. The correlations between adjacent nodes are all equal to 0.5. Choose k = 100\nout of p = 1000 indices uniformly at random as the support of \u03b8, and set \u03b8j = 4.5 for j in the support.\nGenerate n = 1000 independent copies of (X, Y ) in Y = X(cid:62)\u03b8 + \u03be where \u03be \u223c N (0, n).\nFigure 1, left shows the box plots of the power and FDR for EQUI-KNOCKOFF, ASDP-KNOCKOFF,\nand CI-KNOCKOFF, where s is de\ufb01ned as in (6) for CI-KNOCKOFF. The FDR is controlled at the\ntarget q = 0.1 in all three cases. The powers are not statistically signi\ufb01cantly different, but the rough\ntrend is PWRe < PWRa < PWRc. We then compare the effective signal de\ufb01ciency. Note that in the\ncurrent setting, Var(X j|X\u00acj) \u2264 1, and hence Pjj \u2265 1, for each j = 1, . . . , 2p, and we always have\n(cid:107)(Pjj)2p\nj=1(cid:107)LP = 1 by de\ufb01nition (8), which cannot reveal any useful information for comparison. To\nresolve this, we can scale down Pjj by a common factor before computing the LP distances, noting\nthat it yields a valid effective signal de\ufb01ciency. Lacking a systematic way of choosing such a scaling\nfactor, heuristically we choose it as 2000 so that the LP distances for the three algorithms are all\n\u201cbounded away from 0 and 1\u201d. We \ufb01nd that dLP,e (cid:39) 0.501, dLP,a (cid:39) 0.048 and dLP,c (cid:39) 0.002 and\ntheir ordering matches the ordering of the powers.\nIn the previous example, the simplest EQUI-KNOCKOFF has a highly competitive performance.\nHowever, this is an artifact of the fact that the data covariance is highly structured (i.e., correlations\nare all the same). If the correlations have high \ufb02uctuations, and in particular, a small number of\nnode pairs are highly correlated, then the equi-knockoff has a much worse performance. This\nis demonstrated in the next example. Consider the setting where X1, . . . , Xp forms a Markov\nchain, in which X1, . . . , Xp \u223c N (0, 1). In other words, the Gaussian graphical model is a path\ngraph. The correlation between Xj and Xj+1 is \u03c1j := Gj 1{|Gj| \u2264 1}, where Gj \u223c N (0, 0.25),\nj = 1, . . . , p \u2212 1 are chosen independently. Choose k = 100 out of p = 1000 indices uniformly at\nrandom as the support of \u03b8, and set \u03b8j = 4.5 for j in the support. Generate n = 1200 independent\ncopies of (X, Y ) in Y = X(cid:62)\u03b8 + \u03be where \u03be \u223c N (0, 0.49n).\nFigure 1 Right shows the box plots of the power and FDR for the knockoff \ufb01lter with three different\nknockoff constructions. The target FDR is q = 0.1. Since the correlations are now chosen randomly,\nwith high probability there exist highly correlated nodes, and hence \u03bbmin(\u03a3) can be very small, in\nwhich case the equi-knockoff performs poorly. However PWRc is similar to PWRa, with the median\nof the former slightly higher. To compare the ESD, \ufb01rst scale down Pjj by a heuristically chosen\nfactor 100. We \ufb01nd dLP,e (cid:39) 0.9995, dLP,a (cid:39) 0.8660, and dLP,c (cid:39) 0.1075 and their ordering matches\nthe ordering of the powers of the three knockoff constructions.\n\nAcknowledgments\n\nJL was supported by the IDSS Wiener Fellowship. PR was supported by NSF awards IIS-BIGDATA-\n1838071, DMS-1712596 and CCF-TRIPODS- 1740751; ONR grant N00014-17-1-2147.\n\n9\n\nPower_ePower_aPower_cFDR_eFDR_aFDR_c0102030405060708090100Power_ePower_aPower_cFDR_eFDR_aFDR_c0102030405060708090100\fReferences\n\n[BC15] Rina Foygel Barber and Emmanuel J. Cand\u00e9s. Controlling the false discovery rate via\n\nknockoffs. The Annals of Statistics, 43(5):2055\u20132085, 2015.\n\n[BCJW19] Stephen Bates, Emmanuel Cand\u00e8s, Lucas Janson, and Wenshuo Wang. Metropolized\n\nknockoff sampling. arXiv preprint arXiv:1903.00434, 2019.\n\n[BH95] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical\nand powerful approach to multiple testing. Journal of the Royal statistical society:\nseries B (Methodological), 57(1):289\u2013300, 1995.\n\n[BRT09] Peter J. Bickel, Ya\u2019acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of\n\nLasso and Dantzig selector. Ann. Statist., 37(4):1705\u20131732, 2009.\n\n[CFJL18] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for\ngold:\u2018model-x\u2019knockoffs for high dimensional controlled variable selection. Journal\nof the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551\u2013577,\n2018.\n\n[CL68] C Chow and Cong Liu. Approximating discrete probability distributions with depen-\n\ndence trees. IEEE Transactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[FDLL19] Yingying Fan, Emre Demirkaya, Gaorong Li, and Jinchi Lv. Rank: large-scale\ninference with graphical nonlinear knockoffs. Journal of the American Statistical\nAssociation, pages 1\u201343, 2019.\n\n[JM14] Adel Javanmard and Andrea Montanari. Hypothesis testing in high-dimensional\nIEEE\n\nregression under the gaussian random design model: Asymptotic theory.\nTransactions on Information Theory, 60(10):6522\u20136554, 2014.\n\n[KS19] Eugene Katsevich and Chiara Sabatti. Multilayer knockoff \ufb01lter: Controlled variable\nselection at multiple resolutions. Ann. Appl. Stat. The Annals of Applied Statistics,\n13(1):1\u201333, 2019.\n\n[RSC18] Yaniv Romano, Matteo Sesia, and Emmanuel Cand\u00e8s. Deep knockoffs. Journal of the\n\nAmerican Statistical Association (to appear), 2018.\n\n[SKB+] Matteo Sesia, Eugene Katsevich, Stephen Bates, Emmanuel Cand\u00e8s, and Chiara\nSabatti. Multi-resolution localization of causal variants across the genome. bioRxiv\n(2019).\n\n[STS04] John D Storey, Jonathan E Taylor, and David Siegmund. Strong control, conservative\npoint estimation and simultaneous conservative consistency of false discovery rates:\na uni\ufb01ed approach. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 66(1):187\u2013205, 2004.\n\n[vdGBRD14] Sara van de Geer, Peter B\u00fchlmann, Yaacov Ritov, and Ruben Dezeure. On asymptoti-\ncally optimal con\ufb01dence regions and tests for high-dimensional models. Ann. Statist.,\n42(3):1166\u20131202, 06 2014.\n\n[WBC17] Asaf Weinstein, Rina Barber, and Emmanuel Candes. A power and prediction analysis\n\nfor knockoffs with lasso statistics. arXiv preprint arXiv:1712.06465, 2017.\n\n[ZZ14] Cun-Hui Zhang and Stephanie S. Zhang. Con\ufb01dence intervals for low dimensional\nparameters in high dimensional linear models. Journal of the Royal Statistical Society:\nSeries B (Statistical Methodology), 76(1):217\u2013242, 2014.\n\n10\n\n\f", "award": [], "sourceid": 8941, "authors": [{"given_name": "Jingbo", "family_name": "Liu", "institution": "MIT"}, {"given_name": "Philippe", "family_name": "Rigollet", "institution": "MIT"}]}