{"title": "Variable Importance Using Decision Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 426, "page_last": 435, "abstract": "Decision trees and random forests are well established models that not only offer good predictive performance, but also provide rich feature importance information. While practitioners often employ variable importance methods that rely on this impurity-based information, these methods remain poorly characterized from a theoretical perspective. We provide novel insights into the performance of these methods by deriving finite sample performance guarantees in a high-dimensional setting under various modeling assumptions.  We further demonstrate the effectiveness of these impurity-based methods via an extensive set of simulations.", "full_text": "Variable Importance using Decision Trees\n\nS. Jalil Kazemitabar\n\nUCLA\n\nsjalilk@ucla.edu\n\nArash A. Amini\n\nUCLA\n\naaamini@ucla.edu\n\nAdam Bloniarz\nUC Berkeley\u2217\n\nadam@stat.berkeley.edu\n\nAbstract\n\nAmeet Talwalkar\n\nCMU\n\ntalwalkar@cmu.edu\n\nDecision trees and random forests are well established models that not only offer\ngood predictive performance, but also provide rich feature importance information.\nWhile practitioners often employ variable importance methods that rely on this\nimpurity-based information, these methods remain poorly characterized from a\ntheoretical perspective. We provide novel insights into the performance of these\nmethods by deriving \ufb01nite sample performance guarantees in a high-dimensional\nsetting under various modeling assumptions. We further demonstrate the effective-\nness of these impurity-based methods via an extensive set of simulations.\n\n1\n\nIntroduction\n\nKnown for their accuracy and robustness, decision trees and random forests have long been a\nworkhorse in machine learning [1]. In addition to their strong predictive accuracy, they are equipped\nwith measures of variable importance that are widely used in applications where model interpretability\nis paramount. Importance scores are used for model selection: predictors with high-ranking scores\nmay be chosen for further investigation, or for building a more parsimonious model.\nOne common approach naturally couples the model training process with feature selection [2, 5].\nThis approach, which we call TREEWEIGHT, calculates the feature importance score for a variable\nby summing the impurity reductions over all nodes in the tree where a split was made on that\nvariable, with impurity reductions weighted to account for the size of the node. For ensembles, these\nquantities are averaged over constituent trees. TREEWEIGHT is particularly attractive because it can\nbe calculated without any additional computational expense above the standard training procedure.\nHowever, as the training procedure in random forests combines several complex ingredients\u2014bagging,\nrandom selection of predictor subsets at nodes, line search for optimal impurity reduction, recursive\npartitioning\u2014theoretical investigation into TREEWEIGHT is extremely challenging. We propose a\nnew method called DSTUMP that is inspired by TREEWEIGHT but is more amenable to analysis.\nDSTUMP assigns variable importance as the impurity reduction at the root node of a single tree.\nIn this work we characterize the \ufb01nite sample performance of DSTUMP under an additive regression\nmodel, which also yields novel results for variable selection under a linear model, both with correlated\nand uncorrelated design. We corroborate our theoretical analyses with extensive simulations in which\nwe evaluate DSTUMP and TREEWEIGHT on the task of feature selection under various modeling\nassumptions. We also compare the performance of these techniques against established methods\nwhose behaviors have been theoretically characterized, including Lasso, SIS, and SpAM [12, 3, 9].\n\n\u2217Now at Google\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fOur work provides the \ufb01rst \ufb01nite-sample high-dimensional analyses of tree-based variable selection\ntechniques, which are commonly used in practice but lacking in theoretical grounding. Although\nwe focus on DSTUMP, which is a relatively simple tree-based variable selection approach, our\nnovel proof techniques are highly non-trivial and suggest a path forward for studying more general\nmulti-level tree-based techniques such as TREEWEIGHT. Moreover, our simulations demonstrate\nthat such algorithmic generalizations exhibit impressive performance relative to competing methods\nunder more realistic models, e.g., non-linear models with interaction terms and correlated design.\n\n2 Related Work\n\nOur analysis is distinct from existing work in analyzing variable importance measures of trees and\nforests in several ways. To our knowledge, ours is the \ufb01rst analysis to consider the high-dimensional\nsetting, where the number of variables, p, and the size of the active set s, grow with the sample size\nn, and potentially p (cid:29) n.\nThe closest related work is the analysis of [8], which considers a \ufb01xed set of variables, in the limit of\nin\ufb01nite data (n = \u221e). Unlike DSTUMP\u2019s use of the root node only, [8] does consider importance\nscores derived from the full set of splits in a tree as in TREEWEIGHT. However, they make crucial\nsimplifying (and unrealistic) assumptions that are distinct from those of our analysis: (1) each variable\nis split on only once in any given path from the root to a leaf of the tree; (2) at each node a variable is\npicked uniformly at random among those not yet used at the parent nodes, i.e., the splits themselves\nare not driven by impurity reduction; and (3) all predictors are categorical, with splits being made on\nall possible levels of a variable, i.e., the number of child nodes equals the cardinality of the variable\nbeing split. Our analysis instead considers continuous-valued predictors, the split is based on actual\nimpurity reduction, and our results are nonasymptotic, i.e. they give high-probability bounds on\nimpurity measures for active and inactive variables that hold in \ufb01nite samples.\nA second line of related work is motivated by a permutation-based importance method [1] for feature\nselection. In practice, this method is computationally expensive as it determines variable importance\nby comparing the predictive accuracy of a forest before and after random permutation of a predictor.\nAdditionally, due to the algorithmic complexity of the procedure, it is not immediately amenable to\ntheoretical analysis, though the asymptotic properties of a simpli\ufb01ed variant of the procedure have\nbeen studied in [6].\nWhile our work is the \ufb01rst investigation of \ufb01nite-sample model selection performance of tree-based\nregression methods, alternative methods performing both linear and nonparametric regression in high\ndimensions have been studied in the literature. Considering model selection consistency results, most\nof the attention has been focused on the linear setting, whereas the nonparametric (nonlinear) setup\nhas been mostly studied in terms of the prediction consistency. Under a high-dimensional linear\nregression model, LASSO has be extensively studied and is shown to be minimax optimal for variable\nselection under appropriate regularity conditions, including the uncorrelated design with a moderate\n\u03b2min condition. Remarkably, while not tailored to the linear setting, we show that DSTUMP is nearly\nminimax optimal for variable selection in the same uncorrelated design setting (cf. Corollary 1). In\nfact, DSTUMP can be considered a nonlinear version of SIS [4], itself a simpli\ufb01ed form of the LASSO\nwhen one ignores correlation among features (cf. Section 3 for more details).\nThe Rodeo framework [7] performs automatic bandwidth selection and variable selection for local\nlinear smoothers, and is tailored to a more general nonparametric model with arbitrary interactions.\nIt was shown to possess model selection consistency in high dimensions; however, the results are\nasymptotic and focus on achieving optimal prediction rate. In particular, there is no clear \u03b2min\nthreshold as a function of n, s, and p. RODEO is also computationally burdensome for even modest-\nsized problems (we thus omit it our experimental results in Section 4).\nAmong the nonlinear methods, SPAM is perhaps the most well-understood in terms of model\nselection properties. Under a general high-dimensional sparse additive model, SPAM possesses the\nsparsistency property (a term for model selection consistency); the analysis is reduced to a linear\nsetting by considering expansions in basis functions, and selection consistency is proved under an\nirrepresentible condition on the coef\ufb01cients in those bases. We show that DSTUMP is model selection\nconsistent in the sparse additive model with uncorrelated design. Compared to SPAM results, our\nconditions are stated directly in terms of underlying functions and are not tied to a particular basis;\n\n2\n\n\fhence our proof technique is quite different. There is no implicit reduction to a linear setting via basis\nexpansions. Empirically, we show that DSTUMP indeed succeeds in the settings our theory predicts.\n\n3 Selection consistency\n\nThe general model selection problem for non-parametric regression can be stated as follows: we\nobserve noisy samples yi = f(xi1, . . . , xip) + wi, i = 1, . . . , n where {wi} is an i.i.d. noise\nsequence. Here, p is the total number of features (or covariates) and n is the total number of\nobservations (or the sample size). In general, f belongs to a class F of functions from Rp \u2192 R.\nOne further assumes that the functions in F depend on at most s of the features, usually with s (cid:28) p.\nThat is, for every f \u2208 F, there is some f0 : Rs \u2192 R and a subset S \u2282 [p] with |S| \u2264 s such\nthat f(z1, . . . , zp) = f0(zS) where zS = (zi, i \u2208 S). The subset S, i.e., the set of active features,\nis unknown in advance and the goal of model selection is to recover it given {(yi, xi)}n\ni=1. The\nproblem is especially challenging in the high-dimensional setting where p (cid:29) n. We will consider\nvarious special cases of this general model when we analyze DSTUMP. For theoretical analysis it is\ncommon to assume s to be known and we will make this assumption throughout. In practice, one\noften considers s to be a tunable parameter that can be selected, e.g., via cross-validation or greedy\nforward selection.\nWe characterize the model selection performance of DSTUMP by establishing its sample complexity:\nthat is, the scaling of n, p, and s that is suf\ufb01cient to guarantee that DSTUMP identi\ufb01es the active\nset of features with probability converging to 1. Our general results, proved in the technical report,\nallow for a correlated design matrix and additive nonlinearities in the true regression function. Our\nresults for the linear case, derived as a special case of the general theory, allow us to compare the\nperformance of DSTUMP to the information theoretic limits for sample complexity established in\n[11], and to the performance of existing methods more tailored to this setting, such as the Lasso [12].\nGiven a generative model and the restriction of DSTUMP to using root-level impurity reduction, the\ngeneral thrust of our result is straightforward: impurity reduction due to active variables concentrates\nat a signi\ufb01cantly higher level than that of inactive variables. However, there are signi\ufb01cant technical\nchallenges in establishing this result, mainly deriving from the fact that the splitting procedure\nrenders the data in the child nodes non-i.i.d., and hence standard concentration inequalities do not\nimmediately apply. We leverage the fact that the DSTUMP procedure considers splits at the median\nof a predictor. Given this median point, the data in each child node is i.i.d., and hence we can\napply standard concentration inequalities in this conditional distribution. Removing this conditioning\npresents an additional technical subtlety. For ease of exposition, we \ufb01rst present our results for the\nlinear setting in Section 3.1, and subsequently summarize our general results in Section 3.2. We\nprovide a proof of our result in the linear setting in Section 3.3, and defer the proof of our general\nresult to the supplementary material.\n\nAlgorithm 1 DSTUMP\ninput {xk \u2208 Rn}k=p\nm = n\n2\nfor k = 1, . . . , p do\n\nk=1, y \u2208 Rn, # top features s\n\nI(xk) = SortFeatureValues(xk)\nyk = SortLabelByFeature(y,I(xk))\n[n]\\[m] = SplitAtMidpoint(yk)\n[m], yk\nyk\n[n]\\[m])\nik = ComputeImpurity(yk\n\n[m], yk\n\nend for\nS = FindTopImpurityReductions({ik}, s)\noutput top s features sorted by impurity reduction\n\nfollows: Evaluate yk := sor(y, xk) = sor(cid:0)P\n\nFor each k, consider the midpoint split of yk into yk\n\nThe DSTUMP algorithm.\nIn order to de-\nscribe DSTUMP more precisely, let us introduce\n:= {1, . . . , n}.\nsome notation. We write [n]\nThroughout, y = (yi, i \u2208 [n]) \u2208 Rn will be the\nresponse vector observed for a sample of size\nn. For an ordered index set I = (i1, i2, . . . , ir),\nwe set yI = (yi1 , yi2 . . . , yir). A similar nota-\ntion is used for unordered index sets. We write\nxj = (x1j, x2j, . . . , xnj) \u2208 Rn for the vector\ncollecting values of the jth feature; xj forms the\njth column of the design matrix X \u2208 Rn\u00d7p.\nLet I(xj) := (i1, i2 . . . , in) be an ordering of\n[n] such that xi1j \u2264 xi2j \u2264 \u00b7\u00b7\u00b7 \u2264 xinj and let\nsor(y, xj) := yI(xj) \u2208 Rn; this is an operator\nthat sorts y relative to xj. DSTUMP proceeds as\n\n(cid:1), for k = 1, . . . , p. Let m := n/2.\n\n[m] and yk\n\n[n]\\[m] and evaluate the impurity of the\n\nj\u2208S \u03b2jxj + w, xk\n\n3\n\n\fleft-half, using empirical variance as impurity:\n\n[m]) := 1(cid:0)m\n\n(cid:1) X\n\nimp(yk\n\n1\n2(yk\n\ni \u2212 yk\n\nj )2.\n\n(1)\n\n2\n\n1\u2264i<j\u2264m\n\nLet imp(yk\n[m]) be the score of feature k, and output the s features with the smallest scores (corre-\nsponding to maximal reduction in impurity). If the generative model is linear, the choice of the\nmidpoint is justi\ufb01ed by our assumption of the uniform distribution for the features (Zi), and we\nfurther show that this simple choice is effective even under a nonlinear model. The choice of the\nleft-half in our analysis is for convenience; a similar analysis applies if we take the impurity to be that\nof the sum of both halves (or their maximum). DSTUMP is summarized in Algorithm 1. Impurity\nreduction imp(y[m]) \u2212 imp(yk\n[m]) can be considered a form of nonlinear correlation between y and\nfeature xk. The SIS algorithm is equivalent to replacing this nonlinear correlation with the (absolute)\nlinear correlation | 1\nk y|. That is, both procedures assign a score to each feature by considering it\nagainst the response separately, ignoring other features. In the uncorrelated (i.e. orthogonal design)\nsetting, this is more or less optimal, and as is the case with SIS, we show that DSTUMP also retains\nsome model selection performance even under correlated designs. In contrast to SIS, we show that\nDSTUMP also enjoys performance guarantees in non-linear settings.\n\nn xT\n\nThe models. We present our consistency results for models of various complexity. We start with the\nwell-known and extensively studied setting of a linear model with IID design. This basic setup serves\nas the benchmark for comparison of model selection procedures. As will become clear in the course\nof the proof, analyzing DSTUMP (or impurity-based feature selection in general) is challenging even\nin this case, in contrast to linear model based approaches such as SIS or Lasso. Once we have a good\nunderstanding of DSTUMP under the baseline model, we extend the analysis to correlated design and\nnonlinear additive models. The structure of our proof is also most clearly seen in this simple case, as\noutlined in Section 3.3. We now introduce our general models:\nModel 1 (Sparse linear model with ICA-type design). A linear regression model y = X\u03b2 + w with\n\nICA-type (random) design X \u2208 Rn\u00d7p has the following properties: (i) X = eXM where eX \u2208 Rn\u00d7p\nand each row of eX is an independent draw from a (column) vector Z = (Z1, . . . , Zp) with IID\nentries drawn uniformly from [0, 1]. (ii) The noise vector w = (w1, . . . , wn) has IID sub-Gaussian\nw and sub-Gaussian norm kwik\u03c82 \u2264 \u03c3w, . (iii) The\nentries with variance with variance var(wi) = v2\n\u03b2 \u2208 Rp is s-sparse, namely, \u03b2j 6= 0 for j \u2208 S = {1, . . . , s} and zero otherwise.\nModel 1 serves both the correlated and uncorrelated design cases. Each row of the design matrix\nX is a draw from the vector M T Z, which has covariance c M T M for some constant c. Thus, the\nchoice of M = I leads to an uncorrelated design. The choice of the interval [0, 1] for covariates\nis for convenience; it can be replaced with any other compact interval, in the linear setting, since\nvariance impurity is invariant to a shift. Similarly the choice of the (active) support indices, S, is for\nw = \u03c32 and \u03c3w \u2264 C\u03c3 (only \u03c3w would affect the\nconvenience. For simplicity, we often assume v2\nresults as examining of our proofs shows).\nModel 2 (Sparse additive model with uncorrelated design). An additive regression model yi =\nj=1 fj(xij) + wi, is one with random design X = (xij) and the noise (wi) as in Model 1, with\nM = I (uncorrelated design). We assume (fk) to be s-sparse, namely, fj 6= 0 for j \u2208 S = {1, . . . , s}\nand zero otherwise.\n\nPp\n\n3.1 Linear Setting\n\nUncorrelated design. Our baseline result is the following feature selection consistency guarantee\nfor DSTUMP, for the case M = I of Model 1. Throughout, we let \u02c7p := p\u2212s, and C, C1, . . . , c, c1, . . .\nare absolute positive constants which can be different in each occurrence. For any vector x, let\n|x|min := mini |xi|, the minimum absolute value of its entries. The quantity |\u03b2S|2\nmin = mink \u2208 S \u03b22\nk\nappearing in Theorem 1 is a well-known parameter controlling hardness of subset recovery. All our\nresults are stated in terms of constants \u03b4, \u03b1 and \u03be that are related as:\n\n\u03b4 \u2208 (0, 1/8), \u03b1 = log(1/(8\u03b4)),\n\n\u03be = 1 \u2212 (1 \u2212 \u03b4)2.\n\n(2)\n\n4\n\n\fTheorem 1. Assume Model 1 with M = I, and (2). The DSTUMP algorithm, which selects\nthe \u201cs\u201d least impure features at the root, succeeds in feature selection, with probability at least\n1 \u2212 \u02c7p\u2212c \u2212 2e\u2212\u03b1n/2 if log \u02c7p/n \u2264 C1 and\n\n|\u03b2S|2\n\nmin \u2265 C\n\u03be\n\n(k\u03b2k2\n\n2 + \u03c32)\n\nrlog \u02c7p\n\nn\n\n(3)\n\nThe result can be read by setting, e.g., \u03b4 = 1/16 leading to numerical constants for \u03b1 and \u03be. The\ncurrent form allows the \ufb02exibility to trade-off the constant (\u03b1) in the probability bound with the\nconstant (\u03be) in the gap condition (3). Although Theorem 1 applies to a general \u03b2, it is worthwhile to\nsee its consequence in a special regime of interest where |\u03b2S|2\nmin (cid:16) 1/s, corresponding to k\u03b2k2 (cid:16) 1.\nWe get the following immediate corollary:\nCorollary 1. Assume |\u03b2S|2\nmin (cid:16) 1/s, \u03c32 (cid:16) 1 and log \u02c7p/n = O(1). Then DSTUMP succeeds with\nhigh probability if n (cid:38) s2 log \u02c7p.\n\nThe minimax optimal threshold for support recovery in the regime of Corollary 1 is known to be\nn (cid:16) s log \u02c7p [11], and achieved by LASSO [12]. Although this result is obtained for Gaussian design,\nthe same argument goes through for our uniform ensemble. Compared to the optimal threshold, using\nDSTUMP we pay a small factor of s in the sample complexity. However, DSTUMP is not tied to the\nlinear model and as we discuss in Section 3.2, we can generalize the performance of DSTUMP to\nnonlinear settings.\n\n2 + \u03c32)p(log \u02c7p)/n.\n\nmin \u2212 ke\u03b2Sck2\u221e > C(ke\u03b2k2\n\nCorrelated design. We take the following approach to generalize our result to the correlated case:\n\nuncorrelated design. (2) We derive conditions on M such that the correlated case can be turned into\nthe uncorrelated case with approximate sparsity. The following theorem details Step 1:\n\nRp. Let S be any subset of [p] of cardinality s. The DSTUMP algorithm, which selects the \u201cs\u201d least\nimpure features at the root, recovers S, with probability at least 1 \u2212 \u02c7p\u2212c \u2212 2e\u2212\u03b1n/2 if log \u02c7p/n \u2264 C1\n\n(1) We show a version of Theorem 1, which holds for an \u201capproximately sparse\u201d parametere\u03b2 with\nTheorem 2. Assume Model 1(i)-(ii) with M = I, but instead of (iii) let \u03b2 = e\u03b2, a general vector in\nand \u03be|e\u03b2S|2\nThe theorem holds for anye\u03b2 and S, but the gap condition required is likely to be violated unlesse\u03b2\nis approximately sparse w.r.t. S. Going back to Model 1, we see that setting e\u03b2 = M \u03b2 transforms\neX, and approximate sparsity on e\u03b2. The following corollary gives suf\ufb01cient conditions on M, so\n\nthe model with correlated design X, and exact sparsity on \u03b2, to the model with uncorrelated design\nthat Theorem 2 is applicable. Recall the usual (vector) \u2018\u221e norm, kxk\u221e = maxi |xi|, the matrix\n\u2018\u221e \u2192 \u2018\u221e operator norm |||A|||\u221e = maxi\nCorollary 2. Consider a general ICA-type Model 1 with \u03b2 and M satisfying\n|||MScS|||\u221e \u2264 \u03c1\n\u03b3\n\nP\nj |Aij| , and the \u20182 \u2192 \u20182 operator norm |||A|||2.\n\n,\n\n\u03b3\n\n2/(\u03ba \u03be \u03c12).\n\n(4)\nfor some \u03c1, \u03ba \u2208 (0, 1] and \u03b3 \u2265 1. Then, the conclusion of Theorem 1 holds, for DSTUMP applied to\n\ninput (y, eX), under the gap condition (3) with C/\u03be replaced with C|||MSS|||2\nAccess to decorrelated features, eX, is reasonable in cases where one can perform consistent ICA.\n\nThis assumption is practically plausible, especially in the low-dimensional regimes, though it would\nbe desirable if this assumption can be removed theoretically. Moreover, we note that the response y\nis based on the correlated features.\nIn this result, C|||MSS|||2\n2/(\u03ba \u03be \u03c12) plays the role of a new constant. There is a hard bound on how big\n\u03be can be, which via (4) controls how much correlation between off-support and on-support features\nare tolerated. For example, taking \u03b4 = 1/9, we have \u03b1 = log(9/8) \u2248 0.1, \u03be = 17/81 \u2248 0.2 and\n\u221a\n\u03be \u2248 0.45 and this is about as big as it can get (the maximum we can allow is \u2248 0.48). \u03ba can be\narbitrarily close to 0, relaxing the assumption (4), at the expense of increasing the constant in the\nthreshold. \u03b3 controls deviation of |\u03b2j|, j \u2208 S from uniform: in case of equal weights on the support,\n\u221a\ni.e., |\u03b2j| = 1/\ns for j \u2208 S, we have \u03b3 = 1. Theorem 1 for the uncorrelated design is recovered, by\ntaking \u03c1 = \u03ba = 1.\n\n|||MSS \u2212 I|||\u221e \u2264 1 \u2212 \u03c1\n\np\u03be(1 \u2212 \u03ba)\n\nk\u03b2Sk\u221e \u2264 \u03b3|\u03b2S|min,\n\n5\n\n\f3.2 General Additive Model Setting\nTo prove results in this more general setting, we need some further regularity conditions on (fk): Fix\nsome \u03b4 \u2208 (0, 1), let U \u223c unif(0, 1) and assume the following about the underlying functions (fk):\n(F1) kfk(\u03b1U)k2\nNext, we de\ufb01ne \u03c32\n\nf,k, \u2200\u03b1 \u2208 [0, 1]. (F2) var[fk(\u03b1U)] \u2264 var[fk((1 \u2212 \u03b4)U)], \u2200\u03b1 \u2264 1 \u2212 \u03b4.\nf,k along with the following key gap quantities:\n\nf,\u2217 :=Pp\n\nf,k =P\n\n\u03c82 \u2264 \u03c32\n\nk=1 \u03c32\ngf,k(\u03b4) := var[fk(U))] \u2212 var[fk((1 \u2212 \u03b4)U)].\n\nk\u2208S \u03c32\n\n8\u03b4 for \u03b4 \u2208 (0, 1/8). The\nTheorem 3. Assume additive Model 2 with (F1) and (F2). Let \u03b1 = log 1\nDSTUMP algorithm, which selects the \u201cs\u201d least impure features at the root, succeeds in model\nselection, with probability at least 1 \u2212 \u02c7p\u2212c \u2212 2e\u2212\u03b1n/2 if log \u02c7p/n \u2264 C1 and\n\ngf,k(\u03b4) \u2265 C(\u03c32\n\nf,\u2217 + \u03c32)\n\nmin\nk \u2208 S\n\n(5)\n\nrlog \u02c7p\n\nn\n\nIn the supplementary material, we explore in detail the class of functions that satisfy conditions (F1)\nand (F2), as well as the gap condition in (5). (F1) is relatively mild and satis\ufb01ed if f is Lipschitz\nor bounded. (F2) is more stringent and we show that it is satis\ufb01ed for convex nondecreasing and\nconcave nonincreasing functions.2 The gap condition is less restrictive than (F2) and is related to the\nslope of the function near the endpoint, i.e., x = 1. Notably, we study one such function that satis\ufb01es\nall of these conditions, i.e., exp(\u00b7) on [\u22121, 1], in our simulations in Section 4.\n\n3.3 Proof of Theorem 1\n\nWe provide the high-level proof of Theorem 1. For brevity, the proofs of the lemmas have been\nomitted and can be found in the supplement, where we in fact prove them for the more general setup\nof Theorem 3. The analysis boils down to understanding the behavior of yk = sor(y, xk) as de\ufb01ned\n[m] (i.e., rearranging the\nentries according to a random permutation). This reshuf\ufb02ing has no effect on the impurity, that is,\n\nearlier. Leteyk be obtained from yk by random reshuf\ufb02ing of its left half yk\nimp(eyk\nhas the same distribution as y. Thus, each entry ofeyk is an IID draw from the same distribution as\n\nIf k /\u2208 S, the ordering according to which we sort y is\nUnderstanding the distribution of yk.\nindependent of y (since xk is independent of y), hence the sorted version, before and after reshuf\ufb02ing\n\n[m]), and the reason for it becomes clear when we analyze the case k \u2208 S.\n\n[m]) = imp(yk\n\nthe pre-sort version:\n\n\u03b2jZj + w1,\n\ni = 1, . . . , n.\n\n(6)\n\neyk\n\ni\n\niid\u223c W0 :=X\n\nj\u2208S\n\nOn the other hand, if k \u2208 S, then for i = 1, . . . , n\nrk\ni\n\ni = \u03b2kx(i)k + rk\nyk\n\ni , where\n\niid\u223c Wk := X\n\nj\u2208S\\{k}\n\n\u03b2jZj + w1.\n\nHere x(i)k is the ith order statistic of xk, that is, x(1)k \u2264 x(2)k \u2264 \u00b7\u00b7\u00b7 \u2264 x(n)k. Note that the residual\nterms are still IID since they gather the covariates (and the noise) that are independent of the kth one\nand hence its ordering. Note also that rk\n\ni is independent of the \ufb01rst term \u03b2kx(i)k.\n\n[n/2] = (eyk1 ,eyk2 , . . . ,eyk\n\nn/2), and its reshuf\ufb02ed version eyk\n\nlike to claim that the \u201csignal part\u201d of theeyk\n\nRecall that we split at the midpoint and focus on the left split, i.e., we look at yk\n(yk1 , yk2 , . . . , yk\n\n[n/2] =\nn/2). Intuitively, we would\n[n/2] are approximately IID draws from \u03b2kUnif(0, 1/2).\nUnfortunately this is not true, in the sense that the distribution cannot be accurately approximated by\nUnif(0, 1 \u2212 \u03b4) for any \u03b4 (Lemma 1). However, we show that the distribution can be approximated by\nan in\ufb01nite mixture of IID uniforms of reduced range (Lemma 2).\nLet U(1) \u2264 U(2) \u2264 \u00b7\u00b7\u00b7 \u2264 U(n) be the order statistics obtained by ordering an IID sample Ui \u223c\n\nUnif(0, 1), i = 1, . . . , n. Recall that m := n/2 and let eU := (eU1,eU2 . . . ,eUm) be obtained from\n\n2We also observe that this condition holds for functions beyond these two categories.\n\n6\n\n\fi ,\n\ni\n\nfor k \u2208 S,\n\niid\u223c Wk, i \u2208 [m]\n\n(U(1), . . . , U(m)) by random permutation. Then, eU has an exchangeable distribution. We can write\n\neuk \u223c eU ,\ni +erk\ni , i \u2208 [m]) anderk = (erk\n\n\u03b3 \u223c Beta(m, m + 1).\nNote that Beta(m, m + 1) has mean = m/(2m + 1) = (1 + o(1))/2 as m \u2192 \u221e, and variance\n\neyk\ni = \u03b2keuk\nand erk\nwhere the m-vectorseuk = (euk\ni , i \u2208 [m]) are also independent.\nWe have the following result regarding the distribution of eU:\nLemma 1. The distribution of eU is a mixture of IID unif(0, \u03b3) m-vectors with mixing variable\n= O(m\u22121). Thus, Lemma 1 makes our intuition precise in the sense that the distribution of eU\ndTV(eU ,bU) between the distributions of random vectors eU and bU.\nLemma 2. LetbU be distributed according to a mixture of IID Unif(0,b\u03b3) m-vectors withb\u03b3 distributed\nas a Beta(m, m + 1) truncated to (0, 1 \u2212 \u03b4) for \u03b4 = e\u2212\u03b1/8 and \u03b1 > 0. With eU as in Lemma 1, we\nhave dTV(eU ,bU) \u2264 2 exp(\u2212\u03b1m).\nThe approximation of the distribution of the eU by a truncated version, bU, is an essential technique in\nvariable eU, or its truncated approximation eU, to allow for the use of concentration inequalities for\n\nis a \u201crange mixture\u201d of IID uniform distributions, with the range concentrating around 1/2. We\nnow provide a reduced range, \ufb01nite sample approximation in terms of the total variation distance\n\nour proof. As will become clear in the proof of Lemma 3, we will need to condition on the mixing\n\ni = \u03b2kbuk\n\nindependent variables. The resulting bounds should be devoid of randomness so that by taking\nexpectation, we can get similar bounds for the exchangeable case. The truncation allows us to\nmaintain a positive gap in impurities (between on and off support features) throughout this process.\nWe expect the loss due to truncation to be minimal, only impacting the constants.\n\ni , i \u2208 [m]) be drawn from the distribution of bU described in Lemma 2,\nFor k \u2208 S, letbuk = (buk\nindependently of anything else in the model, and letb\u03b3k be its corresponding mixing variable, which\ni +erk\nhas a Beta distribution truncated to (0, 1 \u2212 \u03b4). Let us de\ufb01nebyk\nerk = (erk\n[m] andbyk\ni ) is as before. This construction provides a simple coupling betweeneyk\ni , i \u2208 [m] where\nthe same bound on the their total variation distance. Hence, we can safely work withbyk\n[m] giving\nofeyk\n[m] instead\nbyk\ni =eyk\n[m], and pay a price of at most 2 exp(\u2212\u03b1m) in probability bounds. To simplify discussion, let\ni for k /\u2208 S.\nConcentration of empirical impurity. We will focus onbyk\nlike to control imp(byk\n[m]), the empirical variance impurity ofbyk\nreplaced withbyk\nimp(byk\n\n[m] due the discussion above. We would\n[m] which is de\ufb01ned as in (1) with yk\n[m]\n[m])], or proper bounds on it, and then show that\n[m]) concentrates around its mean. Let us consider the concentration \ufb01rst. (1) is a U-statistic of\n2(u\u2212 v)2. The classical Hoeffding inequality guarantees concentration\norder 2 with kernel h(u, v) = 1\nif h is uniformly bounded and the underlying variables are IID. Instead, we use a version of Hanson\u2013\nWright concentration inequality derived in [10], which allows us to derive a concentration bound for\nthe empirical variance, for general sub-Gaussian vectors, avoiding the boundedness assumption:\nCorollary 3. Let w = (w1, . . . , wm) \u2208 Rm be a random vector with independent components wi\n1\u2264i<j\u2264m(wi \u2212 wj)2 be the\nempirical variance of w. Then, for u \u2265 0,\n\nwhich satisfy Ewi = \u00b5 and kwi \u2212 \u00b5k\u03c82 \u2264 K. Let imp(w) :=(cid:0)m\n\n[m]. The idea is to analyze E[imp(byk\n\n(cid:1)\u22121P\n\n2\n\nP(cid:16)(cid:12)(cid:12) imp(w) \u2212 E imp(w)(cid:12)(cid:12) > K2u\n\n(cid:17) \u2264 2 exp(cid:8)\u2212c (m \u2212 1) min(u, u2)(cid:9).\n\nneeded since we can only guarantee an exchangeable distribution forbyk\n\nWe can immediately apply this result when k /\u2208 S. However, for k \u2208 S, a more careful application is\n[m] in this case. The following\n\nlemma summarizes the conclusions:\n\n7\n\n(7)\n\n\f[m]) and recall that \u03b4 was introduced in the de\ufb01nition of byk\n\ni . Let\n12 be the variance of Unif(0, 1). Recall that \u02c7p := p \u2212 s. Let L = k\u03b2k2. There exist absolute\n\n1 := 1\n\u03ba2\nconstants C1, C2, c such that if log \u02c7p/m \u2264 C1, then with probability at least 1 \u2212 \u02c7p\u2212c,\n\nLemma 3. Let bIm,k = imp(byk\nbIm,k \u2264 I1\n\nbIm,k \u2265 I0 \u2212 \u03b5m, \u2200k /\u2208 S\n\nk + \u03b5m, \u2200k \u2208 S,\n\nand,\n\nwhere, letting \u03be := 1 \u2212 (1 \u2212 \u03b4)2,\nk + L2) + \u03c32,\n\n1(\u2212\u03be\u03b22\n\nI1\nk := \u03ba2\n\n1L2 + \u03c32, and \u03b5m := C2(L2 + \u03c32)plog \u02c7p/m.\n\nI0 := \u03ba2\n\nThe key outcome of Lemma 3 is that, on average, there is a positive gap I0 \u2212 I1\nk in\nimpurities between a feature on the support and those off of it, and that due to concentration, the\n\ufb02uctuations in impurities will be less than this gap for large m. Combined with Lemma 2, we can\n\ntransfer the results toeIm,k := imp(eyk\nCorollary 4. The conclusion of Lemma 3 holds foreIm,k in place ofbIm,k, with probability at least\n\n1 \u2212 \u02c7p\u2212c \u2212 2e\u2212\u03b1m for \u03b1 = log 1\n8\u03b4 .\nNote that for \u03b4 < 1/8, the bound holds with high probability. Thus, as long as I0 \u2212 I1\nk > 2\u03b5m, the\nselection algorithm correctly favors the kth feature in S, over the inactive ones (recall that lower\nimpurity is better). We have our main result after substituting n/2 for m.\n\nk = \u03ba2\n\n1\u03be\u03b22\n\n[m]).\n\n4 Simulations\n\n(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 1: Support recovery performance in a linear regression model augmented with possible\nnonlinearities for n = 1024. (a) Linear case with uncorrelated design. (b) Linear case with correlated\ndesign. (c) Nonlinear additive model with exponentials of covariates and uncorrelated design. (d)\nNonlinear model with interaction terms and uncorrelated design. (e) Nonlinear additive model with\nexponentials of covariates, interaction terms, and uncorrelated design. (f) Nonlinear additive model\nwith exponentials of covariates, interaction terms, and correlated design.\n\nS is the true support of \u03b2. We generate the training data as X = eXM where eX \u2208 Rn\u00d7p is a random\n\nIn order to corroborate the theoretical analysis, we next present various simulation results. We\nconsider the following model: y = X\u03b2 + f(XS) + w, where f(XS) is a potential nonlinearity, and\nmatrix with IID Unif(\u22121, 1) entries, and M \u2208 Rp\u00d7p is an upper-triangular matrix that determines\nwhether the design is IID or correlated. In the IID case we set M = I. To achieve a correlated design\n\n8\n\n\fwe randomly assign values from {0,\u2212\u03c1, +\u03c1} to the upper triangular cells of M, with probabilities\n(1 \u2212 2\u03b1, \u03b1, \u03b1). We observed qualitatively similar results for various values of \u03c1 and \u03b1 and here we\npresent results with \u03b1 = 0.04, and \u03c1 = 0.1. The noise is generated as w \u223c N(0, \u03c32In). We \ufb01x\n\u221a\np = 200, \u03c3 = 0.1, and let \u03b2i = \u00b11/\ns over its support i \u2208 S, where |S| = s. That is, only s of the\np = 200 variables are predictive of the response. The nonlinearity, f, optionally contains additive\nterms in the form of exponentials of on-support covariates. It can also contain interaction terms across\ns xixj for some randomly selected pairs of i, j \u2208 S.\non-support covariates, i.e., terms of the form 2\u221a\nNotably, the choice of f is unknown to the variable selection methods. We vary s \u2208 [5, 100] and note\nthat k\u03b2k2 = 1 remains \ufb01xed.\nThe plots in Figure 1 show the fraction of the true support recovered3 as a function of s, for various\nmethods under different modeling setups: f = 0 (linear), f = 2 exp(\u00b7) (additive), f = interaction\n(interactions), and f = interaction + 2 exp(\u00b7) (interactions+additive) with IID or correlated designs.\nEach data point is an average over 100 trials (see supplementary material for results with 95%\ncon\ufb01dence intervals). In addition to DSTUMP, we evaluate TREEWEIGHT, SPAM, LASSO, SIS and\nrandom guessing for comparison. SIS refers to picking the indices of the top s largest values of X T y\nin absolute value. When X is orthogonal and the generative model is linear, this approach is optimal,\nand we use it as a surrogate for the optimal approach in our nearly orthogonal setup (i.e., the IID\nlinear case), due to its lack of any tuning parameters. Random guessing is used as a benchmark, and\nas expected, on average recovers the fraction s/p = s/200 of the support.\nThe plots show that, in the linear setting, the performance of DSTUMP is comparable to, and only\nslightly worse than, that of SIS or Lasso which are considered optimal in this case. Figure 1(b) shows\nthat under mildly correlated design the gap between DSTUMP and LASSO widens. In this case, SIS\nloses its optimality and performs at the same level as DSTUMP. This matches our intuition as both\nSIS and DSTUMP are both greedy methods that consider covariates independently.\nDSTUMP is more robust to nonlinearities, as characterized theoretically in Theorem 3 and evidenced\nin Figure 1(c). In contrast, in the presence of exponential nonlinearities, SIS and Lasso are effective in\nthe very sparse regime of s (cid:28) p, but quickly approach random guessing as s grows. In the presence\nof interaction terms, TREEWEIGHT and to a lesser extent SPAM outperform all other methods, as\nshown in Figure 1(d), 1(e), and 1(f). We also note that the permutation-based importance method [1],\ndenoted by TREEWEIGHTPERMUTATION in the plots in Figure 1, performs substantially worse than\nTREEWEIGHT across the various modelling settings.\nOverall, these simulations illustrate the promise of multi-level tree-based methods like TREEWEIGHT\nunder more challenging and realistic modeling settings. Future work involves generalizing our\ntheoretical analyses to extend to these more complex multi-level tree-based approaches.\n\n5 Discussion\n\nWe presented a simple model selection algorithm for decision trees, which we called DSTUMP,\nand analyzed its \ufb01nite-sample performance in a variety of settings, including the high-dimensional,\nnonlinear additive model setting. Our theoretical and experimental results show that even a simple\ntree-based algorithm that selects at the root can achieve high dimensional selection consistency.\nWe hope these results pave the way for the \ufb01nite-sample analysis of more re\ufb01ned tree-based model\nselection procedures. Inspired by the empirical success of TREEWEIGHT in nonlinear settings, we are\nactively looking at extensions of DSTUMP to a multi-stage algorithm capable of handling interactions\nwith high-dimensional guarantees.\nMoreover, while we mainly focused on the regression problem, our proof technique based on\nconcentration of impurity reductions is quite general. We expect analogous results to hold, for example\nfor classi\ufb01cation. However, aspects of the proof would be different, since impurity measures used for\nclassi\ufb01cation are different than those of regression. One major hurdle involves deriving concentration\ninequalities for the empirical versions of these measures, which are currently unavailable, and would\nbe of independent interest.\n\n3In the supplementary material we report analogous results using a more stringent performance metric,\n\nnamely the probability of exact support recovery. The results are qualitatively similar.\n\n9\n\n\fReferences\n[1] L. Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n\n[2] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classi\ufb01cation and regression trees.\n\nCRC press, 1984.\n\n[3] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal\n\nof the Royal Statistical Society: Series B, 70(5), 2008.\n\n[4] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849\u2013911, 2008.\n\n[5] J. H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of\n\nStatistics, 29(5):1189\u20131232, 2001.\n\n[6] H. Ishwaran. Variable importance in binary regression trees and forests. Electronic Journal of\n\nStatistics, 1:519\u2013537, 2007.\n\n[7] J. Lafferty and L. Wasserman. Rodeo: Sparse, Greedy Nonparametric Regression. Annals of\n\nStatistics, 36(1):28\u201363, 2008.\n\n[8] G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts. Understanding variable importances in\nforests of randomized trees. In Advances in Neural Information Processing Systems 26. 2013.\n\n[9] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the\n\nRoyal Statistical Society: Series B, 71(5):1009\u20131030, 2009.\n\n[10] M. Rudelson and R. Vershynin. Hanson-Wright inequality and sub-gaussian concentration.\n\nElectron. Commun. Probab, pages 1\u201310, 2013.\n\n[11] M. J. Wainwright. Information-theoretic limits on sparsity recovery in the high-dimensional\n\nand noisy setting. IEEE Transactions on Information Theory, 55(12):5728\u20135741, 2009.\n\n[12] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity Recovery \u20181-\nIEEE Transactions on Information Theory,\n\nconstrained quadratic programming (Lasso).\n55(5):2183\u20132202, 2009.\n\n10\n\n\f", "award": [], "sourceid": 322, "authors": [{"given_name": "Jalil", "family_name": "Kazemitabar", "institution": "University of California, Los Angeles"}, {"given_name": "Arash", "family_name": "Amini", "institution": "UCLA"}, {"given_name": "Adam", "family_name": "Bloniarz", "institution": "Google"}, {"given_name": "Ameet", "family_name": "Talwalkar", "institution": "CMU"}]}