# variable_importance_using_decision_trees__dd0c5677.pdf Variable Importance using Decision Trees S. Jalil Kazemitabar UCLA sjalilk@ucla.edu Arash A. Amini UCLA aaamini@ucla.edu Adam Bloniarz UC Berkeley adam@stat.berkeley.edu Ameet Talwalkar CMU talwalkar@cmu.edu Decision trees and random forests are well established models that not only offer good predictive performance, but also provide rich feature importance information. While practitioners often employ variable importance methods that rely on this impurity-based information, these methods remain poorly characterized from a theoretical perspective. We provide novel insights into the performance of these methods by deriving finite sample performance guarantees in a high-dimensional setting under various modeling assumptions. We further demonstrate the effectiveness of these impurity-based methods via an extensive set of simulations. 1 Introduction Known for their accuracy and robustness, decision trees and random forests have long been a workhorse in machine learning [1]. In addition to their strong predictive accuracy, they are equipped with measures of variable importance that are widely used in applications where model interpretability is paramount. Importance scores are used for model selection: predictors with high-ranking scores may be chosen for further investigation, or for building a more parsimonious model. One common approach naturally couples the model training process with feature selection [2, 5]. This approach, which we call TREEWEIGHT, calculates the feature importance score for a variable by summing the impurity reductions over all nodes in the tree where a split was made on that variable, with impurity reductions weighted to account for the size of the node. For ensembles, these quantities are averaged over constituent trees. TREEWEIGHT is particularly attractive because it can be calculated without any additional computational expense above the standard training procedure. However, as the training procedure in random forests combines several complex ingredients bagging, random selection of predictor subsets at nodes, line search for optimal impurity reduction, recursive partitioning theoretical investigation into TREEWEIGHT is extremely challenging. We propose a new method called DSTUMP that is inspired by TREEWEIGHT but is more amenable to analysis. DSTUMP assigns variable importance as the impurity reduction at the root node of a single tree. In this work we characterize the finite sample performance of DSTUMP under an additive regression model, which also yields novel results for variable selection under a linear model, both with correlated and uncorrelated design. We corroborate our theoretical analyses with extensive simulations in which we evaluate DSTUMP and TREEWEIGHT on the task of feature selection under various modeling assumptions. We also compare the performance of these techniques against established methods whose behaviors have been theoretically characterized, including Lasso, SIS, and Sp AM [12, 3, 9]. Now at Google 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Our work provides the first finite-sample high-dimensional analyses of tree-based variable selection techniques, which are commonly used in practice but lacking in theoretical grounding. Although we focus on DSTUMP, which is a relatively simple tree-based variable selection approach, our novel proof techniques are highly non-trivial and suggest a path forward for studying more general multi-level tree-based techniques such as TREEWEIGHT. Moreover, our simulations demonstrate that such algorithmic generalizations exhibit impressive performance relative to competing methods under more realistic models, e.g., non-linear models with interaction terms and correlated design. 2 Related Work Our analysis is distinct from existing work in analyzing variable importance measures of trees and forests in several ways. To our knowledge, ours is the first analysis to consider the high-dimensional setting, where the number of variables, p, and the size of the active set s, grow with the sample size n, and potentially p n. The closest related work is the analysis of [8], which considers a fixed set of variables, in the limit of infinite data (n = ). Unlike DSTUMP s use of the root node only, [8] does consider importance scores derived from the full set of splits in a tree as in TREEWEIGHT. However, they make crucial simplifying (and unrealistic) assumptions that are distinct from those of our analysis: (1) each variable is split on only once in any given path from the root to a leaf of the tree; (2) at each node a variable is picked uniformly at random among those not yet used at the parent nodes, i.e., the splits themselves are not driven by impurity reduction; and (3) all predictors are categorical, with splits being made on all possible levels of a variable, i.e., the number of child nodes equals the cardinality of the variable being split. Our analysis instead considers continuous-valued predictors, the split is based on actual impurity reduction, and our results are nonasymptotic, i.e. they give high-probability bounds on impurity measures for active and inactive variables that hold in finite samples. A second line of related work is motivated by a permutation-based importance method [1] for feature selection. In practice, this method is computationally expensive as it determines variable importance by comparing the predictive accuracy of a forest before and after random permutation of a predictor. Additionally, due to the algorithmic complexity of the procedure, it is not immediately amenable to theoretical analysis, though the asymptotic properties of a simplified variant of the procedure have been studied in [6]. While our work is the first investigation of finite-sample model selection performance of tree-based regression methods, alternative methods performing both linear and nonparametric regression in high dimensions have been studied in the literature. Considering model selection consistency results, most of the attention has been focused on the linear setting, whereas the nonparametric (nonlinear) setup has been mostly studied in terms of the prediction consistency. Under a high-dimensional linear regression model, LASSO has be extensively studied and is shown to be minimax optimal for variable selection under appropriate regularity conditions, including the uncorrelated design with a moderate βmin condition. Remarkably, while not tailored to the linear setting, we show that DSTUMP is nearly minimax optimal for variable selection in the same uncorrelated design setting (cf. Corollary 1). In fact, DSTUMP can be considered a nonlinear version of SIS [4], itself a simplified form of the LASSO when one ignores correlation among features (cf. Section 3 for more details). The Rodeo framework [7] performs automatic bandwidth selection and variable selection for local linear smoothers, and is tailored to a more general nonparametric model with arbitrary interactions. It was shown to possess model selection consistency in high dimensions; however, the results are asymptotic and focus on achieving optimal prediction rate. In particular, there is no clear βmin threshold as a function of n, s, and p. RODEO is also computationally burdensome for even modestsized problems (we thus omit it our experimental results in Section 4). Among the nonlinear methods, SPAM is perhaps the most well-understood in terms of model selection properties. Under a general high-dimensional sparse additive model, SPAM possesses the sparsistency property (a term for model selection consistency); the analysis is reduced to a linear setting by considering expansions in basis functions, and selection consistency is proved under an irrepresentible condition on the coefficients in those bases. We show that DSTUMP is model selection consistent in the sparse additive model with uncorrelated design. Compared to SPAM results, our conditions are stated directly in terms of underlying functions and are not tied to a particular basis; hence our proof technique is quite different. There is no implicit reduction to a linear setting via basis expansions. Empirically, we show that DSTUMP indeed succeeds in the settings our theory predicts. 3 Selection consistency The general model selection problem for non-parametric regression can be stated as follows: we observe noisy samples yi = f(xi1, . . . , xip) + wi, i = 1, . . . , n where {wi} is an i.i.d. noise sequence. Here, p is the total number of features (or covariates) and n is the total number of observations (or the sample size). In general, f belongs to a class F of functions from Rp R. One further assumes that the functions in F depend on at most s of the features, usually with s p. That is, for every f F, there is some f0 : Rs R and a subset S [p] with |S| s such that f(z1, . . . , zp) = f0(z S) where z S = (zi, i S). The subset S, i.e., the set of active features, is unknown in advance and the goal of model selection is to recover it given {(yi, xi)}n i=1. The problem is especially challenging in the high-dimensional setting where p n. We will consider various special cases of this general model when we analyze DSTUMP. For theoretical analysis it is common to assume s to be known and we will make this assumption throughout. In practice, one often considers s to be a tunable parameter that can be selected, e.g., via cross-validation or greedy forward selection. We characterize the model selection performance of DSTUMP by establishing its sample complexity: that is, the scaling of n, p, and s that is sufficient to guarantee that DSTUMP identifies the active set of features with probability converging to 1. Our general results, proved in the technical report, allow for a correlated design matrix and additive nonlinearities in the true regression function. Our results for the linear case, derived as a special case of the general theory, allow us to compare the performance of DSTUMP to the information theoretic limits for sample complexity established in [11], and to the performance of existing methods more tailored to this setting, such as the Lasso [12]. Given a generative model and the restriction of DSTUMP to using root-level impurity reduction, the general thrust of our result is straightforward: impurity reduction due to active variables concentrates at a significantly higher level than that of inactive variables. However, there are significant technical challenges in establishing this result, mainly deriving from the fact that the splitting procedure renders the data in the child nodes non-i.i.d., and hence standard concentration inequalities do not immediately apply. We leverage the fact that the DSTUMP procedure considers splits at the median of a predictor. Given this median point, the data in each child node is i.i.d., and hence we can apply standard concentration inequalities in this conditional distribution. Removing this conditioning presents an additional technical subtlety. For ease of exposition, we first present our results for the linear setting in Section 3.1, and subsequently summarize our general results in Section 3.2. We provide a proof of our result in the linear setting in Section 3.3, and defer the proof of our general result to the supplementary material. Algorithm 1 DSTUMP input {xk Rn}k=p k=1, y Rn, # top features s m = n 2 for k = 1, . . . , p do I(xk) = Sort Feature Values(xk) yk = Sort Label By Feature(y, I(xk)) yk [m], yk [n]\[m] = Split At Midpoint(yk) ik = Compute Impurity(yk [m], yk [n]\[m]) end for S = Find Top Impurity Reductions({ik}, s) output top s features sorted by impurity reduction The DSTUMP algorithm. In order to describe DSTUMP more precisely, let us introduce some notation. We write [n] := {1, . . . , n}. Throughout, y = (yi, i [n]) Rn will be the response vector observed for a sample of size n. For an ordered index set I = (i1, i2, . . . , ir), we set y I = (yi1, yi2 . . . , yir). A similar notation is used for unordered index sets. We write xj = (x1j, x2j, . . . , xnj) Rn for the vector collecting values of the jth feature; xj forms the jth column of the design matrix X Rn p. Let I(xj) := (i1, i2 . . . , in) be an ordering of [n] such that xi1j xi2j xinj and let sor(y, xj) := y I(xj) Rn; this is an operator that sorts y relative to xj. DSTUMP proceeds as follows: Evaluate yk := sor(y, xk) = sor P j S βjxj + w, xk , for k = 1, . . . , p. Let m := n/2. For each k, consider the midpoint split of yk into yk [m] and yk [n]\[m] and evaluate the impurity of the left-half, using empirical variance as impurity: imp(yk [m]) := 1 m 2 X 1 2(yk i yk j )2. (1) Let imp(yk [m]) be the score of feature k, and output the s features with the smallest scores (corresponding to maximal reduction in impurity). If the generative model is linear, the choice of the midpoint is justified by our assumption of the uniform distribution for the features (Zi), and we further show that this simple choice is effective even under a nonlinear model. The choice of the left-half in our analysis is for convenience; a similar analysis applies if we take the impurity to be that of the sum of both halves (or their maximum). DSTUMP is summarized in Algorithm 1. Impurity reduction imp(y[m]) imp(yk [m]) can be considered a form of nonlinear correlation between y and feature xk. The SIS algorithm is equivalent to replacing this nonlinear correlation with the (absolute) linear correlation | 1 nx T k y|. That is, both procedures assign a score to each feature by considering it against the response separately, ignoring other features. In the uncorrelated (i.e. orthogonal design) setting, this is more or less optimal, and as is the case with SIS, we show that DSTUMP also retains some model selection performance even under correlated designs. In contrast to SIS, we show that DSTUMP also enjoys performance guarantees in non-linear settings. The models. We present our consistency results for models of various complexity. We start with the well-known and extensively studied setting of a linear model with IID design. This basic setup serves as the benchmark for comparison of model selection procedures. As will become clear in the course of the proof, analyzing DSTUMP (or impurity-based feature selection in general) is challenging even in this case, in contrast to linear model based approaches such as SIS or Lasso. Once we have a good understanding of DSTUMP under the baseline model, we extend the analysis to correlated design and nonlinear additive models. The structure of our proof is also most clearly seen in this simple case, as outlined in Section 3.3. We now introduce our general models: Model 1 (Sparse linear model with ICA-type design). A linear regression model y = Xβ + w with ICA-type (random) design X Rn p has the following properties: (i) X = e XM where e X Rn p and each row of e X is an independent draw from a (column) vector Z = (Z1, . . . , Zp) with IID entries drawn uniformly from [0, 1]. (ii) The noise vector w = (w1, . . . , wn) has IID sub-Gaussian entries with variance with variance var(wi) = v2 w and sub-Gaussian norm wi ψ2 σw, . (iii) The β Rp is s-sparse, namely, βj = 0 for j S = {1, . . . , s} and zero otherwise. Model 1 serves both the correlated and uncorrelated design cases. Each row of the design matrix X is a draw from the vector M T Z, which has covariance c M T M for some constant c. Thus, the choice of M = I leads to an uncorrelated design. The choice of the interval [0, 1] for covariates is for convenience; it can be replaced with any other compact interval, in the linear setting, since variance impurity is invariant to a shift. Similarly the choice of the (active) support indices, S, is for convenience. For simplicity, we often assume v2 w = σ2 and σw Cσ (only σw would affect the results as examining of our proofs shows). Model 2 (Sparse additive model with uncorrelated design). An additive regression model yi = Pp j=1 fj(xij) + wi, is one with random design X = (xij) and the noise (wi) as in Model 1, with M = I (uncorrelated design). We assume (fk) to be s-sparse, namely, fj = 0 for j S = {1, . . . , s} and zero otherwise. 3.1 Linear Setting Uncorrelated design. Our baseline result is the following feature selection consistency guarantee for DSTUMP, for the case M = I of Model 1. Throughout, we let ˇp := p s, and C, C1, . . . , c, c1, . . . are absolute positive constants which can be different in each occurrence. For any vector x, let |x|min := mini |xi|, the minimum absolute value of its entries. The quantity |βS|2 min = mink S β2 k appearing in Theorem 1 is a well-known parameter controlling hardness of subset recovery. All our results are stated in terms of constants δ, α and ξ that are related as: δ (0, 1/8), α = log(1/(8δ)), ξ = 1 (1 δ)2. (2) Theorem 1. Assume Model 1 with M = I, and (2). The DSTUMP algorithm, which selects the s least impure features at the root, succeeds in feature selection, with probability at least 1 ˇp c 2e αn/2 if log ˇp/n C1 and |βS|2 min C ξ ( β 2 2 + σ2) The result can be read by setting, e.g., δ = 1/16 leading to numerical constants for α and ξ. The current form allows the flexibility to trade-off the constant (α) in the probability bound with the constant (ξ) in the gap condition (3). Although Theorem 1 applies to a general β, it is worthwhile to see its consequence in a special regime of interest where |βS|2 min 1/s, corresponding to β 2 1. We get the following immediate corollary: Corollary 1. Assume |βS|2 min 1/s, σ2 1 and log ˇp/n = O(1). Then DSTUMP succeeds with high probability if n s2 log ˇp. The minimax optimal threshold for support recovery in the regime of Corollary 1 is known to be n s log ˇp [11], and achieved by LASSO [12]. Although this result is obtained for Gaussian design, the same argument goes through for our uniform ensemble. Compared to the optimal threshold, using DSTUMP we pay a small factor of s in the sample complexity. However, DSTUMP is not tied to the linear model and as we discuss in Section 3.2, we can generalize the performance of DSTUMP to nonlinear settings. Correlated design. We take the following approach to generalize our result to the correlated case: (1) We show a version of Theorem 1, which holds for an approximately sparse parameter eβ with uncorrelated design. (2) We derive conditions on M such that the correlated case can be turned into the uncorrelated case with approximate sparsity. The following theorem details Step 1: Theorem 2. Assume Model 1(i)-(ii) with M = I, but instead of (iii) let β = eβ, a general vector in Rp. Let S be any subset of [p] of cardinality s. The DSTUMP algorithm, which selects the s least impure features at the root, recovers S, with probability at least 1 ˇp c 2e αn/2 if log ˇp/n C1 and ξ|eβS|2 min eβSc 2 > C( eβ 2 2 + σ2) p (log ˇp)/n. The theorem holds for any eβ and S, but the gap condition required is likely to be violated unless eβ is approximately sparse w.r.t. S. Going back to Model 1, we see that setting eβ = Mβ transforms the model with correlated design X, and exact sparsity on β, to the model with uncorrelated design e X, and approximate sparsity on eβ. The following corollary gives sufficient conditions on M, so that Theorem 2 is applicable. Recall the usual (vector) ℓ norm, x = maxi |xi|, the matrix ℓ ℓ operator norm |||A||| = maxi P j |Aij| , and the ℓ2 ℓ2 operator norm |||A|||2. Corollary 2. Consider a general ICA-type Model 1 with β and M satisfying βS γ|βS|min, |||MSS I||| 1 ρ γ , |||MSc S||| ρ for some ρ, κ (0, 1] and γ 1. Then, the conclusion of Theorem 1 holds, for DSTUMP applied to input (y, e X), under the gap condition (3) with C/ξ replaced with C|||MSS|||2 2/(κ ξ ρ2). Access to decorrelated features, e X, is reasonable in cases where one can perform consistent ICA. This assumption is practically plausible, especially in the low-dimensional regimes, though it would be desirable if this assumption can be removed theoretically. Moreover, we note that the response y is based on the correlated features. In this result, C|||MSS|||2 2/(κ ξ ρ2) plays the role of a new constant. There is a hard bound on how big ξ can be, which via (4) controls how much correlation between off-support and on-support features are tolerated. For example, taking δ = 1/9, we have α = log(9/8) 0.1, ξ = 17/81 0.2 and ξ 0.45 and this is about as big as it can get (the maximum we can allow is 0.48). κ can be arbitrarily close to 0, relaxing the assumption (4), at the expense of increasing the constant in the threshold. γ controls deviation of |βj|, j S from uniform: in case of equal weights on the support, i.e., |βj| = 1/ s for j S, we have γ = 1. Theorem 1 for the uncorrelated design is recovered, by taking ρ = κ = 1. 3.2 General Additive Model Setting To prove results in this more general setting, we need some further regularity conditions on (fk): Fix some δ (0, 1), let U unif(0, 1) and assume the following about the underlying functions (fk): (F1) fk(αU) 2 ψ2 σ2 f,k, α [0, 1]. (F2) var[fk(αU)] var[fk((1 δ)U)], α 1 δ. Next, we define σ2 f, := Pp k=1 σ2 f,k = P k S σ2 f,k along with the following key gap quantities: gf,k(δ) := var[fk(U))] var[fk((1 δ)U)]. Theorem 3. Assume additive Model 2 with (F1) and (F2). Let α = log 1 8δ for δ (0, 1/8). The DSTUMP algorithm, which selects the s least impure features at the root, succeeds in model selection, with probability at least 1 ˇp c 2e αn/2 if log ˇp/n C1 and min k S gf,k(δ) C(σ2 f, + σ2) In the supplementary material, we explore in detail the class of functions that satisfy conditions (F1) and (F2), as well as the gap condition in (5). (F1) is relatively mild and satisfied if f is Lipschitz or bounded. (F2) is more stringent and we show that it is satisfied for convex nondecreasing and concave nonincreasing functions.2 The gap condition is less restrictive than (F2) and is related to the slope of the function near the endpoint, i.e., x = 1. Notably, we study one such function that satisfies all of these conditions, i.e., exp( ) on [ 1, 1], in our simulations in Section 4. 3.3 Proof of Theorem 1 We provide the high-level proof of Theorem 1. For brevity, the proofs of the lemmas have been omitted and can be found in the supplement, where we in fact prove them for the more general setup of Theorem 3. The analysis boils down to understanding the behavior of yk = sor(y, xk) as defined earlier. Let eyk be obtained from yk by random reshuffling of its left half yk [m] (i.e., rearranging the entries according to a random permutation). This reshuffling has no effect on the impurity, that is, imp(eyk [m]) = imp(yk [m]), and the reason for it becomes clear when we analyze the case k S. Understanding the distribution of yk. If k / S, the ordering according to which we sort y is independent of y (since xk is independent of y), hence the sorted version, before and after reshuffling has the same distribution as y. Thus, each entry of eyk is an IID draw from the same distribution as the pre-sort version: eyk i iid W0 := X j S βj Zj + w1, i = 1, . . . , n. (6) On the other hand, if k S, then for i = 1, . . . , n yk i = βkx(i)k + rk i , where rk i iid Wk := X j S\{k} βj Zj + w1. Here x(i)k is the ith order statistic of xk, that is, x(1)k x(2)k x(n)k. Note that the residual terms are still IID since they gather the covariates (and the noise) that are independent of the kth one and hence its ordering. Note also that rk i is independent of the first term βkx(i)k. Recall that we split at the midpoint and focus on the left split, i.e., we look at yk [n/2] = (yk 1, yk 2, . . . , yk n/2), and its reshuffled version eyk [n/2] = (eyk 1, eyk 2, . . . , eyk n/2). Intuitively, we would like to claim that the signal part of the eyk [n/2] are approximately IID draws from βk Unif(0, 1/2). Unfortunately this is not true, in the sense that the distribution cannot be accurately approximated by Unif(0, 1 δ) for any δ (Lemma 1). However, we show that the distribution can be approximated by an infinite mixture of IID uniforms of reduced range (Lemma 2). Let U(1) U(2) U(n) be the order statistics obtained by ordering an IID sample Ui Unif(0, 1), i = 1, . . . , n. Recall that m := n/2 and let e U := (e U1, e U2 . . . , e Um) be obtained from 2We also observe that this condition holds for functions beyond these two categories. (U(1), . . . , U(m)) by random permutation. Then, e U has an exchangeable distribution. We can write for k S, eyk i = βk euk i + erk i , euk e U, and erk i iid Wk, i [m] where the m-vectors euk = (euk i , i [m]) and erk = (erk i , i [m]) are also independent. We have the following result regarding the distribution of e U: Lemma 1. The distribution of e U is a mixture of IID unif(0, γ) m-vectors with mixing variable γ Beta(m, m + 1). Note that Beta(m, m + 1) has mean = m/(2m + 1) = (1 + o(1))/2 as m , and variance = O(m 1). Thus, Lemma 1 makes our intuition precise in the sense that the distribution of e U is a range mixture of IID uniform distributions, with the range concentrating around 1/2. We now provide a reduced range, finite sample approximation in terms of the total variation distance d TV(e U, b U) between the distributions of random vectors e U and b U. Lemma 2. Let b U be distributed according to a mixture of IID Unif(0, bγ) m-vectors with bγ distributed as a Beta(m, m + 1) truncated to (0, 1 δ) for δ = e α/8 and α > 0. With e U as in Lemma 1, we have d TV(e U, b U) 2 exp( αm). The approximation of the distribution of the e U by a truncated version, b U, is an essential technique in our proof. As will become clear in the proof of Lemma 3, we will need to condition on the mixing variable e U, or its truncated approximation e U, to allow for the use of concentration inequalities for independent variables. The resulting bounds should be devoid of randomness so that by taking expectation, we can get similar bounds for the exchangeable case. The truncation allows us to maintain a positive gap in impurities (between on and off support features) throughout this process. We expect the loss due to truncation to be minimal, only impacting the constants. For k S, let buk = (buk i , i [m]) be drawn from the distribution of b U described in Lemma 2, independently of anything else in the model, and let bγk be its corresponding mixing variable, which has a Beta distribution truncated to (0, 1 δ). Let us define byk i = βk buk i + erk i , i [m] where erk = (erk i ) is as before. This construction provides a simple coupling between eyk [m] and byk [m] giving the same bound on the their total variation distance. Hence, we can safely work with byk [m] instead of eyk [m], and pay a price of at most 2 exp( αm) in probability bounds. To simplify discussion, let byk i = eyk i for k / S. Concentration of empirical impurity. We will focus on byk [m] due the discussion above. We would like to control imp(byk [m]), the empirical variance impurity of byk [m] which is defined as in (1) with yk [m] replaced with byk [m]. The idea is to analyze E[imp(byk [m])], or proper bounds on it, and then show that imp(byk [m]) concentrates around its mean. Let us consider the concentration first. (1) is a U-statistic of order 2 with kernel h(u, v) = 1 2(u v)2. The classical Hoeffding inequality guarantees concentration if h is uniformly bounded and the underlying variables are IID. Instead, we use a version of Hanson Wright concentration inequality derived in [10], which allows us to derive a concentration bound for the empirical variance, for general sub-Gaussian vectors, avoiding the boundedness assumption: Corollary 3. Let w = (w1, . . . , wm) Rm be a random vector with independent components wi which satisfy Ewi = µ and wi µ ψ2 K. Let imp(w) := m 2 1 P 1 i K2u 2 exp c (m 1) min(u, u2) . (7) We can immediately apply this result when k / S. However, for k S, a more careful application is needed since we can only guarantee an exchangeable distribution for byk [m] in this case. The following lemma summarizes the conclusions: Lemma 3. Let b Im,k = imp(byk [m]) and recall that δ was introduced in the definition of byk i . Let κ2 1 := 1 12 be the variance of Unif(0, 1). Recall that ˇp := p s. Let L = β 2. There exist absolute constants C1, C2, c such that if log ˇp/m C1, then with probability at least 1 ˇp c, b Im,k I1 k + εm, k S, and, b Im,k I0 εm, k / S where, letting ξ := 1 (1 δ)2, I1 k := κ2 1( ξβ2 k + L2) + σ2, I0 := κ2 1L2 + σ2, and εm := C2(L2 + σ2) p The key outcome of Lemma 3 is that, on average, there is a positive gap I0 I1 k = κ2 1ξβ2 k in impurities between a feature on the support and those off of it, and that due to concentration, the fluctuations in impurities will be less than this gap for large m. Combined with Lemma 2, we can transfer the results to e Im,k := imp(eyk [m]). Corollary 4. The conclusion of Lemma 3 holds for e Im,k in place of b Im,k, with probability at least 1 ˇp c 2e αm for α = log 1 Note that for δ < 1/8, the bound holds with high probability. Thus, as long as I0 I1 k > 2εm, the selection algorithm correctly favors the kth feature in S, over the inactive ones (recall that lower impurity is better). We have our main result after substituting n/2 for m. 4 Simulations Figure 1: Support recovery performance in a linear regression model augmented with possible nonlinearities for n = 1024. (a) Linear case with uncorrelated design. (b) Linear case with correlated design. (c) Nonlinear additive model with exponentials of covariates and uncorrelated design. (d) Nonlinear model with interaction terms and uncorrelated design. (e) Nonlinear additive model with exponentials of covariates, interaction terms, and uncorrelated design. (f) Nonlinear additive model with exponentials of covariates, interaction terms, and correlated design. In order to corroborate the theoretical analysis, we next present various simulation results. We consider the following model: y = Xβ + f(XS) + w, where f(XS) is a potential nonlinearity, and S is the true support of β. We generate the training data as X = e XM where e X Rn p is a random matrix with IID Unif( 1, 1) entries, and M Rp p is an upper-triangular matrix that determines whether the design is IID or correlated. In the IID case we set M = I. To achieve a correlated design we randomly assign values from {0, ρ, +ρ} to the upper triangular cells of M, with probabilities (1 2α, α, α). We observed qualitatively similar results for various values of ρ and α and here we present results with α = 0.04, and ρ = 0.1. The noise is generated as w N(0, σ2In). We fix p = 200, σ = 0.1, and let βi = 1/ s over its support i S, where |S| = s. That is, only s of the p = 200 variables are predictive of the response. The nonlinearity, f, optionally contains additive terms in the form of exponentials of on-support covariates. It can also contain interaction terms across on-support covariates, i.e., terms of the form 2 sxixj for some randomly selected pairs of i, j S. Notably, the choice of f is unknown to the variable selection methods. We vary s [5, 100] and note that β 2 = 1 remains fixed. The plots in Figure 1 show the fraction of the true support recovered3 as a function of s, for various methods under different modeling setups: f = 0 (linear), f = 2 exp( ) (additive), f = interaction (interactions), and f = interaction + 2 exp( ) (interactions+additive) with IID or correlated designs. Each data point is an average over 100 trials (see supplementary material for results with 95% confidence intervals). In addition to DSTUMP, we evaluate TREEWEIGHT, SPAM, LASSO, SIS and random guessing for comparison. SIS refers to picking the indices of the top s largest values of XT y in absolute value. When X is orthogonal and the generative model is linear, this approach is optimal, and we use it as a surrogate for the optimal approach in our nearly orthogonal setup (i.e., the IID linear case), due to its lack of any tuning parameters. Random guessing is used as a benchmark, and as expected, on average recovers the fraction s/p = s/200 of the support. The plots show that, in the linear setting, the performance of DSTUMP is comparable to, and only slightly worse than, that of SIS or Lasso which are considered optimal in this case. Figure 1(b) shows that under mildly correlated design the gap between DSTUMP and LASSO widens. In this case, SIS loses its optimality and performs at the same level as DSTUMP. This matches our intuition as both SIS and DSTUMP are both greedy methods that consider covariates independently. DSTUMP is more robust to nonlinearities, as characterized theoretically in Theorem 3 and evidenced in Figure 1(c). In contrast, in the presence of exponential nonlinearities, SIS and Lasso are effective in the very sparse regime of s p, but quickly approach random guessing as s grows. In the presence of interaction terms, TREEWEIGHT and to a lesser extent SPAM outperform all other methods, as shown in Figure 1(d), 1(e), and 1(f). We also note that the permutation-based importance method [1], denoted by TREEWEIGHTPERMUTATION in the plots in Figure 1, performs substantially worse than TREEWEIGHT across the various modelling settings. Overall, these simulations illustrate the promise of multi-level tree-based methods like TREEWEIGHT under more challenging and realistic modeling settings. Future work involves generalizing our theoretical analyses to extend to these more complex multi-level tree-based approaches. 5 Discussion We presented a simple model selection algorithm for decision trees, which we called DSTUMP, and analyzed its finite-sample performance in a variety of settings, including the high-dimensional, nonlinear additive model setting. Our theoretical and experimental results show that even a simple tree-based algorithm that selects at the root can achieve high dimensional selection consistency. We hope these results pave the way for the finite-sample analysis of more refined tree-based model selection procedures. Inspired by the empirical success of TREEWEIGHT in nonlinear settings, we are actively looking at extensions of DSTUMP to a multi-stage algorithm capable of handling interactions with high-dimensional guarantees. Moreover, while we mainly focused on the regression problem, our proof technique based on concentration of impurity reductions is quite general. We expect analogous results to hold, for example for classification. However, aspects of the proof would be different, since impurity measures used for classification are different than those of regression. One major hurdle involves deriving concentration inequalities for the empirical versions of these measures, which are currently unavailable, and would be of independent interest. 3In the supplementary material we report analogous results using a more stringent performance metric, namely the probability of exact support recovery. The results are qualitatively similar. [1] L. Breiman. Random forests. Machine learning, 45(1):5 32, 2001. [2] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and regression trees. CRC press, 1984. [3] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B, 70(5), 2008. [4] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849 911, 2008. [5] J. H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189 1232, 2001. [6] H. Ishwaran. Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519 537, 2007. [7] J. Lafferty and L. Wasserman. Rodeo: Sparse, Greedy Nonparametric Regression. Annals of Statistics, 36(1):28 63, 2008. [8] G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts. Understanding variable importances in forests of randomized trees. In Advances in Neural Information Processing Systems 26. 2013. [9] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society: Series B, 71(5):1009 1030, 2009. [10] M. Rudelson and R. Vershynin. Hanson-Wright inequality and sub-gaussian concentration. Electron. Commun. Probab, pages 1 10, 2013. [11] M. J. Wainwright. Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Transactions on Information Theory, 55(12):5728 5741, 2009. [12] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity Recovery ℓ1constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183 2202, 2009.