# optimal_sparse_recovery_with_decision_stumps__2432bc12.pdf

Optimal Sparse Recovery with Decision Stumps

Kiarash Banihashem, Mohammad Taghi Hajiaghayi, Max Springer

University of Maryland kiarash@umd.edu, hajiagha@cs.umd.edu, mss423@umd.edu

Decision trees are widely used for their low computational cost, good predictive performance, and ability to assess the importance of features. Though often used in practice for feature selection, the theoretical guarantees of these methods are not well understood. We here obtain a tight finite sample bound for the feature selection problem in linear regression using single-depth decision trees. We examine the statistical properties of these decision stumps for the recovery of the s active features from p total features, where s p. Our analysis provides tight sample performance guarantees on high-dimensional sparse systems which align with the finite sample bound of O(s log p) as obtained by Lasso, improving upon previous bounds for both the median and optimal splitting criteria. Our results extend to the non-linear regime as well as arbitrary sub-Gaussian distributions, demonstrating that tree based methods attain strong feature selection properties under a wide variety of settings and further shedding light on the success of these methods in practice. As a byproduct of our analysis, we show that we can provably guarantee recovery even when the number of active features s is unknown. We further validate our theoretical results and proof methodology using computational experiments.

Introduction Decision trees are one of the most popular tools used in machine learning. Due to their simplicity and interpretability, trees are widely implemented by data scientist, both individually, and in aggregation with ensemble methods such as random forests and gradient boosting (Friedman 2001). In addition to their predictive accuracy, tree based methods are an important tool used for the variable selection problem: identifying a relevant small subset of a high-dimensional feature space of the input variables that can accurately predict the output. When the relationship between the variables is linear, it has long been known that LASSO achieves the optimal sample complexity rate for this problem (Wainwright 2009a). In practice, however, tree-based methods have been shown to be preferable as they scale linearly with the size of the data and can capture non-linear relationships between the variables (Xu et al. 2014). Notably, various tree structured systems are implemented for this variable selection task across wide-spanning domains.

Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

For example, tree based importance measures have been used for prediction of financial distress (Qian et al. 2022), wireless signal recognition (Li and Wang 2018), credit scoring (Xia et al. 2017), and the discovery of genes in the field of bioinformatics (Breiman 2001; Bureau et al. 2005; Huynh-Thu et al. 2012; Lunetta et al. 2004) to name a small fraction. Despite this empirical success however, the theoretical properties of these tree based methods for feature selection are not well understood. While several papers have considered variants of this problem (see the Related Work section for an overview), even in the simple linear case, the sample complexity of the decision tree is not well-characterized. In this paper, we attempt to bridge this gap and analyze the variable selection properties of single-level decision trees, commonly referred to as decision stumps (DSTUMP ). Considering both linear and non-linear settings, we show that DSTUMP achieves nearly-tight sample complexity rates for a variety of practical sample distributions. Compared to prior work, our analysis is simpler and applies to different variants of the decision tree, as well as more general function classes. The remainder of the paper is organized as follows: in the next section we introduce our main results on the finite sample guarantees of DSTUMP , in the Related Work section we discuss important prior results in the field of sparse recovery and where they fall flat, in the Algorithm Description section we present the recovery algorithm, and in the subsequent section we provide theoretical guarantees for the procedure under progressively more general problem assumptions. The proofs of these results are provided in the so-labeled section. We supplement the theoretical results with computational simulations in the Experimental Results and provide concluding remarks in the final section.

Our Results We assume that we are given a dataset D = {(Xi,:, Yi)}n i=1 consisting of n samples from the non-parametric regression model Yi = P k fk(Xi,k) + Wi where i denotes the sample number, Xi,: Rp is the input vector with corresponding output Yi R, Wi R are i.i.d noise values and fk : R R are univariate functions that are s-sparse: A set of univariate functions {fk}k [p] is s-sparse on feature set [p] if there exists a set S [p] with size s = |S| p such that fk = 0 k / S. Given access to (Xi,:, Yi)n i=1, we consider the sparse recovery problem, where we attempt to reveal the

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

set S with only a minimal number of samples. Note that this is different from the prediction problem where the goal is to learn the functions fk. In accordance with prior work (Han et al. 2020; Kazemitabar et al. 2017; Klusowski 2020) we assume the Xi,j are i.i.d draws from the uniform distribution on [0, 1] with Gaussian noise, Wi N(0, 1). In the Main Results section, we will discuss how this assumption can be relaxed using our non-parametric results to consider more general distributions. For the recovery problem, we consider the DSTUMP algorithm as first proposed by (Kazemitabar et al. 2017). The algorithm, shown in Algorithm 1, fits a single-level decision tree (stump) on each feature using either the median split or the optimal split strategy and ranks the features by the error of the corresponding trees. The median split provides a substantially simplified implementation as we do not need to optimize the stump construction, further providing an improved run time over the comparable optimal splitting procedure. Indeed, the median and optimal split have time complexity at most O(np) and O(np log(n)) respectively. In spite of this simplification, we show that in the widely considered case of linear design, where the fk are linear, DSTUMP can correctly recover S with a sample complexity bound of O(s log p), matching the minimax optimal lower bound for the problem as achieved by LASSO (Wainwright 2009a,b). This result is noteworthy and surprising since the DSTUMP algorithm (and decision trees in general) is not designed with a linearity assumption, as is the case with LASSO . For this reason, trees are in general utilized for their predictive power in a non-linear model, yet our work proves their value in the opposite. We further extend these results for non-linear models and general sub-Gaussian distributions, improving previously known results using simpler analysis. In addition, our results do not require the sparsity level s to be known in advance. We summarize our main technical results as follows:

We obtain a sample complexity bound of O(s log p) for the DSTUMP algorithm in the linear design case, matching the optimal rate of LASSO and improving prior bounds in the literature for both the median and optimal split. This is the first tight bound on the sample complexity of single-depth decision trees used for sparse recovery and significantly improves on the existing results.

We extend our results to the case of non-linear fk, obtaining tighter results for a wider class of functions compared to the existing literature. We further use these results to generalize our analysis to sub-Gaussian distributions via the extension to nonlinear fk.

As a byproduct of our improved analysis, we show that our results hold for the case where the number of active features s is not known. This is the first theoretical guarantee on decision stumps that does not require s to be known.

We validate our theoretical results using numerical simulations that show the necessity of our analytic techniques.

Related Work While our model framework and the sparsity problem as a whole have been studied extensively (Fan and Lv 2006; Lafferty and Wasserman 2008; Wainwright 2009a,b), none have replicated the well known optimal lower bound for the classification problem under the given set of assumptions. Our work provides improved finite sample guarantees on DSTUMP for the regression problem that nearly match that of LASSO by means of weak learners. Most closely related to our work is that of (Kazemitabar et al. 2017), which formulates the DSTUMP algorithm and theoretical approach for finite sample guarantees of these weak learners. Unlike our nearly tight result on the number of samples required for recovery, (Kazemitabar et al. 2017) provides a weaker O(s2 log p) bound when using the median splitting criterion. We here demonstrate that the procedure can obtain the near optimal finite sample guarantees by highlighting a subtle nuance in the analysis of the stump splitting (potentially of independent interest to the reader); instead of using the variance of one sub-tree as an impurity measure, we use the variance of both sub-trees. As we will show both theoretically and experimentally, this consideration is vital for obtaining tight bounds. Our analysis is also more general than that of (Kazemitabar et al. 2017), with applications to both median and optimal splits, a wider class of functions fk, and more general distributions. In a recent work, (Klusowski and Tian 2021) provide an indirect analysis of the DSTUMP formulation with the optimal splitting criterion, by studying its relation to the SIS algorithm of (Fan and Lv 2006). Designed for linear models specifically, SIS sorts the features based on their Pearson correlation with the output, and has optimal sample complexity for the linear setting. (Klusowski and Tian 2021) show that when using the optimal splitting criterion, DSTUMP is equivalent to SIS up to logarithmic factors, which leads to a sample complexity of

n s log(p) (log(s) + log(log(p))) .

This improves the results of (Kazemitabar et al. 2017), though the analysis is more involved. A similar technique is also used to study the non-linear case, but the conditions assumed on fk are hard to interpret and the bounds are weakened. Indeed, instantiating the non-parametric results for the linear case implies a sub-optimal sample complexity rate of O(s2 log p). In addition, (Klusowski and Tian 2021) describe a heuristic algorithm for the case of unknown |S|, though they fail to prove any guarantees on its performance. In contrast, we provide a direct analysis of DSTUMP . This allows us to obtain optimal bounds for both the median and optimal split in the linear case, as well as improved and generalized bounds in the non-linear case. Our novel approach further allows us to analyze the case of unknown sparsity level, as we provide the first formal proof for the heuristic algorithm suggested in (Klusowski and Tian 2021) and (Fan, Feng, and Song 2011). Compared to prior work, our analysis is considerably simpler and applies to more general settings. Additionally, various studies have leveraged the simplicity of median splits in decision trees to produce sharp bounds

on mean-squared error for the regression problem with random forests (Duroux, Roxane and Scornet, Erwan 2018; Klusowski 2021). In each of these studies, analysis under the median split assumption allows for improvements in both asymptotic and finite sample bounds on prediction error for these ensemble weak learners. In the present work, we extend this intuition to the feature selection problem for highdimensional sparse systems, and further emphasize the utility of the median split even in the singular decision stump case.

Algorithm Description Notation and Problem Setup For mathematical convenience, we adopt matrix notation and use X = (Xij) Rn p and Y = (Yi) Rn to denote the input matrix and the output vector respectively. We use Xi,: and Xk to refer the i-th row and k-th column of the matrix X. We will also extend the definition of the univariate functions fk to vectors by assuming that the function is applied to each coordinate separately: for v Rd, we define fk(v) Rd as (fk(v))i = fk(vi). We let E [ ] and Var( ) denote the expectation and variance for random variables, with b E [ ] and c Var ( ) denoting their empirical counterparts i.e, for a generic vector v Rd,

b E [v] = Pd i=1 vi

d and c Var (v) = Pd i=1(vi b E [v])2

We will also use [d] to denote the set {1, . . . , d}. Finally, we let Unif(a, b) denote the uniform distribution over [a, b] and use C, c > 0 to denote generic universal constants. Throughout the paper, we will use the concept of sub Gaussian random variables for stating and proving our results. A random variable Z R is called sub-Gaussian if there exists a t for which E h e(Z/t)2i 2 and its sub-Gaussian norm is defined as

Z ψ2 = inf n t > 0 : E h e(Z/t)2i 2 o .

Sub-Gaussian random variables are well-studied and have many desirable properties, (see (Vershynin 2018) for a comprehensive overview), some of which we outline below as they are leveraged throughout our analysis.

(P1) (Hoeffding s Inequality) If Z is a sub-Gaussian random variable, then for any t > 0,

Pr (|Z E [Z]| t) 2e c(t/ Z ψ2) 2 .

(P2) If Z1, . . . Zn are independent sub-Gaussian random variables, then P Zi is also sub-Gaussian with norm satisfying P Zi 2 ψ2 P Zi 2 ψ2 . (P3) If Z is a sub-Gaussian random variable, then so is Z E [Z] and Z E [Z] ψ2 c Z ψ2.

DSTUMP Algorithm We here present the recovery algorithm DSTUMP. For each feature k [p], the algorithms fits a single-level decision tree or stump on the given feature and defines the impurity of the feature as the error of this decision tree. Intuitively, the

Algorithm 1: Scoring using DSTUMP

Input: X Rn p, Y Rn, s N Output: Estimate of S 1: for k {1, ..., p} do 2: τ k = argsort(Xk) 3: for n L {1, . . . , n} do 4: n R = n n L 5: Y k L = (Yτ k(1), ...Yτ k(n L))T

6: Y k R = (Yτ k(n L+1), ...Yτ k(n))T

7: impk,n L = n L

n c Var(Y k L ) + n R

n c Var(Y k R)

8: if median split then 9: impk = impk, n

2 . 10: else 11: impk = minℓimpk,ℓ.

12: return τ = argsort(imp)

active features are expected to be better predictors of Y and therefore have lower impurity values. Thus, the algorithm sorts the features based on these values, and outputs the |S| features with smallest impurity. A formal pseudo-code of our approach is given in Algorithm 1. Formally, for each feature k, the k-th column is sorted in increasing order such that Xk τ k(1) Xk τ k(2) Xk τ k(n) with ties broken randomly. The samples are then split into two groups: the left half, consisting of Xk i Xk τ k(n L) and the right half consisting of Xk i > Xk τ k(n L). Conceptually, this is the same as splitting a single-depth tree on the kth column with a n L to n R ratio and collecting the samples in each group. The algorithm then calculates the variance of the output in each group, which represents the optimal prediction error for this group with a single value as c Var Y k L = mint 1 n L P(Y k L,i t)2. The average of these two variances is taken as the impurity. Formally,

impk,n L = n L

n c Var(Y k L ) + n R

n c Var(Y k R). (1)

For the median split algorithm, the value of n L is simply chosen as n

2 , where as for the optimal split, the value is chosen in order to minimize the impurity of the split. Lastly, the features are sorted by their impurity values and the |S| features with lowest impurity are predicted as S. In the Unknown Sparsity Level section, we discuss the case of unknown |S| and obtain an algorithm with nearly matching guarantees.

Main Results Linear Design For our first results, we consider the simplest setting of linear models with uniform distribution of the inputs. Formally, we assume that there is a vector βk Rp such that fk(x) = βk x for all k. This is equivalent to considering the linear regression model Y = P k βk Xk + W. We further assume that each entry of the matrix X is an i.i.d draw from the uniform distribution on [0, 1]. This basic setting is important

from a theoretical perspective as it allows us to compare with existing results from the literature before extending to more general contexts. This initial result of Theorem 0.1, provides an upper bound on the failure probability for DSTUMP in this setting. Theorem 0.1. Assume that each entry of the input matrix X is sampled i.i.d from a uniform distribution on [0, 1]. Assume further that the output vectors satisfy the linear regression model Y = P k βk Xk +W where Wi are sampled i.i.d from N(0, σ2 w). Algorithm 1 correctly recovers the active feature set S with probability at least

1 4se c n 4pe c n mink β2 k β 2 2+σ2w .

for the median split, and with probability at least

1 4se c n 4npe c n mink β2 k β 2 2+σ2w .

for the optimal split. Moreover, the above theorem provides a finite sample guarantee for the DSTUMP algorithm and does not make any assumptions on the parameters or their asymptotic relationship. In order to obtain a comparison with existing literature, (Kazemitabar et al. 2017; Klusowski and Tian 2021; Wainwright 2009a,b), we consider these bounds in the asymptotic regime mink S β2 k Ω( 1

s) and β 2 O(1). Corollary 0.2. In the same setting as Theorem 0.1, assume that β 2 2 O(1) and mink S β2 k Ω( 1

s). Then Algorithm 1 correctly recovers the active feature set S with high probability as long as n s log p. The proof of the Corollary is presented in Appendix. The above result shows that DSTUMP is optimal for recovery when the data obeys a linear model. This is interesting considering tree based methods are known for their strength in capturing non-linear relationships and are not designed with a linearity assumption like LASSO. In the next section, we further extend our finite sample bound analysis to non-linear models.

Additive Design We here consider the case of non-linear fk and obtain theoretical guarantees for the original DSTUMP algorithm. Our main result is Theorem 0.3 stated below. Theorem 0.3. Assume that each entry of the input matrix X is sampled i.i.d from a uniform distribution on [0, 1] and Y = P k fk(Xk) + W where Wi are sampled i.i.d from N(0, σ2 w). Assume further that each fk is monotone and fk(Unif(0, 1)) is sub-Gaussian with sub-Gaussian norm fk(Unif(0, 1)) 2 ψ2 σ2 k. For k S, define gk as

gk := E fk(Unif(1

2, 1)) E fk(Unif(0, 1

and define σ2 as σ2 = σ2 w + P k σ2 k. Algorithm 1 correctly recovers the set S with probability at least

1 4se cn 4pe cn mink g2 k σ2

for the median split and

1 4se cn 4npe cn mink g2 k σ2

for the optimal split.

Note that, by instantiating Theorem 0.3 for linear models, we obtain the same bound as in Theorem 0.1 implying the above bounds are tight in the linear setting.

The extension to all monotone functions in Theorem 0.1 has an important theoretical consequence: since the DSTUMP algorithm is invariant under monotone transformations of the input, we can obtain the same results for any distribution of Xi,:. As a simple example, consider Xij N(0, 1) and assume that we are interested in bounds for linear models. Define the matrix Z as Zij = FN (Xij) where FN (.) denotes the CDF of the Gaussian distribution. Since the CDF is an increasing function, running the DSTUMP algorithm with (Z, Y ) produces the same output as running it with (X, Y ). Furthermore, applying the CDF of a random variable to itself yields a uniform random variable. Therefore, Zij are i.i.d draws of the Unif(0, 1) distribution. Setting fk(t) = βk F 1 N (t), the results of Theorem 0.3 for (Z, Y ) imply the same bound as Theorem 0.1. Notably, we can obtain the same sample complexity bound of O(s log p) for the Gaussian distribution as well. In the appendix, we discuss a generalization of this idea, which effectively allows us to remove the uniform distribution condition in Theorem 0.3.

Unknown Sparsity Level

A drawback of the previous results are that they assume |S| is given when, in general, this is not the case. Even if |S| is not known however, Theorem 0.3 guarantees that the active features are ranked higher than non-active ones in τ, i.e, τ(k) < τ(k ) for all k S, k / S. In order to recover S, it suffices to find a threshold γ such that maxk S impk γ mink/ S impk.

To solve this, we use the so called permutation algorithm which is a well known heuristic in the statistics literature (Barber and Cand es 2015; Chung and Romano 2013, 2016; Fan, Feng, and Song 2011) and was discussed (without proof) by (Klusowski and Tian 2021) as well. Formally, we apply a random permutation σ on the rows of X, obtaining the matrix σ(X) where σ(X)ij = Xσ(i),j, We then rerun Algorithm 1 with σ(X) and Y as input. The random permutation means that X and Y were decoupled from each other and effectively, all of the features are now inactive. We therefore expect mini,t impi(σ(X), y) to be close to mink/ S impk(X, y). Since this estimate may be somewhat conservative, we repeat the sampling and take the minimum value across these repetitions. A formal pseudocode is provided in Algorithm 2. The STUMPSCORE method is the same algorithm as Algorithm 1, with the distinction that it returns imp in Line 12

Algorithm 2: Unknown |S|

Input: X Rn p, Y Rn Output: γ [maxk S impk, mink/ S impk] 1: for t 1, . . . , T 1 do 2: σ(t) Random permutation on [n]. 3: imp(t) STUMPSCORE(σ(t)(X), y) return mini,t imp(t) i

Assuming we have used T repetitions in the algorithm, the probability that mink/ S impk(X, y) γ is at most 1

T . While we provide a formal proof in the appendix, the main intuition behind the result is that imp(t) i and impk for k / S are all the impurities corresponding to inactive features. Thus, the probability that the maximum across all of these impurities falls in [p]\S is at most p s

T p 1 T . Ensuring that maxk S impk(X, y) γ involves treating the extra T impurity scores as (T 1)p extra inactive features. This means that we can use the same results of Theorem 0.3 with p set to Tp since our sample complexity bound is logarithmic in p. The formal result follows with proof in the appendix. Theorem 0.4. In the same setting as Theorem 0.3, let γ be the output of Algorithm 2 and set b S to be {k : impk γ}. The probability that b S = S is at least

1 T 1 4se cn 4Tpe cn mink g2 k σ2

for the median split and at least

1 T 1 4se cn 4n Tpe cn mink g2 k σ2

for the optimal split. We note that if we set T = nc for some constant c > 0, we obtain the same O(s log p) as before.

Proofs In this section, we prove Theorem 0.3 as it is the more general version of Theorem 0.1, and defer with remainder of the proofs to the appendix. To prove that the algorithm succeeds, we need to show that impk < impk for all k S, k / S. We proceed by first proving an upper bound impk for all k S Lemma 0.5. In the setting of Theorem 0.3, for any active feature k S, impk c Var (Y ) mink g2 k 720 with probability at least

1 4e c n 4e c n mink g2 k σ2 . Subsequently, we need to prove an analogous lower bound on impk for all k / S. Lemma 0.6. In the setting of Theorem 0.3, for any inactive feature k / S, impk > c Var (Y ) mink g2 k 720 with probability at least

1 4e c n mink g2 k σ2

for the median split and

1 4ne c n mink g2 k σ2

for the optimal split.

Taking the union bound over all k, k , Lemmas 0.5 and 0.6 prove the theorem as they show that impk < impk for all k S, k / S with the desired probabilities. We now focus on proving Lemma 0.5. We assume without loss of generality that fk is increasing 1. We further assume that E fk(Xk i ) = 0 as DSTUMP is invariant under constant shifts of the output. Finally, we assume that n > 5, as for n 5, the theorem s statement can be made vacuous by choosing large c. We will assume {n L, n R} = { n

2 } throughout the proof; as such our results will hold for both the optimal and median splitting criteria. As noted before, a key point for obtaining a tight bound is considering both sub-trees in the analysis instead of considering them individually. Formally, while the impurity is usually defined via variance as in (1), it has the following equivalent definition.

impk = c Var(Y ) n L n R

n2 b E Y k L b E Y k R 2 . (3)

The above identity is commonly used for the analysis of decision trees and their properties (Breiman et al. 1983; Li et al. 2019; Klusowski 2020; Klusowski and Tian 2021). From an analytic perspective, the key difference between (3) and (1) is that the empirical averaging is calculated before taking the square, allowing us to more simply analyze the first moment rather than the second. Intuitively, we want to use concentration inequalities to show that b E Y k L and b E Y k R concentrate around their expectations and lower bound |E Y k L E Y k R |. This is challenging however as concentration results typically require an i.i.d assumption but, as we will see, Y k L,i are not i.i.d. More formally, for each k, define the random variable Xk L Rn L as (Xk τ k(1), . . . , Xk τ k(n L))T and thus Y k L,i =

fk(Xk L,i) + P j =k fj(Xj τ k(i)) + Wτ k(i). While the random vectors Xj =k and W have i.i.d entries, Xk L was obtained by sorting the coordinates of Xk. Thus, its coordinates are non-identically distributed and dependent. To solve the first problem, observe that the empirical mean is invariant under permutation and we can thus randomly shuffle the elements of Y k L in order to obtain a vector with identically distributed coordinates. Furthermore, by De Finetti s Theorem, any random vector with coordinates that are identically distributed (but not necessarily independent), can be viewed as a mixture of i.i.d vectors, effectively solving the second problem. Formally, the following result holds. Lemma 0.7 (Lemma 1 in (Kazemitabar et al. 2017)). let τ : [n L] [n L] be a random permutation on [n L] independent of (X, W) and define e Xk L Rn L as e Xk L,i := Xk L, τ(i). The

random vector e Xk L is distributed as a mixture of uniform i.i.d uniform vectors of size n L on [0, Θ] with Θ sampled from Beta(n L + 1, n R).

Defining e Y k L Rn L as e Y k L,i := Y k L, τ(i), it is clear that b E h e Y k L i = b E Y k L and therefore we can analyze e Y k L instead

1The case of decreasing fk follows by a symmetric argument or by mapping Xk Xk

of Y k L as e Y k L,i := fk( e Xk L,i) + X

j =k fj(Xj τ k τ(i)) + Wτ k τ(i)

which, given Lemma 0.7, can be seen as a mixture of i.i.d random variables. Lemma 0.7 shows that there are two sources of randomness in the distribution of e Y k L,i: the mixing variable Θ and the sub-Gaussian randomness of e Xk L|Θ and (Xj =k, W). For the second source, it is possible to use standard concentration inequalities to show that conditioned on Θ, b E h e Y k L i concen-

trates around E h e Y k L,1|Θ i . We will formally do this in Lemma 0.9. Before we do this however, we focus on the first source and how Θ affects the distribution of e Y k L,i. Since Θ is sampled from Beta(n L + 1, n R), we can use standard techniques to show that it concentrates around 1

2. More formally, we can use the following lemma, the proof of which is in the appendix. Lemma 0.8. If n 5, we have Θ [ 1

4] with probability at least 1 2e cn. Given the above result, we can analyze e Y k L assuming Θ [1/4, 3/4]. In this case, we can use concentration inequali-

ties to show that with high probability, b E h e Y k L i concentrates around E [fk(Unif(0, Θ)]). Since fk was assumed to be increasing, this can be further bounded by E fk(Unif(0, 3

4) ). Formally, we obtain the following result. Lemma 0.9. Let k S be an active feature. For any θ [ 1

Pr h b E h e Y k L i E h e Y k L,1|Θ = θ i t|Θ = θ i 2e cn t2

Furthermore, letting f k a,b denote E [fk(Unif(a, b))],

Pr h b E Y k L f k 0, 3

4 + gk/8 i 2e c n + 2e c n g2 k σ2 . (4) Proof. For ease of notation, we will fix θ [ 1

4] and let the random variables b Xk L and b Y k L denote e Xk L|Θ = θ and e Y k L |Θ = θ respectively. Recall that for all j, fk(Xj i ) was sub-Gaussian with parameter σj by assumption. It is straightforward to show (see the Appendix) that this means fk( b Xk L,i) E h f( b Xk L,i) i is also sub-Gaussian with norm at

most C σ2 j . Thus,

b Y k L,i ψ2 =

fk( b Xk L,i) + X

j =k fj(Xj τ k τ(i)) + Wτ k τ(i)

fk( b Xk L,i) + X

j =k fj(Xj i ) + Wi

(ii) fk( b Xk L,i) 2

fj(Xj i ) 2

ψ2 + Wi 2 ψ2

j =k σ2 j + σ2 w C σ2.

In the above analysis, (i) follows from the independence assumption of ( b Xk L, Xj =k, W) together with the i.i.d assumption on (Xj =k i , Wi). As for (ii), it follows from (P2) together with the independence assumption of ( b Xk L, Xj =k, W). Prop-

erty (P3) further implies that b Y k L,i E h b Y k L,i i 2

ψ2 is upper

bounded by C σ2, proving the first Equation in the Lemma. Now, using Hoeffding s inequality, we obtain

Pr h b E h b Y k L i E h b Y k L,i i gk/8 i 2e c n g2 k σ2 .

Using Lemma 0.8 with Pr(A) Pr(B) + Pr(A|BC) for any two events A, B, we obtain

Pr h b E Y k L E h b Y k L,i i gk/8 i 2e c n + 2e c n g2 k σ2 .

Note however that E h b Y k L,i i = f k 0,θ which as we show in the

appendix, can further be upper bounded by f k 0, 3

4 , concluding the proof.

Using the symmetry of the decision tree algorithm, we can further obtain that

Pr h b E Y k R f k 1 4 ,1 gk/8 i 2e c n + 2e c n g2 k σ2 (5)

from (4) with the change of variable Xk Xk and fk = fk. Taking union bound over (4) and (5), it follows that

with probability at least 1 4e c n 4e c n g2 k σ2 ,

b E Y k R b E Y k L f k 1 4 ,1 f k 0, 3

As we show in the appendix however, a simple application of conditional expectations implies f k 1 4 ,1 f k 0, 3

Therefore, with probability at least 1 4e c n 4e c n g2 k σ2 , we have b E Y k R b E Y k L gk

12. Assuming n 5, we can further conclude that n L n R

5 which together with (3), proves the lemma.

Experimental Results In this section, we provide further justification of our theoretical results in the form of simulations on the finite sample count for active feature recovery under different regimes, as well as the predictive power of a single sub-tree as compared to the full tree. We additionally contrast DSTUMP with the widely studied optimal LASSO . We first validate the result of Theorem 0.1 and consider the linear design with p = 200 and design matrix entries sampled i.i.d. from U(0, 1) with additive Gaussian noise N(0, .1). Concretely, we examine the optimal number of samples required to recover approximately 95% of the active features s. This is achieved by conducting a binary search on the number of samples to find the minimal such value that recovers the desired fraction of the active feature set, averaged across 25 independent replications. In the leftmost plot of Figure 1, we plot the sample count as a function of varying

Figure 1: Optimal sample count to recover 95% of the active features where design matrix samples i.i.d from U( 1, 1) or N(0, 1) with additive Gaussian noise N(0, 0.1), comparing three methods: DSTUMP with optimal split, DSTUMP with median split, and LASSO .

sparsity levels s [5, 100] for DSTUMP with a median split, DSTUMP with the optimal split, as well as LASSO for benchmarking (with penalty parameter selected through standard cross-validation). By fixing p, we are evaluating the dependence of n on the sparsity level s alone. The results here validate the theoretical O(s log p) bound that nearly matches the optimal LASSO . Also of note, the number of samples required by the median splitting is less than that of the optimal. Thus, in the linear setting, we see that DSTUMP with median splitting is both more simplistic and computationally inexpensive. This optimal bound result is repeated with Gaussian data samples in the right most plot of Figure 1. Notably, in this setting the optimal split decision stumps perform better than the median as it demonstrates their varied utility under different problem contexts. We additionally reiterate that the prior work of (Kazemitabar et al. 2017) attempted to simplify the analysis of the sparse recovery problem using DSTUMP by examining only the left sub-tree, which produced the non-optimal O(s2 log p) finite sample bound. To analyze the effect of this choice, the middle plot of Figure 1 presents the optimal sample recovery count when using only the left sub-tree subject to the additive model of Theorem 0.1. In accordance with our expectation and previous literature s analysis, we see a clear quadratic relationship between n and s when fixing the feature count p. Overall, these simulations further validate the practicality and predictive power of decisions stumps. Benchmarked against the optimal LASSO , we see a slight decrease in performance but a computational reduction and analytic simplification.

In this paper, we presented a simple and consistent feature selection algorithm in the regression case with single-depth decision trees, and derived the finite-sample performance

guarantees in a high-dimensional sparse system. Our theoretical results demonstrate that this very simple class of weak learners is nearly optimal compared to the gold standard LASSO . We have provided strong theoretical evidence for the success of binary decision tree based methods in practice and provided a framework on which to extend the analysis of these structures to arbitrary height, a potential direction for future work.

Acknowledgements The work is partially support by DARPA Qu ICC, NSF AF:Small #2218678, and NSF AF:Small # 2114269. Max Springer was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1840340. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References Barber, R. F.; and Cand es, E. J. 2015. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5): 2055 2085. Breiman, L. 2001. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3): 199 231. Breiman, L.; Friedman, J. H.; Olshen, R. A.; and Stone, C. J. 1983. Classification and Regression Trees. Bureau, A.; Dupuis, J.; Falls, K.; Lunetta, K.; Hayward, B.; Keith, T.; and Eerdewegh, P. 2005. Identifying SNPs predictive of phenotype using random forests. Genetic epidemiology, 28: 171 82. Chung, E.; and Romano, J. P. 2013. Exact and asymptotically robust permutation tests. The Annals of Statistics, 41(2): 484 507.

Chung, E.; and Romano, J. P. 2016. Multivariate and multiple permutation tests. Journal of econometrics, 193(1): 76 91. Duroux, Roxane; and Scornet, Erwan. 2018. Impact of subsampling and tree depth on random forests. ESAIM: PS, 22: 96 128. Fan, J.; Feng, Y.; and Song, R. 2011. Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494): 544 557. Fan, J.; and Lv, J. 2006. Sure independence screening for ultrahigh dimensional feature space. Journal of The Royal Statistical Society Series B-statistical Methodology, 70: 849 911. Friedman, J. H. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29: 1189 1232. Han, C.; Rao, N. S.; Sorokina, D.; and Subbian, K. 2020. Scalable Feature Selection for (Multitask) Gradient Boosted Trees. Ar Xiv, abs/2109.01965. Huynh-Thu, V. A.; Irrthum, A.; Wehenkel, L.; and Geurts, P. 2012. Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLOS ONE, 5(9): 1 10. Kazemitabar, S. J.; Amini, A. A.; Bloniarz, A.; and Talwalkar, A. S. 2017. Variable Importance Using Decision Trees. In NIPS. Klusowski, J. 2021. Sharp Analysis of a Simple Model for Random Forests. In Banerjee, A.; and Fukumizu, K., eds., Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, 757 765. PMLR. Klusowski, J. M. 2020. Sparse learning with CART. Ar Xiv, abs/2006.04266. Klusowski, J. M.; and Tian, P. M. 2021. Nonparametric Variable Screening with Optimal Decision Stumps. In AISTATS. Lafferty, J. D.; and Wasserman, L. A. 2008. Rodeo: Sparse, greedy nonparametric regression. Annals of Statistics, 36: 28 63. Li, L.; and Wang, J. 2018. Research on feature importance evaluation of wireless signal recognition based on decision tree algorithm in cognitive computing. Cognitive Systems Research, 52: 882 890. Li, X.; Wang, Y.; Basu, S.; Kumbier, K.; and Yu, B. 2019. A Debiased m DI Feature Importance Measure for Random Forests. Ar Xiv, abs/1906.10845. Lunetta, K. L.; Hayward, L. B.; Segal, J.; and Eerdewegh, P. V. 2004. Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics, 5(1): 32. Qian, H.; Wang, B.; Yuan, M.; Gao, S.; and Song, Y. 2022. Financial distress prediction using a corrected feature selection measure and gradient boosted decision tree. Expert Systems with Applications, 190: 116202. Vershynin, R. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science.

Wainwright, M. J. 2009a. Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting. IEEE Transactions on Information Theory, 55: 5728 5741. Wainwright, M. J. 2009b. Sharp thresholds for highdimensional and noisy sparsity recovery using l1-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory. Xia, Y.; Liu, C.; Li, Y.; and Liu, N. 2017. A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Systems with Applications, 78: 225 241. Xu, Z.; Huang, G.; Weinberger, K. Q.; and Zheng, A. X. 2014. Gradient boosted feature selection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 522 531.