# sparse_learning_with_cart__11b324a9.pdf

Sparse Learning with CART

Jason M. Klusowski Department of Operations Research &

Financial Engineering

Princeton University Princeton, New Jersey 08544 jason.klusowski@princeton.edu

Decision trees with binary splits are popularly constructed using Classiﬁcation and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. This paper aims to study the statistical properties of regression trees constructed with CART methodology. In doing so, we ﬁnd that the training error is governed by the Pearson correlation between the optimal decision stump and response data in each node, which we bound by constructing a prior distribution on the split points and solving a nonlinear optimization problem. We leverage this connection between the training error and Pearson correlation to show that CART with cost-complexity pruning achieves an optimal complexity/goodnessof-ﬁt tradeoff when the depth scales with the logarithm of the sample size. Data dependent quantities, which adapt to the dimensionality and latent structure of the regression model, are seen to govern the convergence rates of the prediction error.

1 Introduction

Decision trees are the building blocks of some of the most important and powerful algorithms in statistical learning. For example, ensembles of decision trees are used for some bootstrap aggregated prediction rules (e.g., bagging [2] and random forests [3]). In addition, each iteration of gradient tree boosting (e.g., Tree Boost [7]) ﬁts the pseudo-residuals with decision trees as base learners. From an applied perspective, decision trees have an appealing interpretability and are accompanied by a rich set of analytic and visual diagnostic tools. These attributes make tree-based learning particularly well-suited for applied sciences and related disciplines which may rely heavily on understanding and interpreting the output from the training algorithm. Although, as with many aspects of statistical learning, good empirical performance often comes at the expense of rigor. Tree-structured learning with decision trees is no exception statistical guarantees for popular variants, i.e., those that are actually used in practice, are hard to ﬁnd. Indeed, the recursive manner in which decision trees are constructed makes them unamenable to analysis, especially when the split protocol involves both the input and output data. Despite these challenges, we take a step forward in advancing the theory of decision trees and aim to tackle the following fundamental question:

When do decision trees adapt to the sparsity of a predictive model?

To make our work informative to the applied user of decision trees, we strive to make the least departure from practice and therefore focus speciﬁcally on Classiﬁcation and Regression Tree (CART) [4] methodology by far the most popular for regression and classiﬁcation problems. With this methodology, the tree construction importantly depends on both the input and output data and is therefore data dependent. This aspect lends itself favorably to the empirical performance of CART, but poses unique mathematical challenges. It is perhaps not surprising then that, despite the widespread use of CART, there have been only a small handful of papers that study its theoretical

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

properties. For example, [15] study the asymptotic properties of CART in a ﬁxed dimensional regime, en route to establishing consistency of Breiman s random forests for additive regression models. Another notable paper [8] provides oracle-type inequalities for the CART pruning algorithm proposed by [4], though the theory does not imply guarantees for out-out-sample prediction. What the existing literature currently lacks, however, is a more ﬁne-grained analysis that reveals the unique advantages of tree learning with CART over other unstructured regression procedures, like vanilla k-NN or other kernel based estimators. Filling this theoretical gap, our main message is that, in certain settings, CART can identify low dimensional, latent structure in the data and adapt accordingly. We illustrate the adaptive properties of CART when the model is sparse, namely, when the output depends only on a small, unknown subset of the input variables thereby circumventing the curse of dimensionality.

Arguably the most difﬁcult technical aspect of studying decision trees (and for that matter, any adaptive partitioning-based predictor) is understanding their approximation error, or at least pinning down conditions on the data that enable such an endeavor. Indeed, most existing convergence results for decision trees or ensembles thereof bound the expected (over the training data) prediction error by the size (i.e., the diameter) of the terminal nodes and show that they vanish with the depth of the tree, ensuring that the approximation error does so also [6, 16]. Others [15] control the variation of the regression function inside the terminal nodes, without explicitly controlling their diameters, though the effect is similar. While these techniques can be useful to prove consistency statements, they are not generally delicate enough to capture the adaptive properties of the tree. It also often requires making strong assumptions about the tree construction. To address this shortcoming, in contrast, we use the fact that the prediction error is with high-probability (over the training data) bounded by the training error plus a complexity term. One of our crucial insights is that we can avoid using the node diameters as a proxy for the approximation error and, instead, directly bound the training error in terms of data dependent quantities (like the Pearson correlation coefﬁcient) that are more transparent and interpretable, thereby facilitating our analysis and allowing us to prove more ﬁne-grained results.

1.1 Learning setting Let us now describe the learning setting and framework that we will operate under for the rest of the paper. For clarity and ease of exposition, we focus speciﬁcally on regression trees, where the target outcome is a continuous real value. We assume the training data is Dn = {(X1, Y1), . . . , (Xn, Yn)}, where (Xi, Yi), 1 i n are i.i.d. with common joint distribution PX,Y . Here, Xi 2 [0, 1]d is the input and Yi 2 R is a continuous response (or output) variable. A generic pair of variables will be denoted as (X, Y ). A generic coordinate of X will be denoted by X, unless there is a need to highlight the dependence on the jth coordinate index, denoted by Xj, or additionally on the ith data point, denoted by Xij. Using squared error loss L(Y, Y 0) = (Y Y 0)2 as the performance metric, our goal is to predict Y at a new point X = x via a tree structured prediction rule b Y (x) = b Y (x; Dn). The training error and mean squared prediction error are, respectively,

err(b Y ) := 1

(Yi b Y (Xi))2 and Err(b Y ) := E(X0,Y 0)[(Y 0 b Y (X0))2],

where (X0, Y 0) denotes an independent copy of (X, Y ). For data {(X1, U1, V1), . . . , (Xn, Un, Vn)}, we let

b (U, V | X 2 A) :=

Xi2A(Ui U)(Vi V ) q

Xi2A(Ui U)2 1

Xi2A(Vi V )2

(A is a subset, N = #{Xi 2 A}, U = 1

Xi2A Ui, and V = 1

Xi2A Vi) denote the empirical Pearson product-moment correlation coefﬁcient, given X 2 A, and let (U, V | X 2 A) be its inﬁnite sample counterpart. If Ui = g(Xij) for a univariate function g( ) of a coordinate Xj, we write b (g(Xj), V | X 2 A) or (g(Xj), V | X 2 A). For brevity, we let bσ2

Y denote the sample variance of the response values Y1, Y2, . . . , Yn in the training data. The rth derivative of a real valued function g( ) is denoted by g(r)( ). Proofs of all forthcoming results are given in the supplement.

2 Preliminaries

As mentioned earlier, regression trees are commonly constructed with Classiﬁcation and Regression Tree (CART) [4] methodology. The primary objective of CART is to ﬁnd partitions of the input variables that produce minimal variance of the response values (i.e., minimal sum of squares error

with respect to the average response values). Because of the computational infeasibility of choosing the best overall partition, CART decision trees are greedily grown with a procedure in which binary splits recursively partition the tree into near-homogeneous terminal nodes. An effective binary split partitions the data from the parent tree node into two daughter nodes so that the homogeneity in each of the daughter nodes, as measured through the impurity, is reduced.

The CART algorithm is comprised of two elements a growing procedure and a pruning procedure. The growing procedure constructs from the data a maximal binary tree Tmax by the recursive partitioning scheme; the pruning procedure selects, among all the subtrees of Tmax, a sequence of subtrees that greedily optimize a cost function.

2.1 Growing the tree Let us now describe the tree construction algorithm with additional detail. Consider splitting a regression tree T at a node t. Let s be a candidate split point for a generic variable X that divides t into left and right daughter nodes t L and t R according to whether X s or X > s. These two nodes will be denoted by t L = {X 2 t : X s} and t R = {X 2 t : X > s}. As mentioned previously, a tree is grown by recursively reducing node impurity. Impurity for regression trees is determined by the within node sample variance

b (t) := d VAR(Y | X 2 t) = 1 N(t)

(Yi Y t)2, (1)

where Y t = 1 N(t)

Xi2t Yi is the sample mean for t and N(t) = #{Xi 2 t} is the number of data points in t. Similarly, the within node sample variance for a daughter node is

b (t L) = 1 N(t L)

(Yi Y t L)2, b (t R) = 1 N(t R)

(Yi Y t R)2,

where Y t L is the sample mean for t L and N(t L) is the sample size of t L (similar deﬁnitions apply to t R). The parent node t is split into two daughter nodes using the variable and split point producing the largest decrease in impurity. For a candidate split s for X, this decrease in impurity equals [4, Deﬁnition 8.13]

b (s, t) := b (t) [ b P(t L)b (t L) + b P(t R)b (t R)], (2)

where b P(t L) = N(t L)/N(t) and b P(t R) = N(t R)/N(t) are the proportions of data points in t that are contained in t L and t R, respectively.

The tree T is grown recursively by ﬁnding the variable ˆ| and split point ˆs = ˆsˆ| that maximizes

b (s, t). Note that for notational brevity, we suppress the dependence on the input coordinate index j. The output b Y (T) of the tree at a terminal node t is the least squares predictor, namely, b Y (T, x) = Y t for all x 2 t.

2.2 Pruning the tree The CART growing procedure stops once a maximal binary tree Tmax is grown (i.e., when the terminal nodes contain at least a single data point). However, b Y (Tmax) is generally not a good predictor, since it will tend to overﬁt the data and therefore generalize poorly to unseen data. This effect can be mitigated by complexity regularization. Removing portions of the overly complex tree (i.e., via pruning) is one way of reducing its complexity and improving performance. We will now describe such a procedure.

We say that T is a pruned subtree of T 0, written as T T 0, if T can be obtained from T 0 by iteratively merging any number of its internal nodes. A pruned subtree of Tmax is deﬁned as any binary subtree of Tmax having the same root node as Tmax. The number of terminal nodes in a tree T is denoted |T|. Given a subtree T and temperature > 0, we deﬁne the penalized cost function

R (b Y (T)) := err(b Y (T)) + |T|. (3)

As shown in [4, Section 10.2], the smallest minimizing subtree for the temperature ,

b T 2 arg min

R (b Y (T)),

exists and is unique (smallest in the sense that if T 0 2 arg min T Tmax R (b Y (T)), then b T T 0). For a ﬁxed , the optimal subtree b T can be found efﬁciently by weakest link pruning [4, 9], i.e., by successively collapsing the internal node that increases err(b Y (T)) the least, until we arrive at the single-node tree consisting of the root node. Good values of can be selected using cross-validation, for example, though analyzing the effect of such a procedure is outside the scope of the present paper.

Our ﬁrst result shows that, with high probability, the test error of the pruned tree b T on new data is bounded by a multiple of min T Tmax R (b Y (T)) plus lower order terms.

Theorem 1. Let b T be the smallest minimizer of (3). Suppose Y = f(X), B = supx |f(x)| < 1, n > (d + 1)/2, and > 27B2(d+1) log(2en/(d+1))

n . Then, with probability at least 1 δ over the training sample Dn,

Err(b Y ( b T)) 4 min

T Tmax R (b Y (T)) + 54B2 log(2/δ)

Similar bounds hold for the excess risk for binary classiﬁcation, i.e., Y 2 {0, 1}, since, in this case, the squared error impurity (1) equals one-half of the so-called Gini impurity used for classiﬁcation trees (which output the majority vote in each terminal node). See also [14] for results of a similar ﬂavor when the penalty is proportional to

In what follows, we let TK Tmax denote a fully grown binary tree of depth K = (log2(n)), i.e., we stop splitting if (1) the node contains a single data point, (2) all response values in the node are the same, or (3) a depth of K is reached, whichever occurs sooner. We also let b T be the smallest minimizer of the cost function (3) with temperature = ((d/n) log(n/d)).

3 Bounding the Training Error

In the previous section, Theorem 1 showed that, with high probability, the test error is bounded by a multiple of the cost function (3) at its minimum (plus lower order terms). Since the cost function is deﬁned as the training error plus penalty term, the next step in our course of study is to understand how the training error of CART behaves.

3.1 Splitting criterion and Pearson correlation Before we begin our analysis of the training error, we ﬁrst digress back to the tree construction algorithm and give an alternative characterization of the objective. Now, the use of the sum of squares impurity criterion b (s, t) with averages in the terminal nodes permits further simpliﬁcations of the formula (2) above. For example, using the sum of squares decomposition, b (s, t) can equivalently be expressed as [4, Section 9.3]

b P(t L) b P(t R)

Y t L Y t R

which is commonly used for its computational appeal that is, one can ﬁnd the best split for a continuous variable with just a single pass over the data, without the need to calculate multiple averages and sums of squared differences for these averages, as required with (2). Yet another way to view b (s, t), which does not appear to have been considered in past literature and will prove to be useful for our purposes, is via its equivalent representation as b (t) b 2(e Y , Y | X 2 t), where

b (e Y , Y | X 2 t) :=

Xi2t(e Yi Y t)(Yi Y t) q

Xi2t(e Yi Y t)2 1 N(t)

Xi2t(Yi Y t)2

is the Pearson product-moment correlation coefﬁcient between the decision stump

e Y := Y t L1(X s) + Y t R1(X > s) (6)

and response variable Y within t (for the proof, see Lemma A.1 in the supplement).1 Hence, at each node, CART seeks the decision stump most correlated in magnitude with the response variable along

1It should be stressed that the alternative characterization of the splitting criterion (2) in terms of a correlation is unique to the squared error impurity with (constant) averages in the terminal nodes.

a particular variable, i.e.,

ˆs 2 arg max

b (s, t) = arg max

s b (e Y , Y | X 2 t). (7)

Equivalently, CART splits along variables with decision stumps that are most correlated with the residuals Yi Y t of the current ﬁt Y t. As with r2 for simple linear regression, the squared correlation b 2(e Y , Y | X 2 t) equals the coefﬁcient of determination R2, in the sense that it describes the fraction of the variance in Y that is explained by a decision stump e Y in X, since

b 2(e Y , Y | X 2 t) = b (s, t)/b (t) = 1 1 N(t)

X2t(Yi e Yi)2/ 1 N(t)

X2t(Yi Y t)2.

Deﬁnition 1. We let b Y denote a decision stump e Y with an optimal direction ˆ| 2 arg maxj=1,2,...,d b (ˆs, t) and corresponding optimal split ˆs.

We now introduce a data dependent quantity that will play a central role in determining the rates of convergence of the prediction error. For a univariate function class G, we let b G be the largest Pearson correlation between the response data Y and a function in G of a single input coordinate for a worst-case node, i.e.,

sup g( )2G, j=1,2,...,d

|b (g(Xj), Y | X 2 t)|, (8)

where the minimum runs over all internal nodes t in TK. We will speciﬁcally focus on classes G that consist of decision stumps, and more generally, monotone functions.

3.2 Location of splits and Pearson correlation Having already revealed the intimate role the correlation between the decision stump and response values (5) plays in the tree construction, it is instructive to explore this relationship with the location of the splits. In order to study this cleanly, let us for the moment work in an asymptotic data setting to determine the coordinates to split and their split points, i.e.,

b (s, t) ! n!1 (s, t) := (t) [P(t L) (t L) + P(t R) (t R)], (9)

where quantities without hats are the population level counterparts of the empirical quantities deﬁned previously in (2). A decision stump (6) with an optimal theoretical direction j and corresponding optimal theoretical split s = s

j is denoted by b Y . Now, if the number of data points within t is large and (s, t) has a unique global maximum, then we can expect ˆs s (via an empirical process argument) and hence the inﬁnite sample setting is a good approximation to CART with empirical splits, giving us some insights into its dynamics. Indeed, if s is unique, [10, Theorem 2] shows that ˆs converges in probability to s . With additional assumptions, one can go even further and characterize the rate of convergence. For example, [5, Section 3.4.2] and [1] provide cube root asymptotics for ˆs, i.e., n1/3(ˆs s ) converges in distribution.

Each node t is a Cartesian product of intervals. As such, the interval along variable X in t is denoted by [a, b], where a < b. The next theorem characterizes the relationship between an optimal theoretical split s and inﬁnite sample correlation (b Y , Y | X 2 t)

a.s. := limn b (b Y , Y | X 2 t) for a deterministic node t (the limit exists by the uniform law of large numbers). The proof is based on the ﬁrst-order optimality condition, namely, @

@s (s, t) |s=s = 0.

Theorem 2. Suppose X is uniformly distributed on [0, 1]d and (s , t) > 0. For a deterministic parent node t, an optimal theoretical split s 2 [a, b] along variable X has the form

v + 2(b Y , Y | X 2 t)

where v = (E[Y |X2t, X=s ] E[Y |X2t])2

VAR(Y |X2t) .

Expression (10) in Theorem 2 reveals that an optimal theoretical split s is a perturbation of the median (a + b)/2 of the conditional distribution X | X 2 t, where the gap is governed by the correlation (b Y , Y | X 2 t). These correlations control the local and directional granularity of the partition of the input domain. Splits along input coordinates that contain a strong signal, i.e.,

(b Y , Y | X 2 t) 0, tend to be further away from the parent node edges, thereby producing side lengths [a, b] that are on average narrower. At the other extreme, the correlation is weakest when there is no signal in the splitting direction or when the response values in the node are not ﬁt well by a decision stump yielding either s a + (b a) 2(b Y , Y | X 2 t)/(4v) or s b (b a) 2(b Y , Y | X 2 t)/(4v) and hence the predicted output in one of the daughter nodes does not change by much. For example, if Y = g(X) is a sinusoidal waveform with large frequency w (not ﬁt well by a single decision stump) and t is the root node [0, 1]d, then v = (1) and (b Y , Y | X 2 t) = (1/pw), and hence by (10), either s = (1/w) or s = 1 (1/w) (see Lemma A.2 in the supplement). This phenomenon, where optimal splits concentrate at the endpoints of the node along noisy directions, has been dubbed end-cut preference in the literature and has been known empirically since the inception of CART [10], [4, Section 11.8]. The theory above is also consistent with empirical studies on the local adaptivity of Breiman s random forests which use CART [12, Section 4].

3.3 Training error and Pearson correlation In addition to determining the location of the splits, the correlation is also directly connected to the training error. Intuitively, the training error should small when CART ﬁnds decision stumps that have strong correlation with the response values in each node. More precisely, the following lemma reveals the importance of the correlation (5) in controlling the training error. It shows that each time a node t is split, the training error in t is reduced by a constant factor, namely, exp( b 2(b Y , Y | X 2 t)) or, uniformly, by exp( b 2

H), where H is the collection of all decision stumps and b H is the quantity deﬁned in (8). Recursing this contraction inequality over nodes at each level of the tree leads to the conclusion that the training error should be exponentially small in the depth K, provided the correlation at each node is large.

Lemma 1. Almost surely,

(Yi b Yi)2 1 N(t)

(Yi Y t)2 exp( b 2(b Y , Y | X 2 t)), (11)

err(b Y (TK)) bσ2

Y exp( K b 2

H), (12) where H is the collection of all decision stumps.

3.4 Size of Pearson correlation via comparison inequalities Due to the importance of the correlation in controlling the training error, it is natural to ask when it will be large. We accomplish this by studying its size relative to the correlation between the data and another more ﬂexible model. That is, we ﬁt an arbitrary univariate function g(X) of a generic coordinate X to the data in the node and ask how large b (b Y , Y | X 2 t) is relative to |b (g(X), Y | X 2 t)|. Such a relationship will enable us to conclude that if Y is locally correlated with g(X) in the node, then so will Y with an optimal decision stump b Y . Before we continue, let us mention that studying b (b Y , Y | X 2 t) directly is hopeless since it not at all straightforward to disentangle the dependence on the data. Even if this could be done and a target population level quantity could be identiﬁed, it is difﬁcult to rely on concentration of measure when t contains very few data points; a likely situation among deep nodes. Nevertheless, by deﬁnition of b Y via (7), we can construct a prior (j, s) on coordinates j and splits s, and lower bound b (b Y , Y | X 2 t) by

b (e Y , Y | X 2 t)d (j, s), (13)

which is much less burdensome to analyze. Importantly, the prior can involve unknown quantities from the distribution of (X, Y ). For a special choice of prior , (13) can be further lower bounded by

b (b Y , Y | X 2 t) constant |b (g(X), Y | X 2 t)|. (14) The constant in (14) depends on g( ), though importantly it is invariant to the scale of g( ). If g( ) belongs to a univariate model class G, this constant can either be studied directly for the speciﬁc g( ) or minimized over g( ) 2 G to yield a more insightful lower bound. For certain model classes G, the minimization problem turns out to be equivalent to a quadratic program, and the solution can be

obtained explicitly and used to prove the next set of results. Our ﬁrst application of this technique shows that, despite ﬁtting the data to a decision stump with one degree of freedom (i.e., the location of the split), CART behaves almost as if it ﬁt the data to a monotone function with N(t) 1 degrees of freedom, at the expense of a sublogarithmic factor in N(t). For example, the correlation between the response variable and the decision stump is, up to a sub-logarithmic factor, at least as strong as the correlation between the response variable and a linear or isotonic ﬁt.

Fact 1. Almost surely, uniformly over all monotone functions g( ) of X in the node, we have

b (b Y , Y | X 2 t) 1 p

1 + log(2N(t))

|b (g(X), Y | X 2 t)|. (15)

20 40 60 80 100

0.05 0.15 0.25

(a) Synthetic data. Prediction error of CART vs. k-NN as d varies.

20 40 60 80 100

20 40 60 80

(b) Boston housing data. Prediction error of CART vs. k-NN as d varies.

0.0 0.2 0.4 0.6 0.8 1.0

correlation

(c) Minimum correlation b H (averaged over 10 independent replications) of all nodes of pruned CART tree as sparsity d0 varies.

The previous fact also suggests that CART is quite good at ﬁtting response values that have a local, low-dimensional, monotone relationship with the input variables. Note that because correlation is merely a measure of linear association, |b (g(X), Y | X 2 t)| can still be large for some monotone g( ), even if Y is not approximately monotone in one coordinate. That is, Y need only be locally correlated with such a function. On the other hand, if Y has no signal in X, then [11, Lemma 1, Supplement] shows that, with high probability,

b (b Y , Y | X 2 t) is O(

(log N(t))/N(t)).

4 Main Results

In this section, we use the training error bound (12) and the device (13) for obtaining correlation comparison inequalities (à la Fact 1) to give bounds on the prediction error of CART. We ﬁrst outline the high-level strategy. By Theorem 1, with high probability, the leading behavior of the prediction error Err(b Y ( b T)) is governed by inf T Tmax R (b Y (T)), which is smaller than the minimum of R (b Y (TK)) = err(b Y (TK))+ |TK| over all fully grown trees TK of depth K with |TK| 2K, i.e.,

inf K 1{err(b Y (TK)) + 2K}. (16)

Coupled with an informative bound on err(b Y (TK)), (16) can then be further bounded and solved. The proofs reveal that a good balance between the tree size and its goodness of ﬁt occurs when K is logarithmic in the sample size.

4.1 Asymptotic consistency rates for sparse additive models Applying the training error bound (12) to (16) with K = d(b 2

H + log 2) 1 log(bσ2

Y / )e, we have from Theorem 1 that with probability at least 1 δ,

Err(b Y ( b T)) = O

H+log 2 + log(1/δ)

(17) It turns out that if X is uniformly distributed and Y is a sparse additive model with d0 component functions gj( ), then b 2

H is asymptotically lower bounded by a constant multiple of 1/d0. Thus, we ﬁnd from (17) that if d0 is ﬁxed, then limn Err(b Y ( b T))

a.s. = 0 even when the ambient dimension grows as d = o(n). Note that such a statement is not possible for vanilla k-NN or other kernel based regression methods with nonadaptive weights, unless feature selection is performed beforehand. In fact, we show next that the prediction error rate that CART achieves is the same as what would be achieved by a standard kernel predictor if one had a priori knowledge of the locations of the d0 relevant input variables that determine the output. A routine computer experiment

on synthetic or real data easily conﬁrms this theory. In Fig. 1a and Fig. 1c, we generate 1000 samples from the model Y = Pd0

j=1 gj(Xj), where each gj(Xj) equals X2

j (alternating signs) and X Uniform([0, 1]d). In Fig. 1a, we plot the test error, averaged over 10 independent replications, of pruned CART vs. k-NN (with cross-validated k) as d ranges from 5 to 100 with d0 = 5 ﬁxed. A similar experiment is performed in Fig. 1b on the Boston housing dataset [4, Section 8.2] (d0 = 10 and n = 506), where we scale the inputs to be in [0, 1] and add d d0 noisy Uniform([0, 1]) input variables. According to Theorem 3, the convergence rate of CART depends primarily on the sparsity d0 and therefore its performance should not be adversely affected by growing d. Consistent with our theory, the prediction error of CART remains stable as d increases, whereas k-NN does not adapt to the sparsity. Furthermore, Fig. 1c illustrates how b 2

H decays with d0 if d = 20, thus corroborating with the aforementioned asymptotic behavior of (1/d0).

Theorem 3. Suppose X is uniformly distributed on [0, 1]d and Y = P

j gj(Xj) is a sparse additive model with d0 d smooth component functions gj( ), where each function is not too locally ﬂat in the sense that

x inf{r 1 : g(r)

j ( ) exists, continuous, and nonzero at x} < 1. (18)

Then there exists a constant C > 0 that is independent of d0 such that, almost surely, lim infn b 2

H C/d0, and

Err(b Y ( b T)) ((d/n) log(n/d)) (1/d0)

a.s. = O(1). (19)

Remark 1. For independent, continuous marginal input variables Xj, there is no loss of generality in assuming uniform distributions in Theorem 3. Indeed, CART decision trees are invariant to strictly monotone transformations of Xj. One such transformation is the marginal cumulative distribution function FXj( ), for which FXj(Xj) Uniform([0, 1]) and so the problem can immediately be reduced to the uniform case.

Any nonconstant component function gj( ) that admits a power series representation satisﬁes the hypothesis of Theorem 3, though, in general, the condition (18) accommodates functions that are not analytic or inﬁnitely differentiable. In fact, even differentiability is not necessary similar results hold if the gj( ) are step functions, as we now show. To this end, assume that Y = P

j gj(Xj) is an additive model, where each component function gj( ) is a bounded step function and the total number of constant pieces of Y is V . We show in the supplement that each optimal split ˆs in a node t satisﬁes

max{Xij 2 I : Xi 2 t} ˆs min{Xij 2 I0 : Xi 2 t}, (20) for some direction j and successive intervals I and I0 on which gj( ) is constant. For example, if Y = c11(X1 < s1) + c21(X2 < s2), then the ﬁrst split separates the data along X1 at s1 (resp. X2 at s2), and at the next level down, CART separates the data in both daughter nodes along X2 at s2 (resp. X1 at s1). Thus, in general, each empirical split always perfectly separates the data in the node between two adjacent constant values of a component function. A CART decision tree grown in this way will achieve zero training error once it has at least V terminal nodes, i.e., |T| V . This is in fact the same training error that would result from the global least squares projection onto the space of all step functions with V constant pieces. From Theorem 1, we immediately obtain the following performance bound, which is the optimal O(1/n) parametric rate for prediction, up to a logarithmic factor in the sample size [17]. Notice that we do not make any assumptions on the input distribution.

Theorem 4. Suppose Y = P

j gj(Xj), where each component function gj( ) is a bounded step function and the total number of constant pieces of Y is V . With probability at least 1 δ,

Err(b Y ( b T)) = O

V d log(n/d)

n + log(1/δ)

4.2 Finite sample consistency rates for general sparse models Using Fact 1, we now provide results of a similar ﬂavor for more general regression models under a mild assumption on the largest number of data points in a node at level k in TK, denoted by Nk. Importantly, our theory only requires that each N(t) is upper bounded at every level of the tree. This condition still allows for nodes that have very few data points, which is typical for trees trained in practice. Contrast this assumption with past work on tree learning (including tree ensembles like random forests) that requires each N(t) to be lower bounded [4, Section 12.2], [13, 16, 6].

Assumption 1. For some constants a 0 and A > 0, the largest number of data points in a node at level k in TK satisﬁes Nk Anka/2k, for k = 1, 2, . . . , K = (log2(n)).

Recall the quantity b G deﬁned in (8), namely, the largest correlation between the response data Y and a function in G for a worst-case node. Our next theorem shows that if M is the collection of all monotone (i.e., increasing or decreasing) functions, then

sup g( ) monotone, j=1,2,...,d

|b (g(Xj), Y | X 2 t)|

governs the rate at which the training error and prediction error decrease. Both errors are small if the local monotone dependence between X and Y is high; that is, if CART partitions the input domain into pieces where the response variable is locally monotone in a few of the input coordinates.

Theorem 5. Let Y = f(X), where f( ) is a bounded function. Under Assumption 1, almost surely,

err(b Y (TK)) bσ2

1 K log2(4Ka An)

Furthermore, with probability at least 1 δ,

Err(b Y ( b T)) = O

Y ) log2+a(n)) log(n)

M + log(1/δ)

We will now argue that b M is an empirical measure of the local dimensionality of Y . More speciﬁcally, we argue that if CART effectively partitions the input domain so that, in each node, Y is locally correlated with sparse additive models with d0 d monotone component functions, then b 2

M = (1/d0). To see why this assertion is true, suppose g1(X1), g2(X2), . . . , gd(Xd) is an arbitrary collection of d univariate functions from M. However, suppose that only a subset of d0 of the input variables X1, X2, . . . , Xd locally affect Y in each node. Then, it can be shown (see Lemma A.3 in the supplement) that there is some node t and sparse additive model Y0 with d0 component functions of the form gj(Xj), corresponding to the d0 input variables that locally affect Y , such that

M b 2(Y0, Y | X 2 t)

= (1/d0), (24)

almost surely. The above statement is reminiscent of Theorem 3 in which b 2

H = (1/d0) controls the convergence rate of the prediction error when the underlying regression model is additive. Though, in contrast, note that (24) holds regardless of the dependence structure between the d0 input coordinates that matter and the d d0 input coordinates that do not. Thus, (24) and Theorem 5 together suggest that it is possible to achieve rates of the form (log(d)/ log(n)) (1/d0) in fairly general settings.

5 Extensions to Tree Ensembles

Key to our analysis of CART was the ability to connect the training error to the objective function of the growing procedure, as in Lemma 1. Establishing similar relationships is not as easy with trees that are constructed from bootstrap samples or random subsets of input variables. Nevertheless, we mention a few ideas for future work. Suppose b Y (T ) = (1/m) P

T b Y (T) is the output of an ensemble T of m decision trees from a random forest. By convexity of the squared error loss [3, Section 11] or [2, Section 4.1], we have Err(b Y (T )) (1/m) P

T Err(b Y (T)), where the prediction error is averaged with respect to the tree randomization mechanism. Using the law of total expectation by conditioning on each realization of the (random) tree, Err(b Y (T)) can be decomposed into quantities that involve the prediction error of a ﬁxed (non-random) tree, for which our previous results can be applied. We will leave the exact details of these extensions for future work.

6 Conclusion

A key strength of CART decision trees is that they can exploit local, low dimensionality of the model via a built-in, automatic dimension reduction mechanism. This is particularly useful since many real-world input/output systems are locally approximated by simple model forms with only a few variables. Adaptivity with CART is made possible by the recursive partitioning of the input space, in which optimal splits are increasingly affected by local qualities of the data as the tree is grown. To illustrate this ability, we identiﬁed settings where CART adapts to the unknown sparsity of the model. To the best of our knowledge, the consistency rates given here are the ﬁrst of their kind for CART decision trees.

Broader Impact

Who may beneﬁt from this research. There are at least two groups of people who will beneﬁt either directly or indirectly from this research.

1. Decision makers across a variety of domains, especially those with limited training in

statistics. CART has enabled data-driven decision making in multiple high-stakes domains (e.g., business, medicine, and policy) over the past three decades. In particular, those who do not have a formal quantitative background will beneﬁt from the intuitive and interpretable nature of CART and its quick and easy implementation.

2. Members of the society who may be have faced ethical/fairness concerns associated with their

data and its use. As this paper has demonstrated, CART forms predictions by emphasizing variables that are more relevant to the output. In a social science context, this suggests that CART may focus more on key diagnostic information (e.g., education, income) without being inﬂuenced by potentially non-diagnostic variables that other methods may have falsely deemed relevant (e.g., gender, race).

Who may be put at a disadvantage from this research. There is no foreseeable population who may be put at a disadvantage from this research.

What are the consequences of failure of the system. Overreliance on any prediction method can have obvious, negative real-world consequences, particularly when the prediction method is prone to failure. CART suffers from a couple of pitfalls: instability (i.e., small perturbations in the training samples may signiﬁcantly change the structure of an optimal tree and consequent predictions) and difﬁculty in accurately approximating certain simple models, such as linear or, more generally, additive, if given insufﬁcient or low quality data.

Whether the task/method leverages biases in the data. While CART is not impervious to all pre-existing biases in the data (e.g., those arising from systematic measurement errors at the data collection stage), as we have shown, it is less susceptible to the presence of additional, non-diagnostic variables in the data. Consequently, CART has the potential to mitigate the negative consequences of biasing information that is inevitable with most datasets.

Acknowledgments and Disclosure of Funding

The author is indebted to Min Xu, Minge Xie, Samory Kpotufe, Robert Mc Culloch, Andy Liaw, Richard Baumgartner, Regina Liu, Cun-Hui Zhang, Michael Kosorok, Jianqing Fan, and Matias Cattaneo for helpful discussions and feedback. This research was supported in part by NSF DMS #1915932 and NSF TRIPODS DATA-INSPIRE CCF #1934924.

[1] Moulinath Banerjee and Ian W Mc Keague. Conﬁdence sets for split points in decision trees. The Annals

of Statistics, 35(2):543 574, 2007.

[2] Leo Breiman. Bagging predictors. Machine learning, 24(2):123 140, 1996.

[3] Leo Breiman. Random forests. Machine Learning, 45(1):5 32, 2001.

[4] Leo Breiman, Jerome Friedman, RA Olshen, and Charles J Stone. Classiﬁcation and regression trees.

Chapman and Hall/CRC, 1984.

[5] Peter Bühlmann and Bin Yu. Analyzing bagging. The Annals of Statistics, 30(4):927 961, 2002.

[6] Misha Denil, David Matheson, and Nando De Freitas. Narrowing the gap: Random forests in theory and in

practice. In International Conference on Machine Learning (ICML), 2014.

[7] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics,

pages 1189 1232, 2001.

[8] Servane Gey and Elodie Nedelec. Model selection for CART regression trees. IEEE Transactions on

Information Theory, 51(2):658 670, 2005.

[9] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and

Prediction, Second Edition. Springer Series in Statistics. Springer New York, 2009. ISBN 9780387848587. URL https://books.google.com/books?id=t VIjm NS3Ob8C.

[10] Hemant Ishwaran. The effect of splitting on random forests. Machine Learning, 99(1):75 118, 2015.

[11] Xiao Li, Yu Wang, Sumanta Basu, Karl Kumbier, and Bin Yu. A debiased MDI feature importance measure

for random forests. In Advances in Neural Information Processing Systems 32, pages 8049 8059. Curran Associates, Inc., 2019.

[12] Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors. Journal of the American

Statistical Association, 101(474):578 590, 2006.

[13] Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7(Jun):983 999,

[14] Andrew B Nobel. Analysis of a complexity-based pruning scheme for classiﬁcation trees. IEEE Transac-

tions on Information Theory, 48(8):2362 2368, 2002.

[15] Erwan Scornet, Gérard Biau, and Jean-Philippe Vert. Consistency of random forests. Annals of Statistics,

43(4):1716 1741, 2015.

[16] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random

forests. Journal of the American Statistical Association, pages 1 15, 2018.

[17] Yi-Ching Yao and Siu-Tong Au. Least-squares estimation of a step function. Sankhy a: The Indian Journal

of Statistics, Series A, pages 370 381, 1989.