# treevalues_selective_inference_for_regression_trees__685f8fa9.pdf

Journal of Machine Learning Research 23 (2022) 1-43 Submitted 6/21; Revised 10/22; Published 11/22

Tree-Values: Selective Inference for Regression Trees

Anna C. Neufeld aneufeld@uw.edu Department of Statistics University of Washington Seattle, WA 98195, USA

Lucy L. Gao lucy.gao@stat.ubc.ca Department of Statistics University of British Columbia Vancouver, British Columbia, V6T 1Z4, Canada

Daniela M. Witten dwitten@uw.edu Departments of Statistics and Biostatistics University of Washington Seattle, WA 98195, USA

Editor: Garvesh Raskutti

We consider conducting inference on the output of the Classiﬁcation and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a ﬁtted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the diﬀerence in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a conﬁdence interval for the mean response within a single terminal node that attains the nominal selective coverage. Eﬃcient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake. Keywords: Regression trees, CART, selective inference, post-selection inference, hypothesis testing.

1. Introduction

Regression tree algorithms recursively partition covariate space using binary splits to obtain regions that are maximally homogeneous with respect to a continuous response. The Classiﬁcation and Regression Tree (CART; Breiman et al. 1984) proposal, which involves growing a large tree and then pruning it back, is by far the most popular of these algorithms. The regions deﬁned by the splits in a ﬁtted CART tree induce a piecewise constant regression model where the predicted response within each region is the mean of the observations in that region. CART is popular in large part because it is highly interpretable; someone without technical expertise can easily read the tree to make predictions, and to understand why a certain prediction is made. However, its interpretability belies the fact

c 2022 Anna C. Neufeld, Lucy L. Gao, Daniela M. Witten.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0722.html.

Neufeld, Gao, Witten

that CART trees are highly unstable: a small change to the training dataset can drastically change the structure of the ﬁtted tree. In the absence of an established notion of statistical signiﬁcance associated with a given split in the tree, it is hard for a practitioner to know whether they are interpreting signal or noise. In this paper, we use the framework of selective inference to ﬁll this gap by providing a toolkit to conduct inference on hypotheses motivated by the output of the CART algorithm.

Given a CART tree, consider testing for a diﬀerence in the mean response of the regions resulting from a binary split. A very naive approach, such as a two-sample Z-test, that does not account for the fact that the regions were themselves estimated from the data will fail to control the selective Type 1 error rate: the probability of rejecting a true null hypothesis, given that we decided to test it (Fithian et al., 2014). Similarly, a naive Z-interval for the mean response in a region will not attain nominal selective coverage: the probability that the interval covers the parameter, given that we chose to construct it.

In fact, approaches for conducting inference on the output of a regression tree are quite limited. Sample splitting involves ﬁtting a CART tree using a subset of the observations, which will naturally lead to an inferior tree to the one resulting from all of the observations, and thus is unsatisfactory in many applied settings; see Athey and Imbens (2016). Wager and Walther (2015) develop convergence guarantees for unpruned CART trees that can be leveraged to build conﬁdence intervals for the mean response within a region; however, they do not provide ﬁnite-sample results and cannot accommodate pruning. Loh et al. (2016) and Loh et al. (2019) develop bootstrap calibration procedures that attempt to provide conﬁdence intervals for the regions of a regression tree. In Appendix A, we show that this bootstrap calibration approach fails to provide intervals that achieve nominal coverage for the parameters of interest in this paper.

As an alternative to performing inference on a CART tree, one could turn to the conditional inference tree (CTree) framework of Hothorn et al. (2006). This framework uses a diﬀerent tree-growing algorithm than CART, and at each split tests for linear association between the split covariate and the response. As summarized in Loh (2014), the CTree framework alleviates issues with instability and variable selection bias associated with CART. Despite these advantages, CTree remains far less widely-used than CART. Furthermore, while CTree attaches a notion of statistical signiﬁcance to each split in a tree, it does not directly allow for inference on the mean response within a region or the diﬀerence in mean response between two regions. Finally, while the CTree framework requires few assumptions, its inference is based on asymptotics.

In this paper, we introduce a ﬁnite-sample selective inference (Fithian et al., 2014) framework for the diﬀerence between the mean responses in two regions, and for the mean response in a single region, in a pruned or unpruned CART tree. We condition on the event that CART yields a particular set of regions, and thereby achieve selective Type 1 error rate control as well as nominal selective coverage.

The rest of this paper is organized as follows. In Section 2, we review the CART algorithm, and brieﬂy deﬁne some key ideas in selective inference. In Section 3, we present our proposal for selective inference on the regions estimated via CART. We show that the necessary conditioning sets can be eﬃciently computed in Section 4. In Section 5 we compare our framework to sample splitting and CTree via simulation. In Section 6 we

Selective Inference for Regression Trees

compare our framework to CTree on data from the Box Lunch Study. The discussion is in Section 7. Technical details are relegated to the supplementary materials.

2. Background

2.1 Notation for Regression Trees

Given p covariates (X1, . . . , Xp) measured on each of n observations (x1, . . . , xn), let xj,(s) denote the sth order statistic of the jth covariate, and deﬁne the half-spaces

χj,s,1 = z Rp : zj xj,(s) , χj,s,0 = z Rp : zj > xj,(s) . (1)

The following deﬁnitions are illustrated in Figure 1.

Deﬁnition 1 (Tree and Region) Consider a set S such that R Rp for all R S. Then S is a tree if and only if (i) Rp S; (ii) every element of S \ Rp equals R χj,s,e for some R S, j {1, . . . , p}, s {1, . . . , n 1}, e {0, 1}; (iii) R χj,s,e S implies that R χj,s,1 e S for e {0, 1}; and (iv) for any R, R S, R R { , R, R }. If R S and S is a tree, then we refer to R as a region.

We use the notation tree to refer to a particular tree. Deﬁnition 1 implies that any region R tree \{Rp} is of the form R = L l=1χjl,sl,el, where for each l = 1, . . . , L, we have that jl {1, . . . , p}, sl {1, . . . , n 1}, and el {0, 1}. We call L the level of the region, and use the convention that the level of Rp is 0.

Deﬁnition 2 (Siblings and Children) Suppose that {R, R χj,s,1, R χj,s,0} tree. Then R χj,s,1 and R χj,s,0 are siblings. Furthermore, they are the children of R.

Deﬁnition 3 (Descendant and Ancestor) If R, R tree and R R , then R is a descendant of R , and R is an ancestor of R.

Deﬁnition 4 (Terminal Region) A region R tree without descendants is a terminal region.

We let desc(R, tree) denote the set of descendants of region R in tree, and we let term(R, tree) denote the subset of desc(R, tree) that are terminal regions. Given a response vector y Rn, let y R = P i:xi R yi / Pn i=1 1(xi R) , where 1(A) is an indicator variable that equals 1 if the event A holds, and 0 otherwise. Then, a tree tree induces the regression model ˆµ(x) = P R term(Rp,tree) y R1(x R). In other words, it predicts the response within each terminal region to be the mean of the observations in that region.

2.2 A Review of the CART Algorithm (Breiman et al., 1984)

The CART algorithm (Breiman et al., 1984) greedily searches for a tree that minimizes the sum of squared errors P R term(Rp,tree) P i:xi R(yi y R)2. It ﬁrst grows a very large tree via recursive binary splits, starting with the full covariate space Rp. To split a region R, it selects the covariate xj and the split point xj,(s) to maximize the gain, deﬁned as

gain R (y, j, s) X

i R (yi y R)2

yi y R χj,s,1 2 + X

yi y R χj,s,0 2

Neufeld, Gao, Witten

Figure 1: The regression tree takes the form tree = {Rp, χ1,s1,1, χ1,s1,0, χ1,s1,1 χ2,s2,1, χ1,s1,1 χ2,s2,0}. The regions RA = χ1,s1,1 χ2,s2,1 and RB = χ1,s1,1 χ2,s2,0 are siblings, and are children, and therefore descendants, of the region χ1,s1,1. The ancestors of RA and RB are Rp and χ1,s1,1. Furthermore, RA, RB, and χ1,s1,0 are terminal regions.

Details are provided in Algorithm A1. Once a very large tree has been grown, cost-complexity pruning is applied. We deﬁne the average per-region gain in sum-of-squared errors provided by the descendants of a region R,

g(R, tree, y) =

i:xi R (yi y R)2 P

r term(R,tree)

i:xi r (yi yr)2

| term(R, tree)| 1 . (3)

Given a complexity parameter λ 0, if g(R, tree, y) < λ for some R tree, then costcomplexity pruning removes R s descendants from tree, turning R into a terminal region. Details are in Algorithm A2, which involves the notion of a bottom-up ordering.

Deﬁnition 5 (Bottom-up ordering) Let tree = {R1, . . . , RK}. Let π be a permutation of the integers (1, . . . , K). Then O = Rπ(1), . . . , Rπ(K) is a bottom-up ordering of the regions in tree if, for all k = 1, . . . , K, π(k) π(j) if Rk desc(Rj, tree).

There are other equivalent formulations for cost-complexity pruning (see Proposition 7.2 in Ripley (1996)); the formulation in Algorithm A2 is convenient for establishing the results in this paper. To summarize, the CART algorithm ﬁrst applies Algorithm A1 to the initial region Rp and the data y to obtain an unpruned tree, which we call tree0(y). It then applies Algorithm A2 to tree0(y) to obtain an optimally-pruned tree using complexity parameter λ, which we call treeλ(y).

Algorithm A1 (Growing a tree) Grow(R, y)

1. If a stopping condition is met, return R. 2. Else return {R, Grow(R χ j, s,1, y), Grow(R χ j, s,0, y)}, where ( j, s) arg max(j,s):s {1,...,n 1},j {1,...,p} gain R (y, j, s) .

Selective Inference for Regression Trees

Algorithm A2 (Cost-complexity pruning) Parameter O is a bottom-up ordering of the K regions in tree. Prune(tree, y, λ, O)

1. Let tree0 = tree. Let K be the number of regions in tree0. 2. For k = 1, . . . , K: (a) Let R be the kth region in O. (b) Update treek as follows, where g( ) is deﬁned in (3):

( treek 1 \ desc(R, treek 1) if g(R, treek 1, y) < λ, treek 1 otherwise. 3. Return tree K.

2.3 A Brief Overview of Selective Inference

Here, we provide a very brief overview of selective inference; see Fithian et al. (2014) or Taylor and Tibshirani (2015) for a more detailed treatment. Consider conducting inference on a parameter θ. Classical approaches assume that we were already interested in conducting inference on θ before looking at our data. If, instead, our interest in θ was sparked by looking at our data, then inference must be performed with care: we must account for the fact that we selected θ based on the data (Fithian et al., 2014). In this setting, interest focuses on a p-value p(Y ) such that the test for H0 : θ = θ0 based on p(Y ) controls the selective Type 1 error rate, in the sense that

pr H0:θ=θ0 {p(Y ) α | θ selected} α, for all 0 α 1. (4)

Also of interest are conﬁdence intervals [L(Y ), U(Y )] that achieve (1 α)-selective coverage for the parameter θ, meaning that

pr {θ [L(Y ), U(Y )] | θ selected} 1 α. (5)

Roughly speaking, the inferential guarantees in (4) and (5) can be achieved by deﬁning p-values and conﬁdence intervals that condition on the aspect of the data that led to the selection of θ. In recent years, a number of papers have taken this approach to perform selective inference on parameters selected from the data in the regression (Lee et al., 2016; Liu et al., 2018; Tian and Taylor, 2018; Tibshirani et al., 2016), clustering (Gao et al., 2020), and changepoint detection (Hyun et al., 2021; Jewell et al., 2022) settings. In the next section, we propose p-values that satisfy (4) and conﬁdence intervals that satisfy (5) in the setting of CART, where the parameter of interest is either the mean response within a region, or the diﬀerence between the mean responses of two sibling regions.

3. The Selective Inference Framework for CART

3.1 Inference on a Pair of Sibling Regions

Throughout this paper, we assume that Y Nn(µ, σ2In) with σ > 0 known. We let X Rn p denote a ﬁxed covariate matrix. Suppose that we apply CART with complexity parameter λ to a realization y = (y1, . . . , yn)T from Y to obtain treeλ(y). Given

Neufeld, Gao, Witten

sibling regions RA and RB in treeλ(y), we deﬁne a contrast vector νsib Rn such that

(νsib)i = 1(xi RA) Pn i =1 1(xi RA) 1(xi RB) Pn i =1 1(xi RB) , (6)

and νT sibµ = P i:xi RA µi / Pn i=1 1(xi RA) P i:xi RB µi / Pn i=1 1(xi RB) . Now, consider testing the null hypothesis of no diﬀerence in means between RA and RB, i.e. H0 : νT sibµ = 0 versus H1 : νT sibµ = 0. This null hypothesis is of interest because RA and RB appeared as siblings in treeλ(y). A test based on a p-value of the form pr H0 (|νT sib Y | |νT siby|) that does not account for this will not control the selective Type 1 error rate in (4). To control the selective Type 1 error rate, we propose a p-value that conditions on the aspect of the data that led us to select νT sibµ,

pr H0 n |νT sib Y | |νT siby| | RA, RB are siblings in treeλ(Y ) o . (7)

But (7) depends on a nuisance parameter, the portion of µ that is orthogonal to νsib. To remove the dependence on this nuisance parameter, we condition on its suﬃcient statistic P νsib Y , where P ν = I ννT/ ν 2 2. The resulting p-value, or tree-value , is deﬁned as

psib(y) = pr H0 n |νT sib Y | |νT siby| | RA, RB are siblings in treeλ(Y ), P νsib Y = P νsiby o . (8) Results similar to Theorem 6 can be found in Jewell et al. (2022); Lee et al. (2016); Liu et al. (2018), and Tibshirani et al. (2016).

Theorem 6 The test based on the p-value psib(y) in (8) controls the selective Type 1 error rate for H0 : νT sibµ = 0, where νsib is deﬁned in (6), in the sense that

pr H0 n psib(Y ) α | RA, RB are siblings in treeλ(Y ) o = α, for all 0 α 1. (9)

Furthermore, psib(y) = pr |φ| |νT siby| | φ Sλ sib(νsib) , where φ N(0, νsib 2 2σ2), y (φ, ν) = P ν y + φ(ν/ ν 2 2), and

Sλ sib(νsib) = {φ : RA, RB are siblings in treeλ{y (φ, νsib)}}. (10)

Proofs of all theoretical results are provided in the appendix. Theorem 6 says that given the set Sλ sib(νsib), we can compute the p-value in (8) using

psib(y) = 1 F n |νT siby|; 0, νsib 2 2σ2, Sλ sib(νsib) o +F n |νT siby|; 0, νsib 2 2σ2, Sλ sib(νsib) o , (11)

where F ; 0, ν 2σ2, S denotes the cumulative distribution function of the N(0, ν 2 2σ2) distribution truncated to the set S. In Section 4, we provide an eﬃcient approach for analytically characterizing the truncation set Sλ sib(νsib). To avoid numerical issues associated with the truncated normal distribution, we compute (11) using methods described in the supplement of Chen and Bien (2020). Note that the proof of Theorem 6, and consequently the eﬃcient computation of psib(y) discussed in Section 4, relies on the assumption that Y Nn(µ, σ2In).

Selective Inference for Regression Trees

We now consider inverting the test proposed in (8) to construct an equitailed conﬁdence interval for νT sibµ that has (1 α)-selective coverage (5), in the sense that

pr n νT sibµ [L(Y ), U(Y )] | RA, RB are siblings in treeλ(Y ) o = 1 α. (12)

Proposition 7 For any 0 α 1 and any realization y Rn, the values L(y) and U(y) that satisfy

F{ν T siby; L(y), σ2 νsib 2 2, Sλ sib(νsib)} = 1 α/2, F{ν T siby; U(y), σ2 νsib 2 2, Sλ sib(νsib)} = α/2, (13)

are unique, and [L(Y ), U(Y )] achieves (1 α)-selective coverage for νT sibµ.

3.2 Inference on a Single Region

Given a single region RA in a CART tree, we deﬁne the contrast vector νreg such that

(νreg)i = 1(xi RA)/

i =1 1(xi RA)

Then, νT regµ = P i:xi RA µi / Pn i=1 1(xi RA) . We now consider testing the null hypothesis H0 : νT regµ = c for some ﬁxed c. Because our interest in this null hypothesis results from the fact that RA treeλ(y), we must condition on this event in deﬁning the p-value. We deﬁne

preg(y) = pr H0 n |νT reg Y c| |νT regy c| | RA treeλ(Y ), P νreg Y = P νregy o , (15)

and introduce the following theorem.

Theorem 8 The test based on the p-value preg(y) in (15) controls the selective Type 1 error rate for H0 : νT regµ = c, where νreg is deﬁned in (14). Furthermore, preg(y) = pr |φ c| |νT regy c| | φ Sreg(νreg) , where φ N(c, νreg 2 2σ2) and, for y (φ, ν) = P ν y + φ(ν/ ν 2 2), Sλ reg(νreg) = {φ : RA treeλ{y (φ, νreg)}}. (16)

Theorem 2 and the resulting eﬃcient computations in Section 4 rely on the assumption that Y Nn(µ, σ2In). We can also deﬁne a conﬁdence interval for νT regµ that attains nominal selective coverage.

Proposition 9 For any 0 α 1 and any realization y Rn, the values L(y) and U(y) that satisfy

F{ν T regy; L(y), σ2 νreg 2 2, Sλ reg(νreg)} = 1 α/2, F{ν T regy; U(y), σ2 νreg 2 2, Sλ reg(νreg)} = α/2, (17) are unique, and [L(Y ), U(Y )] achieves (1 α)-selective coverage for νT regµ.

In Section 4, we propose an approach to analytically characterize the set Sλ reg(νreg) in (16).

Neufeld, Gao, Witten

3.3 Intuition for the Conditioning Sets Sλ sib(νsib) and Sλ reg(νreg)

We ﬁrst develop intuition for the set Sλ sib(νsib) deﬁned in (10). From Theorem 6,

y (φ, νsib)

i = yi+(φ νT siby)

( Pn i =1 1(xi RB) Pn i =1 1(xi RA RB) 1(xi RA)

Pn i =1 1(xi RA) Pn i =1 1(xi RA RB) 1(xi RB)

Thus, y (φ, νsib) is a perturbation of y that exaggerates the diﬀerence between the observed sample mean responses of RA and RB if |φ| > |νT siby|, and shrinks that diﬀerence if |φ| < |νT siby|. The set Sλ sib(νsib) quantiﬁes the amount that we can shift the diﬀerence in sample mean responses between RA and RB while still producing a tree containing these sibling regions. The top row of Figure 2 displays tree0{y (φ, νsib)}, as a function of φ, in an example where S0 sib(νsib) = ( 19.8, 1.8) (0.9, 34.9).

Figure 2: Data with n = 100 and p = 2. Regions resulting from CART (λ = 0) are delineated using solid lines. Here, RA = χ1,26,0 χ2,72,1 and RB = χ1,26,0 χ2,72,0. Top: Output of CART applied to y (φ, νsib), where νsib in (6) encodes the contrast between RA and RB, for various values of φ. The left-most panel displays y = y (νT siby, νsib). By inspection, we see that 14.9 S0 sib(νsib) and 5 S0 sib(νsib), but 0 S0 sib(νsib) and 40 S0 sib(νsib). In fact, S0 sib(νsib) = ( 19.8, 1.8) (0.9, 34.9). Bottom: Output of CART applied to y (φ, νreg), where νreg in (14) encodes membership in RA. The left-most panel displays y = y (νT regy, νreg). Here, S0 reg(νreg) = ( , 3.1) (5.8, 8.8) (14.1, ).

We next develop intuition for Sλ reg(νreg), deﬁned in (16). Note that {y (φ, νreg)}i = yi + (φ νT regy)1(xi RA), where y (φ, νreg) is deﬁned in Theorem 8. Thus, y (φ, νreg) shifts the responses of the observations in RA so that their sample mean equals φ, and leaves the others unchanged. The set Sλ reg(νreg) quantiﬁes the amount that we can exaggerate or shrink the sample mean response in region RA while still producing a tree that contains RA. The bottom row of Figure 2 displays y (φ, νreg) as φ is varied, in an example with S0 reg(νreg) = ( , 3.1) (5.8, 8.8) (14.1, ).

Selective Inference for Regression Trees

4. Computing the conditioning sets Sλ sib(νsib) and Sλ reg(νreg)

4.1 Recharacterizing the conditioning sets in terms of branches

We begin by introducing the concept of a branch.

Deﬁnition 10 (Branch) A branch is an ordered sequence of triples B = ((j1, s1, e1), . . . , (j L, s L, e L)) such that jl {1, . . . , p}, sl {1, . . . , n 1}, and el {0, 1} for l = 1, . . . , L. The branch B induces a nested set of regions R(B) = {R(0), R(1), . . . , R(L)}, where R(l) = Tl l =1 χjl ,sl ,el for l = 1, . . . , L, and R(0) = Rp.

For a branch B and a vector ν, we deﬁne

Sλ(B, ν) = n φ : R(B) treeλ{y (φ, ν)} o . (18)

For R tree, we let branch(R, tree) denote the branch such that R{branch(R, tree)} contains R and all of its ancestors in tree.

Lemma 11 Suppose that RA and RB are siblings in treeλ(y). Then RA and RB are siblings in treeλ{y (φ, νsib)} if and only if R[branch{RA, treeλ(y)}] treeλ{y (φ, νsib)}. Therefore, Sλ sib(νsib) = Sλ[branch{RA, treeλ(y)}, νsib], deﬁned in (10) and (18).

Lemma 11 says that treeλ{y (φ, νsib)} contains siblings RA and RB if and only if it contains the entire branch associated with RA in treeλ(y). However, Lemma 11 does not apply in the single region case: for νreg deﬁned in (14) and some RA treeλ(y), the fact that RA treeλ{y (φ, νreg)} does not imply that R[branch{RA, treeλ(y)}] treeλ{y (φ, νreg)}. Instead, a result similar to Lemma 11 holds, involving permutations of the branch.

Deﬁnition 12 (Permutation of a branch) Let Π denote the set of all L! permutations of (1, 2, . . . , L). Given π Π and a branch B = ((j1, s1, e1), . . . , (j L, s L, e L)), we say that π (B) =((jπ(1), sπ(1), eπ(1)), . . . , (jπ(L), sπ(L), eπ(L))) is a permutation of the branch B.

Branch B and its permutation π (B) induce the same region R(L), but R{π (B)} = R(B).

Lemma 13 Let RA treeλ(y). Then RA treeλ{y (φ, νreg)} if and only if there exists a π Π such that R[π {branch RA(y)}] treeλ{y (φ, νreg)}. Thus, for Sλ reg(νreg) in (16),

Sλ reg(νreg) = [

π Π Sλ π h branch{RA, treeλ(y)} i , νreg . (19)

Lemmas 11 and 13 reveal that computing Sλ sib(νsib) and Sλ reg(νreg) requires characterizing sets of the form Sλ(B, ν), deﬁned in (18). To compute Sλ sib(νsib) we will only need to consider Sλ(B, ν) where R(B) treeλ(y). However, to compute Sλ reg(νreg), we will need to consider Sλ{π(B), ν} where R(B) treeλ(y) but R{π(B)} treeλ(y).

Neufeld, Gao, Witten

4.2 Computing Sλ(B, ν) in (18)

Throughout this section, we consider a vector ν Rn and a branch B = ((j1, s1, e1), . . . , (j L, s L, e L)), where R(B) may or may not be in treeλ(y). Recall from Deﬁnition 10 that B induces the nested regions R(l) = Tl l =1 χjl ,sl ,el for l = 1, . . . , L, and R(0) = Rp. Throughout this section, our only requirement on B and ν is the following condition.

Condition 1 For y (φ, ν) deﬁned in Theorem 6, B and ν satisfy {y (φ, ν)}i = yi+c11{xi R(L)}+ c21[xi {R(L 1) χj L,s L,1 e L}] for i = 1, . . . , n and for some constants c1 and c2.

To characterize Sλ(B, ν) in (18), recall that the CART algorithm in Section 2.2 involves growing a very large tree tree0(y), and then pruning it. We ﬁrst characterize the set

Sgrow(B, ν) = {φ : R(B) tree0{y (φ, ν)}}. (20)

Proposition 14 Recall the deﬁnition of gain R(l){y (φ, ν), j, s} in (2), and let Sl,j,s = {φ : gain R(l 1){y (φ, ν), j, s} gain R(l 1){y (φ, ν), jl, sl}}. Then, Sgrow(B, ν) = TL l=1 Tp j=1 Tn 1 s=1 Sl,j,s.

Proposition 15 says that we can compute Sgrow(B, ν) eﬃciently.

Proposition 15 The set Sl,j,s is deﬁned by a quadratic inequality in φ. Furthermore, we can evaluate all of the sets Sl,j,s, for l = 1, . . . , L, j = 1, . . . , p, s = 1, . . . , n 1, in O {np L + np log(n)} operations. Intersecting these sets to obtain Sgrow(B, ν) requires at most O {np L log(np L)} operations, and only O(np L) operations if B = branch{RA, treeλ(y)} and ν is of the form νsib in (6).

Noting that Sλ(B, ν) = φ Sgrow(B, ν) : R(L) treeλ{y (φ, ν)} , it remains to characterize the set of φ Sgrow(B, ν) such that R(L) is not removed during pruning. Recall that g( ) was deﬁned in (3).

Proposition 16 There exists a tree tree(B, ν, λ) such that

Sλ(B, ν) = Sgrow(B, ν)

n φ : g n R(l), tree(B, ν, λ), y (φ, ν) o λ o!

If R(B) treeλ(y), then tree(B, ν, λ) = treeλ(y) satisﬁes (21). Otherwise, given the set Sgrow(B, ν), computing a tree(B, ν, λ) that satisﬁes (21) has a worst-case computational cost of O(n2p).

We explain how to compute a tree(B, ν, λ) satisfying (21) when R(B) treeλ(y) in the supplementary materials.

Proposition 17 The set TL 1 l=0 φ : g R(l), tree(B, ν, λ), y (φ, ν) λ in (21) is the intersection of the solution sets of L quadratic inequalities in φ. Given tree(B, ν, λ), the coeﬃcients of these quadratics can be obtained in O(n L) operations. After Sgrow(B, ν) has been computed, intersecting it with these quadratic sets to obtain Sλ(B, ν) from (21) requires O{np L log(np L)} operations in general, and only O(L) operations if B = branch{RA, treeλ(y)} and ν = νsib from (6).

Selective Inference for Regression Trees

The results in this section have relied upon Condition 1. Indeed, this condition holds for branches B and vectors ν that arise in characterizing the sets Sλ sib(νsib) and Sλ reg(νreg).

Proposition 18 If either (i) B = branch{RA, treeλ(y)} and ν = νsib (6), where RA and RB are siblings in treeλ(y), or (ii) B is a permutation of branch{RA, treeλ(y)} and ν = νreg (14), where RA treeλ(y), then Condition 1 holds.

Combining Lemma 11 with Propositions 14 18, we see that Sλ sib(νsib) can be computed in O{np L + np log(n)} operations. However, computing Sλ reg(νreg) is much more computationally intensive: by Lemma 13 and Propositions 14 18, it requires computing Sλ(π branch{RA, treeλ(y)} , νreg) for all L! permutations π Π, for a total of O L! n2p Llog(p L)

operations. In Section 4.3, we discuss ways to avoid these calculations.

4.3 A Computationally-Eﬃcient Alternative to Sλ reg(νreg)

Lemma 13 suggests that carrying out inference on a single region requires computing Sλ(π branch{RA, treeλ(y)} , νreg) for every π Π. We now present a less computationally demanding alternative.

Proposition 19 Let Q be a subset of the L! permutations in Π, i.e. Q Π. Deﬁne

p Q reg(y) = pr H0

|ν T reg Y c| |ν T regy c| | [

R(π[branch{RA, treeλ(y)}]) treeλ(Y ) , P νreg Y = P νregy

The test based on p Q reg(y) controls the selective Type 1 error rate (4) for H0 : νT regµ = c. Fur-

thermore, p Q reg(y) = pr n |φ c| |νT regy c| | φ S π Q Sλ π[branch{RA, treeλ(y)}], νreg o , where

φ N(c, νreg 2 2σ2).

Using the notation in Proposition 19, preg(y) introduced in (15) equals pΠ reg(y). If we take Q = {I}, where I is the identity permutation, then we arrive at

p I reg(y) = P |φ c| |νT regy c| | φ Sλ[branch{RA, treeλ(y)}, νreg] , (22)

where φ N(c, νreg 2 2σ2). The set Sλ[branch{RA, treeλ(y)}, νreg] can be easily computed by Proposition 16. Compared to (15), (22) conditions on an extra piece of information: the ancestors of RA. Thus, while (22) controls the selective Type 1 error rate, it may have lower power than (15) (Fithian et al., 2014). Similarly, inverting (22) to form a conﬁdence interval provides correct selective coverage, but may yield intervals that are wider than those in Proposition 9. Proposition 19 is motivated by a proposal by Lee et al. (2016) to condition on both the selected model (necessary information) and the signs of the selected variables (extra information) in the lasso setting, to gain computational eﬃciency at the possible expense of precision and power. In Appendix F, we show through simulation that the loss in power associated with using (22) rather than (15) is negligible. Thus, in practice, we suggest using (22) for its computational eﬃciency. We use (22) for the remainder of this paper.

Neufeld, Gao, Witten

Figure 3: The true mean model in Section 5, for a = 0.5 (left), a = 1 (center), and a = 2 (right). The diﬀerence in means between the sibling nodes at level two in the tree is ab, while the diﬀerence in means between the sibling nodes at level three is b.

Furthermore, we can consider computing conﬁdence intervals of the form [LSIreg(y), USIreg(y)] rather than (17), where LSIreg(y) and USIreg(y) satisfy

F νT regy; LSIreg(y), σ2 νreg 2 2, Sλ h branch{RA, treeλ(y)}, νreg i = 1 α

F νT regy; USIreg(y), σ2 νreg 2 2, Sλ h branch{RA, treeλ(y)}, νreg i = α

In Appendix F, we show that the conﬁdence intervals resulting from (23) are not much wider than those resulting from (17). We therefore make use of conﬁdence intervals of the form (23) in the remainder of this paper.

5. Simulation Study

5.1 Data Generating Mechanism

We simulate X Rn p with n = 200, p = 10, Xij i.i.d. N(0, 1), and y Nn(µ, σ2In) with

σ = 5 and µi = b h 1(xi,1 0) {1 + a1(xi,2>0) + 1(xi,3 xi,2>0)} i . This µ vector deﬁnes a three-level tree, shown in Figure 3 for three values of a R.

5.2 Methods for Comparison

All CART trees are ﬁt using the R package rpart (Therneau and Atkinson, 2019) with λ = 200, a maximum level of three, and a minimum node size of one. We compare three approaches for conducting inference. (i) Selective Z-methods: Fit a CART tree to the data. For each split, test for a diﬀerence in means between the two sibling regions using (8), and compute the corresponding conﬁdence interval in (13). Compute the conﬁdence interval for the mean of each region using (23). (ii) Naive Z-methods: Fit a CART tree to the data. For each split, conduct a naive Z-test for the diﬀerence in means between the two sibling regions, and compute the corresponding naive Z-interval. Compute a naive Z-interval for each region s mean. (iii) Sample splitting: Split the data into equally-sized training and test sets. Fit a CART tree to the training set. On the test set, conduct a naive Z-test for each

Selective Inference for Regression Trees

split and compute a naive Z-interval for each split and each region. If a region has no test set observations, then we fail to reject the null hypothesis and fail to cover the parameter.

The conditional inference tree (CTree) framework of Hothorn et al. (2006) uses a diﬀerent criterion than CART to perform binary splits. Within a region, it tests for linear association between each covariate and the response. The covariate with the smallest p-value for this linear association is selected as the split variable, and a Bonferroni corrected p-value that accounts for the number of covariates is reported in the ﬁnal tree. Then, the split point is selected. If, after accounting for multiple testing, no variable has a p-value below a prespeciﬁed signiﬁcance level α, then the recursion stops. While CTree s p-values assess linear association and thus are not directly comparable to the p-values in (i) (iii) above, it is the most popular framework currently available for determining if a regression tree split is statistically signiﬁcant. Thus, we also evaluate the performance of (iv) CTree: Fit a CTree to all of the data using the R package partykit (Hothorn and Zeileis, 2015) with α = 0.05. For each split, record the p-value reported by partykit.

In Sections 5.3 5.6, we assume that σ is known. We consider the case of unknown σ in Section 5.7.

5.3 Uniform p-values under a Global Null

We generate 5, 000 datasets with a = b = 0, so that H0 : νT sibµ = 0 holds for all splits in all trees. Figure 4 displays the distributions of p-values across all splits in all ﬁtted trees for the naive Z-test, sample splitting, and the selective Z-test. The selective Z-test and sample splitting achieve uniform p-values under the null, while the naive Z-test (which does not account for the fact that νsib was obtained by applying CART to the same data used for testing) does not. CTree is omitted from the comparison: it creates a split only if the p-value is less than α = 0.05, and thus its p-values over the splits do not follow a Uniform(0,1) distribution.

Figure 4: Quantile-quantile plots of the p-values for testing H0 : νT sibµ = 0, as described in Section 5.3. A naive Z-test (green), sample splitting (blue), and selective Z-test (pink) were performed; see Section 5.2. The p-values are stratiﬁed by the level of the regions in the ﬁtted tree.

Neufeld, Gao, Witten

Table 1: A 3 3 contingency table indicating an observation s involvement in a given true split and estimated split. The adjusted Rand index is computed using only the shaded cells

Estimated Split In left region In right region In neither

True Split In left region t1 t2 t3 In right region u1 u2 u3 In neither v1 v2 v3

We generate 500 datasets for each (a, b) {0.5, 1, 2} {1, . . . , 10}, and evaluate the power of selective Z-tests, sample splitting, and CTree to reject the null hypothesis H0 : νT sibµ = 0. As naive Z-tests do not control the Type 1 error rate (Figure 4), we do not evaluate their power. We consider two aspects of power: the probability that we detect a true split, and the probability that we reject the null hypothesis corresponding to a true split. Given a true split in Figure 3 and an estimated split, we construct the 3 3 contingency table in Table 1, which indicates whether an observation is on the left-hand side, right-hand side, or not involved in the true split (rows) and the estimated split (columns). To quantify the agreement between the true and estimated splits, we compute the adjusted Rand index (Hubert and Arabie, 1985) associated with the 2 3 contingency table corresponding to the shaded region in Table 1. For each true split, we identify the estimated split for which the adjusted Rand index is largest; if this index exceeds 0.75 then this true split is detected . Given that a true split is detected, the associated null hypothesis is rejected if the corresponding p-value is below 0.05. Figure 5 displays the proportion of true splits that are detected and rejected by each method. As sample splitting ﬁts a tree using only half of the data, it detects fewer true splits, and thus rejects the null hypothesis for fewer true splits, than the selective Z-test. When a is small, the diﬀerence in means between sibling regions at level two is small. Because CTree makes a split only if there is strong evidence of association at that level, it tends to build one-level trees, and thus fails to detect many true splits; by contrast, the selective Z-test (based on CART) successfully builds more three-level trees. Thus, when a is small, the selective Z-test detects (and rejects) more true diﬀerences than CTree between regions at levels two and three.

5.5 Coverage of Conﬁdence Intervals for νT sibµ and νT regµ

We generate 500 datasets for each (a, b) {0.5, 1, 2} {0, . . . , 10} to evaluate the coverage of 95% conﬁdence intervals constructed using naive Z-methods, selective Z-methods, and sample splitting. CTree is omitted from these comparisons because it does not provide conﬁdence intervals. We say that the interval covers the truth if it contains νTµ, where ν is deﬁned as in (6) (for a particular split) or (14) (for a particular region). Table 2 shows the proportion of each type of interval that covers the truth, aggregated across values of a

Selective Inference for Regression Trees

Figure 5: Proportion of true splits detected (solid lines) and rejected (dotted lines) for CART with selective Z-tests (pink), CTree (black), and CART with sample splitting (blue) across diﬀerent settings of the data generating mechanism, stratiﬁed by level in tree. As CTree only makes a split if the p-value is less than 0.05, the proportion of detections equals the proportion of rejections.

and b. The selective Z-intervals attain correct coverage of 95%, while the naive Z-intervals do not. It may come as a surprise that sample splitting does not attain correct coverage. Recall that ν from (6) or (14) is an n-vector that contains entries for all observations in both the training set and the test set. Thus, νTµ involves the true mean among both training and test set observations in a given region or pair of regions. By contrast, sample splitting attains correct coverage for a diﬀerent parameter involving the true means of only the test observations that fall within a given region or pair of regions.

5.6 Width of Conﬁdence Intervals

Figure 6(a) illustrates that our selective Z-intervals for νT regµ can be extremely wide when b is small, particularly for regions located at deeper levels in the tree. For each tree that we build and for levels 1, 2, and 3, we compute the adjusted Rand Index (Hubert and Arabie, 1985) between the true tree (truncated at the appropriate level) and the estimated tree (truncated at the same level). Figure 6(b) shows that our selective conﬁdence intervals can be extremely wide when this adjusted Rand Index is small, particularly at deeper levels of the tree. When b is small and the adjusted Rand Index is small, the trees built by CART tend to be unstable, in the sense that small perturbations to the data aﬀect the ﬁtted tree. In this

Neufeld, Gao, Witten

Table 2: Proportion of 95% conﬁdence intervals containing the true parameter, aggregated over all trees ﬁt to the 5,500 datasets generated with (a, b) {0.5, 1, 2} {1, . . . , 10}

Parameter νT regµ Parameter νT sibµ

Level Selective Z Naive Z Sample Splitting Selective Z Naive Z Sample Splitting

1 0.951 0.889 0.918 0.948 0.834 0.915 2 0.950 0.645 0.921 0.951 0.410 0.917 3 0.951 0.711 0.921 0.950 0.550 0.921

setting, the sample statistics νT regy fall very close to the boundary of the truncation set. See Kivaranovic and Leeb (2021) for a discussion of why wide conﬁdence intervals can arise in these settings. The great width of our conﬁdence intervals reﬂects the uncertainty about the mean response within each region due to the instability of the tree-ﬁtting procedure.

Figure 6: The median width of the selective Z-intervals for parameter νT regµ for regions at levels one (solid), two (dashed), and three (dotted) of the tree. Similar results hold for parameter νT sibµ. Panel (a) breaks results down by the parameters a and b, whereas panel (b) aggregates results across values of parameters a and b, and displays them as a function of the adjusted Rand Index between the true and estimated trees.

5.7 Results with Unknown σ

Thus far, we have assumed that σ is known. In this section, we compare the following three versions of the selective Z-methods that plug diﬀerent values of σ into the truncated normal CDF when computing p-values and conﬁdence intervals:

1. σ: We plug in the true value of σ, as in Sections 5.3 5.6.

2. ˆσcons: We plug in ˆσcons =

(n 1) 1 n P

i=1 (yi y)2, where y = n 1 Pn i=1 yi.

Selective Inference for Regression Trees

Figure 7: QQ plots of the p-values from testing H0 : νT sibµ = 0 when µ = 0n using the selective Z-test with three diﬀerent values plugged in to the truncated normal CDF for σ. The p-values are stratiﬁed by the level of the regions in the ﬁtted tree.

3. ˆσSSE : Let T = term Rp, treeλ(y) be the number of terminal regions in treeλ(y).

We plug in ˆσSSE = q

(n T ) 1 Pn i=1(yi ˆyi)2, where ˆyi is the predicted value for

the ith observation given by Treeλ(y).

It is straightforward to show that E[ˆσ2 cons] σ2, for any value of E[y] = µ. Thus, we expect this estimate to lead to conservative inference. On the other hand, ˆσ2 SSE can be made arbitrarily small by making the ﬁtted tree arbitrarily deep, and so we expect inference based on this estimate to be anti-conservative if the ﬁtted CART tree is large.

Figure 7 shows the distribution of p-values from testing H0 : νT sibµ = 0 with the three versions of the selective Z-test under the data generating mechanism described in Section 5.1, with a = b = 0. In this setting, νT sibµ = 0 holds for all splits in all trees. We see almost no diﬀerence between the three versions of the selective Z-test. In this global null setting, E[ˆσ2 cons] = σ2. Furthermore, the empirical bias of ˆσ2 SSE is small because the trees we grow are not particularly large; as in the rest of Section 5, we build trees to a maximum depth of 3 and prune with λ = 200.

Figure 8 displays the proportion of true splits detected and the proportion of true splits detected and rejected, as deﬁned in Section 5.4, for the three versions of the selective Z-test when data is generated as in Section 5.4. For simplicity, we only show the setting where a = 1. All three methods detect the same proportion of true splits, because they all perform inference on the same CART trees. The proportion of splits detected and rejected is very similar for σ and ˆσSSE because ˆσSSE is a very good estimator for σ in this setting. While ˆσcons performs reasonably when b is small, it severely overestimates σ and thus has low power when b is large.

Neufeld, Gao, Witten

Figure 8: Proportion of true splits detected (solid lines) and rejected (dotted lines) for CART with the three versions of the selective Z-test. The results are stratiﬁed by level in tree.

Table 3 displays conﬁdence intervals for νT sibµ and νT regµ for the three versions of the selective Z-intervals, where data is generated as in Section 5.5. As expected, ˆσcons leads to slight over-coverage and ˆσSSE leads to slight under-coverage.

Parameter νT regµ Parameter νT sibµ

Level σ ˆσcons ˆσSSE σ ˆσcons ˆσSSE

1 0.95 0.98 0.95 0.95 0.98 0.94 2 0.95 0.97 0.94 0.95 0.97 0.94 3 0.95 0.96 0.94 0.95 0.96 0.94

Table 3: Proportion of 95% conﬁdence intervals containing the true parameter, aggregated over all trees ﬁt to the 5,500 datasets generated with (a, b) {0.5, 1, 2} {1, . . . , 10}

In this section, we have seen that when trees are not grown overly large, plugging in ˆσSSE leads to approximate selective Type 1 error control, approximately correct selective coverage, and good power. Unfortunately, providing theoretical guarantees for our procedures when using ˆσSSE would be quite diﬃcult, as the estimator is anti-conservative and depends on the output of CART. Providing theoretical guarantees for our procedures under ˆσcons is more straightforward, using ideas from Gao et al. (2020), Chen and Witten (2022), and Tibshirani et al. (2018). However, as shown in Figure 8, selective Z-tests based on ˆσcons can have very low power. One promising avenue of future work involves providing theoretical guarantees in the regression tree setting for estimators that are less conservative than ˆσcons.

Selective Inference for Regression Trees

Figure 9: Left: A CART tree ﬁt to the Box Lunch Study data. Each split has been labeled with a p-value (8), and each region has been labeled with a conﬁdence interval (23). The shading of the nodes indicates the average response values (white indicates a very small value and dark blue a very large value). Top right: A CTree ﬁt to the Box Lunch Study data. Bottom right: A scatterplot showing the relationship between the covariate hunger and the response.

6. An Application to the Box Lunch Study

Venkatasubramaniam et al. (2017) compare CART and CTree (Hothorn et al., 2006) within the context of epidemiological studies. They conclude that CTree is preferable to CART because it provides p-values for each split, even though CART has higher predictive accuracy. Since our framework provides p-values for each split in a CART tree, we revisit their analysis of the Box Lunch Study, a clinical trial studying the impact of portion control interventions on 24-hour caloric intake. We consider identifying subgroups of study participants with baseline diﬀerences in 24-hour caloric intake on the basis of scores from an assessment that quantiﬁes constructs such as hunger, liking, the relative reinforcement of food (rrvfood), and restraint (resteating).

We exactly reproduce the trees presented in Figures 1 and 2 of Venkatasubramaniam et al. (2017) by building a CTree using partykit and a CART tree using rpart on the Box Lunch Study data provided in the R package vis Tree (Venkatasubramaniam and Wolfson, 2018). We apply our selective inference framework to compute p-values (8) for each split in CART, and conﬁdence intervals (23) for each region. In this section, we use ˆσSSE, deﬁned in Section 5.7, to estimate the error variance. The results are shown in Figure 9.

Neufeld, Gao, Witten

Both CART and CTree choose hunger<1.8 as the ﬁrst split. For this split, our selective Z-test reports a large p-value of 0.44, while CTree reports a p-value less than 0.001. The conﬂicting p-values are explained by the diﬀerence in null hypotheses. CTree ﬁnds strong evidence against the null of no linear association between hunger and caloric intake. By contrast, our selective framework for CART does not ﬁnd strong evidence for a diﬀerence between mean caloric intake of participants with hunger<1.8 and those with hunger 1.8. We see from the bottom right of Figure 9 that while there is evidence of a linear relationship between hunger and caloric intake, there is less evidence of a diﬀerence in means across the particular split hunger=1.8. Given that the goal of Venkatasubramaniam et al. (2017) is to identify population subgroups that are relatively homogeneous with respect to an outcome , the p-value resulting from our selective framework is more natural than the pvalue output by CTree, since the former relates directly to the subgroups formed by the split, whereas the latter does not take into account the location of the split point. In general, the left-hand panel of Figure 9 shows that the subgroups of patients identiﬁed by CART are not signiﬁcantly diﬀerent from one another. This is an important ﬁnding that would be missed without our selective inference framework. Furthermore, unlike CTree, our framework provides conﬁdence intervals for the mean response in each subgroup. An alternative analysis using ˆσcons, deﬁned in Section 5.7, is provided in Appendix H, and leads to similar ﬁndings.

7. Discussion

Our framework relies on the assumption that Y Nn(µ, σ2I), with σ2 known. In Section 5.7, we showed strong empirical performance when the variance is unknown and σ2 is estimated. In this section, we brieﬂy comment on the assumptions of spherical variance and normally distributed data. It natural to wonder whether the assumption that Y Nn(µ, σ2I) can be relaxed to the assumption that Y Nn(µ, Σ), with Σ known. Following the work of Lee et al. (2016), the results in Section 3 extend to the setting where Y Nn(µ, Σ) if we:

1. Modify (8) and (15) to condition on the event n In ΣννT

Y = In ΣννT

rather than the event {P ν Y = P ν y}, where ν = νsib in the case of (8) and ν = νreg in the case of (15).

2. Replace all instances of the perturbation y (φ, ν), deﬁned in Theorem 6, with the perturbation y (φ, ν) = In ΣννT

y + Σν νT Σν φ.

Unfortunately, the modiﬁed perturbation y (φ, ν) does not satisfy Condition 1 in Section 4.2 when Σ = σ2In, and so many of the results of Section 4 do not extend to this non-spherical setting. Future work could explore how to eﬃciently compute the conditioning set in this non-spherical setting. Furthermore, our framework assumes a normally-distributed response variable. CART is commonly used for classiﬁcation, survival (Segal, 1988), and treatment eﬀect estimation in causal inference (Athey and Imbens, 2016). While the idea of conditioning on a selection event to control the selective Type 1 error rate applies regardless of the distribution of the response, our Theorem 6 and Theorem 8, and the resulting computational results, relied on

Selective Inference for Regression Trees

normality of Y . In the absence of this assumption, exactly characterizing the conditioning set and the distribution of the test statistic requires further investigation. We show in Appendix G that our selective Z-tests approximately control the selective Type 1 error when the normality assumption is violated. Tian and Taylor (2017) and Tibshirani et al. (2018) establish conditions under which selective p-values for linear regression (derived under the assumption of normality) will be asymptotically uniformly distributed under non-normality. Thus suggests the possibility of developing asymptotic theory for our proposed selective Z-tests under violations of normality. A reviewer pointed out similarities between the problem of testing signiﬁcance of the ﬁrst split in the tree and signiﬁcance testing for a single changepoint, as in Bhattacharya (1994). Building on this connection may provide an avenue for future work. A software implementation of the methods in this paper is available in the R package treevalues, at https://github.com/anna-neufeld/treevalues.

Acknowledgments

Daniela Witten and Anna Neufeld were supported by the National Institutes of Health (NIH R01 GM12399) and the Simons Foundation (Simons Investigator Award in Mathematical Modeling of Living Systems). Lucy Gao was supported by the Natural Sciences and Engineering Research Council of Canada (Discovery Grants).

Appendix A. Comparison to Loh et al. (2019)

Loh et al. (2016) and Loh et al. (2019) use regression trees to ﬁnd subgroups of patients with similar treatment eﬀects in clinical trials. They grow trees based on patient characteristics using a diﬀerent algorithm than CART. Furthermore, they are interested in the mean treatment eﬀect (which is a linear regression coeﬃcient) within each terminal region of the tree, rather than the mean response within each region. One of the goals of Loh et al. (2019) is to construct valid post-selection conﬁdence intervals for the treatment eﬀect within each terminal node. In this appendix, we show that their approach, when adapted to the setting of this paper, does not yield conﬁdence intervals with nominal coverage. The basic idea of Loh et al. (2019), instantiated to our setting, is as follows. Suppose that RA Treeλ(y), and deﬁne the vector νreg such that (νreg)i = 1(xi RA)/ Pn i =1 1(xi RA) , as in (14). We know that the naive Z-interval does not achieve nominal coverage, meaning that

ν regy zα/2 σ q Pn i=1 1(xi RA) , ν regy + zα/2 σ q Pn i=1 1(xi RA)

< 1 α. (24)

This is because the multiplier for the naive conﬁdence interval, zα/2, is derived under the assumption that the region RA (or, equivalently, the vector νreg), is ﬁxed, rather than a function of the data.

Neufeld, Gao, Witten

Loh et al. (2019) observe that there exists some α < α such that

Pr ν regµ ν regy zα /2 σ Pn i=1 1(xi RA) , ν regy + zα /2 σ Pn i=1 1(xi RA)

= 1 α. (25)

The value for α for a given tree will depend on the number of split covariates p, the number of data points n, the depth of the tree, and the value of λ used for tree pruning, among other considerations. If we know how the data was generated, then we can check whether some value α satisﬁes (25) as follows:

1. Draw B diﬀerent simulated datasets {(Xb, yb)}B b=1 from the same distribution (call this F) as the original data. For b = 1, . . . , B:

(a) Build a tree using the simulated data (Xb, yb), using the same procedure and the same settings as in Step 1, and denote it treeλ(yb).

(b) For each terminal region R term treeλ(yb), Rp in the tree:

i. Construct a (1 α ) naive Z-interval for the mean response in the region using (Xb, yb). ii. Check if each interval contains µR, the true mean for this region R.

2. Compute the fraction of intervals in 1(b) that contain µR.

3. If this value is 1 α, then we have found the correct value of α . If not, then we try a larger or smaller value of α .

We test this procedure in a very simple simulation study. We generate data Xij N(0, 1) and yi N(0, 1) for i = 1, . . . , 100 and j = 1, . . . , p. In this simple setting, the true mean response for every region in every ﬁtted tree is 0. We carry out the procedure outlined above with α = 0.1. For two values of p and for CART trees with 1, 2, and 3 levels, we create 1000 datasets and 1000 trees and report the empirical coverage of the intervals obtained using this ideal method, averaged over all nodes in all trees. The results, shown in Table 4, show that this ideal procedure procedure leads to intervals that achieve nominal coverage. Unfortunately, this ideal procedure is practically infeasible, as it requires the user to know the true distribution of the data F. Thus, in practice, Loh et al. (2019) propose replacing F by ˆF, the empirical distribution of the original data. This amounts to replacing the simulated datasets in Step 2 with bootstrapped datasets, and checking whether the naive Z-intervals in Step 1(b) contain y R rather than µR. We can see why this is problematic in a very simple setting where we ﬁt a tree with depth 1. If all observations have mean 0, then a CART tree ﬁt to a bootstrap sample of the data will nevertheless ﬁnd regions RL and RR such that the sample mean value of yb within RL is negative and the sample mean value of yb within RR is positive. As yb and y contain many overlapping observations, it is likely that the sample mean value of y within RL is also negative and the sample mean value of y within RR is also positive. In other words, because of the overlap between y and yb, the within-region sample means of yb are closer to the within-region sample means of y than they are to the within-region population

Selective Inference for Regression Trees

means. Thus, when we calibrate α to cover the mean values of y within various regions, we end up with under-coverage of the true population mean, as shown in Table 4. We see in Table 4 that our selective inference framework approach enables valid inference in this setting, whereas a bootstrap procedure modeled after Loh et al. (2019) does not.

Loh (ideal) Loh (bootstrap) Selective CIs p Tree depth Coverage Average α Coverage Average α Coverage 1 0.902 0.008 0.749 0.037 0.890 2 2 0.905 0.004 0.695 0.038 0.904 3 0.900 0.005 0.660 0.047 0.895 1 0.883 0.001 0.601 0.016 0.908 20 2 0.900 0.00025 0.549 0.016 0.901 3 0.904 0.00015 0.543 0.022 0.905

Table 4: Coverage of 90% conﬁdence intervals computed using three methods for the simple setting where yi N(0, 1) and Xij N(0, 1) for for i = 1, . . . , 100 and j = 1, . . . , p. Note that the Loh (ideal) method can never be used in practice, as it requires knowledge of the true parameter.

Appendix B. Proofs for Section 3

B.1 Proof of Theorem 6

Let 0 α 1. We start by proving the ﬁrst statement in Theorem 6:

pr H0 n psib(Y ) α | RA, RB are siblings in treeλ(Y ) o = α.

This is a special case of Proposition 3 from Fithian et al. (2014). It follows from the deﬁnition of psib(Y ) in (8) that

pr H0 n psib(Y ) α | RA, RB are siblings in treeλ(Y ), P νsib Y = P νsiby o = α.

Therefore, applying the law of total expectation yields

pr H0 n psib(Y ) α | RA, RB siblings in treeλ(Y ) o

= EH0 h 1{psib(Y )<α} | RA, RB are siblings in treeλ(Y ) i

EH0 h 1{psib(Y )<α} | RA, RB are siblings in treeλ(Y ), P νsib Y = P νsiby i

| RA, RB are siblings in treeλ(Y )

= EH0 n α | RA, RB are siblings in treeλ(Y ) o = α.

The second statement of Theorem 6 follows directly from the following result.

Neufeld, Gao, Witten

Lemma 20 If Y Nn µ, σ2In , then the random variable νT sib Y has the following conditional distribution:

ν T sib Y | {RA, RB are siblings in treeλ(Y ), P νsib Y = P νsiby} T N n ν T sibµ, σ2 νsib 2 2; Sλ sib(νsib) o , (26)

where Sλ sib(νsib) is deﬁned in (10) and T N (µ, σ, S) denotes the N(µ, σ2) distribution truncated to the set S.

Proof The following holds for any ν Rn.

pr n νTY > c | RA, RB are siblings in treeλ (Y ) , P ν Y = P ν y o

= pr νTY > c | RA, RB are siblings in treeλ P ν Y + ννT

ν 2 2 Y , P ν Y = P ν y

= pr νTY > c | RA, RB are siblings in treeλ P ν y + ν ν 2 2 νTY , P ν Y = P ν y

= pr νTY > c | RA, RB are siblings in treeλ P ν y + ν ν 2 2 νTY

= pr n φ > c | φ Sλ sib (ν) o ,

where φ = νTY. In the fourth line, the condition P ν Y = P ν y can be dropped because when Y Nn µ, σ2In , P ν Y is independent of ν Y . Finally, since φ N νTµ, σ2 ν 2 2 , (26) holds.

B.2 Proof of Proposition 7

Theorem 6.1 from Lee et al. (2016) says that the truncated normal distribution has monotone likelihood ratio in the mean parameter. This guarantees that L(y) and U(y) in (13) are unique. Then, for L( ) and U( ) in (13), (26) in Lemma 20 guarantees that

pr ν Tµ [L(Y ), U(Y )] | RA, RB are siblings in treeλ(Y ), P ν Y = P ν y = 1 α. (27)

Finally, we need to prove that (27) implies (1 α) selective coverage as deﬁned in (12). Following Proposition 3 from Fithian et al. (2014), let η be the random variable P ν Y and let f( ) be its density. Then,

pr n νTµ [L(Y ), U(Y )] | RA, RB are siblings in treeλ(Y ) o

= Z pr n νTµ [L(Y ), U(Y )] | RA, RB are siblings in treeλ(Y ), P ν Y = P ν y o f(η)dη

= Z (1 α)f(η)dη = 1 α.

B.3 Proof of Theorem 8

We omit the proof of the ﬁrst statement of Theorem 8, as it is similar to the proof of the ﬁrst statement of Theorem 6 in Appendix B.1. The second statement in Theorem 8 follows directly from the following result.

Selective Inference for Regression Trees

Lemma 21 The random variable νT reg Y has the conditional distribution

νT reg Y | {RA treeλ(Y ), P νreg Y = P νregy} T N n νT regµ, σ2 νreg 2 2; Sλ reg(νreg) o , (28)

where Sλ reg(νreg) was deﬁned in (16).

We omit the proof of Lemma 21, as it is similar to the proof of Lemma 20.

B.4 Proof of Proposition 9

The proof largely follows the proof of Proposition 7. The fact that the truncated normal distribution has monotone likelihood ratio (Theorem 6.1 of Lee et al. 2016) ensures that L(y) and U(y) deﬁned in (17) are unique, and (28) in Lemma 21 implies that

pr n νT regµ [L(Y ), U(Y )] | RA treeλ(Y ), P νreg Y = P νregy o = 1 α.

The rest of the argument is as in the proof of Proposition 7.

Appendix C. Proofs for Section 4.1

C.1 Proof of Lemma 11

We ﬁrst state and prove the following lemma.

Lemma 22 Let RA and RB be the regions in the deﬁnition of νsib in (6). For an arbitrary region R and for any j {1, . . . , p} and s {1, . . . , n 1}, recall that the potential children of R (the ones that CART will consider adding to the tree when applying Algorithm A1 to region R) are given by R χj,s,0 and R χj,s,1, where χj,s,0 and χj,s,l were deﬁned in (1). If (RA RB) R χj,s,0 or (RA RB) R χj,s,1, then Gain R{y (φ, νsib), j, s} = Gain R{y, j, s} for all φ.

Proof It follows from algebra that for Gain R(y, j, s) deﬁned in (2),

Gain R(y, j, s) =

i=1 1(xi R)

i=1 1(xi R χj,s,0)

i=1 1(xi R χj,s,1)

where y R = P i R yi / Pn i=1 1(xi R) . It follows from (29) that to prove Lemma 22, it suﬃces to show that y T = y (φ, νsib)T for T {R, R χj,s,0, R χj,s,1}. Recall from Section 3.3 that {y (φ, νsib)}i = yi + i, where

(φ νT siby)

Pn i =1 1(x i RB) Pn i =1 1(x i RA RB) if i RA

(φ νT siby)

Pn i =1 1(x i RA) Pn i =1 1(x i RA RB) if i RB

0 otherwise.

Neufeld, Gao, Witten

Without loss of generality, assume that (RA RB) R χj,s,0. For any T {R, R χj,s,0}, RA RB T. Thus,

y (φ, νsib)T = 1 Pn i=1 1(xi T )

i T \(RA RB) yi + X

i RA (yi + i) + X

i RB (yi + i)

i RB i Pn i=1 1(xi T )

Pn i=1 1(xi RA) (φ νT siby)

Pn i=1 1(xi RB) Pn i=1 1(xi RA RB) Pn i=1 1(xi RB) (φ νT siby)

Pn i=1 1(xi RA) Pn i=1 1(xi RA RB) Pn i=1 1(xi T ) = y T + 0 = y T .

Furthermore,

y (φ, νsib)R χj,s,1 = 1 Pn i=1 1(xi R χj,s,1)

i R χj,s,1 (yi+ i) = 1 Pn i=1 1(xi R χj,s,1)

i R χj,s,1 (yi+0) = y R χj,s,1.

We will now prove Lemma 11. It follows from Deﬁnition 10 that if R branch{R, treeλ(y)} treeλ{y (φ, νsib)}, then RA and RB are siblings in treeλ{y (φ, νsib)}. This establishes the ( ) direction. We will prove the ( ) direction by contradiction. Suppose that RA and RB are siblings in treeλ{y (φ, νsib)}. Deﬁne branch{RA, treeλ(y)} = ((j1, s1, e1), . . . , (j L, s L, e L)), and

deﬁne R(l ) = l T

l=1 χjl,sl,el for l = 1, . . . , L. Assume that there exists l {0, . . . , L 2}

such that R(l) treeλ{y (φ, νsib)} and R(l+1) treeλ{y (φ, νsib)}. We assume that any ties between splits that occur at Step 2 of Algorithm A1 are broken in the same way for y and y (φ, νsib), and so this implies that there exists ( j, s) = (jl+1, sl+1) such that ( j, s) arg maxj,s Gain R(l){y (φ, νsib), j, s} and

Gain R(l){y (φ, νsib), jl+1, sl+1} < Gain R(l){y (φ, νsib), j, s}. (30)

Since RA and RB are siblings in treeλ{y (φ, νsib)}, it follows from Lemma 22 that Gain R(l){y (φ, νsib), j, s} = Gain R(l)(y, j, s). Also, since RA and RB are siblings in treeλ(y), it follows from Lemma 22 that Gain R(l){y (φ, νsib), jl+1, sl+1} = Gain R(l)(y, jl+1, sl+1). Applying these facts to (30) yields

Gain R(l)(y, jl+1, sl+1) < Gain R(l)(y, j, s). (31)

But since R(l) and R(l+1) both appeared in treeλ(y),

(jl+1, sl+1) arg max j,s Gain R(l)(y, j, s).

This contradicts (31). Therefore, for any l {0, . . . , L 2}, if R(l) treeλ{y (φ, νsib)}, then R(l+1) treeλ{y (φ, νsib)}. Since R(0) treeλ{y (φ, νsib)}, the proof follows by induction.

Selective Inference for Regression Trees

C.2 Proof of Lemma 13

Let RA treeλ(y) with branch{RA, treeλ(y)} = ((j1, s1, e1), . . . , (j L, s L, e L)) such that RA = TL l=1 χjl,sl,el. Since Algorithm A1 creates regions by intersecting halfspaces and set intersections are invariant to the order of intersection, it follows that RA = TL l=1 χjl,sl,el treeλ{y (φ, νreg)} if and only if there exists π Π such that ( l \

l=1 χjπ(l),sπ(l),eπ(l)

l =1 treeλ{y (φ, νreg)}.

By Deﬁnitions 10 and 12,

R(π[branch{RA, treeλ(y)}]) =

l=1 χjπ(l),sπ(l),eπ(l)

Sλ reg = n φ : RA treeλ{y (φ, νreg)} o

n φ : R(π[branch{RA, treeλ(y)}]) treeλ y (φ, νreg) o

π Π Sλ π h branch{RA, treeλ(y)} i , νreg ,

where the third equality follows from the deﬁnition of Sλ(B, ν) in (18).

Appendix D. Proofs for Section 4.2

D.1 Proof of Proposition 14

Recall that B = ((j1, s1, e1), . . . , (j L, s L, e L)) and R(B) = {R(0), . . . , R(L)}. Recall from (20) that Sgrow(B, ν) = {φ : R(B) tree0{y (φ, ν)}}, and that we deﬁne Sl,j,s = φ : gain R(l 1){y (φ, ν), j, s} gain R(l 1){y (φ, ν), jl, sl} . For l = 1, . . . , L,

R(l 1) tree0{y (φ, ν)} and φ n 1 s=1 p j=1 Sl,j,s {R(l 1), R(l)} tree0{y (φ, ν)}, (32)

because, given that R(l 1) tree0{y (φ, ν)}, R(l) tree0{y (φ, ν)} if and only if (jl, sl) arg max(j,s):s {1,...,n 1},j {1,...,p} gain R(l 1) (y (φ, ν), j, s) . Combining (32) with the fact that {φ : R(0) tree0{y (φ, ν)}} = R yields

s=1 Sl,j,s =

n φ : {R(l 1), R(l)} tree0{y (φ, ν)} o = {φ : R(B) tree0{y (φ, ν)}}.

D.2 Proof of Proposition 15

Given a region R, let 1(R) denote the vector in Rn such that the ith element is 1(xi R).

Let P1(R) = 1(R) 1(R) T1(R) 1 1(R) T denote the orthogonal projection matrix onto the vector 1(R).

Neufeld, Gao, Witten

Lemma 23 For any region R, gain R(y, j, s) = y TMR,j,sy, where

MR,j,s = P1(R χj,s,1) + P1(R χj,s,0) P1(R). (33)

Furthermore, the matrix MR,j,s is positive semideﬁnite.

Proof For any region R, P i R(yi y R)2 = P i R y2 i y TP1(R)y. Thus, from (2),

gain R(y, j, s) = X

i R (yi y R)2 X

i R χj,s,1 (yi y R χj,s,1)2 X

i R χj,s,0 (yi y R χj,s,0)2

i R y2 i y TP1(R)y X

i R χj,s,1 y2 i + y TP1(R χj,s,1)y X

i R χj,s,0 y2 i + y TP1(R χj,s,0)y

= y T P1(R χj,s,1) + P1(R χj,s,0) P1(R) y = y TMR,j,sy.

To see that MR,j,s is positive semideﬁnite, observe that, for any vector v,

v TMR,j,sv = gain R(v, j, s) = X

i R (vi v R)2 min a1,a2

i R χj,s,1 (vi a1)2 + X

i R χj,s,0 (vi a2)2

i R (vi v R)2

i R χj,s,1 (vi v R)2 + X

i R χj,s,0 (vi v R)2

It follows from Lemma 23 that we can express each set Sl,j,s from Proposition 14 as

Sl,j,s = φ : gain R(l 1){y (φ, ν), j, s} gain R(l 1){y (φ, ν), jl, sl}

= n φ : y (φ, ν)TMR(l 1),j,sy (φ, ν) y (φ, ν)TMR(l 1),jl,sly (φ, ν) o . (34)

We now use (34) to prove the ﬁrst statement of Proposition 15.

Lemma 24 Each set Sl,j,s is deﬁned by a quadratic inequality in φ.

Proof The deﬁnition of y (φ, ν) in (10) implies that

y (φ, ν)TMR,j,sy (φ, ν) = P ν y + νφ

= νTMR,j,sν

ν 4 2 φ2 + 2νTMR,j,s P ν y ν 2 2 φ + y TP ν MR,j,s P ν y

a(R, j, s)φ2 + b(R, j, s)φ + c(R, j, s). (35)

Therefore, by (34),

Sl,j,s = φ : h a n R(l 1), j, s o a n R(l 1), jl, sl oi φ2 + h b n R(l 1), j, s o b n R(l 1), jl, sl oi φ

+ h c n R(l 1), j, s o c n R(l 1), jl, sl oi 0 . (36)

Selective Inference for Regression Trees

Proposition 14 indicates that to compute Sgrow(B, ν) from (20), we need to compute the coeﬃcients of the quadratic for each Sl,j,s, where l = 1, . . . , L, j = 1, . . . , p, and s = 1, . . . , n 1.

Lemma 25 We can compute the coeﬃcients a R(l 1), j, s , b R(l 1), j, s and c R(l 1), j, s , deﬁned in Lemma 24, for l = 1, . . . , L, j = 1, . . . , p, and s = 1, . . . , n 1, in O {nplog(n) + np L} operations.

Proof Using the deﬁnitions in Lemmas 23 and 24 and algebra, we have that

ν 4 2a{R(l 1), j, s} =

h νT1{R(l 1) χj,s,1} i2

Pn i=1 1{i R(l 1) χj,s,1} +

h νT1{R(l 1) χj,s,0} i2

Pn i=1 1{i R(l 1) χj,s,0}

h νT1{R(l 1)} i2

Pn i=1 1{i R(l 1)} , (37)

1 2 ν 2 2b{R(l 1), j, s} = νT1{R(l 1) χj,s,1} P ν y T 1{R(l 1) χj,s,1} Pn i=1 1{i R(l 1) χj,s,1} + (38)

νT1{R(l 1) χj,s,0} P ν y T 1{R(l 1) χj,s,0} Pn i=1 1{i R(l 1) χj,s,0} νT1{R(l 1)} P ν y T 1{R(l 1)} Pn i=1 1{i R(l 1)} ,

c{R(l 1), j, s} =

h (P ν y)T1{R(l 1) χj,s,1} i2

Pn i=1 1{i R(l 1) χj,s,1} +

h (P ν y)T1{R(l 1) χj,s,0} i2

Pn i=1 1{i R(l 1) χj,s,0}

h (P ν y)T1{R(l 1)} i2

Pn i=1 1{i R(l 1)} . (39)

We compute the scalar ν 2 2 and the vector P ν y in O(n) operations once at the start of the algorithm. We also sort each feature in O [nlog(n)] operations per feature. We will now show that for the lth level and the jth feature, we can compute a{R(l 1), j, s}, b{R(l 1), j, s}, and c{R(l 1), j, s} for all n 1 values of s in O(n) operations. The index s appears in (37) (39) only through Pn i=1 1{i R(l 1) χj,s,1}, Pn i=1 1{i R(l 1) χj,s,0},

and through inner products of vectors ν and P ν y with indicator vectors 1 R(l 1) χj,s,1

and 1 R(l 1) χj,s,0 . For simplicity, we assume that covariate xj is continuous, and thus the order statistics are unique. Let m1 be the index corresponding to the smallest value of xj. Then νT1 R(l 1) χj,1,1 = νm1 if observation m1 is in R(l 1), and is 0 otherwise. Similarly, Pn i=1 1{i R(l 1) χj,1,1} = 1

if observation m1 is in R(l 1), and is 0 otherwise. Next, let m2 be the index corresponding to the second smallest value of xj. Then νT1 R(l 1) χj,2,1 = 1 R(l 1) χj,1,1 + νm2 if observation m2 is in R(l 1), and is equal to 1 R(l 1) χj,1,1 otherwise. We compute Pn i=1 1{i R(l 1) χj,2,1} in the same manner. Each update is done in constant time. Con-

tinuing in this manner, computing the full set of n 1 quantities νT1 R(l 1) χj,s,1

and Pn i=1 1{i R(l 1) χj,s,1} for s = 1, . . . , n 1 requires a single forward pass through the sorted values of xj, which takes O(n) operations. The same ideas can be applied to compute P ν y T 1 R(l 1) χj,1,1 , νT1 R(l 1) χj,s,0 , P ν y T 1 R(l 1) χj,s,0 , and Pn i=1 1{i R(l 1) χj,s,0} using constant time updates for each value of s.

Neufeld, Gao, Witten

Thus, we can obtain all components of coeﬃcients a{R(l 1), j, s}, b{R(l 1), j, s}, and c{R(l 1), j, s} for a ﬁxed j and l, and for all s = 1, . . . , n 1, in O(n) operations. These scalar components can be combined to obtain the coeﬃcients in O(n) operations. Therefore, given the sorted features, we compute the (n 1)p L coeﬃcients in O(np L) operations.

Once the coeﬃcients on the right hand side of (36) have been computed, we can compute Sl,j,s in constant time via the quadratic equation: it is either a single interval or the union of two intervals. Finally, in general we can intersect (n 1)p L intervals in O {np L log(np L)} operations (Bourgon, 2009). The ﬁnal claim of Proposition 15 involves the special case where ν = νsib and B = branch{RA, treeλ(y)}.

Lemma 26 Suppose that ν = νsib from (6) and B = branch{RA, treeλ(y)}. (i) If l < L, then for all j and s, there exist a, b [ , ] such that a νTy b and Sl,j,s = (a, b). (ii) If l = L, then for all j and s, there exist c, d R such that c 0 d and Sl,j,s = ( , c] [d, ). (iii) We can intersect all (n 1)p L sets of the form Sl,j,s in O(np L) operations.

Proof This proof relies on the form of Sl,j,s given in (36). To prove (i), note that when l < L,

νsib 4 2 h a n R(l 1), j, s o a n R(l 1), jl, sl oi = νT sib MR(l 1),j,sνsib νT sib MR(l 1),jl,slνsib

= νT sib MR(l 1),j,sνsib 0.

The ﬁrst equality follows directly from the deﬁnition of a(R, j, s) in (35). To see why the second equality holds, observe that RA RB R(l 1), and without loss of generality assume that RA RB R(l 1) χjl,sl,1. Recall that the ith element of νsib is non-zero if and only if i RA RB, and that the non-zero elements of νsib sum to 0. Thus, 1 R(l 1) T νsib = 0 and 1 R(l 1) χjl,sl,1 T νsib = 0. Furthermore, the supports of R(l 1) χjl,sl,0 and νsib are non-overlapping, and so 1 R(l 1) χjl,sl,0 T νsib = 0. Thus,

MR(l 1),jl,slνsib = h P1{R(l 1) χjl,sl,1} + P1{R(l 1) χjl,sl,0} P1{R(l 1)}

i νsib = 0.

The ﬁnal inequality follows because MR(l 1),j,s is positive semideﬁnite (Lemma 23). Thus, when l < L, Sl,j,s is deﬁned in (36) by a quadratic inequality with a non-negative quadratic coeﬃcient. Thus, Sl,j,s must be a single interval of the form (a, b). Furthermore, since B = branch{RA, treeλ(y)}, we know that R(B) tree0(y) = tree0{y (νTy, ν)}. Therefore, νTy Sgrow (B, νsib) = L l=1 p j=1 n 1 s=1 Sl,j,s, and so we conclude a νTy b. This completes the proof of (i). To prove (ii), we ﬁrst prove that when l = L the quadratic equation in φ deﬁned in (36) has a non-positive quadratic coeﬃcient. To see this, note that

νsib 4 2 h a n R(L 1), j, s o a n R(L 1), j L, s L oi = νT sib MR(L 1),j,sνsib νT sib MR(L 1),j L,s Lνsib

= νT sib h P1{R(L 1) χj,s,1} + P1{R(L 1) χj,s,0} i νsib νT sib h P1{R(L 1) χj L,s L,1} + P1{R(L 1) χj L,s L,0} i νsib

= νT sib h P1{R(L 1) χj,s,1} + P1{R(L 1) χj,s,0} i νsib νT sibνsib

= h P1{R(L 1) χj,s,1} + P1{R(L 1) χj,s,0} i νsib 2

2 νsib 2 2 0. (40)

Selective Inference for Regression Trees

The ﬁrst equality follows from (35). The second follows from the deﬁnition of MR,j,s given in (33) and from the fact that P1{R(L 1)}νsib = 0 because 1{R(L 1)} Tνsib sums up all of the non-zero elements of νsib, which sum to 0. The third equality follows because νsib lies in span 1 R(L 1) χj L,s L,1 , 1 R(L 1) χj L,s L,0 ; projecting it onto this span yields

itself. Noting that h P1{R(L 1) χj,s,1} + P1{R(L 1) χj,s,0} i is itself a projection matrix, the fourth equality follows from the idempotence of projection matrices, and the inequality follows from the fact that νsib 2 Qνsib 2 for any projection matrix Q. Thus, when l = L, the quadratic that deﬁnes Sl,j,s has a non-positive quadratic coeﬃcient. Equality is attained in (40) if and only if νsib span 1{R(L 1) χj,s,1}, 1{R(L 1) χj,s,0} . This can only happen if splitting R(L 1) on j, s yields an identical partition of the data to splitting on j L, s L. If this is the case, then SL,j,s = ( , 0] [0, ) from the deﬁnition of Sl,j,s in Proposition 14, and so (ii) is satisﬁed with c = d = 0. We now proceed to the setting where the inequality in (40) is strict. In this case, (36) implies that SL,j,s = ( , c] [d, ) for c d and c, d R. To complete the proof of (ii), we must argue that c 0 and d 0. Recall that the quadratic in (34) has the form gain R(L 1){y (φ, ν), j, s} gain R(L 1){y (φ, ν), j L, s L}. When φ = 0, gain R(L 1){y (φ, ν), j L, s L} = 0, because φ = 0 eliminates the contrast between RA and RB, so that the split on j L, s L provides zero gain. So, when φ = 0, the quadratic evaluates to gain R(L 1){y (φ, ν), j, s}, which is non-negative by Lemma 23. Thus, Sl,j,s is deﬁned by a downward facing quadratic that is non-negative when φ = 0, and so the set Sl,j,s has the form ( , c] [d, ) for c 0 d. To prove (iii), observe that (i) implies that L 1 l=1 p j=1 n 1 s=1 Sl,j,s = (amax, bmin), where amax is the maximum over all of the a s, and bmin is the minimum over all of the b s. This can be computed in np(L 1) steps. Furthermore, (ii) implies that p j=1 n 1 s=1 SL,j,s = ( , cmin] [dmax, ), where cmin and dmax are the minimum over all of the c s and the maximum over all the d s, respectively. This can be computed in np steps. Thus, we can compute L l=1 p j=1 n 1 s=1 Sl,j,s in O(np L) operations.

D.3 Proof of Proposition 16

To prove Proposition 16, we ﬁrst propose a particular method of constructing an example of tree(B, ν, λ). We then show that (21) holds for this particular choice for tree(B, ν, λ). We conclude by evaluating the computational cost of computing such an example of tree(B, ν, λ), and by arguing that in the special case where R(B) treeλ(y), our example is equal to treeλ(y). When Algorithm A2 is called with parameters tree, y, λ, O, where O is a bottom-up ordering of the K nodes in tree, it computes a sequence of intermediate trees, tree0, . . . , tree K. We use the notation treek(tree, y, λ, O), for k = 0, . . . , K, to denote the kth of these intermediate trees. The following lemma helps build up to our proposed example of tree(B, ν, λ).

Lemma 27 Let φ1 Sgrow(B, ν) and φ2 Sgrow(B, ν). Then tree0{y (φ1, ν)} = tree0{y (φ2, ν)}. Let O be a bottom-up ordering of the K regions in tree0{y (φ1, ν)} such that the last L regions in the ordering are R(L 1), . . . , R(0). Then tree K L[tree0{y (φ1, ν)}, y (φ1, ν), λ, O] = tree K L[tree0{y (φ2, ν)}, y (φ2, ν), λ, O].

Neufeld, Gao, Witten

Proof We ﬁrst prove that tree0{y (φ1, ν)} tree0{y (φ2, ν)}, where φ1, φ2 Sgrow(B, ν). The fact that φ1 Sgrow(B, ν) and φ2 Sgrow(B, ν) implies two properties:

Property 1: R(l) tree0{y (φ1, ν)} and R(l) tree0{y (φ2, ν)} for l {0, . . . , L} by the deﬁnition of Sgrow(B, ν).

Property 2: R(l) sib tree0{y (φ1, ν)} and R(l) sib tree0{y (φ2, ν)} for l {1, . . . , L},

where R(l) sib R(l 1) χjl,sl,1 el. This follows from Property 1 and Deﬁnition 1.

Suppose that R tree0{y (φ1, ν)}. Then R must belong to one of these three cases, illustrated in Figure 10(a):

Case 1: l {0, . . . L} such that R = R(l). By Property 1, R tree0{y (φ2, ν)}.

Case 2: l {1, . . . , L} such that R = R(l) sib. By Property 2, R tree0{y (φ2, ν)}.

Case 3: R desc[R , tree0{y (φ1, ν)}], where either R = R(l) sib for some l {1, . . . , L},

or else R = R(L). By Properties 1 and 2, R tree0{y (φ2, ν)}. Condition 1 ensures that, for all i R and for some constants c and d,

{y (φ2, ν)}i =

{y (φ1, ν)}i if R = R(l) sib for some l {1, . . . , L 1},

{y (φ1, ν)}i + c if R = R(L) sib , {y (φ1, ν)}i + d if R = R(L).

As constant shifts preserve within-node sums of squared errors, in each of these three scenarios, desc[R , tree0{y (φ1, ν)}] = desc[R , tree0{y (φ2, ν)}]. Thus, R tree0{y (φ2, ν)}.

Thus, if R tree0{y (φ1, ν)}, then R tree0{y (φ2, ν)}. This completes the argument that tree0{y (φ1, ν)} tree0{y (φ2, ν)}. Swapping the roles of φ1 and φ2 in this argument, we see that tree0{y (φ2, ν)} tree0{y (φ1, ν)}. This concludes the proof that tree0{y (φ1, ν)} = tree0{y (φ2, ν)}. Because tree0{y (φ1, ν)} = tree0{y (φ2, ν)}, it follows that any bottom-up ordering of the regions in tree0{y (φ1, ν)} is also a bottom-up ordering for the regions in tree0{y (φ2, ν)}. We next prove by induction that, if we choose a bottom-up ordering O that places the regions in R(B) at the end of the ordering, then

treek[tree0{y (φ1, ν)}, y (φ1, ν), λ, O] = treek[tree0{y (φ2, ν)}, y (φ2, ν), λ, O], (41)

for k = 0, . . . , K L. It follows immediately from Algorithm A2 and the argument above that tree0[tree0{y (φ1, ν)}, y (φ1, ν), λ, O] = tree0[tree0{y (φ2, ν)}, y (φ2, ν), λ, O]. Next, suppose that for some k {1, . . . , K L},

treek 1[tree0{y (φ1, ν)}, y (φ1, ν), λ, O] = treek 1[tree0{y (φ2, ν)}, y (φ2, ν), λ, O], (42)

and denote this tree with treek 1 for brevity. We must prove that (41) holds. Let R be the kth region in O and recall the assumption that the last L regions in O are {R(L 1), . . . , R(0)}. Since k K L, this implies that R {R(L 1), . . . , R(0)}. This means that either

Selective Inference for Regression Trees

R desc h R(l) sib, tree0{y (φ1, ν)} i for l {1, . . . , L} or R desc R(L), tree0{y (φ1, ν)} , meaning that R is a black region in Figure 10(b). From Condition 1,

{y (φ2, ν)}i =

{y (φ1, ν)}i if R desc h R(l) sib, tree0{y (φ1, ν)} i for l {1, . . . , L 1},

{y (φ1, ν)}i + c if R desc h R(L) sib , tree0{y (φ1, ν)} i ,

{y (φ1, ν)}i + d if R desc R(L), tree0{y (φ1, ν)} .

In any of the three cases illustrated in Figure 10, for g( ) deﬁned in (3), g{R, treek 1, y (φ1, ν)} = g{R, treek 1, y (φ2, ν)}. Combining this with (42) and Step 2(b) of Algorithm A2 yields (41). This completes the proof by induction.

Figure 10: (a). An illustration of Case 1 (red), Case 2 (blue), and Case 3 (black) for a region R tree0{y (φ1, ν)} in the base case of the proof of Lemma 27, where R(B) = {R(0), . . . , R(3)}. (b.) The black regions show the possible cases for R treek 1 in the inductive step of the proof of Lemma 27.

Since Lemma 27 guarantees that each φ Sgrow(B, ν) leads to the same tree0{y (φ, ν)}, we will refer to this tree as tree0, will let K be the number of regions in this tree, and will let O be a bottom-up ordering of these regions that places R(L 1), . . . , R(0) in the last L spots. We will further denote tree K L{tree0, y (φ, ν), λ, O} for any φ Sgrow(B, ν) as tree K L, since Lemma 27 further tells us that this is the same for all φ Sgrow(B, ν). In what follows, we argue that if we let tree(B, ν, λ) = tree K L, where tree(B, ν, λ) appears in the statement of Proposition 16, then (21) holds. In other words, we prove that tree K L, which always exists and is well-deﬁned, is a valid example of tree(B, ν, λ). Recall from (18) that Sλ (B, ν) = φ Sgrow(B, ν) : R(L) treeλ{y (φ, ν)} . Lemma 27 says that for φ Sgrow(B, ν), we can rewrite treeλ{y (φ, ν)} as tree K{tree0, y (φ, ν), λ, O}.

Neufeld, Gao, Witten

So we can rewrite Sλ (B, ν) as

Sλ (B, ν) = n φ Sgrow(B, ν) : R(L) tree K tree0, y (φ, ν), λ, O o . (43)

Furthermore, since R(L 1), . . . , R(0) (all of which are ancestors of R(L)) are the last L nodes in the ordering O, we see that R(L) tree K{tree0, y (φ, ν), λ, O} if and only if no pruning occurs during the last L iterations of Step 2 in Algorithm A2. This means that we can characterize (43) as φ Sgrow(B, ν) : tree K L{tree0, y (φ, ν), λ, O} = tree K tree0, y (φ, ν), λ, O . (44)

Recall that for k = K L + 1, . . . , K, R(K k) is the kth region in O, and is an ancestor of R(L). We next argue that we can rewrite (44) as

φ Sgrow(B, ν) : g n R(K k), tree K L, y (φ, ν) o λ . (45)

To begin, suppose that φ (45). As we are talking about a particular φ, for k = 0, . . . , K we will suppress the dependence of treek tree0, y (φ, ν), λ, O on its arguments and denote it with treek. The fact that φ (45) means that g R(L 1), tree K L, y (φ, ν) λ, which ensures that no pruning occurs at step K L + 1, which in turn ensures that tree K L+1 = tree K L. Combined with (45), this implies that g R(L 2), tree K L, y (φ, ν) = g R(L 2), tree K L+1, y (φ, ν) λ, which ensures that no pruning occurs at step K L+2, which in turn ensures that tree K L+2 = tree K L+1 = tree K L. Proceeding in this manner, by tracing through the last L iterations of Step 2 of Algorithm A2, we see that φ satisﬁes tree K = tree K L, and so φ (44). Next suppose that φ (45). Let

k = min k {K L+1,...,K}{k : g n R(K k), tree K L, y (φ, ν) o < λ}.

As k is a minimum, we know that no pruning occurred during steps K L + 1, . . . , k 1, and so treek 1 tree0(y (φ, ν)), y (φ, ν), λ, O = tree K L. This implies that

g n R(K k ), tree K L, y (φ, ν) o < λ can be rewritten as

g R(K k ), treek 1 tree0{y (φ, ν)}, y (φ, ν), λ, O , y (φ, ν) < λ. It then follows from

Algorithm A2 that pruning occurs at step k , which means that tree K cannot possibly equal tree K L. Thus, φ (44). Thus, φ (45) if and only if φ (44). Finally, Proposition 16 rewrites (45) with the indexing over k changed to an indexing over l, and plugging in tree(B, ν, λ) = tree K L. Therefore, tree K L is a valid example of tree(B, ν, λ). To compute tree(B, ν, λ), we ﬁrst select an arbitrary φ Sgrow(B, ν). We then apply Algorithm A1 to grow tree0{y (φ, ν)}. We create a bottom-up ordering O of the K nodes in tree0{y (φ, ν)} such that R(L 1), . . . , R(0) are at the end. Finally, we apply the ﬁrst K L iterations of Algorithm A2 with arguments tree0{y (φ, ν)}, y (φ, ν), λ, and O to obtain tree(B, ν, λ). The worst case computational cost of CART (the combined Algorithm A1 and Algorithm A2) is O(n2p).

Selective Inference for Regression Trees

In the special case that R(B) treeλ(y), we have that νT y Sgrow(B, ν) because y = y (νT y, ν) and R(B) treeλ(y) tree0(y). In this case, suppose that we carry out the process described in the previous paragraph by selecting φ = νT y. As R(B) treeλ(y), it is clear from Algorithm A2 that no pruning occurs during the last L iterations of Step 2 in Algorithm A2 applied to arguments tree0(y), y, λ, and O. Therefore, the process from the previous paragraph returns the optimally pruned treeλ(y). Thus, when R(B) treeλ(y), we can simply plug in treeλ(y), which has already been built, for tree(B, ν, λ) in (21).

D.4 Proof of Proposition 17

We ﬁrst show that we can express n φ : g n R(l), tree(B, ν, λ), y (φ, ν) o λ o (46)

as the solution set of a quadratic inequality in φ for l = 0, . . . , L 1, where g( ) was deﬁned in (3). Because only the numerator of g{R(l), tree(B, ν, λ), y (φ, ν)} depends on φ, it will be useful to introduce the following concise notation:

h(R, tree, y) X

i R (yi y R)2 X

r term(R,tree)

i r (yi yr)2. (47)

We begin with the following lemma.

Lemma 28 Suppose that region R in tree has children R χj,s,0 and R χj,s,1. Then, h(R, tree, y) = gain R(y, j, s)+h(R χj,s,1, tree, y)+h(R χj,s,0, tree, y), where gain R(y, j, s) is deﬁned in (2).

Proof The result follows from adding and subtracting P i R χj,s,1(yi y R χj,s,1)2 and P i R χj,s,0(yi y R χj,s,0)2 in (47) and noting that term(R, tree) = term(R χj,s,1, tree) term(R χj,s,0).

Recall that B = ((j1, s1, e1), . . . , (j L, s L, e L)) and that R(B) = {R(0), R(1), . . . , R(L)}. Lemma 29 follows from Lemma 28 and the fact that, due to the form of the vector ν, there are many regions R for which h {R, tree(B, ν, λ), y (φ, ν)} does not depend on φ.

Lemma 29 For any φ Sgrow(B, ν), we can decompose h{R(l), tree(B, ν, λ), y (φ, ν)} as

h n R(l), tree(B, ν, λ), y (φ, ν) o =

l =l+1 h n R(l ) sib , tree(B, ν, λ), y ( φ, ν) o

+ h n R(L), tree(B, ν, λ), y ( φ, ν) o

l =l gain R(l ){y (φ, ν), jl +1, sl +1},

where R(l) sib is the sibling of R(l) in tree(B, ν, λ).

Neufeld, Gao, Witten

Proof Repeatedly applying Lemma 28 yields

h n R(l), tree(B, ν, λ), y (φ, ν) o =

l =l+1 h n R(l ) sib , tree(B, ν, λ), y (φ, ν) o

+ h n R(L), tree(B, ν, λ), y (φ, ν) o

l =l gain R(l ){y (φ, ν), jl +1, sl +1}. (48)

For l = l + 1, . . . , L 1, h n R(l ) sib , tree(B, ν, λ), y (φ, ν) o = h n R(l ) sib , tree(B, ν, λ), y ( φ, ν) o be-

cause R(l ) sib only contains observations where [y (φ, ν)]i = [y ( φ, ν)]i. Similarly,

h R(L), tree(B, ν, λ), y (φ, ν) = h n R(L), tree(B, ν, λ), y ( φ, ν) o and h n R(L) sib , tree(B, ν, λ), y (φ, ν) o =

h n R(L) sib , tree(B, ν, λ), y ( φ, ν) o because for i R(L) and i R(L) sib , {y (φ, ν)}i and {y ( φ, ν)}i only diﬀer by a constant shift. Plugging these two facts into (48) completes the proof.

We can now write (46) as n φ : g n R(l), tree(B, ν, λ), y (φ, ν) o λ o

= n φ : h n R(l), tree(B, ν, λ), y (φ, ν) o λ h | term{R(l), tree(B, ν, λ)}| 1 io

l =l gain R(l ){y (φ, ν), jl +1, sl +1} λ h | term{R(l), tree(B, ν, λ)}| 1 i

l =l+1 h n R(l ) sib , tree(B, ν, λ), y ( φ, ν) o h n R(L), tree(B, ν, λ), y ( φ, ν) o

h a n R(l ), jl +1, sl +1 o φ2 + b n R(l ), jl +1, sl +1 o φ + c n R(l ), jl +1, sl +1 oi γl

where the functions a( ), b( ), and c( ) were deﬁned in (35) in Appendix D.2, and where

γl λ h | term{R(l), tree(B, ν, λ)}| 1 i

l =l+1 h n R(l ) sib , tree(B, ν, λ), y ( φ, ν) o h n R(L), tree(B, ν, λ), y ( φ, ν) o

is a constant that does not depend on φ. The ﬁrst equality simply applies the deﬁnitions of h( ) and g( ). The second equality follows from Lemma 29 and moving terms that do not depend on φ to the right-hand-side. The third equality follows from plugging in notation from Appendix D.2 and deﬁning the constant γl for convenience. Thus, (46) is quadratic inequality in φ. We now just need to argue that its coeﬃcients can be obtained eﬃciently. We need to compute the coeﬃcients in (49) for l = 0, . . . , L 1. The quantities a{R(l ), jl , sl +1}, b{R(l ), jl +1, sl +1}, and c{R(l ), jl +1, sl +1} for l = 0, . . . , L 1 were already computed while computing Sgrow(B, ν). To get the coeﬃcients for the left hand side of (49) for each l = 0, . . . , L 1, we simply need to compute L partial sums of these quantities, which takes O(L) operations. As we are assuming that we have access to tree(B, ν, λ), computing h{R, tree(B, ν, λ), y ( φ, ν)} requires O(n) operations. Therefore, computing γ0

Selective Inference for Regression Trees

takes O(n L) operations. By storing partial sums during the computation of γ0, we can subsequently obtain γl for l = 1, . . . , L 1 in constant time. We have now seen that we can obtain the coeﬃcients needed to express (46) as a quadratic function of φ for l = 0, . . . , L 1 in O(n L) total operations. Once we have these quantities, we can compute each set of the form (46) in constant time using the quadratic equation. It remains to compute

Sλ(B, ν) = Sgrow(B, ν) \ "L 1 \

n φ : g n R(l), tree(B, ν, λ), y (φ, ν) o λ o#

Recall from Proposition 14 that Sgrow(B, ν) is the intersection of O(np L) quadratic sets. Thus, in the worst case, Sgrow(B, ν) has O(np L) disjoint components, and so this ﬁnal intersection involves O(np L) components. Thus, we can compute Sλ(B, ν) in O{np L log(np L)} operations (Bourgon, 2009). The following lemma explains why, similar to Proposition 15, computation time can be reduced in the special case where ν = νsib and B = branch{RA, treeλ(y)}.

Lemma 30 When B = branch{RA, treeλ(y)}, the set {φ : g{R(l), tree(B, ν, λ), y (φ, νsib)} λ} has the form ( , al) (bl, ), where al 0 bl. Therefore, we can compute TL 1 l=0 φ : g R(l), tree(B, ν, λ), y (φ, ν) λ as

l=0 ( , al) (bl, ) = ( , min 0 l L 1 al) ( max 0 l L 1 bl, )

in O(L) operations. Furthermore, we can compute (50) in constant time.

Proof As R(B) treeλ(y), we can let φ Sgrow(B, ν) from Lemma 29 be νT y such that y ( φ, ν) = y. We can then apply Lemma 22 to note that

l =l gain R(l ){y (φ, νsib), jl +1, sl +1} =

l =l gain R(l ){y, jl +1, sl +1}

+gain R(L 1){y (φ, νsib), j L, s L}.

Thus, when ν = νsib and B = branch{RA, treeλ(y)}, we can rewrite (49) with all of the terms corresponding to gain R(l ){y (φ, νsib), jl +1, sl +1} for l = l + 1, . . . , L 2 moved into the constant on the right-hand-side. This lets us rewrite (49) as {φ : Gain R(L 1){y (φ, νsib), j L, s L} γl}, where γl is an updated constant that does not depend on φ. To prove that {φ : Gain R(L 1){j L, s L, y (φ, νsib)} γl} has the form ( , al) (bl, ) for al 0 bl, ﬁrst recall from Lemma 23 that Gain R(L 1){j L, s L, y (φ, νsib)} is a quadratic function of φ. It then suﬃces to show that this quadratic has a non-negative second derivative and achieves its minimum when φ = 0. The second derivative of this quadratic is a R(L 1), j L, s L = ||νsib|| 4 2 νT sib MR(L 1),j L,s Lνsib, which is non-negative by Lemma 23. From Lemma 23, Gain R(L 1) {j L, s L, y (φ, νsib)} is non-negative. It equals 0 when φ = 0, because when φ = 0 then y R(L 1) χj L,s L,1 = y R(L 1) χj L,s L,0.

Neufeld, Gao, Witten

Intersecting L sets of the form ( , al) (bl, ) for al 0 bl only takes O(L) operations, because we simply need to identify the minimum al and maximum bl. Finally, Lemma 26 ensures that Sgrow(B, νsib) has at most two disjoint intervals, and so the ﬁnal intersection with Sgrow(B, νsib) takes only O(1) operations.

D.5 Proof of Proposition 18

Let B = branch{RA, treeλ(y)}, let R(B) = {R(0), . . . , R(L)}, and let ν = νsib (6). Applying the expression given for {y (φ, νsib)}i in Section 3.3 (which follows from algebra), we immediately see that Condition 1 holds with RA = R(L) and RB = R(L 1) χj L,s L,1 e L. Let B = π branch{RA, treeλ(y)} and let ν = νreg (14). Note that π branch{RA, treeλ(y)}

induces the same region R(L) as the unpermuted branch{RA, treeλ(y)}. Regardless of the permutation, the induced R(L) is equal to RA. Applying the expression for {y (φ, νreg)}i given in Section 3.3, Condition 1 holds with constant c2 = 0.

Appendix E. Proofs for Section 4.3

E.1 Proof of Proposition 19

As stated in Proposition 19, let

p Q reg(y) = pr H0

|ν T reg Y c| |ν T regy c| | [

π Q R(π[branch{RA, treeλ(y)}]) treeλ(Y ), P νreg Y = P νregy

First, we will show that the test based on p Q reg(y) controls the selective Type 1 error rate, deﬁned in (4); this is a special case of Proposition 3 from Fithian et al. (2014). Deﬁne E1 = {Y : RA treeλ(Y )}, E2 = {Y : P νreg Y = P νregy}, and

E3 = n Y : S π Q R π branch RA, treeλ(y) treeλ(Y ) o . Recall that the test of

H0 : νT regµ = c based on p Q reg(y) controls the selective Type 1 error rate if, for all α [0, 1], pr H0{p Q reg(Y ) α | E1} α. By construction,

pr H0 p Q reg(Y ) α | E2 E3 = E h 1{pr H0(|νT reg Y c| |νT regy c||E2 E3) α} | E2 E3 i = α.

Let ψQ RA = 1{pr H0(|νT reg Y c| |νT regy c||E2 E3) α}. An argument similar to that of Lemma 13 indicates that E3 E1. The law of total expectation then yields

E(ψQ RA | E1) = E n E(ψQ RA | E1 E2 E3) | E1 o = E n E(ψQ RA | E2 E3) | E1 o = E(α | E1) = α.

Thus, the test based on p Q reg(y) controls the selective Type 1 error rate. We omit the proof that p Q reg(y) can be computed as

p Q reg(y) = pr H0

|φ c| |νT regy c| | φ [

π Q Sλ π[branch{RA, treeλ(y)}], νreg

for φ N(c, νreg 2 2σ2), as the proof is similar to the proof of Theorem 6.

Selective Inference for Regression Trees

Appendix F. Eﬀect of Using the Computationally Eﬃcient Alternative in Section 4.3

In this section, we investigate the eﬀect of using p I reg(y) from (22) rather than preg(y) from (15) on power. We also investigate the eﬀect of using (23) rather than (17) on the width of conﬁdence intervals for νT regµ. We generate data as described in Section 5.1, but for simplicity we restrict our attention to the case where a = 1 (corresponding to the center panel of Figure 3). For each tree that we build, we consider (a) testing H0 : νT regµ = 0 and (b) constructing a conﬁdence interval for νT regµ, for each region appearing at the third level of the tree. For each test, we compare the test that uses the full conditioning set (i.e. that uses preg(y) from (15)) to the test that uses the identity permutation only (i.e. that uses p I reg(y) from (22)). For each interval, we compare the method that uses the full conditioning set (i.e. (17)) to the method that uses the identity permutation only (i.e. (23)). The results are displayed in Figure 11. The left panel of Figure 11 shows that the power loss resulting from using (22) instead of (15) is negligible. In fact, we need to zoom in on the left panel, as shown in the center panel, to see any separation between the power curves. We see in the center panel that power is lower when (22) is used, though we emphasize that the diﬀerences in power are extremely small. We see in the right panel of Figure 11 that the computationally eﬃcient conditioning set has a more noticeable impact on the median width of our conﬁdence intervals. As expected, the conﬁdence intervals are narrower when we use the full conditioning set (i.e. (17)). However, this diﬀerence is most noticeable when b is small. When b is small, in which case the conﬁdence intervals are wide even when the full conditioning set is used. Thus, the amount of precision lost overall by constructing conﬁdence intervals using the identity permutation only (i.e. (23)) is not of practical importance. Based on these results, we recommend using the identity permutation in practice, because it is the most computationally eﬃcient choice and does not meaningfully reduce power or precision compared to the full conditioning set.

Appendix G. Robustness to Non-Normality

In this section, we explore the performance of the selective Z-test under a global null when the normality assumption on Y is violated. For four choices of cumulative distribution function F, we generate Yi i.i.d. F such that all observations have the same expected value. We set n = 200, p = 10, and we grow trees to a maximum depth of 3. We plug in (n 1) 1 Pn i=1(yi y)2 as an estimate of σ2, as it does not make sense to assume known variance for distributions with a mean-variance relationship. Figure 12 displays quantile-quantile plots of the p-values for testing H0 : νT sibµ = 0, using the test in (8). Figure 12 shows that despite the fact that our proposed selective inference framework was derived under a normality assumption, it yields approximately uniformly distributed p-values for Poisson(10), Bernoulli(0.5), and Gamma(1,10) data. We suspect that this is because P ν Y is approximately independent of νTY for these distributions (see the proof of Theorem 6 in Appendix B). In the case of Bernoulli(0.1) data, which represents a par-

Neufeld, Gao, Witten

Figure 11: Simulation results comparing inference based on the full conditioning set to inference based on the identity permutation only (see Section 4.3). The left panel shows power curves. The center panel zooms in on one section of the left panel. The right panel shows median widths of conﬁdence intervals.

Figure 12: Quantile-quantile plots of the p-values for testing H0 : νT sibµ = 0 under a global null. A naive Z-test (green), sample splitting (blue), and selective Z-test (pink) were performed; see Section 5.2.

Selective Inference for Regression Trees

Figure 13: A CART tree ﬁt to the Box Lunch Study data. Each split has been labeled with a p-value (8), and each region has been labeled with a conﬁdence interval (23). Inference is carried out by plugging in ˆσcons, from Section 5.7, as an estimate of σ.

ticularly extreme violation of the normality assumption, the p-values from our selective Z-test are not uniformly distributed. As mentioned in Section 7, future work could involve characterizing conditions for F under which our selective Z-tests will approximately control the selective Type 1 error.

Appendix H. Alternate Box Lunch Study Analysis

Figure 13 is the same as the left panel of Figure 9, but the selective Z-inference is carried out with ˆσcons, from Section 5.7, rather than ˆσSSE. The takeaways presented in Section 6 do not change.

Neufeld, Gao, Witten

Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal eﬀects. Proceedings of the National Academy of Sciences, 113(27):7353 7360, 2016.

PK Bhattacharya. Some aspects of change-point analysis. Lecture Notes-Monograph Series, pages 28 56, 1994.

Richard Bourgon. Overview of the intervals package, 2009. R Vignette, URL https://cran. r-project.org/web/packages/intervals/vignettes/intervals_overview.pdf.

Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classiﬁcation and regression trees. CRC Press, 1984.

Shuxiao Chen and Jacob Bien. Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, 29(2):323 334, 2020.

Yiqun T Chen and Daniela M Witten. Selective inference for k-means clustering. ar Xiv preprint ar Xiv:2203.15267, 2022.

William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. ar Xiv preprint ar Xiv:1410.2597, 2014.

Lucy L Gao, Jacob Bien, and Daniela Witten. Selective inference for hierarchical clustering. ar Xiv preprint ar Xiv:2012.02936, 2020.

Torsten Hothorn and Achim Zeileis. partykit: A modular toolkit for recursive partytioning in R. The Journal of Machine Learning Research, 16(1):3905 3909, 2015.

Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15 (3):651 674, 2006.

Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classiﬁcation, 2 (1):193 218, 1985.

Sangwon Hyun, Kevin Z Lin, Max G Sell, and Ryan J Tibshirani. Post-selection inference for changepoint detection algorithms with application to copy number variation data. Biometrics, pages 1 13, 2021.

Sean Jewell, Paul Fearnhead, and Daniela Witten. Testing for a change in mean after changepoint detection. Journal of the Royal Statistical Society, Series B, 2022.

Danijel Kivaranovic and Hannes Leeb. On the length of post-model-selection conﬁdence intervals conditional on polyhedral constraints. Journal of the American Statistical Association, 116(534):845 857, 2021.

Jason D Lee, Dennis L Sun, Yuekai Sun, Jonathan E Taylor, et al. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907 927, 2016.

Selective Inference for Regression Trees

Keli Liu, Jelena Markovic, and Robert Tibshirani. More powerful post-selection inference, with application to the lasso. ar Xiv preprint ar Xiv:1801.09037, 2018.

Wei-Yin Loh. Fifty years of classiﬁcation and regression trees. International Statistical Review, 82(3):329 348, 2014.

Wei-Yin Loh, Haoda Fu, Michael Man, Victoria Champion, and Menggang Yu. Identiﬁcation of subgroups with diﬀerential treatment eﬀects for longitudinal and multiresponse variables. Statistics in Medicine, 35(26):4837 4855, 2016.

Wei-Yin Loh, Michael Man, and Shuaicheng Wang. Subgroups from regression trees with adjustment for prognostic eﬀects and postselection inference. Statistics in Medicine, 38 (4):545 557, 2019.

Brian D Ripley. Pattern recognition and neural networks. Cambridge University Press, 1996.

Mark Robert Segal. Regression trees for censored data. Biometrics, 44(1):35 47, 1988.

Jonathan Taylor and Robert J Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629 7634, 2015.

Terry Therneau and Beth Atkinson. rpart: Recursive Partitioning and Regression Trees, 2019. R package version 4.1-15, available on CRAN.

Xiaoying Tian and Jonathan Taylor. Asymptotics of selective inference. Scandinavian Journal of Statistics, 44(2):480 499, 2017.

Xiaoying Tian and Jonathan Taylor. Selective inference with a randomized response. The Annals of Statistics, 46(2):679 710, 2018.

Ryan J Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani. Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600 620, 2016.

Ryan J Tibshirani, Alessandro Rinaldo, Rob Tibshirani, and Larry Wasserman. Uniform asymptotic inference and the bootstrap after model selection. The Annals of Statistics, 46(3):1255 1287, 2018.

Ashwini Venkatasubramaniam and Julian Wolfson. vis Tree: Visualization of Subgroups for Decision Trees, 2018. R package version 0.8.1, available on CRAN.

Ashwini Venkatasubramaniam, Julian Wolfson, Nathan Mitchell, Timothy Barnes, Meghan Ja Ka, and Simone French. Decision trees in epidemiological research. Emerging Themes in Epidemiology, 14(1):11, 2017.

Stefan Wager and Guenther Walther. Adaptive concentration of regression trees, with application to random forests. ar Xiv preprint ar Xiv:1503.06388, 2015.