# treevalues_selective_inference_for_regression_trees__685f8fa9.pdf Journal of Machine Learning Research 23 (2022) 1-43 Submitted 6/21; Revised 10/22; Published 11/22 Tree-Values: Selective Inference for Regression Trees Anna C. Neufeld aneufeld@uw.edu Department of Statistics University of Washington Seattle, WA 98195, USA Lucy L. Gao lucy.gao@stat.ubc.ca Department of Statistics University of British Columbia Vancouver, British Columbia, V6T 1Z4, Canada Daniela M. Witten dwitten@uw.edu Departments of Statistics and Biostatistics University of Washington Seattle, WA 98195, USA Editor: Garvesh Raskutti We consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake. Keywords: Regression trees, CART, selective inference, post-selection inference, hypothesis testing. 1. Introduction Regression tree algorithms recursively partition covariate space using binary splits to obtain regions that are maximally homogeneous with respect to a continuous response. The Classification and Regression Tree (CART; Breiman et al. 1984) proposal, which involves growing a large tree and then pruning it back, is by far the most popular of these algorithms. The regions defined by the splits in a fitted CART tree induce a piecewise constant regression model where the predicted response within each region is the mean of the observations in that region. CART is popular in large part because it is highly interpretable; someone without technical expertise can easily read the tree to make predictions, and to understand why a certain prediction is made. However, its interpretability belies the fact c 2022 Anna C. Neufeld, Lucy L. Gao, Daniela M. Witten. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0722.html. Neufeld, Gao, Witten that CART trees are highly unstable: a small change to the training dataset can drastically change the structure of the fitted tree. In the absence of an established notion of statistical significance associated with a given split in the tree, it is hard for a practitioner to know whether they are interpreting signal or noise. In this paper, we use the framework of selective inference to fill this gap by providing a toolkit to conduct inference on hypotheses motivated by the output of the CART algorithm. Given a CART tree, consider testing for a difference in the mean response of the regions resulting from a binary split. A very naive approach, such as a two-sample Z-test, that does not account for the fact that the regions were themselves estimated from the data will fail to control the selective Type 1 error rate: the probability of rejecting a true null hypothesis, given that we decided to test it (Fithian et al., 2014). Similarly, a naive Z-interval for the mean response in a region will not attain nominal selective coverage: the probability that the interval covers the parameter, given that we chose to construct it. In fact, approaches for conducting inference on the output of a regression tree are quite limited. Sample splitting involves fitting a CART tree using a subset of the observations, which will naturally lead to an inferior tree to the one resulting from all of the observations, and thus is unsatisfactory in many applied settings; see Athey and Imbens (2016). Wager and Walther (2015) develop convergence guarantees for unpruned CART trees that can be leveraged to build confidence intervals for the mean response within a region; however, they do not provide finite-sample results and cannot accommodate pruning. Loh et al. (2016) and Loh et al. (2019) develop bootstrap calibration procedures that attempt to provide confidence intervals for the regions of a regression tree. In Appendix A, we show that this bootstrap calibration approach fails to provide intervals that achieve nominal coverage for the parameters of interest in this paper. As an alternative to performing inference on a CART tree, one could turn to the conditional inference tree (CTree) framework of Hothorn et al. (2006). This framework uses a different tree-growing algorithm than CART, and at each split tests for linear association between the split covariate and the response. As summarized in Loh (2014), the CTree framework alleviates issues with instability and variable selection bias associated with CART. Despite these advantages, CTree remains far less widely-used than CART. Furthermore, while CTree attaches a notion of statistical significance to each split in a tree, it does not directly allow for inference on the mean response within a region or the difference in mean response between two regions. Finally, while the CTree framework requires few assumptions, its inference is based on asymptotics. In this paper, we introduce a finite-sample selective inference (Fithian et al., 2014) framework for the difference between the mean responses in two regions, and for the mean response in a single region, in a pruned or unpruned CART tree. We condition on the event that CART yields a particular set of regions, and thereby achieve selective Type 1 error rate control as well as nominal selective coverage. The rest of this paper is organized as follows. In Section 2, we review the CART algorithm, and briefly define some key ideas in selective inference. In Section 3, we present our proposal for selective inference on the regions estimated via CART. We show that the necessary conditioning sets can be efficiently computed in Section 4. In Section 5 we compare our framework to sample splitting and CTree via simulation. In Section 6 we Selective Inference for Regression Trees compare our framework to CTree on data from the Box Lunch Study. The discussion is in Section 7. Technical details are relegated to the supplementary materials. 2. Background 2.1 Notation for Regression Trees Given p covariates (X1, . . . , Xp) measured on each of n observations (x1, . . . , xn), let xj,(s) denote the sth order statistic of the jth covariate, and define the half-spaces χj,s,1 = z Rp : zj xj,(s) , χj,s,0 = z Rp : zj > xj,(s) . (1) The following definitions are illustrated in Figure 1. Definition 1 (Tree and Region) Consider a set S such that R Rp for all R S. Then S is a tree if and only if (i) Rp S; (ii) every element of S \ Rp equals R χj,s,e for some R S, j {1, . . . , p}, s {1, . . . , n 1}, e {0, 1}; (iii) R χj,s,e S implies that R χj,s,1 e S for e {0, 1}; and (iv) for any R, R S, R R { , R, R }. If R S and S is a tree, then we refer to R as a region. We use the notation tree to refer to a particular tree. Definition 1 implies that any region R tree \{Rp} is of the form R = L l=1χjl,sl,el, where for each l = 1, . . . , L, we have that jl {1, . . . , p}, sl {1, . . . , n 1}, and el {0, 1}. We call L the level of the region, and use the convention that the level of Rp is 0. Definition 2 (Siblings and Children) Suppose that {R, R χj,s,1, R χj,s,0} tree. Then R χj,s,1 and R χj,s,0 are siblings. Furthermore, they are the children of R. Definition 3 (Descendant and Ancestor) If R, R tree and R R , then R is a descendant of R , and R is an ancestor of R. Definition 4 (Terminal Region) A region R tree without descendants is a terminal region. We let desc(R, tree) denote the set of descendants of region R in tree, and we let term(R, tree) denote the subset of desc(R, tree) that are terminal regions. Given a response vector y Rn, let y R = P i:xi R yi / Pn i=1 1(xi R) , where 1(A) is an indicator variable that equals 1 if the event A holds, and 0 otherwise. Then, a tree tree induces the regression model ˆµ(x) = P R term(Rp,tree) y R1(x R). In other words, it predicts the response within each terminal region to be the mean of the observations in that region. 2.2 A Review of the CART Algorithm (Breiman et al., 1984) The CART algorithm (Breiman et al., 1984) greedily searches for a tree that minimizes the sum of squared errors P R term(Rp,tree) P i:xi R(yi y R)2. It first grows a very large tree via recursive binary splits, starting with the full covariate space Rp. To split a region R, it selects the covariate xj and the split point xj,(s) to maximize the gain, defined as gain R (y, j, s) X i R (yi y R)2 yi y R χj,s,1 2 + X yi y R χj,s,0 2 Neufeld, Gao, Witten Figure 1: The regression tree takes the form tree = {Rp, χ1,s1,1, χ1,s1,0, χ1,s1,1 χ2,s2,1, χ1,s1,1 χ2,s2,0}. The regions RA = χ1,s1,1 χ2,s2,1 and RB = χ1,s1,1 χ2,s2,0 are siblings, and are children, and therefore descendants, of the region χ1,s1,1. The ancestors of RA and RB are Rp and χ1,s1,1. Furthermore, RA, RB, and χ1,s1,0 are terminal regions. Details are provided in Algorithm A1. Once a very large tree has been grown, cost-complexity pruning is applied. We define the average per-region gain in sum-of-squared errors provided by the descendants of a region R, g(R, tree, y) = i:xi R (yi y R)2 P r term(R,tree) i:xi r (yi yr)2 | term(R, tree)| 1 . (3) Given a complexity parameter λ 0, if g(R, tree, y) < λ for some R tree, then costcomplexity pruning removes R s descendants from tree, turning R into a terminal region. Details are in Algorithm A2, which involves the notion of a bottom-up ordering. Definition 5 (Bottom-up ordering) Let tree = {R1, . . . , RK}. Let π be a permutation of the integers (1, . . . , K). Then O = Rπ(1), . . . , Rπ(K) is a bottom-up ordering of the regions in tree if, for all k = 1, . . . , K, π(k) π(j) if Rk desc(Rj, tree). There are other equivalent formulations for cost-complexity pruning (see Proposition 7.2 in Ripley (1996)); the formulation in Algorithm A2 is convenient for establishing the results in this paper. To summarize, the CART algorithm first applies Algorithm A1 to the initial region Rp and the data y to obtain an unpruned tree, which we call tree0(y). It then applies Algorithm A2 to tree0(y) to obtain an optimally-pruned tree using complexity parameter λ, which we call treeλ(y). Algorithm A1 (Growing a tree) Grow(R, y) 1. If a stopping condition is met, return R. 2. Else return {R, Grow(R χ j, s,1, y), Grow(R χ j, s,0, y)}, where ( j, s) arg max(j,s):s {1,...,n 1},j {1,...,p} gain R (y, j, s) . Selective Inference for Regression Trees Algorithm A2 (Cost-complexity pruning) Parameter O is a bottom-up ordering of the K regions in tree. Prune(tree, y, λ, O) 1. Let tree0 = tree. Let K be the number of regions in tree0. 2. For k = 1, . . . , K: (a) Let R be the kth region in O. (b) Update treek as follows, where g( ) is defined in (3): ( treek 1 \ desc(R, treek 1) if g(R, treek 1, y) < λ, treek 1 otherwise. 3. Return tree K. 2.3 A Brief Overview of Selective Inference Here, we provide a very brief overview of selective inference; see Fithian et al. (2014) or Taylor and Tibshirani (2015) for a more detailed treatment. Consider conducting inference on a parameter θ. Classical approaches assume that we were already interested in conducting inference on θ before looking at our data. If, instead, our interest in θ was sparked by looking at our data, then inference must be performed with care: we must account for the fact that we selected θ based on the data (Fithian et al., 2014). In this setting, interest focuses on a p-value p(Y ) such that the test for H0 : θ = θ0 based on p(Y ) controls the selective Type 1 error rate, in the sense that pr H0:θ=θ0 {p(Y ) α | θ selected} α, for all 0 α 1. (4) Also of interest are confidence intervals [L(Y ), U(Y )] that achieve (1 α)-selective coverage for the parameter θ, meaning that pr {θ [L(Y ), U(Y )] | θ selected} 1 α. (5) Roughly speaking, the inferential guarantees in (4) and (5) can be achieved by defining p-values and confidence intervals that condition on the aspect of the data that led to the selection of θ. In recent years, a number of papers have taken this approach to perform selective inference on parameters selected from the data in the regression (Lee et al., 2016; Liu et al., 2018; Tian and Taylor, 2018; Tibshirani et al., 2016), clustering (Gao et al., 2020), and changepoint detection (Hyun et al., 2021; Jewell et al., 2022) settings. In the next section, we propose p-values that satisfy (4) and confidence intervals that satisfy (5) in the setting of CART, where the parameter of interest is either the mean response within a region, or the difference between the mean responses of two sibling regions. 3. The Selective Inference Framework for CART 3.1 Inference on a Pair of Sibling Regions Throughout this paper, we assume that Y Nn(µ, σ2In) with σ > 0 known. We let X Rn p denote a fixed covariate matrix. Suppose that we apply CART with complexity parameter λ to a realization y = (y1, . . . , yn)T from Y to obtain treeλ(y). Given Neufeld, Gao, Witten sibling regions RA and RB in treeλ(y), we define a contrast vector νsib Rn such that (νsib)i = 1(xi RA) Pn i =1 1(xi RA) 1(xi RB) Pn i =1 1(xi RB) , (6) and νT sibµ = P i:xi RA µi / Pn i=1 1(xi RA) P i:xi RB µi / Pn i=1 1(xi RB) . Now, consider testing the null hypothesis of no difference in means between RA and RB, i.e. H0 : νT sibµ = 0 versus H1 : νT sibµ = 0. This null hypothesis is of interest because RA and RB appeared as siblings in treeλ(y). A test based on a p-value of the form pr H0 (|νT sib Y | |νT siby|) that does not account for this will not control the selective Type 1 error rate in (4). To control the selective Type 1 error rate, we propose a p-value that conditions on the aspect of the data that led us to select νT sibµ, pr H0 n |νT sib Y | |νT siby| | RA, RB are siblings in treeλ(Y ) o . (7) But (7) depends on a nuisance parameter, the portion of µ that is orthogonal to νsib. To remove the dependence on this nuisance parameter, we condition on its sufficient statistic P νsib Y , where P ν = I ννT/ ν 2 2. The resulting p-value, or tree-value , is defined as psib(y) = pr H0 n |νT sib Y | |νT siby| | RA, RB are siblings in treeλ(Y ), P νsib Y = P νsiby o . (8) Results similar to Theorem 6 can be found in Jewell et al. (2022); Lee et al. (2016); Liu et al. (2018), and Tibshirani et al. (2016). Theorem 6 The test based on the p-value psib(y) in (8) controls the selective Type 1 error rate for H0 : νT sibµ = 0, where νsib is defined in (6), in the sense that pr H0 n psib(Y ) α | RA, RB are siblings in treeλ(Y ) o = α, for all 0 α 1. (9) Furthermore, psib(y) = pr |φ| |νT siby| | φ Sλ sib(νsib) , where φ N(0, νsib 2 2σ2), y (φ, ν) = P ν y + φ(ν/ ν 2 2), and Sλ sib(νsib) = {φ : RA, RB are siblings in treeλ{y (φ, νsib)}}. (10) Proofs of all theoretical results are provided in the appendix. Theorem 6 says that given the set Sλ sib(νsib), we can compute the p-value in (8) using psib(y) = 1 F n |νT siby|; 0, νsib 2 2σ2, Sλ sib(νsib) o +F n |νT siby|; 0, νsib 2 2σ2, Sλ sib(νsib) o , (11) where F ; 0, ν 2σ2, S denotes the cumulative distribution function of the N(0, ν 2 2σ2) distribution truncated to the set S. In Section 4, we provide an efficient approach for analytically characterizing the truncation set Sλ sib(νsib). To avoid numerical issues associated with the truncated normal distribution, we compute (11) using methods described in the supplement of Chen and Bien (2020). Note that the proof of Theorem 6, and consequently the efficient computation of psib(y) discussed in Section 4, relies on the assumption that Y Nn(µ, σ2In). Selective Inference for Regression Trees We now consider inverting the test proposed in (8) to construct an equitailed confidence interval for νT sibµ that has (1 α)-selective coverage (5), in the sense that pr n νT sibµ [L(Y ), U(Y )] | RA, RB are siblings in treeλ(Y ) o = 1 α. (12) Proposition 7 For any 0 α 1 and any realization y Rn, the values L(y) and U(y) that satisfy F{ν T siby; L(y), σ2 νsib 2 2, Sλ sib(νsib)} = 1 α/2, F{ν T siby; U(y), σ2 νsib 2 2, Sλ sib(νsib)} = α/2, (13) are unique, and [L(Y ), U(Y )] achieves (1 α)-selective coverage for νT sibµ. 3.2 Inference on a Single Region Given a single region RA in a CART tree, we define the contrast vector νreg such that (νreg)i = 1(xi RA)/ i =1 1(xi RA) Then, νT regµ = P i:xi RA µi / Pn i=1 1(xi RA) . We now consider testing the null hypothesis H0 : νT regµ = c for some fixed c. Because our interest in this null hypothesis results from the fact that RA treeλ(y), we must condition on this event in defining the p-value. We define preg(y) = pr H0 n |νT reg Y c| |νT regy c| | RA treeλ(Y ), P νreg Y = P νregy o , (15) and introduce the following theorem. Theorem 8 The test based on the p-value preg(y) in (15) controls the selective Type 1 error rate for H0 : νT regµ = c, where νreg is defined in (14). Furthermore, preg(y) = pr |φ c| |νT regy c| | φ Sreg(νreg) , where φ N(c, νreg 2 2σ2) and, for y (φ, ν) = P ν y + φ(ν/ ν 2 2), Sλ reg(νreg) = {φ : RA treeλ{y (φ, νreg)}}. (16) Theorem 2 and the resulting efficient computations in Section 4 rely on the assumption that Y Nn(µ, σ2In). We can also define a confidence interval for νT regµ that attains nominal selective coverage. Proposition 9 For any 0 α 1 and any realization y Rn, the values L(y) and U(y) that satisfy F{ν T regy; L(y), σ2 νreg 2 2, Sλ reg(νreg)} = 1 α/2, F{ν T regy; U(y), σ2 νreg 2 2, Sλ reg(νreg)} = α/2, (17) are unique, and [L(Y ), U(Y )] achieves (1 α)-selective coverage for νT regµ. In Section 4, we propose an approach to analytically characterize the set Sλ reg(νreg) in (16). Neufeld, Gao, Witten 3.3 Intuition for the Conditioning Sets Sλ sib(νsib) and Sλ reg(νreg) We first develop intuition for the set Sλ sib(νsib) defined in (10). From Theorem 6, y (φ, νsib) i = yi+(φ νT siby) ( Pn i =1 1(xi RB) Pn i =1 1(xi RA RB) 1(xi RA) Pn i =1 1(xi RA) Pn i =1 1(xi RA RB) 1(xi RB) Thus, y (φ, νsib) is a perturbation of y that exaggerates the difference between the observed sample mean responses of RA and RB if |φ| > |νT siby|, and shrinks that difference if |φ| < |νT siby|. The set Sλ sib(νsib) quantifies the amount that we can shift the difference in sample mean responses between RA and RB while still producing a tree containing these sibling regions. The top row of Figure 2 displays tree0{y (φ, νsib)}, as a function of φ, in an example where S0 sib(νsib) = ( 19.8, 1.8) (0.9, 34.9). Figure 2: Data with n = 100 and p = 2. Regions resulting from CART (λ = 0) are delineated using solid lines. Here, RA = χ1,26,0 χ2,72,1 and RB = χ1,26,0 χ2,72,0. Top: Output of CART applied to y (φ, νsib), where νsib in (6) encodes the contrast between RA and RB, for various values of φ. The left-most panel displays y = y (νT siby, νsib). By inspection, we see that 14.9 S0 sib(νsib) and 5 S0 sib(νsib), but 0 S0 sib(νsib) and 40 S0 sib(νsib). In fact, S0 sib(νsib) = ( 19.8, 1.8) (0.9, 34.9). Bottom: Output of CART applied to y (φ, νreg), where νreg in (14) encodes membership in RA. The left-most panel displays y = y (νT regy, νreg). Here, S0 reg(νreg) = ( , 3.1) (5.8, 8.8) (14.1, ). We next develop intuition for Sλ reg(νreg), defined in (16). Note that {y (φ, νreg)}i = yi + (φ νT regy)1(xi RA), where y (φ, νreg) is defined in Theorem 8. Thus, y (φ, νreg) shifts the responses of the observations in RA so that their sample mean equals φ, and leaves the others unchanged. The set Sλ reg(νreg) quantifies the amount that we can exaggerate or shrink the sample mean response in region RA while still producing a tree that contains RA. The bottom row of Figure 2 displays y (φ, νreg) as φ is varied, in an example with S0 reg(νreg) = ( , 3.1) (5.8, 8.8) (14.1, ). Selective Inference for Regression Trees 4. Computing the conditioning sets Sλ sib(νsib) and Sλ reg(νreg) 4.1 Recharacterizing the conditioning sets in terms of branches We begin by introducing the concept of a branch. Definition 10 (Branch) A branch is an ordered sequence of triples B = ((j1, s1, e1), . . . , (j L, s L, e L)) such that jl {1, . . . , p}, sl {1, . . . , n 1}, and el {0, 1} for l = 1, . . . , L. The branch B induces a nested set of regions R(B) = {R(0), R(1), . . . , R(L)}, where R(l) = Tl l =1 χjl ,sl ,el for l = 1, . . . , L, and R(0) = Rp. For a branch B and a vector ν, we define Sλ(B, ν) = n φ : R(B) treeλ{y (φ, ν)} o . (18) For R tree, we let branch(R, tree) denote the branch such that R{branch(R, tree)} contains R and all of its ancestors in tree. Lemma 11 Suppose that RA and RB are siblings in treeλ(y). Then RA and RB are siblings in treeλ{y (φ, νsib)} if and only if R[branch{RA, treeλ(y)}] treeλ{y (φ, νsib)}. Therefore, Sλ sib(νsib) = Sλ[branch{RA, treeλ(y)}, νsib], defined in (10) and (18). Lemma 11 says that treeλ{y (φ, νsib)} contains siblings RA and RB if and only if it contains the entire branch associated with RA in treeλ(y). However, Lemma 11 does not apply in the single region case: for νreg defined in (14) and some RA treeλ(y), the fact that RA treeλ{y (φ, νreg)} does not imply that R[branch{RA, treeλ(y)}] treeλ{y (φ, νreg)}. Instead, a result similar to Lemma 11 holds, involving permutations of the branch. Definition 12 (Permutation of a branch) Let Π denote the set of all L! permutations of (1, 2, . . . , L). Given π Π and a branch B = ((j1, s1, e1), . . . , (j L, s L, e L)), we say that π (B) =((jπ(1), sπ(1), eπ(1)), . . . , (jπ(L), sπ(L), eπ(L))) is a permutation of the branch B. Branch B and its permutation π (B) induce the same region R(L), but R{π (B)} = R(B). Lemma 13 Let RA treeλ(y). Then RA treeλ{y (φ, νreg)} if and only if there exists a π Π such that R[π {branch RA(y)}] treeλ{y (φ, νreg)}. Thus, for Sλ reg(νreg) in (16), Sλ reg(νreg) = [ π Π Sλ π h branch{RA, treeλ(y)} i , νreg . (19) Lemmas 11 and 13 reveal that computing Sλ sib(νsib) and Sλ reg(νreg) requires characterizing sets of the form Sλ(B, ν), defined in (18). To compute Sλ sib(νsib) we will only need to consider Sλ(B, ν) where R(B) treeλ(y). However, to compute Sλ reg(νreg), we will need to consider Sλ{π(B), ν} where R(B) treeλ(y) but R{π(B)} treeλ(y). Neufeld, Gao, Witten 4.2 Computing Sλ(B, ν) in (18) Throughout this section, we consider a vector ν Rn and a branch B = ((j1, s1, e1), . . . , (j L, s L, e L)), where R(B) may or may not be in treeλ(y). Recall from Definition 10 that B induces the nested regions R(l) = Tl l =1 χjl ,sl ,el for l = 1, . . . , L, and R(0) = Rp. Throughout this section, our only requirement on B and ν is the following condition. Condition 1 For y (φ, ν) defined in Theorem 6, B and ν satisfy {y (φ, ν)}i = yi+c11{xi R(L)}+ c21[xi {R(L 1) χj L,s L,1 e L}] for i = 1, . . . , n and for some constants c1 and c2. To characterize Sλ(B, ν) in (18), recall that the CART algorithm in Section 2.2 involves growing a very large tree tree0(y), and then pruning it. We first characterize the set Sgrow(B, ν) = {φ : R(B) tree0{y (φ, ν)}}. (20) Proposition 14 Recall the definition of gain R(l){y (φ, ν), j, s} in (2), and let Sl,j,s = {φ : gain R(l 1){y (φ, ν), j, s} gain R(l 1){y (φ, ν), jl, sl}}. Then, Sgrow(B, ν) = TL l=1 Tp j=1 Tn 1 s=1 Sl,j,s. Proposition 15 says that we can compute Sgrow(B, ν) efficiently. Proposition 15 The set Sl,j,s is defined by a quadratic inequality in φ. Furthermore, we can evaluate all of the sets Sl,j,s, for l = 1, . . . , L, j = 1, . . . , p, s = 1, . . . , n 1, in O {np L + np log(n)} operations. Intersecting these sets to obtain Sgrow(B, ν) requires at most O {np L log(np L)} operations, and only O(np L) operations if B = branch{RA, treeλ(y)} and ν is of the form νsib in (6). Noting that Sλ(B, ν) = φ Sgrow(B, ν) : R(L) treeλ{y (φ, ν)} , it remains to characterize the set of φ Sgrow(B, ν) such that R(L) is not removed during pruning. Recall that g( ) was defined in (3). Proposition 16 There exists a tree tree(B, ν, λ) such that Sλ(B, ν) = Sgrow(B, ν) n φ : g n R(l), tree(B, ν, λ), y (φ, ν) o λ o! If R(B) treeλ(y), then tree(B, ν, λ) = treeλ(y) satisfies (21). Otherwise, given the set Sgrow(B, ν), computing a tree(B, ν, λ) that satisfies (21) has a worst-case computational cost of O(n2p). We explain how to compute a tree(B, ν, λ) satisfying (21) when R(B) treeλ(y) in the supplementary materials. Proposition 17 The set TL 1 l=0 φ : g R(l), tree(B, ν, λ), y (φ, ν) λ in (21) is the intersection of the solution sets of L quadratic inequalities in φ. Given tree(B, ν, λ), the coefficients of these quadratics can be obtained in O(n L) operations. After Sgrow(B, ν) has been computed, intersecting it with these quadratic sets to obtain Sλ(B, ν) from (21) requires O{np L log(np L)} operations in general, and only O(L) operations if B = branch{RA, treeλ(y)} and ν = νsib from (6). Selective Inference for Regression Trees The results in this section have relied upon Condition 1. Indeed, this condition holds for branches B and vectors ν that arise in characterizing the sets Sλ sib(νsib) and Sλ reg(νreg). Proposition 18 If either (i) B = branch{RA, treeλ(y)} and ν = νsib (6), where RA and RB are siblings in treeλ(y), or (ii) B is a permutation of branch{RA, treeλ(y)} and ν = νreg (14), where RA treeλ(y), then Condition 1 holds. Combining Lemma 11 with Propositions 14 18, we see that Sλ sib(νsib) can be computed in O{np L + np log(n)} operations. However, computing Sλ reg(νreg) is much more computationally intensive: by Lemma 13 and Propositions 14 18, it requires computing Sλ(π branch{RA, treeλ(y)} , νreg) for all L! permutations π Π, for a total of O L! n2p Llog(p L) operations. In Section 4.3, we discuss ways to avoid these calculations. 4.3 A Computationally-Efficient Alternative to Sλ reg(νreg) Lemma 13 suggests that carrying out inference on a single region requires computing Sλ(π branch{RA, treeλ(y)} , νreg) for every π Π. We now present a less computationally demanding alternative. Proposition 19 Let Q be a subset of the L! permutations in Π, i.e. Q Π. Define p Q reg(y) = pr H0 |ν T reg Y c| |ν T regy c| | [ R(π[branch{RA, treeλ(y)}]) treeλ(Y ) , P νreg Y = P νregy The test based on p Q reg(y) controls the selective Type 1 error rate (4) for H0 : νT regµ = c. Fur- thermore, p Q reg(y) = pr n |φ c| |νT regy c| | φ S π Q Sλ π[branch{RA, treeλ(y)}], νreg o , where φ N(c, νreg 2 2σ2). Using the notation in Proposition 19, preg(y) introduced in (15) equals pΠ reg(y). If we take Q = {I}, where I is the identity permutation, then we arrive at p I reg(y) = P |φ c| |νT regy c| | φ Sλ[branch{RA, treeλ(y)}, νreg] , (22) where φ N(c, νreg 2 2σ2). The set Sλ[branch{RA, treeλ(y)}, νreg] can be easily computed by Proposition 16. Compared to (15), (22) conditions on an extra piece of information: the ancestors of RA. Thus, while (22) controls the selective Type 1 error rate, it may have lower power than (15) (Fithian et al., 2014). Similarly, inverting (22) to form a confidence interval provides correct selective coverage, but may yield intervals that are wider than those in Proposition 9. Proposition 19 is motivated by a proposal by Lee et al. (2016) to condition on both the selected model (necessary information) and the signs of the selected variables (extra information) in the lasso setting, to gain computational efficiency at the possible expense of precision and power. In Appendix F, we show through simulation that the loss in power associated with using (22) rather than (15) is negligible. Thus, in practice, we suggest using (22) for its computational efficiency. We use (22) for the remainder of this paper. Neufeld, Gao, Witten Figure 3: The true mean model in Section 5, for a = 0.5 (left), a = 1 (center), and a = 2 (right). The difference in means between the sibling nodes at level two in the tree is ab, while the difference in means between the sibling nodes at level three is b. Furthermore, we can consider computing confidence intervals of the form [LSIreg(y), USIreg(y)] rather than (17), where LSIreg(y) and USIreg(y) satisfy F νT regy; LSIreg(y), σ2 νreg 2 2, Sλ h branch{RA, treeλ(y)}, νreg i = 1 α F νT regy; USIreg(y), σ2 νreg 2 2, Sλ h branch{RA, treeλ(y)}, νreg i = α In Appendix F, we show that the confidence intervals resulting from (23) are not much wider than those resulting from (17). We therefore make use of confidence intervals of the form (23) in the remainder of this paper. 5. Simulation Study 5.1 Data Generating Mechanism We simulate X Rn p with n = 200, p = 10, Xij i.i.d. N(0, 1), and y Nn(µ, σ2In) with σ = 5 and µi = b h 1(xi,1 0) {1 + a1(xi,2>0) + 1(xi,3 xi,2>0)} i . This µ vector defines a three-level tree, shown in Figure 3 for three values of a R. 5.2 Methods for Comparison All CART trees are fit using the R package rpart (Therneau and Atkinson, 2019) with λ = 200, a maximum level of three, and a minimum node size of one. We compare three approaches for conducting inference. (i) Selective Z-methods: Fit a CART tree to the data. For each split, test for a difference in means between the two sibling regions using (8), and compute the corresponding confidence interval in (13). Compute the confidence interval for the mean of each region using (23). (ii) Naive Z-methods: Fit a CART tree to the data. For each split, conduct a naive Z-test for the difference in means between the two sibling regions, and compute the corresponding naive Z-interval. Compute a naive Z-interval for each region s mean. (iii) Sample splitting: Split the data into equally-sized training and test sets. Fit a CART tree to the training set. On the test set, conduct a naive Z-test for each Selective Inference for Regression Trees split and compute a naive Z-interval for each split and each region. If a region has no test set observations, then we fail to reject the null hypothesis and fail to cover the parameter. The conditional inference tree (CTree) framework of Hothorn et al. (2006) uses a different criterion than CART to perform binary splits. Within a region, it tests for linear association between each covariate and the response. The covariate with the smallest p-value for this linear association is selected as the split variable, and a Bonferroni corrected p-value that accounts for the number of covariates is reported in the final tree. Then, the split point is selected. If, after accounting for multiple testing, no variable has a p-value below a prespecified significance level α, then the recursion stops. While CTree s p-values assess linear association and thus are not directly comparable to the p-values in (i) (iii) above, it is the most popular framework currently available for determining if a regression tree split is statistically significant. Thus, we also evaluate the performance of (iv) CTree: Fit a CTree to all of the data using the R package partykit (Hothorn and Zeileis, 2015) with α = 0.05. For each split, record the p-value reported by partykit. In Sections 5.3 5.6, we assume that σ is known. We consider the case of unknown σ in Section 5.7. 5.3 Uniform p-values under a Global Null We generate 5, 000 datasets with a = b = 0, so that H0 : νT sibµ = 0 holds for all splits in all trees. Figure 4 displays the distributions of p-values across all splits in all fitted trees for the naive Z-test, sample splitting, and the selective Z-test. The selective Z-test and sample splitting achieve uniform p-values under the null, while the naive Z-test (which does not account for the fact that νsib was obtained by applying CART to the same data used for testing) does not. CTree is omitted from the comparison: it creates a split only if the p-value is less than α = 0.05, and thus its p-values over the splits do not follow a Uniform(0,1) distribution. Figure 4: Quantile-quantile plots of the p-values for testing H0 : νT sibµ = 0, as described in Section 5.3. A naive Z-test (green), sample splitting (blue), and selective Z-test (pink) were performed; see Section 5.2. The p-values are stratified by the level of the regions in the fitted tree. Neufeld, Gao, Witten Table 1: A 3 3 contingency table indicating an observation s involvement in a given true split and estimated split. The adjusted Rand index is computed using only the shaded cells Estimated Split In left region In right region In neither True Split In left region t1 t2 t3 In right region u1 u2 u3 In neither v1 v2 v3 We generate 500 datasets for each (a, b) {0.5, 1, 2} {1, . . . , 10}, and evaluate the power of selective Z-tests, sample splitting, and CTree to reject the null hypothesis H0 : νT sibµ = 0. As naive Z-tests do not control the Type 1 error rate (Figure 4), we do not evaluate their power. We consider two aspects of power: the probability that we detect a true split, and the probability that we reject the null hypothesis corresponding to a true split. Given a true split in Figure 3 and an estimated split, we construct the 3 3 contingency table in Table 1, which indicates whether an observation is on the left-hand side, right-hand side, or not involved in the true split (rows) and the estimated split (columns). To quantify the agreement between the true and estimated splits, we compute the adjusted Rand index (Hubert and Arabie, 1985) associated with the 2 3 contingency table corresponding to the shaded region in Table 1. For each true split, we identify the estimated split for which the adjusted Rand index is largest; if this index exceeds 0.75 then this true split is detected . Given that a true split is detected, the associated null hypothesis is rejected if the corresponding p-value is below 0.05. Figure 5 displays the proportion of true splits that are detected and rejected by each method. As sample splitting fits a tree using only half of the data, it detects fewer true splits, and thus rejects the null hypothesis for fewer true splits, than the selective Z-test. When a is small, the difference in means between sibling regions at level two is small. Because CTree makes a split only if there is strong evidence of association at that level, it tends to build one-level trees, and thus fails to detect many true splits; by contrast, the selective Z-test (based on CART) successfully builds more three-level trees. Thus, when a is small, the selective Z-test detects (and rejects) more true differences than CTree between regions at levels two and three. 5.5 Coverage of Confidence Intervals for νT sibµ and νT regµ We generate 500 datasets for each (a, b) {0.5, 1, 2} {0, . . . , 10} to evaluate the coverage of 95% confidence intervals constructed using naive Z-methods, selective Z-methods, and sample splitting. CTree is omitted from these comparisons because it does not provide confidence intervals. We say that the interval covers the truth if it contains νTµ, where ν is defined as in (6) (for a particular split) or (14) (for a particular region). Table 2 shows the proportion of each type of interval that covers the truth, aggregated across values of a Selective Inference for Regression Trees Figure 5: Proportion of true splits detected (solid lines) and rejected (dotted lines) for CART with selective Z-tests (pink), CTree (black), and CART with sample splitting (blue) across different settings of the data generating mechanism, stratified by level in tree. As CTree only makes a split if the p-value is less than 0.05, the proportion of detections equals the proportion of rejections. and b. The selective Z-intervals attain correct coverage of 95%, while the naive Z-intervals do not. It may come as a surprise that sample splitting does not attain correct coverage. Recall that ν from (6) or (14) is an n-vector that contains entries for all observations in both the training set and the test set. Thus, νTµ involves the true mean among both training and test set observations in a given region or pair of regions. By contrast, sample splitting attains correct coverage for a different parameter involving the true means of only the test observations that fall within a given region or pair of regions. 5.6 Width of Confidence Intervals Figure 6(a) illustrates that our selective Z-intervals for νT regµ can be extremely wide when b is small, particularly for regions located at deeper levels in the tree. For each tree that we build and for levels 1, 2, and 3, we compute the adjusted Rand Index (Hubert and Arabie, 1985) between the true tree (truncated at the appropriate level) and the estimated tree (truncated at the same level). Figure 6(b) shows that our selective confidence intervals can be extremely wide when this adjusted Rand Index is small, particularly at deeper levels of the tree. When b is small and the adjusted Rand Index is small, the trees built by CART tend to be unstable, in the sense that small perturbations to the data affect the fitted tree. In this Neufeld, Gao, Witten Table 2: Proportion of 95% confidence intervals containing the true parameter, aggregated over all trees fit to the 5,500 datasets generated with (a, b) {0.5, 1, 2} {1, . . . , 10} Parameter νT regµ Parameter νT sibµ Level Selective Z Naive Z Sample Splitting Selective Z Naive Z Sample Splitting 1 0.951 0.889 0.918 0.948 0.834 0.915 2 0.950 0.645 0.921 0.951 0.410 0.917 3 0.951 0.711 0.921 0.950 0.550 0.921 setting, the sample statistics νT regy fall very close to the boundary of the truncation set. See Kivaranovic and Leeb (2021) for a discussion of why wide confidence intervals can arise in these settings. The great width of our confidence intervals reflects the uncertainty about the mean response within each region due to the instability of the tree-fitting procedure. Figure 6: The median width of the selective Z-intervals for parameter νT regµ for regions at levels one (solid), two (dashed), and three (dotted) of the tree. Similar results hold for parameter νT sibµ. Panel (a) breaks results down by the parameters a and b, whereas panel (b) aggregates results across values of parameters a and b, and displays them as a function of the adjusted Rand Index between the true and estimated trees. 5.7 Results with Unknown σ Thus far, we have assumed that σ is known. In this section, we compare the following three versions of the selective Z-methods that plug different values of σ into the truncated normal CDF when computing p-values and confidence intervals: 1. σ: We plug in the true value of σ, as in Sections 5.3 5.6. 2. ˆσcons: We plug in ˆσcons = (n 1) 1 n P i=1 (yi y)2, where y = n 1 Pn i=1 yi. Selective Inference for Regression Trees Figure 7: QQ plots of the p-values from testing H0 : νT sibµ = 0 when µ = 0n using the selective Z-test with three different values plugged in to the truncated normal CDF for σ. The p-values are stratified by the level of the regions in the fitted tree. 3. ˆσSSE : Let T = term Rp, treeλ(y) be the number of terminal regions in treeλ(y). We plug in ˆσSSE = q (n T ) 1 Pn i=1(yi ˆyi)2, where ˆyi is the predicted value for the ith observation given by Treeλ(y). It is straightforward to show that E[ˆσ2 cons] σ2, for any value of E[y] = µ. Thus, we expect this estimate to lead to conservative inference. On the other hand, ˆσ2 SSE can be made arbitrarily small by making the fitted tree arbitrarily deep, and so we expect inference based on this estimate to be anti-conservative if the fitted CART tree is large. Figure 7 shows the distribution of p-values from testing H0 : νT sibµ = 0 with the three versions of the selective Z-test under the data generating mechanism described in Section 5.1, with a = b = 0. In this setting, νT sibµ = 0 holds for all splits in all trees. We see almost no difference between the three versions of the selective Z-test. In this global null setting, E[ˆσ2 cons] = σ2. Furthermore, the empirical bias of ˆσ2 SSE is small because the trees we grow are not particularly large; as in the rest of Section 5, we build trees to a maximum depth of 3 and prune with λ = 200. Figure 8 displays the proportion of true splits detected and the proportion of true splits detected and rejected, as defined in Section 5.4, for the three versions of the selective Z-test when data is generated as in Section 5.4. For simplicity, we only show the setting where a = 1. All three methods detect the same proportion of true splits, because they all perform inference on the same CART trees. The proportion of splits detected and rejected is very similar for σ and ˆσSSE because ˆσSSE is a very good estimator for σ in this setting. While ˆσcons performs reasonably when b is small, it severely overestimates σ and thus has low power when b is large. Neufeld, Gao, Witten Figure 8: Proportion of true splits detected (solid lines) and rejected (dotted lines) for CART with the three versions of the selective Z-test. The results are stratified by level in tree. Table 3 displays confidence intervals for νT sibµ and νT regµ for the three versions of the selective Z-intervals, where data is generated as in Section 5.5. As expected, ˆσcons leads to slight over-coverage and ˆσSSE leads to slight under-coverage. Parameter νT regµ Parameter νT sibµ Level σ ˆσcons ˆσSSE σ ˆσcons ˆσSSE 1 0.95 0.98 0.95 0.95 0.98 0.94 2 0.95 0.97 0.94 0.95 0.97 0.94 3 0.95 0.96 0.94 0.95 0.96 0.94 Table 3: Proportion of 95% confidence intervals containing the true parameter, aggregated over all trees fit to the 5,500 datasets generated with (a, b) {0.5, 1, 2} {1, . . . , 10} In this section, we have seen that when trees are not grown overly large, plugging in ˆσSSE leads to approximate selective Type 1 error control, approximately correct selective coverage, and good power. Unfortunately, providing theoretical guarantees for our procedures when using ˆσSSE would be quite difficult, as the estimator is anti-conservative and depends on the output of CART. Providing theoretical guarantees for our procedures under ˆσcons is more straightforward, using ideas from Gao et al. (2020), Chen and Witten (2022), and Tibshirani et al. (2018). However, as shown in Figure 8, selective Z-tests based on ˆσcons can have very low power. One promising avenue of future work involves providing theoretical guarantees in the regression tree setting for estimators that are less conservative than ˆσcons. Selective Inference for Regression Trees Figure 9: Left: A CART tree fit to the Box Lunch Study data. Each split has been labeled with a p-value (8), and each region has been labeled with a confidence interval (23). The shading of the nodes indicates the average response values (white indicates a very small value and dark blue a very large value). Top right: A CTree fit to the Box Lunch Study data. Bottom right: A scatterplot showing the relationship between the covariate hunger and the response. 6. An Application to the Box Lunch Study Venkatasubramaniam et al. (2017) compare CART and CTree (Hothorn et al., 2006) within the context of epidemiological studies. They conclude that CTree is preferable to CART because it provides p-values for each split, even though CART has higher predictive accuracy. Since our framework provides p-values for each split in a CART tree, we revisit their analysis of the Box Lunch Study, a clinical trial studying the impact of portion control interventions on 24-hour caloric intake. We consider identifying subgroups of study participants with baseline differences in 24-hour caloric intake on the basis of scores from an assessment that quantifies constructs such as hunger, liking, the relative reinforcement of food (rrvfood), and restraint (resteating). We exactly reproduce the trees presented in Figures 1 and 2 of Venkatasubramaniam et al. (2017) by building a CTree using partykit and a CART tree using rpart on the Box Lunch Study data provided in the R package vis Tree (Venkatasubramaniam and Wolfson, 2018). We apply our selective inference framework to compute p-values (8) for each split in CART, and confidence intervals (23) for each region. In this section, we use ˆσSSE, defined in Section 5.7, to estimate the error variance. The results are shown in Figure 9. Neufeld, Gao, Witten Both CART and CTree choose hunger<1.8 as the first split. For this split, our selective Z-test reports a large p-value of 0.44, while CTree reports a p-value less than 0.001. The conflicting p-values are explained by the difference in null hypotheses. CTree finds strong evidence against the null of no linear association between hunger and caloric intake. By contrast, our selective framework for CART does not find strong evidence for a difference between mean caloric intake of participants with hunger<1.8 and those with hunger 1.8. We see from the bottom right of Figure 9 that while there is evidence of a linear relationship between hunger and caloric intake, there is less evidence of a difference in means across the particular split hunger=1.8. Given that the goal of Venkatasubramaniam et al. (2017) is to identify population subgroups that are relatively homogeneous with respect to an outcome , the p-value resulting from our selective framework is more natural than the pvalue output by CTree, since the former relates directly to the subgroups formed by the split, whereas the latter does not take into account the location of the split point. In general, the left-hand panel of Figure 9 shows that the subgroups of patients identified by CART are not significantly different from one another. This is an important finding that would be missed without our selective inference framework. Furthermore, unlike CTree, our framework provides confidence intervals for the mean response in each subgroup. An alternative analysis using ˆσcons, defined in Section 5.7, is provided in Appendix H, and leads to similar findings. 7. Discussion Our framework relies on the assumption that Y Nn(µ, σ2I), with σ2 known. In Section 5.7, we showed strong empirical performance when the variance is unknown and σ2 is estimated. In this section, we briefly comment on the assumptions of spherical variance and normally distributed data. It natural to wonder whether the assumption that Y Nn(µ, σ2I) can be relaxed to the assumption that Y Nn(µ, Σ), with Σ known. Following the work of Lee et al. (2016), the results in Section 3 extend to the setting where Y Nn(µ, Σ) if we: 1. Modify (8) and (15) to condition on the event n In ΣννT Y = In ΣννT rather than the event {P ν Y = P ν y}, where ν = νsib in the case of (8) and ν = νreg in the case of (15). 2. Replace all instances of the perturbation y (φ, ν), defined in Theorem 6, with the perturbation y (φ, ν) = In ΣννT y + Σν νT Σν φ. Unfortunately, the modified perturbation y (φ, ν) does not satisfy Condition 1 in Section 4.2 when Σ = σ2In, and so many of the results of Section 4 do not extend to this non-spherical setting. Future work could explore how to efficiently compute the conditioning set in this non-spherical setting. Furthermore, our framework assumes a normally-distributed response variable. CART is commonly used for classification, survival (Segal, 1988), and treatment effect estimation in causal inference (Athey and Imbens, 2016). While the idea of conditioning on a selection event to control the selective Type 1 error rate applies regardless of the distribution of the response, our Theorem 6 and Theorem 8, and the resulting computational results, relied on Selective Inference for Regression Trees normality of Y . In the absence of this assumption, exactly characterizing the conditioning set and the distribution of the test statistic requires further investigation. We show in Appendix G that our selective Z-tests approximately control the selective Type 1 error when the normality assumption is violated. Tian and Taylor (2017) and Tibshirani et al. (2018) establish conditions under which selective p-values for linear regression (derived under the assumption of normality) will be asymptotically uniformly distributed under non-normality. Thus suggests the possibility of developing asymptotic theory for our proposed selective Z-tests under violations of normality. A reviewer pointed out similarities between the problem of testing significance of the first split in the tree and significance testing for a single changepoint, as in Bhattacharya (1994). Building on this connection may provide an avenue for future work. A software implementation of the methods in this paper is available in the R package treevalues, at https://github.com/anna-neufeld/treevalues. Acknowledgments Daniela Witten and Anna Neufeld were supported by the National Institutes of Health (NIH R01 GM12399) and the Simons Foundation (Simons Investigator Award in Mathematical Modeling of Living Systems). Lucy Gao was supported by the Natural Sciences and Engineering Research Council of Canada (Discovery Grants). Appendix A. Comparison to Loh et al. (2019) Loh et al. (2016) and Loh et al. (2019) use regression trees to find subgroups of patients with similar treatment effects in clinical trials. They grow trees based on patient characteristics using a different algorithm than CART. Furthermore, they are interested in the mean treatment effect (which is a linear regression coefficient) within each terminal region of the tree, rather than the mean response within each region. One of the goals of Loh et al. (2019) is to construct valid post-selection confidence intervals for the treatment effect within each terminal node. In this appendix, we show that their approach, when adapted to the setting of this paper, does not yield confidence intervals with nominal coverage. The basic idea of Loh et al. (2019), instantiated to our setting, is as follows. Suppose that RA Treeλ(y), and define the vector νreg such that (νreg)i = 1(xi RA)/ Pn i =1 1(xi RA) , as in (14). We know that the naive Z-interval does not achieve nominal coverage, meaning that ν regy zα/2 σ q Pn i=1 1(xi RA) , ν regy + zα/2 σ q Pn i=1 1(xi RA) < 1 α. (24) This is because the multiplier for the naive confidence interval, zα/2, is derived under the assumption that the region RA (or, equivalently, the vector νreg), is fixed, rather than a function of the data. Neufeld, Gao, Witten Loh et al. (2019) observe that there exists some α < α such that Pr ν regµ ν regy zα /2 σ Pn i=1 1(xi RA) , ν regy + zα /2 σ Pn i=1 1(xi RA) = 1 α. (25) The value for α for a given tree will depend on the number of split covariates p, the number of data points n, the depth of the tree, and the value of λ used for tree pruning, among other considerations. If we know how the data was generated, then we can check whether some value α satisfies (25) as follows: 1. Draw B different simulated datasets {(Xb, yb)}B b=1 from the same distribution (call this F) as the original data. For b = 1, . . . , B: (a) Build a tree using the simulated data (Xb, yb), using the same procedure and the same settings as in Step 1, and denote it treeλ(yb). (b) For each terminal region R term treeλ(yb), Rp in the tree: i. Construct a (1 α ) naive Z-interval for the mean response in the region using (Xb, yb). ii. Check if each interval contains µR, the true mean for this region R. 2. Compute the fraction of intervals in 1(b) that contain µR. 3. If this value is 1 α, then we have found the correct value of α . If not, then we try a larger or smaller value of α . We test this procedure in a very simple simulation study. We generate data Xij N(0, 1) and yi N(0, 1) for i = 1, . . . , 100 and j = 1, . . . , p. In this simple setting, the true mean response for every region in every fitted tree is 0. We carry out the procedure outlined above with α = 0.1. For two values of p and for CART trees with 1, 2, and 3 levels, we create 1000 datasets and 1000 trees and report the empirical coverage of the intervals obtained using this ideal method, averaged over all nodes in all trees. The results, shown in Table 4, show that this ideal procedure procedure leads to intervals that achieve nominal coverage. Unfortunately, this ideal procedure is practically infeasible, as it requires the user to know the true distribution of the data F. Thus, in practice, Loh et al. (2019) propose replacing F by ˆF, the empirical distribution of the original data. This amounts to replacing the simulated datasets in Step 2 with bootstrapped datasets, and checking whether the naive Z-intervals in Step 1(b) contain y R rather than µR. We can see why this is problematic in a very simple setting where we fit a tree with depth 1. If all observations have mean 0, then a CART tree fit to a bootstrap sample of the data will nevertheless find regions RL and RR such that the sample mean value of yb within RL is negative and the sample mean value of yb within RR is positive. As yb and y contain many overlapping observations, it is likely that the sample mean value of y within RL is also negative and the sample mean value of y within RR is also positive. In other words, because of the overlap between y and yb, the within-region sample means of yb are closer to the within-region sample means of y than they are to the within-region population Selective Inference for Regression Trees means. Thus, when we calibrate α to cover the mean values of y within various regions, we end up with under-coverage of the true population mean, as shown in Table 4. We see in Table 4 that our selective inference framework approach enables valid inference in this setting, whereas a bootstrap procedure modeled after Loh et al. (2019) does not. Loh (ideal) Loh (bootstrap) Selective CIs p Tree depth Coverage Average α Coverage Average α Coverage 1 0.902 0.008 0.749 0.037 0.890 2 2 0.905 0.004 0.695 0.038 0.904 3 0.900 0.005 0.660 0.047 0.895 1 0.883 0.001 0.601 0.016 0.908 20 2 0.900 0.00025 0.549 0.016 0.901 3 0.904 0.00015 0.543 0.022 0.905 Table 4: Coverage of 90% confidence intervals computed using three methods for the simple setting where yi N(0, 1) and Xij N(0, 1) for for i = 1, . . . , 100 and j = 1, . . . , p. Note that the Loh (ideal) method can never be used in practice, as it requires knowledge of the true parameter. Appendix B. Proofs for Section 3 B.1 Proof of Theorem 6 Let 0 α 1. We start by proving the first statement in Theorem 6: pr H0 n psib(Y ) α | RA, RB are siblings in treeλ(Y ) o = α. This is a special case of Proposition 3 from Fithian et al. (2014). It follows from the definition of psib(Y ) in (8) that pr H0 n psib(Y ) α | RA, RB are siblings in treeλ(Y ), P νsib Y = P νsiby o = α. Therefore, applying the law of total expectation yields pr H0 n psib(Y ) α | RA, RB siblings in treeλ(Y ) o = EH0 h 1{psib(Y )<α} | RA, RB are siblings in treeλ(Y ) i EH0 h 1{psib(Y )<α} | RA, RB are siblings in treeλ(Y ), P νsib Y = P νsiby i | RA, RB are siblings in treeλ(Y ) = EH0 n α | RA, RB are siblings in treeλ(Y ) o = α. The second statement of Theorem 6 follows directly from the following result. Neufeld, Gao, Witten Lemma 20 If Y Nn µ, σ2In , then the random variable νT sib Y has the following conditional distribution: ν T sib Y | {RA, RB are siblings in treeλ(Y ), P νsib Y = P νsiby} T N n ν T sibµ, σ2 νsib 2 2; Sλ sib(νsib) o , (26) where Sλ sib(νsib) is defined in (10) and T N (µ, σ, S) denotes the N(µ, σ2) distribution truncated to the set S. Proof The following holds for any ν Rn. pr n νTY > c | RA, RB are siblings in treeλ (Y ) , P ν Y = P ν y o = pr νTY > c | RA, RB are siblings in treeλ P ν Y + ννT ν 2 2 Y , P ν Y = P ν y = pr νTY > c | RA, RB are siblings in treeλ P ν y + ν ν 2 2 νTY , P ν Y = P ν y = pr νTY > c | RA, RB are siblings in treeλ P ν y + ν ν 2 2 νTY = pr n φ > c | φ Sλ sib (ν) o , where φ = νTY. In the fourth line, the condition P ν Y = P ν y can be dropped because when Y Nn µ, σ2In , P ν Y is independent of ν Y . Finally, since φ N νTµ, σ2 ν 2 2 , (26) holds. B.2 Proof of Proposition 7 Theorem 6.1 from Lee et al. (2016) says that the truncated normal distribution has monotone likelihood ratio in the mean parameter. This guarantees that L(y) and U(y) in (13) are unique. Then, for L( ) and U( ) in (13), (26) in Lemma 20 guarantees that pr ν Tµ [L(Y ), U(Y )] | RA, RB are siblings in treeλ(Y ), P ν Y = P ν y = 1 α. (27) Finally, we need to prove that (27) implies (1 α) selective coverage as defined in (12). Following Proposition 3 from Fithian et al. (2014), let η be the random variable P ν Y and let f( ) be its density. Then, pr n νTµ [L(Y ), U(Y )] | RA, RB are siblings in treeλ(Y ) o = Z pr n νTµ [L(Y ), U(Y )] | RA, RB are siblings in treeλ(Y ), P ν Y = P ν y o f(η)dη = Z (1 α)f(η)dη = 1 α. B.3 Proof of Theorem 8 We omit the proof of the first statement of Theorem 8, as it is similar to the proof of the first statement of Theorem 6 in Appendix B.1. The second statement in Theorem 8 follows directly from the following result. Selective Inference for Regression Trees Lemma 21 The random variable νT reg Y has the conditional distribution νT reg Y | {RA treeλ(Y ), P νreg Y = P νregy} T N n νT regµ, σ2 νreg 2 2; Sλ reg(νreg) o , (28) where Sλ reg(νreg) was defined in (16). We omit the proof of Lemma 21, as it is similar to the proof of Lemma 20. B.4 Proof of Proposition 9 The proof largely follows the proof of Proposition 7. The fact that the truncated normal distribution has monotone likelihood ratio (Theorem 6.1 of Lee et al. 2016) ensures that L(y) and U(y) defined in (17) are unique, and (28) in Lemma 21 implies that pr n νT regµ [L(Y ), U(Y )] | RA treeλ(Y ), P νreg Y = P νregy o = 1 α. The rest of the argument is as in the proof of Proposition 7. Appendix C. Proofs for Section 4.1 C.1 Proof of Lemma 11 We first state and prove the following lemma. Lemma 22 Let RA and RB be the regions in the definition of νsib in (6). For an arbitrary region R and for any j {1, . . . , p} and s {1, . . . , n 1}, recall that the potential children of R (the ones that CART will consider adding to the tree when applying Algorithm A1 to region R) are given by R χj,s,0 and R χj,s,1, where χj,s,0 and χj,s,l were defined in (1). If (RA RB) R χj,s,0 or (RA RB) R χj,s,1, then Gain R{y (φ, νsib), j, s} = Gain R{y, j, s} for all φ. Proof It follows from algebra that for Gain R(y, j, s) defined in (2), Gain R(y, j, s) = i=1 1(xi R) i=1 1(xi R χj,s,0) i=1 1(xi R χj,s,1) where y R = P i R yi / Pn i=1 1(xi R) . It follows from (29) that to prove Lemma 22, it suffices to show that y T = y (φ, νsib)T for T {R, R χj,s,0, R χj,s,1}. Recall from Section 3.3 that {y (φ, νsib)}i = yi + i, where (φ νT siby) Pn i =1 1(x i RB) Pn i =1 1(x i RA RB) if i RA (φ νT siby) Pn i =1 1(x i RA) Pn i =1 1(x i RA RB) if i RB 0 otherwise. Neufeld, Gao, Witten Without loss of generality, assume that (RA RB) R χj,s,0. For any T {R, R χj,s,0}, RA RB T. Thus, y (φ, νsib)T = 1 Pn i=1 1(xi T ) i T \(RA RB) yi + X i RA (yi + i) + X i RB (yi + i) i RB i Pn i=1 1(xi T ) Pn i=1 1(xi RA) (φ νT siby) Pn i=1 1(xi RB) Pn i=1 1(xi RA RB) Pn i=1 1(xi RB) (φ νT siby) Pn i=1 1(xi RA) Pn i=1 1(xi RA RB) Pn i=1 1(xi T ) = y T + 0 = y T . Furthermore, y (φ, νsib)R χj,s,1 = 1 Pn i=1 1(xi R χj,s,1) i R χj,s,1 (yi+ i) = 1 Pn i=1 1(xi R χj,s,1) i R χj,s,1 (yi+0) = y R χj,s,1. We will now prove Lemma 11. It follows from Definition 10 that if R branch{R, treeλ(y)} treeλ{y (φ, νsib)}, then RA and RB are siblings in treeλ{y (φ, νsib)}. This establishes the ( ) direction. We will prove the ( ) direction by contradiction. Suppose that RA and RB are siblings in treeλ{y (φ, νsib)}. Define branch{RA, treeλ(y)} = ((j1, s1, e1), . . . , (j L, s L, e L)), and define R(l ) = l T l=1 χjl,sl,el for l = 1, . . . , L. Assume that there exists l {0, . . . , L 2} such that R(l) treeλ{y (φ, νsib)} and R(l+1) treeλ{y (φ, νsib)}. We assume that any ties between splits that occur at Step 2 of Algorithm A1 are broken in the same way for y and y (φ, νsib), and so this implies that there exists ( j, s) = (jl+1, sl+1) such that ( j, s) arg maxj,s Gain R(l){y (φ, νsib), j, s} and Gain R(l){y (φ, νsib), jl+1, sl+1} < Gain R(l){y (φ, νsib), j, s}. (30) Since RA and RB are siblings in treeλ{y (φ, νsib)}, it follows from Lemma 22 that Gain R(l){y (φ, νsib), j, s} = Gain R(l)(y, j, s). Also, since RA and RB are siblings in treeλ(y), it follows from Lemma 22 that Gain R(l){y (φ, νsib), jl+1, sl+1} = Gain R(l)(y, jl+1, sl+1). Applying these facts to (30) yields Gain R(l)(y, jl+1, sl+1) < Gain R(l)(y, j, s). (31) But since R(l) and R(l+1) both appeared in treeλ(y), (jl+1, sl+1) arg max j,s Gain R(l)(y, j, s). This contradicts (31). Therefore, for any l {0, . . . , L 2}, if R(l) treeλ{y (φ, νsib)}, then R(l+1) treeλ{y (φ, νsib)}. Since R(0) treeλ{y (φ, νsib)}, the proof follows by induction. Selective Inference for Regression Trees C.2 Proof of Lemma 13 Let RA treeλ(y) with branch{RA, treeλ(y)} = ((j1, s1, e1), . . . , (j L, s L, e L)) such that RA = TL l=1 χjl,sl,el. Since Algorithm A1 creates regions by intersecting halfspaces and set intersections are invariant to the order of intersection, it follows that RA = TL l=1 χjl,sl,el treeλ{y (φ, νreg)} if and only if there exists π Π such that ( l \ l=1 χjπ(l),sπ(l),eπ(l) l =1 treeλ{y (φ, νreg)}. By Definitions 10 and 12, R(π[branch{RA, treeλ(y)}]) = l=1 χjπ(l),sπ(l),eπ(l) Sλ reg = n φ : RA treeλ{y (φ, νreg)} o n φ : R(π[branch{RA, treeλ(y)}]) treeλ y (φ, νreg) o π Π Sλ π h branch{RA, treeλ(y)} i , νreg , where the third equality follows from the definition of Sλ(B, ν) in (18). Appendix D. Proofs for Section 4.2 D.1 Proof of Proposition 14 Recall that B = ((j1, s1, e1), . . . , (j L, s L, e L)) and R(B) = {R(0), . . . , R(L)}. Recall from (20) that Sgrow(B, ν) = {φ : R(B) tree0{y (φ, ν)}}, and that we define Sl,j,s = φ : gain R(l 1){y (φ, ν), j, s} gain R(l 1){y (φ, ν), jl, sl} . For l = 1, . . . , L, R(l 1) tree0{y (φ, ν)} and φ n 1 s=1 p j=1 Sl,j,s {R(l 1), R(l)} tree0{y (φ, ν)}, (32) because, given that R(l 1) tree0{y (φ, ν)}, R(l) tree0{y (φ, ν)} if and only if (jl, sl) arg max(j,s):s {1,...,n 1},j {1,...,p} gain R(l 1) (y (φ, ν), j, s) . Combining (32) with the fact that {φ : R(0) tree0{y (φ, ν)}} = R yields s=1 Sl,j,s = n φ : {R(l 1), R(l)} tree0{y (φ, ν)} o = {φ : R(B) tree0{y (φ, ν)}}. D.2 Proof of Proposition 15 Given a region R, let 1(R) denote the vector in Rn such that the ith element is 1(xi R). Let P1(R) = 1(R) 1(R) T1(R) 1 1(R) T denote the orthogonal projection matrix onto the vector 1(R). Neufeld, Gao, Witten Lemma 23 For any region R, gain R(y, j, s) = y TMR,j,sy, where MR,j,s = P1(R χj,s,1) + P1(R χj,s,0) P1(R). (33) Furthermore, the matrix MR,j,s is positive semidefinite. Proof For any region R, P i R(yi y R)2 = P i R y2 i y TP1(R)y. Thus, from (2), gain R(y, j, s) = X i R (yi y R)2 X i R χj,s,1 (yi y R χj,s,1)2 X i R χj,s,0 (yi y R χj,s,0)2 i R y2 i y TP1(R)y X i R χj,s,1 y2 i + y TP1(R χj,s,1)y X i R χj,s,0 y2 i + y TP1(R χj,s,0)y = y T P1(R χj,s,1) + P1(R χj,s,0) P1(R) y = y TMR,j,sy. To see that MR,j,s is positive semidefinite, observe that, for any vector v, v TMR,j,sv = gain R(v, j, s) = X i R (vi v R)2 min a1,a2 i R χj,s,1 (vi a1)2 + X i R χj,s,0 (vi a2)2 i R (vi v R)2 i R χj,s,1 (vi v R)2 + X i R χj,s,0 (vi v R)2 It follows from Lemma 23 that we can express each set Sl,j,s from Proposition 14 as Sl,j,s = φ : gain R(l 1){y (φ, ν), j, s} gain R(l 1){y (φ, ν), jl, sl} = n φ : y (φ, ν)TMR(l 1),j,sy (φ, ν) y (φ, ν)TMR(l 1),jl,sly (φ, ν) o . (34) We now use (34) to prove the first statement of Proposition 15. Lemma 24 Each set Sl,j,s is defined by a quadratic inequality in φ. Proof The definition of y (φ, ν) in (10) implies that y (φ, ν)TMR,j,sy (φ, ν) = P ν y + νφ = νTMR,j,sν ν 4 2 φ2 + 2νTMR,j,s P ν y ν 2 2 φ + y TP ν MR,j,s P ν y a(R, j, s)φ2 + b(R, j, s)φ + c(R, j, s). (35) Therefore, by (34), Sl,j,s = φ : h a n R(l 1), j, s o a n R(l 1), jl, sl oi φ2 + h b n R(l 1), j, s o b n R(l 1), jl, sl oi φ + h c n R(l 1), j, s o c n R(l 1), jl, sl oi 0 . (36) Selective Inference for Regression Trees Proposition 14 indicates that to compute Sgrow(B, ν) from (20), we need to compute the coefficients of the quadratic for each Sl,j,s, where l = 1, . . . , L, j = 1, . . . , p, and s = 1, . . . , n 1. Lemma 25 We can compute the coefficients a R(l 1), j, s , b R(l 1), j, s and c R(l 1), j, s , defined in Lemma 24, for l = 1, . . . , L, j = 1, . . . , p, and s = 1, . . . , n 1, in O {nplog(n) + np L} operations. Proof Using the definitions in Lemmas 23 and 24 and algebra, we have that ν 4 2a{R(l 1), j, s} = h νT1{R(l 1) χj,s,1} i2 Pn i=1 1{i R(l 1) χj,s,1} + h νT1{R(l 1) χj,s,0} i2 Pn i=1 1{i R(l 1) χj,s,0} h νT1{R(l 1)} i2 Pn i=1 1{i R(l 1)} , (37) 1 2 ν 2 2b{R(l 1), j, s} = νT1{R(l 1) χj,s,1} P ν y T 1{R(l 1) χj,s,1} Pn i=1 1{i R(l 1) χj,s,1} + (38) νT1{R(l 1) χj,s,0} P ν y T 1{R(l 1) χj,s,0} Pn i=1 1{i R(l 1) χj,s,0} νT1{R(l 1)} P ν y T 1{R(l 1)} Pn i=1 1{i R(l 1)} , c{R(l 1), j, s} = h (P ν y)T1{R(l 1) χj,s,1} i2 Pn i=1 1{i R(l 1) χj,s,1} + h (P ν y)T1{R(l 1) χj,s,0} i2 Pn i=1 1{i R(l 1) χj,s,0} h (P ν y)T1{R(l 1)} i2 Pn i=1 1{i R(l 1)} . (39) We compute the scalar ν 2 2 and the vector P ν y in O(n) operations once at the start of the algorithm. We also sort each feature in O [nlog(n)] operations per feature. We will now show that for the lth level and the jth feature, we can compute a{R(l 1), j, s}, b{R(l 1), j, s}, and c{R(l 1), j, s} for all n 1 values of s in O(n) operations. The index s appears in (37) (39) only through Pn i=1 1{i R(l 1) χj,s,1}, Pn i=1 1{i R(l 1) χj,s,0}, and through inner products of vectors ν and P ν y with indicator vectors 1 R(l 1) χj,s,1 and 1 R(l 1) χj,s,0 . For simplicity, we assume that covariate xj is continuous, and thus the order statistics are unique. Let m1 be the index corresponding to the smallest value of xj. Then νT1 R(l 1) χj,1,1 = νm1 if observation m1 is in R(l 1), and is 0 otherwise. Similarly, Pn i=1 1{i R(l 1) χj,1,1} = 1 if observation m1 is in R(l 1), and is 0 otherwise. Next, let m2 be the index corresponding to the second smallest value of xj. Then νT1 R(l 1) χj,2,1 = 1 R(l 1) χj,1,1 + νm2 if observation m2 is in R(l 1), and is equal to 1 R(l 1) χj,1,1 otherwise. We compute Pn i=1 1{i R(l 1) χj,2,1} in the same manner. Each update is done in constant time. Con- tinuing in this manner, computing the full set of n 1 quantities νT1 R(l 1) χj,s,1 and Pn i=1 1{i R(l 1) χj,s,1} for s = 1, . . . , n 1 requires a single forward pass through the sorted values of xj, which takes O(n) operations. The same ideas can be applied to compute P ν y T 1 R(l 1) χj,1,1 , νT1 R(l 1) χj,s,0 , P ν y T 1 R(l 1) χj,s,0 , and Pn i=1 1{i R(l 1) χj,s,0} using constant time updates for each value of s. Neufeld, Gao, Witten Thus, we can obtain all components of coefficients a{R(l 1), j, s}, b{R(l 1), j, s}, and c{R(l 1), j, s} for a fixed j and l, and for all s = 1, . . . , n 1, in O(n) operations. These scalar components can be combined to obtain the coefficients in O(n) operations. Therefore, given the sorted features, we compute the (n 1)p L coefficients in O(np L) operations. Once the coefficients on the right hand side of (36) have been computed, we can compute Sl,j,s in constant time via the quadratic equation: it is either a single interval or the union of two intervals. Finally, in general we can intersect (n 1)p L intervals in O {np L log(np L)} operations (Bourgon, 2009). The final claim of Proposition 15 involves the special case where ν = νsib and B = branch{RA, treeλ(y)}. Lemma 26 Suppose that ν = νsib from (6) and B = branch{RA, treeλ(y)}. (i) If l < L, then for all j and s, there exist a, b [ , ] such that a νTy b and Sl,j,s = (a, b). (ii) If l = L, then for all j and s, there exist c, d R such that c 0 d and Sl,j,s = ( , c] [d, ). (iii) We can intersect all (n 1)p L sets of the form Sl,j,s in O(np L) operations. Proof This proof relies on the form of Sl,j,s given in (36). To prove (i), note that when l < L, νsib 4 2 h a n R(l 1), j, s o a n R(l 1), jl, sl oi = νT sib MR(l 1),j,sνsib νT sib MR(l 1),jl,slνsib = νT sib MR(l 1),j,sνsib 0. The first equality follows directly from the definition of a(R, j, s) in (35). To see why the second equality holds, observe that RA RB R(l 1), and without loss of generality assume that RA RB R(l 1) χjl,sl,1. Recall that the ith element of νsib is non-zero if and only if i RA RB, and that the non-zero elements of νsib sum to 0. Thus, 1 R(l 1) T νsib = 0 and 1 R(l 1) χjl,sl,1 T νsib = 0. Furthermore, the supports of R(l 1) χjl,sl,0 and νsib are non-overlapping, and so 1 R(l 1) χjl,sl,0 T νsib = 0. Thus, MR(l 1),jl,slνsib = h P1{R(l 1) χjl,sl,1} + P1{R(l 1) χjl,sl,0} P1{R(l 1)} i νsib = 0. The final inequality follows because MR(l 1),j,s is positive semidefinite (Lemma 23). Thus, when l < L, Sl,j,s is defined in (36) by a quadratic inequality with a non-negative quadratic coefficient. Thus, Sl,j,s must be a single interval of the form (a, b). Furthermore, since B = branch{RA, treeλ(y)}, we know that R(B) tree0(y) = tree0{y (νTy, ν)}. Therefore, νTy Sgrow (B, νsib) = L l=1 p j=1 n 1 s=1 Sl,j,s, and so we conclude a νTy b. This completes the proof of (i). To prove (ii), we first prove that when l = L the quadratic equation in φ defined in (36) has a non-positive quadratic coefficient. To see this, note that νsib 4 2 h a n R(L 1), j, s o a n R(L 1), j L, s L oi = νT sib MR(L 1),j,sνsib νT sib MR(L 1),j L,s Lνsib = νT sib h P1{R(L 1) χj,s,1} + P1{R(L 1) χj,s,0} i νsib νT sib h P1{R(L 1) χj L,s L,1} + P1{R(L 1) χj L,s L,0} i νsib = νT sib h P1{R(L 1) χj,s,1} + P1{R(L 1) χj,s,0} i νsib νT sibνsib = h P1{R(L 1) χj,s,1} + P1{R(L 1) χj,s,0} i νsib 2 2 νsib 2 2 0. (40) Selective Inference for Regression Trees The first equality follows from (35). The second follows from the definition of MR,j,s given in (33) and from the fact that P1{R(L 1)}νsib = 0 because 1{R(L 1)} Tνsib sums up all of the non-zero elements of νsib, which sum to 0. The third equality follows because νsib lies in span 1 R(L 1) χj L,s L,1 , 1 R(L 1) χj L,s L,0 ; projecting it onto this span yields itself. Noting that h P1{R(L 1) χj,s,1} + P1{R(L 1) χj,s,0} i is itself a projection matrix, the fourth equality follows from the idempotence of projection matrices, and the inequality follows from the fact that νsib 2 Qνsib 2 for any projection matrix Q. Thus, when l = L, the quadratic that defines Sl,j,s has a non-positive quadratic coefficient. Equality is attained in (40) if and only if νsib span 1{R(L 1) χj,s,1}, 1{R(L 1) χj,s,0} . This can only happen if splitting R(L 1) on j, s yields an identical partition of the data to splitting on j L, s L. If this is the case, then SL,j,s = ( , 0] [0, ) from the definition of Sl,j,s in Proposition 14, and so (ii) is satisfied with c = d = 0. We now proceed to the setting where the inequality in (40) is strict. In this case, (36) implies that SL,j,s = ( , c] [d, ) for c d and c, d R. To complete the proof of (ii), we must argue that c 0 and d 0. Recall that the quadratic in (34) has the form gain R(L 1){y (φ, ν), j, s} gain R(L 1){y (φ, ν), j L, s L}. When φ = 0, gain R(L 1){y (φ, ν), j L, s L} = 0, because φ = 0 eliminates the contrast between RA and RB, so that the split on j L, s L provides zero gain. So, when φ = 0, the quadratic evaluates to gain R(L 1){y (φ, ν), j, s}, which is non-negative by Lemma 23. Thus, Sl,j,s is defined by a downward facing quadratic that is non-negative when φ = 0, and so the set Sl,j,s has the form ( , c] [d, ) for c 0 d. To prove (iii), observe that (i) implies that L 1 l=1 p j=1 n 1 s=1 Sl,j,s = (amax, bmin), where amax is the maximum over all of the a s, and bmin is the minimum over all of the b s. This can be computed in np(L 1) steps. Furthermore, (ii) implies that p j=1 n 1 s=1 SL,j,s = ( , cmin] [dmax, ), where cmin and dmax are the minimum over all of the c s and the maximum over all the d s, respectively. This can be computed in np steps. Thus, we can compute L l=1 p j=1 n 1 s=1 Sl,j,s in O(np L) operations. D.3 Proof of Proposition 16 To prove Proposition 16, we first propose a particular method of constructing an example of tree(B, ν, λ). We then show that (21) holds for this particular choice for tree(B, ν, λ). We conclude by evaluating the computational cost of computing such an example of tree(B, ν, λ), and by arguing that in the special case where R(B) treeλ(y), our example is equal to treeλ(y). When Algorithm A2 is called with parameters tree, y, λ, O, where O is a bottom-up ordering of the K nodes in tree, it computes a sequence of intermediate trees, tree0, . . . , tree K. We use the notation treek(tree, y, λ, O), for k = 0, . . . , K, to denote the kth of these intermediate trees. The following lemma helps build up to our proposed example of tree(B, ν, λ). Lemma 27 Let φ1 Sgrow(B, ν) and φ2 Sgrow(B, ν). Then tree0{y (φ1, ν)} = tree0{y (φ2, ν)}. Let O be a bottom-up ordering of the K regions in tree0{y (φ1, ν)} such that the last L regions in the ordering are R(L 1), . . . , R(0). Then tree K L[tree0{y (φ1, ν)}, y (φ1, ν), λ, O] = tree K L[tree0{y (φ2, ν)}, y (φ2, ν), λ, O]. Neufeld, Gao, Witten Proof We first prove that tree0{y (φ1, ν)} tree0{y (φ2, ν)}, where φ1, φ2 Sgrow(B, ν). The fact that φ1 Sgrow(B, ν) and φ2 Sgrow(B, ν) implies two properties: Property 1: R(l) tree0{y (φ1, ν)} and R(l) tree0{y (φ2, ν)} for l {0, . . . , L} by the definition of Sgrow(B, ν). Property 2: R(l) sib tree0{y (φ1, ν)} and R(l) sib tree0{y (φ2, ν)} for l {1, . . . , L}, where R(l) sib R(l 1) χjl,sl,1 el. This follows from Property 1 and Definition 1. Suppose that R tree0{y (φ1, ν)}. Then R must belong to one of these three cases, illustrated in Figure 10(a): Case 1: l {0, . . . L} such that R = R(l). By Property 1, R tree0{y (φ2, ν)}. Case 2: l {1, . . . , L} such that R = R(l) sib. By Property 2, R tree0{y (φ2, ν)}. Case 3: R desc[R , tree0{y (φ1, ν)}], where either R = R(l) sib for some l {1, . . . , L}, or else R = R(L). By Properties 1 and 2, R tree0{y (φ2, ν)}. Condition 1 ensures that, for all i R and for some constants c and d, {y (φ2, ν)}i = {y (φ1, ν)}i if R = R(l) sib for some l {1, . . . , L 1}, {y (φ1, ν)}i + c if R = R(L) sib , {y (φ1, ν)}i + d if R = R(L). As constant shifts preserve within-node sums of squared errors, in each of these three scenarios, desc[R , tree0{y (φ1, ν)}] = desc[R , tree0{y (φ2, ν)}]. Thus, R tree0{y (φ2, ν)}. Thus, if R tree0{y (φ1, ν)}, then R tree0{y (φ2, ν)}. This completes the argument that tree0{y (φ1, ν)} tree0{y (φ2, ν)}. Swapping the roles of φ1 and φ2 in this argument, we see that tree0{y (φ2, ν)} tree0{y (φ1, ν)}. This concludes the proof that tree0{y (φ1, ν)} = tree0{y (φ2, ν)}. Because tree0{y (φ1, ν)} = tree0{y (φ2, ν)}, it follows that any bottom-up ordering of the regions in tree0{y (φ1, ν)} is also a bottom-up ordering for the regions in tree0{y (φ2, ν)}. We next prove by induction that, if we choose a bottom-up ordering O that places the regions in R(B) at the end of the ordering, then treek[tree0{y (φ1, ν)}, y (φ1, ν), λ, O] = treek[tree0{y (φ2, ν)}, y (φ2, ν), λ, O], (41) for k = 0, . . . , K L. It follows immediately from Algorithm A2 and the argument above that tree0[tree0{y (φ1, ν)}, y (φ1, ν), λ, O] = tree0[tree0{y (φ2, ν)}, y (φ2, ν), λ, O]. Next, suppose that for some k {1, . . . , K L}, treek 1[tree0{y (φ1, ν)}, y (φ1, ν), λ, O] = treek 1[tree0{y (φ2, ν)}, y (φ2, ν), λ, O], (42) and denote this tree with treek 1 for brevity. We must prove that (41) holds. Let R be the kth region in O and recall the assumption that the last L regions in O are {R(L 1), . . . , R(0)}. Since k K L, this implies that R {R(L 1), . . . , R(0)}. This means that either Selective Inference for Regression Trees R desc h R(l) sib, tree0{y (φ1, ν)} i for l {1, . . . , L} or R desc R(L), tree0{y (φ1, ν)} , meaning that R is a black region in Figure 10(b). From Condition 1, {y (φ2, ν)}i = {y (φ1, ν)}i if R desc h R(l) sib, tree0{y (φ1, ν)} i for l {1, . . . , L 1}, {y (φ1, ν)}i + c if R desc h R(L) sib , tree0{y (φ1, ν)} i , {y (φ1, ν)}i + d if R desc R(L), tree0{y (φ1, ν)} . In any of the three cases illustrated in Figure 10, for g( ) defined in (3), g{R, treek 1, y (φ1, ν)} = g{R, treek 1, y (φ2, ν)}. Combining this with (42) and Step 2(b) of Algorithm A2 yields (41). This completes the proof by induction. Figure 10: (a). An illustration of Case 1 (red), Case 2 (blue), and Case 3 (black) for a region R tree0{y (φ1, ν)} in the base case of the proof of Lemma 27, where R(B) = {R(0), . . . , R(3)}. (b.) The black regions show the possible cases for R treek 1 in the inductive step of the proof of Lemma 27. Since Lemma 27 guarantees that each φ Sgrow(B, ν) leads to the same tree0{y (φ, ν)}, we will refer to this tree as tree0, will let K be the number of regions in this tree, and will let O be a bottom-up ordering of these regions that places R(L 1), . . . , R(0) in the last L spots. We will further denote tree K L{tree0, y (φ, ν), λ, O} for any φ Sgrow(B, ν) as tree K L, since Lemma 27 further tells us that this is the same for all φ Sgrow(B, ν). In what follows, we argue that if we let tree(B, ν, λ) = tree K L, where tree(B, ν, λ) appears in the statement of Proposition 16, then (21) holds. In other words, we prove that tree K L, which always exists and is well-defined, is a valid example of tree(B, ν, λ). Recall from (18) that Sλ (B, ν) = φ Sgrow(B, ν) : R(L) treeλ{y (φ, ν)} . Lemma 27 says that for φ Sgrow(B, ν), we can rewrite treeλ{y (φ, ν)} as tree K{tree0, y (φ, ν), λ, O}. Neufeld, Gao, Witten So we can rewrite Sλ (B, ν) as Sλ (B, ν) = n φ Sgrow(B, ν) : R(L) tree K tree0, y (φ, ν), λ, O o . (43) Furthermore, since R(L 1), . . . , R(0) (all of which are ancestors of R(L)) are the last L nodes in the ordering O, we see that R(L) tree K{tree0, y (φ, ν), λ, O} if and only if no pruning occurs during the last L iterations of Step 2 in Algorithm A2. This means that we can characterize (43) as φ Sgrow(B, ν) : tree K L{tree0, y (φ, ν), λ, O} = tree K tree0, y (φ, ν), λ, O . (44) Recall that for k = K L + 1, . . . , K, R(K k) is the kth region in O, and is an ancestor of R(L). We next argue that we can rewrite (44) as φ Sgrow(B, ν) : g n R(K k), tree K L, y (φ, ν) o λ . (45) To begin, suppose that φ (45). As we are talking about a particular φ, for k = 0, . . . , K we will suppress the dependence of treek tree0, y (φ, ν), λ, O on its arguments and denote it with treek. The fact that φ (45) means that g R(L 1), tree K L, y (φ, ν) λ, which ensures that no pruning occurs at step K L + 1, which in turn ensures that tree K L+1 = tree K L. Combined with (45), this implies that g R(L 2), tree K L, y (φ, ν) = g R(L 2), tree K L+1, y (φ, ν) λ, which ensures that no pruning occurs at step K L+2, which in turn ensures that tree K L+2 = tree K L+1 = tree K L. Proceeding in this manner, by tracing through the last L iterations of Step 2 of Algorithm A2, we see that φ satisfies tree K = tree K L, and so φ (44). Next suppose that φ (45). Let k = min k {K L+1,...,K}{k : g n R(K k), tree K L, y (φ, ν) o < λ}. As k is a minimum, we know that no pruning occurred during steps K L + 1, . . . , k 1, and so treek 1 tree0(y (φ, ν)), y (φ, ν), λ, O = tree K L. This implies that g n R(K k ), tree K L, y (φ, ν) o < λ can be rewritten as g R(K k ), treek 1 tree0{y (φ, ν)}, y (φ, ν), λ, O , y (φ, ν) < λ. It then follows from Algorithm A2 that pruning occurs at step k , which means that tree K cannot possibly equal tree K L. Thus, φ (44). Thus, φ (45) if and only if φ (44). Finally, Proposition 16 rewrites (45) with the indexing over k changed to an indexing over l, and plugging in tree(B, ν, λ) = tree K L. Therefore, tree K L is a valid example of tree(B, ν, λ). To compute tree(B, ν, λ), we first select an arbitrary φ Sgrow(B, ν). We then apply Algorithm A1 to grow tree0{y (φ, ν)}. We create a bottom-up ordering O of the K nodes in tree0{y (φ, ν)} such that R(L 1), . . . , R(0) are at the end. Finally, we apply the first K L iterations of Algorithm A2 with arguments tree0{y (φ, ν)}, y (φ, ν), λ, and O to obtain tree(B, ν, λ). The worst case computational cost of CART (the combined Algorithm A1 and Algorithm A2) is O(n2p). Selective Inference for Regression Trees In the special case that R(B) treeλ(y), we have that νT y Sgrow(B, ν) because y = y (νT y, ν) and R(B) treeλ(y) tree0(y). In this case, suppose that we carry out the process described in the previous paragraph by selecting φ = νT y. As R(B) treeλ(y), it is clear from Algorithm A2 that no pruning occurs during the last L iterations of Step 2 in Algorithm A2 applied to arguments tree0(y), y, λ, and O. Therefore, the process from the previous paragraph returns the optimally pruned treeλ(y). Thus, when R(B) treeλ(y), we can simply plug in treeλ(y), which has already been built, for tree(B, ν, λ) in (21). D.4 Proof of Proposition 17 We first show that we can express n φ : g n R(l), tree(B, ν, λ), y (φ, ν) o λ o (46) as the solution set of a quadratic inequality in φ for l = 0, . . . , L 1, where g( ) was defined in (3). Because only the numerator of g{R(l), tree(B, ν, λ), y (φ, ν)} depends on φ, it will be useful to introduce the following concise notation: h(R, tree, y) X i R (yi y R)2 X r term(R,tree) i r (yi yr)2. (47) We begin with the following lemma. Lemma 28 Suppose that region R in tree has children R χj,s,0 and R χj,s,1. Then, h(R, tree, y) = gain R(y, j, s)+h(R χj,s,1, tree, y)+h(R χj,s,0, tree, y), where gain R(y, j, s) is defined in (2). Proof The result follows from adding and subtracting P i R χj,s,1(yi y R χj,s,1)2 and P i R χj,s,0(yi y R χj,s,0)2 in (47) and noting that term(R, tree) = term(R χj,s,1, tree) term(R χj,s,0). Recall that B = ((j1, s1, e1), . . . , (j L, s L, e L)) and that R(B) = {R(0), R(1), . . . , R(L)}. Lemma 29 follows from Lemma 28 and the fact that, due to the form of the vector ν, there are many regions R for which h {R, tree(B, ν, λ), y (φ, ν)} does not depend on φ. Lemma 29 For any φ Sgrow(B, ν), we can decompose h{R(l), tree(B, ν, λ), y (φ, ν)} as h n R(l), tree(B, ν, λ), y (φ, ν) o = l =l+1 h n R(l ) sib , tree(B, ν, λ), y ( φ, ν) o + h n R(L), tree(B, ν, λ), y ( φ, ν) o l =l gain R(l ){y (φ, ν), jl +1, sl +1}, where R(l) sib is the sibling of R(l) in tree(B, ν, λ). Neufeld, Gao, Witten Proof Repeatedly applying Lemma 28 yields h n R(l), tree(B, ν, λ), y (φ, ν) o = l =l+1 h n R(l ) sib , tree(B, ν, λ), y (φ, ν) o + h n R(L), tree(B, ν, λ), y (φ, ν) o l =l gain R(l ){y (φ, ν), jl +1, sl +1}. (48) For l = l + 1, . . . , L 1, h n R(l ) sib , tree(B, ν, λ), y (φ, ν) o = h n R(l ) sib , tree(B, ν, λ), y ( φ, ν) o be- cause R(l ) sib only contains observations where [y (φ, ν)]i = [y ( φ, ν)]i. Similarly, h R(L), tree(B, ν, λ), y (φ, ν) = h n R(L), tree(B, ν, λ), y ( φ, ν) o and h n R(L) sib , tree(B, ν, λ), y (φ, ν) o = h n R(L) sib , tree(B, ν, λ), y ( φ, ν) o because for i R(L) and i R(L) sib , {y (φ, ν)}i and {y ( φ, ν)}i only differ by a constant shift. Plugging these two facts into (48) completes the proof. We can now write (46) as n φ : g n R(l), tree(B, ν, λ), y (φ, ν) o λ o = n φ : h n R(l), tree(B, ν, λ), y (φ, ν) o λ h | term{R(l), tree(B, ν, λ)}| 1 io l =l gain R(l ){y (φ, ν), jl +1, sl +1} λ h | term{R(l), tree(B, ν, λ)}| 1 i l =l+1 h n R(l ) sib , tree(B, ν, λ), y ( φ, ν) o h n R(L), tree(B, ν, λ), y ( φ, ν) o h a n R(l ), jl +1, sl +1 o φ2 + b n R(l ), jl +1, sl +1 o φ + c n R(l ), jl +1, sl +1 oi γl where the functions a( ), b( ), and c( ) were defined in (35) in Appendix D.2, and where γl λ h | term{R(l), tree(B, ν, λ)}| 1 i l =l+1 h n R(l ) sib , tree(B, ν, λ), y ( φ, ν) o h n R(L), tree(B, ν, λ), y ( φ, ν) o is a constant that does not depend on φ. The first equality simply applies the definitions of h( ) and g( ). The second equality follows from Lemma 29 and moving terms that do not depend on φ to the right-hand-side. The third equality follows from plugging in notation from Appendix D.2 and defining the constant γl for convenience. Thus, (46) is quadratic inequality in φ. We now just need to argue that its coefficients can be obtained efficiently. We need to compute the coefficients in (49) for l = 0, . . . , L 1. The quantities a{R(l ), jl , sl +1}, b{R(l ), jl +1, sl +1}, and c{R(l ), jl +1, sl +1} for l = 0, . . . , L 1 were already computed while computing Sgrow(B, ν). To get the coefficients for the left hand side of (49) for each l = 0, . . . , L 1, we simply need to compute L partial sums of these quantities, which takes O(L) operations. As we are assuming that we have access to tree(B, ν, λ), computing h{R, tree(B, ν, λ), y ( φ, ν)} requires O(n) operations. Therefore, computing γ0 Selective Inference for Regression Trees takes O(n L) operations. By storing partial sums during the computation of γ0, we can subsequently obtain γl for l = 1, . . . , L 1 in constant time. We have now seen that we can obtain the coefficients needed to express (46) as a quadratic function of φ for l = 0, . . . , L 1 in O(n L) total operations. Once we have these quantities, we can compute each set of the form (46) in constant time using the quadratic equation. It remains to compute Sλ(B, ν) = Sgrow(B, ν) \ "L 1 \ n φ : g n R(l), tree(B, ν, λ), y (φ, ν) o λ o# Recall from Proposition 14 that Sgrow(B, ν) is the intersection of O(np L) quadratic sets. Thus, in the worst case, Sgrow(B, ν) has O(np L) disjoint components, and so this final intersection involves O(np L) components. Thus, we can compute Sλ(B, ν) in O{np L log(np L)} operations (Bourgon, 2009). The following lemma explains why, similar to Proposition 15, computation time can be reduced in the special case where ν = νsib and B = branch{RA, treeλ(y)}. Lemma 30 When B = branch{RA, treeλ(y)}, the set {φ : g{R(l), tree(B, ν, λ), y (φ, νsib)} λ} has the form ( , al) (bl, ), where al 0 bl. Therefore, we can compute TL 1 l=0 φ : g R(l), tree(B, ν, λ), y (φ, ν) λ as l=0 ( , al) (bl, ) = ( , min 0 l L 1 al) ( max 0 l L 1 bl, ) in O(L) operations. Furthermore, we can compute (50) in constant time. Proof As R(B) treeλ(y), we can let φ Sgrow(B, ν) from Lemma 29 be νT y such that y ( φ, ν) = y. We can then apply Lemma 22 to note that l =l gain R(l ){y (φ, νsib), jl +1, sl +1} = l =l gain R(l ){y, jl +1, sl +1} +gain R(L 1){y (φ, νsib), j L, s L}. Thus, when ν = νsib and B = branch{RA, treeλ(y)}, we can rewrite (49) with all of the terms corresponding to gain R(l ){y (φ, νsib), jl +1, sl +1} for l = l + 1, . . . , L 2 moved into the constant on the right-hand-side. This lets us rewrite (49) as {φ : Gain R(L 1){y (φ, νsib), j L, s L} γl}, where γl is an updated constant that does not depend on φ. To prove that {φ : Gain R(L 1){j L, s L, y (φ, νsib)} γl} has the form ( , al) (bl, ) for al 0 bl, first recall from Lemma 23 that Gain R(L 1){j L, s L, y (φ, νsib)} is a quadratic function of φ. It then suffices to show that this quadratic has a non-negative second derivative and achieves its minimum when φ = 0. The second derivative of this quadratic is a R(L 1), j L, s L = ||νsib|| 4 2 νT sib MR(L 1),j L,s Lνsib, which is non-negative by Lemma 23. From Lemma 23, Gain R(L 1) {j L, s L, y (φ, νsib)} is non-negative. It equals 0 when φ = 0, because when φ = 0 then y R(L 1) χj L,s L,1 = y R(L 1) χj L,s L,0. Neufeld, Gao, Witten Intersecting L sets of the form ( , al) (bl, ) for al 0 bl only takes O(L) operations, because we simply need to identify the minimum al and maximum bl. Finally, Lemma 26 ensures that Sgrow(B, νsib) has at most two disjoint intervals, and so the final intersection with Sgrow(B, νsib) takes only O(1) operations. D.5 Proof of Proposition 18 Let B = branch{RA, treeλ(y)}, let R(B) = {R(0), . . . , R(L)}, and let ν = νsib (6). Applying the expression given for {y (φ, νsib)}i in Section 3.3 (which follows from algebra), we immediately see that Condition 1 holds with RA = R(L) and RB = R(L 1) χj L,s L,1 e L. Let B = π branch{RA, treeλ(y)} and let ν = νreg (14). Note that π branch{RA, treeλ(y)} induces the same region R(L) as the unpermuted branch{RA, treeλ(y)}. Regardless of the permutation, the induced R(L) is equal to RA. Applying the expression for {y (φ, νreg)}i given in Section 3.3, Condition 1 holds with constant c2 = 0. Appendix E. Proofs for Section 4.3 E.1 Proof of Proposition 19 As stated in Proposition 19, let p Q reg(y) = pr H0 |ν T reg Y c| |ν T regy c| | [ π Q R(π[branch{RA, treeλ(y)}]) treeλ(Y ), P νreg Y = P νregy First, we will show that the test based on p Q reg(y) controls the selective Type 1 error rate, defined in (4); this is a special case of Proposition 3 from Fithian et al. (2014). Define E1 = {Y : RA treeλ(Y )}, E2 = {Y : P νreg Y = P νregy}, and E3 = n Y : S π Q R π branch RA, treeλ(y) treeλ(Y ) o . Recall that the test of H0 : νT regµ = c based on p Q reg(y) controls the selective Type 1 error rate if, for all α [0, 1], pr H0{p Q reg(Y ) α | E1} α. By construction, pr H0 p Q reg(Y ) α | E2 E3 = E h 1{pr H0(|νT reg Y c| |νT regy c||E2 E3) α} | E2 E3 i = α. Let ψQ RA = 1{pr H0(|νT reg Y c| |νT regy c||E2 E3) α}. An argument similar to that of Lemma 13 indicates that E3 E1. The law of total expectation then yields E(ψQ RA | E1) = E n E(ψQ RA | E1 E2 E3) | E1 o = E n E(ψQ RA | E2 E3) | E1 o = E(α | E1) = α. Thus, the test based on p Q reg(y) controls the selective Type 1 error rate. We omit the proof that p Q reg(y) can be computed as p Q reg(y) = pr H0 |φ c| |νT regy c| | φ [ π Q Sλ π[branch{RA, treeλ(y)}], νreg for φ N(c, νreg 2 2σ2), as the proof is similar to the proof of Theorem 6. Selective Inference for Regression Trees Appendix F. Effect of Using the Computationally Efficient Alternative in Section 4.3 In this section, we investigate the effect of using p I reg(y) from (22) rather than preg(y) from (15) on power. We also investigate the effect of using (23) rather than (17) on the width of confidence intervals for νT regµ. We generate data as described in Section 5.1, but for simplicity we restrict our attention to the case where a = 1 (corresponding to the center panel of Figure 3). For each tree that we build, we consider (a) testing H0 : νT regµ = 0 and (b) constructing a confidence interval for νT regµ, for each region appearing at the third level of the tree. For each test, we compare the test that uses the full conditioning set (i.e. that uses preg(y) from (15)) to the test that uses the identity permutation only (i.e. that uses p I reg(y) from (22)). For each interval, we compare the method that uses the full conditioning set (i.e. (17)) to the method that uses the identity permutation only (i.e. (23)). The results are displayed in Figure 11. The left panel of Figure 11 shows that the power loss resulting from using (22) instead of (15) is negligible. In fact, we need to zoom in on the left panel, as shown in the center panel, to see any separation between the power curves. We see in the center panel that power is lower when (22) is used, though we emphasize that the differences in power are extremely small. We see in the right panel of Figure 11 that the computationally efficient conditioning set has a more noticeable impact on the median width of our confidence intervals. As expected, the confidence intervals are narrower when we use the full conditioning set (i.e. (17)). However, this difference is most noticeable when b is small. When b is small, in which case the confidence intervals are wide even when the full conditioning set is used. Thus, the amount of precision lost overall by constructing confidence intervals using the identity permutation only (i.e. (23)) is not of practical importance. Based on these results, we recommend using the identity permutation in practice, because it is the most computationally efficient choice and does not meaningfully reduce power or precision compared to the full conditioning set. Appendix G. Robustness to Non-Normality In this section, we explore the performance of the selective Z-test under a global null when the normality assumption on Y is violated. For four choices of cumulative distribution function F, we generate Yi i.i.d. F such that all observations have the same expected value. We set n = 200, p = 10, and we grow trees to a maximum depth of 3. We plug in (n 1) 1 Pn i=1(yi y)2 as an estimate of σ2, as it does not make sense to assume known variance for distributions with a mean-variance relationship. Figure 12 displays quantile-quantile plots of the p-values for testing H0 : νT sibµ = 0, using the test in (8). Figure 12 shows that despite the fact that our proposed selective inference framework was derived under a normality assumption, it yields approximately uniformly distributed p-values for Poisson(10), Bernoulli(0.5), and Gamma(1,10) data. We suspect that this is because P ν Y is approximately independent of νTY for these distributions (see the proof of Theorem 6 in Appendix B). In the case of Bernoulli(0.1) data, which represents a par- Neufeld, Gao, Witten Figure 11: Simulation results comparing inference based on the full conditioning set to inference based on the identity permutation only (see Section 4.3). The left panel shows power curves. The center panel zooms in on one section of the left panel. The right panel shows median widths of confidence intervals. Figure 12: Quantile-quantile plots of the p-values for testing H0 : νT sibµ = 0 under a global null. A naive Z-test (green), sample splitting (blue), and selective Z-test (pink) were performed; see Section 5.2. Selective Inference for Regression Trees Figure 13: A CART tree fit to the Box Lunch Study data. Each split has been labeled with a p-value (8), and each region has been labeled with a confidence interval (23). Inference is carried out by plugging in ˆσcons, from Section 5.7, as an estimate of σ. ticularly extreme violation of the normality assumption, the p-values from our selective Z-test are not uniformly distributed. As mentioned in Section 7, future work could involve characterizing conditions for F under which our selective Z-tests will approximately control the selective Type 1 error. Appendix H. Alternate Box Lunch Study Analysis Figure 13 is the same as the left panel of Figure 9, but the selective Z-inference is carried out with ˆσcons, from Section 5.7, rather than ˆσSSE. The takeaways presented in Section 6 do not change. Neufeld, Gao, Witten Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353 7360, 2016. PK Bhattacharya. Some aspects of change-point analysis. Lecture Notes-Monograph Series, pages 28 56, 1994. Richard Bourgon. Overview of the intervals package, 2009. R Vignette, URL https://cran. r-project.org/web/packages/intervals/vignettes/intervals_overview.pdf. Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification and regression trees. CRC Press, 1984. Shuxiao Chen and Jacob Bien. Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, 29(2):323 334, 2020. Yiqun T Chen and Daniela M Witten. Selective inference for k-means clustering. ar Xiv preprint ar Xiv:2203.15267, 2022. William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. ar Xiv preprint ar Xiv:1410.2597, 2014. Lucy L Gao, Jacob Bien, and Daniela Witten. Selective inference for hierarchical clustering. ar Xiv preprint ar Xiv:2012.02936, 2020. Torsten Hothorn and Achim Zeileis. partykit: A modular toolkit for recursive partytioning in R. The Journal of Machine Learning Research, 16(1):3905 3909, 2015. Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15 (3):651 674, 2006. Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2 (1):193 218, 1985. Sangwon Hyun, Kevin Z Lin, Max G Sell, and Ryan J Tibshirani. Post-selection inference for changepoint detection algorithms with application to copy number variation data. Biometrics, pages 1 13, 2021. Sean Jewell, Paul Fearnhead, and Daniela Witten. Testing for a change in mean after changepoint detection. Journal of the Royal Statistical Society, Series B, 2022. Danijel Kivaranovic and Hannes Leeb. On the length of post-model-selection confidence intervals conditional on polyhedral constraints. Journal of the American Statistical Association, 116(534):845 857, 2021. Jason D Lee, Dennis L Sun, Yuekai Sun, Jonathan E Taylor, et al. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907 927, 2016. Selective Inference for Regression Trees Keli Liu, Jelena Markovic, and Robert Tibshirani. More powerful post-selection inference, with application to the lasso. ar Xiv preprint ar Xiv:1801.09037, 2018. Wei-Yin Loh. Fifty years of classification and regression trees. International Statistical Review, 82(3):329 348, 2014. Wei-Yin Loh, Haoda Fu, Michael Man, Victoria Champion, and Menggang Yu. Identification of subgroups with differential treatment effects for longitudinal and multiresponse variables. Statistics in Medicine, 35(26):4837 4855, 2016. Wei-Yin Loh, Michael Man, and Shuaicheng Wang. Subgroups from regression trees with adjustment for prognostic effects and postselection inference. Statistics in Medicine, 38 (4):545 557, 2019. Brian D Ripley. Pattern recognition and neural networks. Cambridge University Press, 1996. Mark Robert Segal. Regression trees for censored data. Biometrics, 44(1):35 47, 1988. Jonathan Taylor and Robert J Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629 7634, 2015. Terry Therneau and Beth Atkinson. rpart: Recursive Partitioning and Regression Trees, 2019. R package version 4.1-15, available on CRAN. Xiaoying Tian and Jonathan Taylor. Asymptotics of selective inference. Scandinavian Journal of Statistics, 44(2):480 499, 2017. Xiaoying Tian and Jonathan Taylor. Selective inference with a randomized response. The Annals of Statistics, 46(2):679 710, 2018. Ryan J Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani. Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600 620, 2016. Ryan J Tibshirani, Alessandro Rinaldo, Rob Tibshirani, and Larry Wasserman. Uniform asymptotic inference and the bootstrap after model selection. The Annals of Statistics, 46(3):1255 1287, 2018. Ashwini Venkatasubramaniam and Julian Wolfson. vis Tree: Visualization of Subgroups for Decision Trees, 2018. R package version 0.8.1, available on CRAN. Ashwini Venkatasubramaniam, Julian Wolfson, Nathan Mitchell, Timothy Barnes, Meghan Ja Ka, and Simone French. Decision trees in epidemiological research. Emerging Themes in Epidemiology, 14(1):11, 2017. Stefan Wager and Guenther Walther. Adaptive concentration of regression trees, with application to random forests. ar Xiv preprint ar Xiv:1503.06388, 2015.