# on_semiparametric_inference_for_bart__efdf9331.pdf On Semi-parametric Inference for BART Veronika Roˇckov a 1 Abstract There has been a growing realization of the potential of Bayesian machine learning as a platform that can provide both flexible modeling, accurate predictions as well as coherent uncertainty statements. In particular, Bayesian Additive Regression Trees (BART) have emerged as one of today s most effective general approaches to predictive modeling under minimal assumptions. Statistical theoretical developments for machine learning have been mostly concerned with approximability or rates of estimation when recovering infinite dimensional objects (curves or densities). Despite the impressive array of available theoretical results, the literature has been largely silent about uncertainty quantification. In this work, we continue the theoretical investigation of BART initiated recently by (Roˇckov a and van der Pas, 2017). We focus on statistical inference questions. In particular, we study the Bernstein-von Mises (Bv M) phenomenon (i.e. asymptotic normality) for smooth linear functionals of the regression surface within the framework of nonparametric regression with fixed covariates. Our semi-parametric Bv M results show that, beyond rate-optimal estimation, BART can be also used for valid statistical inference. 1. Introduction With visible successes on a wide range of predictive tasks, the role of machine learning has become increasingly recognized across a wide array of application domains ranging from economics to electronic commerce. Bayesian approaches have been particularly appealing as they provide a structured approach to uncertainty assessment via hierarchical modeling. Uncertainty quantification for inference 1Booth School of Business, University of Chicago. Correspondence to: Veronika Roˇckov a . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 by the author(s). (hypothesis testing and confidence statements) is a fundamental goal of statistics that goes beyond mere prediction. This note studies the inferential potential of Bayesian Additive Regression Trees (BART) of (Chipman et al., 2010), one of the workhorses of Bayesian machine learning (Hahn et al., 2017; Hill, 2011; He et al., 2019; Bleich, 2014; Linero, 2016; Linero and Yang, 2017). In particular, we study the Gaussian approximability of certain aspects of posterior distributions in non-parametric regression with trees/forest priors. Results of this type, initially due to Laplace (1810) but most commonly known as Bernstein-von Mises (Bv M) theorems, imply that posteriorbased inference asymptotically coincides with the one based on traditional frequentist 1/ n-consistent estimators. In this vein, Bv M theorems provide a rigorous frequentist justification of Bayesian inference. The main thrust of this work is to understand the extent to which this phenomenon holds for various incarnations of BART. In simple words, the Bv M phenomenon occurs when, as the number of observations increases, the posterior distribution has approximately the shape of a Gaussian distribution centered at an efficient estimator of the parameter of interest. Moreover, the posterior credible sets, i.e. regions with prescribed posterior probability, are then also confidence regions with the same asymptotic coverage. This property has important practical implications in the sense that constructing confidence regions via variation in MCMC draws is relatively straightforward compared to direct constructions. Under fairly mild assumptions, Bv M statements can be expected to hold in regular finite-dimensional models (van der Vaart, 2000). Unfortunately, the frequentist theory on asymptotic normality does not generalize fully to semiand non-parametric estimation problems (Bickel and Kleijn, 2012). Freedman initiated the discussion on the consequences of unwisely chosen priors in the 1960 s, providing a negative Bv M result in a basic ℓ2-sequence Gaussian conjugate setting. The warnings against seemingly innocuous priors that may lead to posterior inconsistency were then reiterated many times in the literature, including (Cox, 1993) and (Diaconis and Freedman, 1998; 1986). Other counterexamples and anomalies of the Bv M behavior in infinite-dimensional Inference for BART problems can be found in (Johnstone, 2010) and (Leahu, 2011). While, as pointed out by (Bickel and Kleijn, 1999), analogues of the Bv M property for infinite-dimensional parameters are not immediately obvious, rigorous notions of non-parametric Bv M s have been introduced in several pioneering works (Leahu, 2011; Ghosal, 2000; Castillo and Nickl, 2014; 2013). Unwisely chosen priors leave room for unintended consequences also in semi-parametric contexts (Rivoirard and Rousseau, 2012). Castillo (2012a) provided an interesting counterexample where the posterior does not display the Bv M behavior due to a non-vanishing bias in the centering of the posterior distribution. Various researchers have nevertheless documented instances of the Bv M limit in semi-parametric models (a) when the parameter can be separated into a finite-dimensional parameter of interest and an infinite-dimensional nuisance parameter (Castillo, 2012b; Shen, 2002; Bickel and Kleijn, 2012; Cheng and Kosorok, 2008; Johnstone, 2010; Jonge and van Zanten, 2013), and (b) when the parameter of interest is a functional of the infinite-dimensional parameter (Rivoirard and Rousseau, 2012; Castillo and Rousseau, 2015). In this work, we focus on the latter class of semi-parametric Bv M s and study the asymptotic behavior of smooth linear functionals of the regression function. We consider the standard non-parametric regression setup, where a vector of responses Y (n) = (Y1, . . . , Yn) is linked to fixed (rescaled) predictors xi = (xi1, . . . , xip) [0, 1]p for each 1 i n through Yi = f0(xi) + εi, εi iid N(0, 1), (1) where f0 is an unknown α-H older continuous function on a unit cube [0, 1]p. The true generative model giving rise to (1) will be denoted with Pn 0. Each model is parametrized by f F, where F is an infinite-dimensional set of possibilities of f0. Let Ψ : F R be a measurable functional of interest and let Π be a probability distribution on F. Given observations Y (n) from (1), we study the asymptotic behavior of the posterior distribution of Ψ(f), denoted with Π[Ψ(f) | Y (n)]. Let N(0, V ) denote the centered normal law with a covariance matrix V . In simple words, we want to show that under the Bayesian CART or BART priors on F, the posterior distribution satisfies the Bv M-type property in the sense that Π[ n Ψ(f) bΨ | Y (n)] N(0, V ) (2) as n in Pn 0-probability, where bΨ is a random centering point estimator and where stands for weak convergence. We make this statement more precise in Section 2 Castillo and Rousseau (2015) provide general conditions on the model and on the functional Ψ to guarantee that (2) holds. These conditions describe how the choice of the prior influences the bias bΨ and variance V . Building on this contribution, we provide (a) tailored statements for various incarnations of tree/forest priors that have occurred in the literature, (b) extend the considerations to adaptive priors under self-similarity. 1.1. Notation The notation will be used to denote inequality up to a constant, a b denotes a b and b a and a b denotes max{a, b}. The ε-covering number of a set Ωfor a semimetric d, denoted by N(ε, Ω, d), is the minimal number of d-balls of radius ε needed to cover set Ω. We denote by φ( ; σ2) the normal density with zero mean and variance σ2 With 2 we denote the standard Euclidean norm and with λmin(Σ) and λmax(Σ) the extremal eigenvalues of a matrix Σ. The set of α-H older continuous functions (i.e. H older smooth with 0 < α 1) on [0, 1]p will be denoted with Hα p . 2. Rudiments Before delving into our development, we first make precise the definition of asymptotic normality. Definition 2.1. We say that the posterior distribution for the functional Ψ(f) is asymptotically normal with centering Ψn and variance V if, for β the bounded Lipschitz metric for weak convergence, and the real-valued mapping τn : f n[Ψ(f) Ψn], it holds that β(Π[ | Y (n)] τ 1 n , N(0, V )) 0 (3) in Pn 0 probability as n , which we denote as Π[ | Y (n)] τ 1 n N(0, V ). When efficiency theory at the rate n is available, we say that the posterior distribution satisfies the Bv M theorem if (3) holds with Ψn = bΨ+o P (1/ n) for bΨ a linear efficient estimator of Ψ(f0). The proof of such a statement typically requires of a few steps (a) a semi-local step where one proves that the posterior distribution concentrates on certain sets and (b) a strictly local study on these sets, which can be carried out under the LAN (local asymptotic normality) conditions. Denoting the log-likelihood with [Yi f(xi)]2 we define the log-likelihood ratio n(f) = ℓn(f) ℓn(f0) and express it as a sum of a quadratic term and a stochastic term via the LAN expansion as follows 2 f f0 2 L + n Wn(f f0), (4) Inference for BART f f0 2 L = 1 i=1 [f0(xi) f(xi)]2, Wn(f f0) = f f0, n ε L n εi [f(xi) f0(xi)]. Recall that the phrase semi-parametric here refers to the problem of estimating functionals in an infinite-dimensional model rather than Euclidean parameters in the presence of infinite-dimensional nuisance parameters. In this paper, we consider the smooth linear functional i=1 a(xi)f(xi) (5) for some smooth uniformly bounded γ-H older continuous function a( ), i.e. a < Ca and a Hγ p for some 0 < γ 1. Were γ > 1, H older continuity would imply that a( ) is a constant function and (5) boils down to a constant multiple of the average regression surface evaluated at fixed design points. This quantity is actually of independent interest and has been studied in a different setup by (Ray and van der Vaart, 2018) who focus on the posterior distribution of the half average treatment effect estimator (the mean regression surface) in the presence of missing data and random covariates. Our results can be extended to this scenario. 3. Tree and Forest Priors Regression trees provide a piecewise constant reconstruction of the regression surface f0, where the pieces correspond to terminal nodes of recursive partitioning (Donoho, 1997). Before introducing the tree function classes, we need to define the notion of tree-shaped partitions. Starting from a parent node [0, 1]p, a binary tree partition is obtained by successively applying a splitting rule on a chosen internal node. Each such internal node is divided into two daughter cells with a split along one of the p coordinates at a chosen observed value. These daughter cells define two new rectangular subregions of [0, 1]p, which can be split further (to become internal nodes) or end up being terminal nodes. The terminal cells after K 1 splits then yield a tree-shaped partition T = {Ωk}K k=1 consisting of boxes Ωk [0, 1]p. We denote with VK the set of all tree-shaped partitions that can be obtained by recursively splitting K 1 times with each split made at one of the observed values X {xi}p i=1. The tree functions can be then written as f T ,β(x) = k=1 I(x Ωk)βk, (6) where T = {Ωk}K k=1 VK and where β = (β1, . . . , βK) RK is a vector of jump sizes. Solitary trees are not as good performers as ensembles of trees/forests (Breiman, 2001; Chipman et al., 2010). The forest mapping underpinning the BART method of (Chipman et al., 2010) is the following sum-of-trees model indexed by a collection of T tree-shaped partitions E = [T 1, . . . , T T ] and heights B = [β1, . . . , βT ]: t=1 f T t,βt(x) for T t VKt and βt RKt. (7) The prior distribution is assigned over the class of forests F = {f E,B(x) as in (7) for some E and B and T N} in a hierarchical manner. One first assigns a prior distribution over T (or sets it equal to some value) and then a prior over the tree-shaped partitions T t as well as heights βt for 1 t T. 3.1. Tree Partition Priors π(T ) In 1998, there were two Bayesian CART developments that surfaced independently of each other: Chipman et al. (1997) and Denison et al. (1998). Albeit related, the two methods differ in terms of the proposed tree partition prior π(T ). The prior of Denison et al. (1998) is equalitarian in the sense that trees with the same number of leaves are a-priori equally likely, regardless of their shape. To prioritize smaller trees (that do not overfit), one assigns a complexity prior π(K) on the tree size (i.e. the number of bottom nodes) K. They considered the Poisson distribution ( eλ 1)K!, K = 1, 2, . . . , for some λ > 0. (8) Given the tree size K, one assigns a uniform prior over tree topologies π(T | K) = I T VK | VK | , (9) where | VK | is the cardinality of VK. This prior can be straightforwardly implemented using Metropolis-Hastings strategies (Denison et al., 1998) and was studied theoretically by Roˇckov a and van der Pas (2017). The Bayesian CART prior of Chipman et al. (1997), on the other hand, specifies the prior implicitly as a tree-generating Galton-Watson (GW) process. This process provides a mathematical representation of an evolving population of individuals who reproduce and die subject to laws of chance. The tree growing process is characterized as follows. Starting with a root node Ω00 = [0, 1]p, one decides to split each node Ωlk into two children by flipping a coin. We are tacitly using the two-index numbering of nodes (l, k), where Inference for BART Figure 1: A binary tree of prior cut probabilities in Bayesian CART by Chipman et al. (1998). l corresponds to the tree layer and k denotes the (k + 1)st left-most node at the lth layer. In order to prevent the trees from growing indefinitely, the success probability decays with l, making bushy trees far less likely. Let us denote with γlk {0, 1} the binary splitting indicator for whether or not a node Ωlk was divided. Chipman et al. (1997) assume P(γlk = 1) = plk = α (1 + l)δ (10) for some α (0, 1) and δ 0. A plot of the hierarchically organized split probabilities is in Figure 1. Roˇckov a and Saha (2019) propose a modification plk = αl for some 1/n < α < 1, which yields optimal posterior concentration in the ℓ2 sense. If the node Ωlk splits (i.e. γlk = 1), one chooses a splitting rule by first picking a variable j uniformly from available directions {1, . . . , p} and then picking a split point c uniformly from available data values x1j, . . . , xnj. Unlike in the homogeneous case (where all γlk s are iid), (10) defines a heterogeneous GW process where the offspring distribution is allowed to vary from generation to generation, i.e. the variables γlk are independent but nonidentical. 3.2. Priors on Jump Sizes π(β | K) After partitioning the predictor space into K nested rectangular cells, one needs to assign a prior on the presumed level of the outcome. Throughout this work, we denote with β = (β1, . . . , βK) the vector of jump sizes associated with K partitioning cells. Both Chipman et al. (1997) and Denison et al. (1998) assumed an independent product of Gaussians π(β) QK k=1 φ(βk; σ2) for some σ2 > 0. Chipman et al. (2000) argue, however, that the simple independence prior on the bottom leaf coefficients βk may not provide enough structure. They claim that values βk that correspond to nearby cells in the predictor space should be more similar so that the prior incorporates local smoothness. They suggest a prior on bottom leaves that aggregates priors on the ancestral internal nodes and, in this way, induces correlation among neighboring cells. Motivated by these considerations, here we allow for general correlation structures by assuming a multivariate Gaussian prior π(β | K) NK(0, Σ), (11) with λmin(Σ) > c > 0 and λmax(Σ) n and where Σ is some K K positive definite matrix. We also consider an independent product of Laplace priors (which was not yet studied in this context) k=1 ψ(βk; λ), (12) where ψ(β; λ) = λ/2 e λ|β| for some λ > 0. 4. Simple One-dimensional Scenario To set the stage for our development, we start with a simple toy scenario where (a) p = 1, (b) K is regarded as fixed, and (c) when there is only one partition T = {Ωk}K k=1 consisting of K equivalent blocks. The equivalent blocks partition T = {Ωk}K k=1 (Anderson, 1966) comprises K intervals Ωk with roughly equal number of observations in them, i.e. µ(Ωk) 1 n Pn i=1 I(xi Ωk) 1/K. For the sake of simplicity, we will also assume that the data points lie on a regular grid, where xi = i/n for 1 i n in which case the intervals Ωk have also roughly equal lengths. This setup was studied previously by van der Pas and Roˇckov a (2017) in the study of regression histograms. We relax this assumption in the next section. We denote the class of approximating functions as f K β : [0, 1]p R; f K β (x) = k=1 I(x Ωk)βk (13) where we have omitted the subscript T and highlighted the dependence on K. We denote with ΠK(f) the prior distribution over F[K], obtained by assigning the prior (11) or (12). To further simplify the notation, we will drop the subscript β and write f K when referring to the elements of F[K]. The aim is to study the posterior distribution of Ψ(f K) and to derive conditions under which Π[ n(Ψ(f K) Ψn) | Y (n)] (14) converges to a normal distribution (in Pn 0 probability) with mean zero and variance V0 = a 2 L, where Ψn is a random centering, distributed according to a Gaussian distribution with mean Ψ(f0) and variance V0. Using the fact that convergence of Laplace transforms for all t in probability implies convergence in distribution in probability (Section 1 of Castillo and Rousseau (2015)) this Bv M statement holds when t R EΠ[ et n(Ψ(f K) bΨK) | Y (n)] e t2 2 a 2 L, (15) Inference for BART in Pn 0 probability as n , where bΨK is a linear efficient estimator of Ψ(f0) such that n(bΨK Ψn) = o P (1). In order to show (15), we first need to introduce some notation. Let a K be the projection of a onto F[K], i.e. k=1 I(x Ωk)a K k with a K k = Pn i=1 I(xi Ωk) n a(xi) µ(Ωk), where µ(Ωk) was defined above as µ(Ωk) = 1 n Pn i=1 I(xi Ωk). Similarly, we denote with f K 0 = PK k=1 I(x Ωk)βK k the projection of f0 onto F[K] with jump sizes βK = (βK 1 , . . . , βK K ) . Next, we define bΨK = Ψ f K 0 + Wn(a K) n and Ψn = Ψ (f0) + Wn(a) n . The following Theorem characterizes the Bv M property when K is sufficiently large and when α is known. Theorem 4.1. Assume the model (1) with p = 1, where f0 is endowed with a prior on F[K] in (13) induced by (11). Assume f0 Hα 1 and a Hγ 1 for 1/2 < α 1 and γ > 1/2. With the choice K = Kn = (n/ log n)1/(2α+1) , we have Π n Ψ(f K) bΨK | Y (n) N(0, a 2 L) in Pn 0-probability as n . Proof: Appendix Section 1.1. Remark 4.1. Theorem 4.1 applies to the so-called symmetric dyadic trees. In particular, when n = 2r for some r > 0, the equivalent blocks partition with K = 2L cells can be represented with a symmetric full binary tree which splits all the nodes at dyadic rationals up to a resolution L. Theorem 4.1 is related to Theorem 4.2 of Castillo and Rousseau (2015) for density estimation with non-adaptive histogram priors. The proof here also requires two key ingredients. The first one is the construction of a prior which does not change too much under the change of measures from f K to f K t , where f K t is a step function with shifted heights βt k = βk ta K k n . This property holds for (correlated) Gaussian step heights and is safely satisfied by other prior distributions with heavier tails. Remark 4.2. Theorem 4.1 holds for Laplace product prior π(β | K) = QK k=1 ψ(βk, λ) (as shown in the Supplemental Material). It is interesting to note that under the Laplace prior, one can obviate the need for showing posterior concentration around a projection of f0 onto F[K], which is needed for the Gaussian case. The second crucial ingredient (as in the Proposition 1 of Castillo and Rousseau (2015)) is the so-called no bias condition: n a a K, f0 f K 0 L = o(1). (16) This condition vaguely reads as follows: one should be able to approximate simultaneously a( ) and f0( ) well enough using functions in the inferential class F[K]. Using the Cauchy-Schwartz inequality and Lemma 3.2 of Roˇckov a and van der Pas (2017), this condition will be satisfied when n K (α+γ) 0. Choosing K = Kn = (n/ log n)1/(2α+1) , (16) holds as long as γ > 1/2. Different choices of Kn, however, would imply different restrictions on α and γ. The no-bias condition thus enforces certain limitations on the regularities α and γ. This poses challenges for adaptive priors that only adapt to the regularity of f0, which may not be necessarily similar to the regularity of a. This phenomenon has been coined as the curse of adaptivity (Castillo and Rousseau, 2015). 5. Overcoming the Curse of Adaptivity The dependence of Kn on α makes the result in Theorem 4.1 somewhat theoretical. In practice, it is common to estimate the regularity from the data using, e.g., a hierarchical Bayes method, which treats both K and the partition T as unknown with a prior. This fully Bayes model brings us a step closer to the actual Bayesian CART and BART priors. Treating both K and T as random and assuming T = 1, the class of approximating functions now constitutes a union of shells where each shell T VK F[T ], itself is a union of sets F[T ] = {f T ,β of the form (6) for some β RK}. The sets F[T ] collect all step functions that grow on the same tree partition T VK. As mentioned in Castillo and Rousseau (2015), obtaining Bv M in the case of random K is case dependent. As the prior typically adapts to the regularity of f0, the no-bias condition (16) may not be satisfied if the regularities of a and f0 are too different. The adaptive prior can be detrimental in such scenarios, inducing a non-vanishing bias in the centering of the posterior distribution (see Castillo (2012a) or Rivoirard and Rousseau (2012)). Roughly speaking, one needs to make sure that the prior supports large enough K values and sufficiently regular partitions T so that f0 and a can be both safely approximated. To ensure this behavior, we enforce a signal strength assumption through Inference for BART self-similarity requiring that the function f0 does not appear smoother than it actually is (Gine and Nickl, 2015). Such qualitative assumptions are natural and necessary for obtaining adaptation properties of confidence bands (Picard and Tribouley, 2000; Bull, 2012; Gine and Nickl, 2010; Nickl and Szabo, 2016; Ray, 2017). 5.1. Self-similarity Various self-similarity conditions have been considered in various estimation settings, including the supremum-norm loss (Bull, 2012; Gine and Nickl, 2010; 2011) as well as the ℓ2 loss (Nickl and Szabo, 2016; Szabo et al., 2015). In the multi-scale analysis, the term self-similar coins desirable truths f0 that are not so difficult to estimate since their regularity looks similar at small and large scales. Gine and Nickl (2010) argue that such self-similarity is a reasonable modeling assumption and Bull (2012) shows the set of selfdissimilar H older functions (in the ℓ sense) is negligible. Szabo et al. (2015) provided an ℓ2-style self-similarity restriction on a Sobolev parameter space. Nickl and Szabo (2016) weakened this condition and showed that it cannot be improved upon and that the statistical complexity of the estimation problem does not decrease quantitatively under self-similarity in Sobolev spaces. We consider a related notion of ℓ2 self-similar classes within the context of fixed-design regression as opposed to the asymptotically equivalent white noise model. To this end, let us first formalize the notion of the cell size in terms of the local spread of the data and introduce the partition diameter (Verma et al., 2009; Roˇckov a and van der Pas, 2017). Definition 5.1. (Diameter) Denote by T = {Ωk}K k=1 a partition of [0, 1]p and by X = {x1, . . . , xn} a collection of data points in [0, 1]p. We define a diameter of Ωk as diam (Ωk) = max x,y Ωk X x y 2. and with diam (T ) = q PK k=1 µ(Ωk)diam2 (Ωk) we define a diameter of the entire partition T where µ(Ωk) = 1 n Pn i=1 I(xi Ωk) = n(Ωk)/n. Here, we do not require that the design points are strictly on a grid as long as they are regular according to Definition 3.3 of Roˇckov a and van der Pas (2017). We define regular datasets below. First we introduce the notion of the k-d tree (Bentley, 1975). Such a partition b T is constructed by cycling over coordinate directions in S = {1, . . . , p}, where all nodes at the same level are split along the same axis. For a given direction j S, each internal node, say Ω k, will be split at a median of the point set (along the jth axis). This split will pass µ(Ω k)n/2 and µ(Ω k)n/2 observations onto its two children, thereby roughly halving the number of points. After s rounds of splits on each variable, all K terminal nodes have at least n/K observations, where K = 2s |S|. We now define regular datasets in terms of the k-d tree partition. Definition 5.2. Denote by b T = {bΩk}K k=1 VK the k-d tree where K = 2s p. We say that a dataset X {xi}n i=1 is regular if max 1 k K diam(bΩk) < M k=1 µ(bΩk)diam(bΩk) (17) for some large enough constant M > 0 and all s N\{0}. The definition states that in a regular dataset, the maximal diameter in the k-d tree partition should not be much larger than a typical diameter. Definition 5.3. We say that the function f0 Hα p is selfsimilar, if there exists constant M > 0 and D > 0 such that f 0 T f0 2 L Mdiam(T )2α (18) for all T VK such that diam(T ) D where f 0 T is the the L projection of f0 onto F[T ]. We can relate the assumption (18) to the notion of selfsimilarity in the Remark 3.4 of Szabo et al. (2015). To see this connection, assume for now the equivalent block partition T from Section 4, whose diameter diam(T ) is roughly 1/K when the design points lie on a regular grid. The study of regression histograms with K = 2L equivalent blocks under a fixed regular design is statistically equivalent to the multi-scale analysis of Haar wavelet coefficients up to the resolution L 1 in the white noise model. The projected model onto the Haar basis can be written as Ylk = f 0 lk + 1 nεlk for 0 l < L and 0 k < 2l, where εlk iid N(0, 1) and where f 0 lk are the wavelet coefficients indexed by the resolution level l and the shift index k. The speed of decay of f 0 lk determines the statistical properties of f0, where α-H older continuous functions satisfy | f 0 lk | 2 l(α+1/2). Assuming the equivalent blocks partition T with K = 2L blocks, the condition in the Remark 3.4 of Szabo et al. (2015) writes as follows: there exists K0 N such that K K0 we have R 1 0 | f 0 T (x) f0(x) |2 0 k<2l(f 0 lk)2 MK 2α diam(T )2α. The first equality above stems from the orthogonality of the Haar bases. In this vein, (18) can be regarded a generalization of this condition to imbalanced partitions and fixed design under the norm L. To get even more insight into (18) in fixed design regression, we take a closer look at the approximation gap. We have Inference for BART f0 f 0 T 2 L = PK k=1 µ(Ωk)Var [f0 | Ωk], where Var [f0 | Ωk] = 1 n(Ωk) f0(xi) 1 n(Ωk) xi Ωk f0(xi) is the local variability of the function f0 inside Ωk. The function f0 will be self-similar when the variability inside each cell Ωk s is large enough to be detectable, i.e. inf 1 k K Var [f0 | Ωk] > Mdiam2α(T ) for all T = {Ωk}K k=1 VK such that diam(T ) D for some D > 0. From the definition of the partition diameter, it turns out that this will be satisfied when Var [f0 | Ωk] > diam2α(Ωk) for all 1 k K and T = {Ωk}K k=1 VK. Functions that are nearly constant in long intervals will not satisfy this requirement. The premise of self-similar functions is that their signal should be detectable with partitions T that undersmooth the truth. In addition, it follows from the proof of Lemma 3.2 of Roˇckov a and van der Pas (2017) that f 0 T f0 2 L diam(T )2α. The lower bound in (18) thus matches the upper bound making the approximation error behave similarly across partitions with similar diameters, essentially identifying the smoothness. Based on these considerations, we introduce the notion of regular partitions that are not too rough in the sense that their diameters shrink sufficiently fast. Definition 5.4. For some M > 0 and for some arbitrarily slowly increasing sequence Mn , we denote dn(α) (Mn/M)1/αn 1/(2α+p)log1/2α n. (19) A tree partition T VK is said to be n-regular for a given n N if diam(T ) dn(α). We denote the subset of all n-regular partitions with Rn. The following Lemma states that, when f0 is self-similar, the posterior concentrates on partitions that are not too complex or irregular. Lemma 5.1. Assume that f0 Hα p is self-similar, p log n and that the design X {xi}n i=1 is regular. Under the Bayesian CART prior ((8) and (9) or (10)) with Gaussian or Laplace step heights ((11) or (12)) we have Π {T / Rn} {T VK : K > Kn} | Y (n) 0 in Pn 0-probability as n, p , where Rn are all regular partitions and Kn = M2 nε2 n/ log n (n/ log n)p/(2α+p) for some M2 > 0. Proof: Appendix Section 1.5. 5.2. Adaptive Bv M for Smooth Functionals when p = 1 It is known that signal-strength conditions enforced through self-similarity allow for the construction of honest adaptive credible balls (Gine and Nickl, 2010). Our notion of self-similarity is sufficient for obtaining the adaptive semiparametric Bv M phenomenon for smooth linear functionals. Denote with R(Kn) = {T Rn VK for K Kn}. Lemma 5.1 shows that the posterior concentrates on this set so that we can perform the analysis locally. For any T R(Kn), we write bΨT = Ψ(f T 0 ) + Wn(a T ) n , where f T 0 and a T are the L projections onto F[T ]. Under the adaptive prior (treating the partitions as random with a prior) the posterior is asymptotically close to a mixture of normals indexed by T R(Kn) with weights π(T | Y (n), R(Kn)). When max T R(Kn) | a T L a L |= o(1) (20) and max T R(Kn) n(Ψn bΨT ) = o P (1) (21) this mixture boils down to the target law N(0, a 2 L). The first condition (20) holds owing to the fact that a a T L dn(α)γ 0 (Lemma 3.2 of Rockova and van der Pas (2017)). The second condition (21) will be satisfied as long as n dn(α)α+γ 0. (22) For T R(Kn) and assuming p = 1, we have for Mn log n n dn(α)α+γ n1/2 α+γ 2α+1 (log n) α+γ for γ > 1/2. We can now state an adaptive variant of Theorem 4.1 for random partitions. Theorem 5.1. Assume model (1) with p = 1, where f0 Hα 1 and a Hγ 1 for 1/2 < α 1 and γ > 1/2. Assume that f0 is self-similar. Under the Bayesian CART prior ((8) and (9) or (10)) with Gaussian or Laplace step heights ((11) or (12)), we have Π n Ψ(f T ,β) bΨT | Y (n) N(0, a 2 L) in Pn 0-probability as n . Proof: Appendix Section 1.3. Inference for BART Theorem 5.1 can be regarded as an adaptive extension of the regular density histogram result of Castillo and Rousseau (2015). Here, we instead focus on irregular and adaptive regression histograms underpinned by tree priors and treat both K and T as random under self-similarity. The change of measure argument is performed locally for each regular partition. Theorem 5.1 can be extended to tree ensembles. The selfsimilarity assumption would be instead formulated in terms of a global partition, which is obtained by super-imposing all tree partitions inside E and by merging empty cells with one of their neighbors. Since tree ensembles also concentrate at near-minimax speed (Roˇckov a and van der Pas, 2017; Roˇckov a and Saha, 2019), one obtains that the posterior concentrates on regular ensembles (where the diameter is small). The analysis is then performed locally on regular ensembles in the same spirit as for single trees. 5.3. Average Regression Surface when p > 1 One of the main limitations of tree/forest methods is that they cannot estimate optimally functions that are smoother than Lipschitz (Scricciolo, 2007). The reason for this limitation is that step functions are relatively rough; e.g. the approximation error of histograms for functions that are smoother than Lipschitz is at least of the order 1/K, where K is the number of bins (Roˇckov a and van der Pas, 2017). The number of steps required to approximate a smooth function well is thus too large, creating a costly bias-variance tradeoff. When p > 1, the no-bias condition (16) would be satisfied if γ > p/2 which, from the H older continuity, holds when a(xi) is a constant function. Focusing on the actual BART method when p > 1, we now rephrase Theorem 5.1 for the average regression functional (5) obtained with a( ) = 1. When a( ) is a constant function, the no-bias condition (16) is automatically satisfied. Recall that the second requirement for Bv M pertains to the shift of measures. It turns out that the Gaussian prior (11) may induce too much bias when the variance is too small (fixed as n ). We thereby deploy an additional assumption in the prior to make sure that the variance increases suitably with the number of steps K. For the BART prior on step heights βt of each tree T t E, we assume either a Gaussian prior βt NKt(0, Kt IKt) or the Laplace prior with λt 1/ Kt. Having the variance scale with the number of steps is generic in the multi-scale analysis of Haar wavelets. The following theorem is formulated for a few variants of the BART prior. This prior consists of (a) either fixed number of trees T (as recommended by (Chipman et al., 2010)), (b) the Galton-Watson prior (10) or the conditionally uniform tree prior (8) and (9), independently for each tree, and (c) the Gaussian prior βt NKt(0, Kt IKt) or the Laplace prior with λt 1/ Kt, where Kt is the number of bottom nodes of a tree T t. Below, we denote with bΨE = Ψ(f 0 E) + Wn(a)/ n, where f 0 E is a projection of f0 onto F[E], a set of all forest mappings (7) supported on the tree ensemble E. Theorem 5.2. Assume model (1) with p 1, where f0 Hα p is endowed with the BART prior (as stated above) and where log p n and 1/2 α < 1. Assume that a( ) = 1 in (5). When the design X = {xi}n i=1 is regular, we have Π n Ψ(f E,B) bΨE | Y (n) N(0, a 2 L) in Pn 0probability as n . Proof: Appendix Section 1.4. 6. Discussion This paper focuses on the important problem of uncertainty quantification and inference for machine learning. We provide frequentist justifications for Bayesian inference with BART based on (smooth) linear functionals of the regression function. These results can be used for testing hypotheses pertaining to exceedance of (weighted) average level, i.e. Pn i=1 aif0(xi) > c for some c R, or for causal inference (Hill, 2011; Hahn et al., 2017). Indeed, embedding our development within the missing data framework of Ray (2017) will provide asymptotic normality results for average treatment estimation. This work will be reported elsewhere. Both Theorem 4.1 and Theorem 5.1 impose restrictions on α and γ to make sure that they are compatible. These theorems can be obtained without assuming α > 1/2 when a( ) is constant. The self-similarity assumption (a very typical assumption for uncertainty quantification of densities and functions) makes it possible to identify smoothness so that α does not need to be known in Theorem 5.1. Variants of this assumption are pervasive in the literature on adaptive confidence sets construction. This assumption is, again, not needed when a( ) is constant. We focus on semi-parametric Bv M s for linear functionals of the infinite-dimensional regression function parameters. This semi-parametric setup already poses nontrivial challenges on hierarchical Bayes. We have reiterated and highlighted some of the challenges here and addressed them with self-similarity identification. Our results serve as a step towards obtaining fully non-parametric Bv M for the actual function f0, as opposed to just its low-dimensional summaries. These results will be reported in follow-up work. Finally, the limitation p = 1 for smooth functions a( ) is due to the fact that BART cannot optimally approximate functions smoother than Lipschitz. This can be overcome by considering smoother versions of BART (Linero and Yang, 2017). Inference for BART Anderson, T. W. (1966). Some nonparametric multivariate procedures based on statistically equivalent blocks. Multivariate Analysis 1, 5 27. Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 509 517. Bickel, P. and B. Kleijn (1999). On the Bernstein von Mises theorem with infinite-dimensional parameters. The Annals of Statistics 27, 1119 1140. Bickel, P. and B. Kleijn (2012). The semiparametric Bernstein von Mises theorem. The Annals of Statistics 40, 206 237. Bleich, J. and Kapelner, A. and George, E. and Jensen, S. (2014). Variable selection for BART: An application to gene regulation. The Annals of Applied Statistics 8, 1750 1781. Breiman, L. (2001). Random forests. Machine Learning 45(1), 5 32. Bull, A. (2012). Honest adaptive confidence bands and self-similar functions. Electronic Journal of Statistics 6, 1490 1516. Castillo, I. (2012a). Semiparametric Bernstein von Mises theorem and bias, illustrated with Gaussian process priors. Sankhya 74, 194 221. Castillo, I. (2012b). A semiparametric Bernstein von Mises theorem for Gaussian process priors. Probab. Theory Related Fields 152, 53 99. Castillo, I. and R. Nickl (2013). Nonparametric Bernstein von Mises theorems in Gaussian white noise. The Annals of Statistics 41, 1999 2028. Castillo, I. and R. Nickl (2014). On the Bernstein von Mises theorem for nonparametric Bayes proceduress. The Annals of Statistics 27, 1941 1969. Castillo, I. and J. Rousseau (2015). A Bernstein von Mises theorem for smooth functionals in semiparametric models. The Annals of Statistics 43, 2353 2383. Castillo, I. and V. Roˇckov a (2019). Multiscale analysis of BART. Ar Xiv. Cheng, G. and M. R. Kosorok (2008). General frequentist properties of the posterior profile distribution. The Annals of Statistics 36, 1819 1853. Chipman, H., E. I. George, and R. Mc Culloch (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics 4, 266 298. Chipman, H., E. I. George, and R. E. Mc Culloch (1997). Bayesian CART model search. Journal of the American Statistical Association 93, 935 960. Chipman, H., E. I. George, and R. E. Mc Culloch (2000). Hierarchical priors for Bayesian CART shrinkage. Statistics and Computing 10, 17 24. Cox, D. (1993). An analysis of Bayesian inference for nonparametric regression. The Annals of Statistics 21, 903 923. Denison, D., B. Mallick, and A. Smith (1998). A Bayesian CART algorithm. Biometrika 85, 363 377. Diaconis, P. and D. Freedman (1986). On consistency of Bayes estimates. The Annals of Statistics 14, 1 26. Diaconis, P. and D. Freedman (1998). Consistency of Bayes estimates for nonparametric regression: normal theory. Bernoulli 4, 411 444. Donoho, D. (1997). CART and best-ortho-basis: a connection. The Annals of Statistics 25, 1870 1911. Ghosal, S. (2000). Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. Journal of Multivariate Analysis 74, 49 68. Gine, E. and R. Nickl (2010). Confidence bands in density estimation. The Annals of Statistics 38, 1122 1170. Gine, E. and R. Nickl (2011). On adaptive inference and confidence bands. The Annals of Statistics 39, 2383 2409. Gine, E. and R. Nickl (2015). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge University Press. Hahn, R. and Murray, J. and Carvalho, C. (2017). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Journal of Computational and Graphical Statistics 29, 217 240. He, J. and Yalov, S. and Hahn, R. XBART: Accelerated Bayesian Additive Regression Trees. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 89, 217 240. Hill, J. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 29, 217 240. Johnstone, I. (2010). High dimensional Bernstein von Mises: Simple examples. In Borrowing Strength: Theory Powering Applications?A Festschrift for Lawrence D. Brown. Inference for BART Jonge, R. and H. van Zanten (2013). Semiparametric Bernstein?von Mises for the error standard deviation. Electronic Journal of Statistics 7, 217 243. Kleijn, B. and A. van der Vaart (2006). Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics 34, 837 877. Laplace, P. S. (1810). Memoire sur les formules qui sont fonctions de tres grands nombres et sur leurs applications aux probabilites. Oeuvres de Laplace 12, 301 349. Leahu, H. (2011). On the Bernstein von Mises phenomenon in the Gaussian white noise model. Electron. J. Stat 4, 373 404. Linero, A. (2016). Bayesian regression trees for high dimensional prediction and variable selection. Journal of the American Statistical Association 113, 626 636. Linero, A. and Yang, Y. (2017). Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Association (to appear). Nickl, R. and B. Szabo (2016). A sharp adaptive confidence ball for self-similar functions. Stochastic Processes and their Applications 126, 3913 3934. Picard, D. and K. Tribouley (2000). Adaptive confidence interval for pointwise curve estimation. The Annals of Statistics 28, 298 335. Ray, K. (2017). Adaptive Bernstein von Mises theorems in Gaussian white noise. The Annals of Statistics 45, 2511 2536. Ray, K. and A. van der Vaart (2018). Semiparametric Bayesian causal inference using Gaussian process priors. Ar Xiv. Rivoirard, V. and J. Rousseau (2012). Bernstein von Mises theorem for linear functionals of the density. The Annals of Statistics 40, 1489 1523. Roˇckov a, V. and E. Saha (2019). On theory for BART. 22nd International Conference on Artificial Intelligence and Statistics. Roˇckov a, V. and S. van der Pas (2017+). Posterior concentration for Bayesian regression trees and forests. The Annals of Statistics, 1 40. Scricciolo, C. (2007). On rates of convergence for Bayesian density estimation. Scandinavian Journal of Statistics 34, 626 642. Shen, X. (2002). Asymptotic normality of semiparametric and nonparametric posterior distributions. Journal of the American Statistical Association 97, 222 235. Szabo, B., A. van der Vaart, and J. van Zanten (2015). Frequentist coverage of adaptive nonparametric Bayesian credible sets. The Annals of Statistics 43, 1391 1428. van der Pas, S. and V. Roˇckov a (2017). Bayesian dyadic trees and histograms for regression. Advances in Neural Information Processing Systems, 1 12. van der Vaart, A. (2000). Asymptotic Statistics. Cambridge University Press. Verma, N., S. Kpotufe, and S. Dasgupta (2009). Which spatial partition trees are adaptive to intrinsic dimension? In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pp. 565 574. AUAI Press.