# extrapolated_random_tree_for_regression__197e6fd8.pdf

Extrapolated Random Tree for Regression

Yuchao Cai * 1 Yuheng Ma * 1 Yiwei Dong 1 Hanfang Yang 1 2

In this paper, we propose a novel tree-based algorithm named Extrapolated Random Tree for Regression (ERTR) that adapts to arbitrary smoothness of the regression function while maintaining the interpretability of the tree. We first put forward the homothetic random tree for regression (HRTR) that converges to the target function as the homothetic ratio approaches zero. Then ERTR uses a linear regression model to extrapolate HRTR estimations with different ratios to the ratio zero. From the theoretical perspective, we for the first time establish the optimal convergence rates for ERTR when the target function resides in the general H older space Ck,α for k N, whereas the lower bound of the convergence rate of the random tree for regression (RTR) is strictly slower than ERTR in the space Ck,α for k 1. This shows that ERTR outperforms RTR for the target function with high-order smoothness due to the extrapolation. In the experiments, we compare ERTR with state-of-the-art tree algorithms on real datasets to show the superior performance of our model. Moreover, promising improvements are brought by using the extrapolated trees as base learners in the extension of ERTR to ensemble methods.

1. Introduction

Ensemble of trees methods (Breiman, 2001; Friedman, 2002; Chen & Guestrin, 2016) are powerful and well-known methods that are successfully adopted for regression problems in many applications and machine learning competitions (Yu et al., 2021; Basak et al., 2022; K unzel et al., 2022; Amini et al., 2022). In fact, the success of ensemble methods stems from the advantages inherited from the tree

*Equal contribution 1School of Statistics, Renmin University of China 2Center for Applied Statistics, School of Statistics, Renmin University of China. Correspondence to: Hanfang Yang <hyang@ruc.edu.cn>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

methods. That is, the wide applicability of tree methods due to weak distributional assumption on data, the computational efficiency, as well as the interpretability after we build the model.

Despite their empirical success, the merit of regression trees is also addressed from theoretical perspectives. For instance, Gao & Zhou (2020) considered the convergence rate for random trees which attains a mini-max optimal rate for Lipschitz functions. Mourtada et al. (2020) derived similar results for Mondrian trees. As another line of work, bayesian additive regression trees (Chipman et al., 2010) impose prior distribution on the parameters of the partitioning process and are shown to achieve optimal convergence rate for Lipschitz functions (Roˇckov a & Saha, 2019; Roˇckov a & van der Pas, 2020). These works imply that standard regression trees, which are based on piece-wise constant functions, can only adapt to functions that are at most Lipschitz smooth. Therefore, they may be difficult to adapt to the high-order smoothness of the target function.

To overcome this issue, researchers seek to improve regression trees. Most works focus on employing smooth estimations, which are achieved by attributing data by soft assignment instead of hard assignment. Several closely related works, including Su arez & Lutsko (1999); Da Rosa et al. (2008); Irsoy et al. (2012); Frosst & Hinton (2017); Alkhoury et al. (2020), assign each data point to all leaves with a certain class membership, and the final prediction is a smooth combination of the prediction at each node. Despite their contribution from the methodology perspective, none of the works above managed to explain how their smooth/flexible tree structures facilitate adaption to highorder smoothness theoretically. A recent work (Linero & Yang, 2018) on Bayesian additive regression tree utilized the soft tree ensembles and derived mini-max rate for arbitrary smoothness. However, its Bayesian nature yields a heavy computational burden. Besides, the choice of priors is strongly restricted to achieve theoretical optimality and computation efficiency.

Under such background, borrowing the extrapolation techniques from Brezinski & Zaglia (2013); Okuno & Shimodaira (2020), we propose a novel tree-based algorithm called Extrapolated Random Tree for Regression (ERTR) that is adaptive to high order smoothness of the target func-

Extrapolated Random Tree for Regression

tion while maintaining the interpretability. First, we introduce the random tree partition that divides the input space recursively by cutting the longest edge of the cell. Then we apply the homothetic transformation to shrink the cell by a scale factor that is the same in all directions, called homothetic ratio, according to a center point. Based on the transformations, we put forward the homothetic random tree for regression (HRTR) that represents the average of the response variable on these homothetic cells. Then, by using Taylor s expansion concerning the homothetic ratio, we show that HRTR converges to the target function as the ratio approaches zero. Finally, we propose extrapolated random tree for regression (ERTR) that extrapolates HRTR estimations with a series of pre-specified ratios to the ratio zero. More specifically, we train a linear regression model regarding the ratios as covariates and the corresponding HRTR estimations as responses to estimate the value at the ratio zero. Two advantages of ERTR are listed as follows. First, the extrapolation procedure eradicates the dominant terms in approximation error, which leads to the adaptivity to the target function with high-order smoothness. Moreover, ERTR inherits the interpretability from the tree model and linear regression model, which can tell us how important different input data are in prediction. Our contributions are summarized as follows.

(i) We propose a tree-based algorithm called extrapolated random tree for regression (ERTR) that is adaptive to the high-order smoothness of the target function while preserving the interpretability of the tree.

(ii) From the learning theory perspective, we establish opti-

mal convergence rate O(n 2(k+α) 2(k+α)+d ) for ERTR when the target function resides in the general H older space Ck,α, k N, whereas the lower bound of the convergence rates of the random tree for regression (RTR) is mere of the order O(n 2 2+d ) in the space Ck,α, k 1. This shows that ERTR outperforms RTR for the target function with high-order smoothness due to the extrapolation.

(iii) From the experimental perspective, we first conduct synthetic experiments to illustrate the power of extrapolation, which coincides with the established theoretical results. Moreover, we demonstrate that ERTR outperforms other state-of-the-art tree methods on almost real-world data sets. Furthermore, we extend ERTR to ensemble methods including random forest and gradient boosting. Promising improvements are brought by using the extrapolated trees as base learners.

2. Methodology

We dedicate this section to the methodology of ERTR. In Section 2.1, we first present preliminaries related to regres-

sion problems. Then we introduce the standard random tree for regression in Section 2.2. Next, in Section 2.3, we propose the homothetic random tree for regression. Finally, we apply the extrapolation method to these estimations to obtain the ERTR algorithm in Section 2.4.

2.1. Preliminaries

Regression is to predict the value of an unobserved output variable Y based on the observed input variable X, based on a dataset D := {(X1, Y1), . . . , (Xn, Yn)} consisting of i.i.d. observations drawn from an unknown probability measure P on X Y. Throughout this paper, we assume that X = [0, 1]d Rd and that Y [ M, M] are compact and non-empty.

Recall that for 1 p < , the Lp-norm of x = (x1, . . . , xd) is defined by x p := (|x1|p + +|xd|p)1/p, and the L -norm is defined by x := maxi=1,...,d |xi|. Throughout this paper, we use the notation an bn and an bn to denote that there exist positive constant c and c

such that an cbn and an c bn, for all n N. In addition, we denote an bn if there hold an bn and bn an. Moreover, for any x R, let x denote the largest integer less than or equal to x. In the sequel, the following multi-index notations are used frequently. For any vector x = (xi)d i=1 Rd, we write x := ( xi )d i=1, x 1 := (x 1 i )d i=1, log(x) := (log xi)d i=1, x = maxi=1,...,d xi, and x = mini=1,...,d xi. In addition, for any matrix R, let R denote the transpose of the matrix R. Besides, for any set A Rd, the dimameter of A is defined by diam(A) := supx,x A x x 2.

It is legitimate to consider the least square loss L : X Y Y [0, ) defined by L(x, y, f(x)) := (y f(x))2 for our target of regression. Then, for a measurable decision function f : X Y, the risk is defined by RL,P(f) := R

X Y L(x, y, f(x)) d P(x, y) and the empirical risk is defined by RL,D(f) := 1 n Pn i=1 L(Xi, Yi, f(Xi)). The Bayes risk, which is the smallest possible risk with respect to P and L, is given by R L,P := inf{RL,P(f)|f : X Y measurable}. The function that achieves the Bayes risk is called Bayes function, namely, f L,P(x) := E Y |X = x .

2.2. Random Tree for Regression

Breiman s original algorithm (Li et al., 1984) is difficult to analyze as a result of the complexity of the partitioning procedure. Nevertheless, grown independently of the samples, random trees are amenable to theoretical analysis. In this paper, we investigate the random tree partitions with the mid-point splitting rule on the max edges suggested by Gao & Zhou (2020). To be specific, let A1 0 := [0, 1]d be the initial rectangular cell and π0 := {A1 0} be the initialized cell partition. In addition, let p N represent the depth of

Extrapolated Random Tree for Regression

the tree, which is fixed beforehand by the user, and possibly depending on n.

In the first step, we choose one of the coordinates X = (X1, . . . , Xd) with the ℓ-th feature Xℓhaving a probability 1/d of being selected and then split A1 0 into two rectangular cells along the midpoint of the chosen side. In this way, we get a partition with two rectangular cells denoted as π1 := {A1 1, A2 1}. Suppose after i 1 steps of the recursion, 1 i p, we have obtained a partition πi 1 of X with 2i 1 rectangular cells. In the i-th step, further partitioning of the region is defined as follows:

(i) For each rectangular cell Aj i 1, 1 j 2i 1. Let eℓ i 1,j denote the length of edge of Aj i 1 along the ℓ-th coordinate for 1 l d and ei 1,j = max1 ℓ d eℓ i 1,j. Then a coordinate of X = (X1, . . . , Xd), namely Zi,j is uniformly selected among the coordinates with maximal side length,

P(Zi,j = ℓ) = 1(eℓ i 1,j = ei 1,j) Pd ℓ=1 1(eℓ i 1,j = ei 1,j) . (1)

(ii) For each rectangular cell Aj i 1, 1 j 2i 1, once the coordinate is selected, the split is at the midpoint of the chosen side. Then, each cell Aj i 1 is divided into two new ones, namely A2j 1 i and A2j i . We denote the set of all these cells {Aj i, 1 j 2i} by πi.

p=0 p=3 p=2 p=1

Figure 1. Illustration of random tree partition.

After p recursive steps, we obtain the partition of [0, 1]d, i.e.

πp := {Aj p}j Ip := {Aj p, 1 j 2p}. (2)

We call it a random tree partition with the maximal edge (max-edge) splitting rule of depth p. The complete process is presented in Algorithm 1 with an illustration in Figure 1.

For any x [0, 1]d, there exists j Ip such that x Aj p. Then we denote the cell containing x as A(x) := Aj p. With these preparations, we introduce the random tree for regression (RTR) based on the random tree partition πp.

Algorithm 1 Random Tree Partition

Input: Depth of the random tree p; Initial partition π0 = {A1 0 := [0, 1]d}. for i = 1 to p do

for j = 1 to 2i 1 do

For rectangular cell Aj i 1, randomly choose one coordinate Zi,j among the longest edges as in (1); Divide the cell Aj i 1 into two subregions, that is, Aj i 1 = A2j 1 i A2j i , along the midpoint of the dimension Zi,j; end for Get πi = {Aj i, 1 j 2i}. end for Output: Partition πp.

Definition 2.1. Let Q be a probability measure and πp := {Aj p}j Ip be a random tree partition with depth p as in (2). Then, the random tree for regression (RTR), namely, f Q : [0, 1]d R is defined by

A(x) Y Y d Q R

A(x) Y d Q . (3)

Taking Q := P, we get the population RTR as

A(x) f L,P(x ) d PX(x ) R

A(x) d PX(x ) . (4)

From the above definition, f P(x) represents the average of the Bayes function f L,P(x ) on A(x). Since P is inaccessible, we need to estimate it from the data. When Q is taken as the empirical measure D := 1

n Pn i=1 δ(xi,yi) given the data set D = {(X1, Y1), . . . , (Xn, Yn)}, the empirical RTR is expressed as

f D(x) := Pn i=1 1{Xi A(x)}Yi Pn i=1 1{Xi A(x)} . (5)

The denominator on the right-hand side of (5) counts the number of observations falling in A(x) and thus f D(x) is the average of the responses in the cell A(x).

2.3. Homothetic Random Tree for Regression

In this section, we put forward the homothetic random tree for regression and present Taylor s expansion concerning the homothetic ratio.

In mathematics, a homothetic transformation of an affine space is determined by a point x called its center and a nonzero number r called its ratio, which sends point z to a point z by the rule z = x + r(z x). Recall that for x [0, 1]d, we use A(x) to represent the cell of the partition πp containing x. Then for any 0 r 1, we define the

Extrapolated Random Tree for Regression

homothetic transformation Tx,r : A(x) Rd with the center x and the ratio 0 r 1 as

Tx,r(z) := x + r(z x), z A(x). (6)

Figure 2. Illustration of homothetic transformations. The back dots between x and z denote the images of homothetic transformation Tx,1/4(z), Tx,1/2(z), and Tx,3/4(z).

Then we define Ar(x) as the image of Tx,r, that is,

Ar(x) := {Tx,r(z) : z A(x)}. (7)

Ar(x) defines a collection of rectangles that is parametrized by r. For r = 1, A1(x) remains A(x). As r approaches 0, Ar(x) gradually shrinks and degenerates to x in the end. With these preparations, we can define a broader class of estimations as below.

Definition 2.2. Let Q be a probability measure and πp := {Aj p}j Ip be a random tree partition with depth p as in (2). Moreover, let Ar(x) be defined by (7), then the homothetic random tree for regression (HRTR) with the ratio r, namely, f Q,r : [0, 1]d [0, ) is defined by

f Q,r(x) :=

Ar(x) Y Y d Q R

Ar(x) Y d Q . (8)

Taking Q := P, we get the population HRTR as

f P,r(x) :=

Ar(x) f L,P(x ) d PX(x ) R

Ar(x) d PX(x ) . (9)

Clearly, f P,r(x) is the average of f L,P(x ) on Ar(x). Similar to (5), we can estimate HRTR from the data by

f D,r(x) := Pn i=1 1{Xi Ar(x)}Yi Pn i=1 1{Xi Ar(x)} . (10)

To make HRTR converge to f(x) as r approaches 0, we need the following definition of H older continuity space.

Definition 2.3. Let k N, α (0, 1]. We say that a function f : X R is (k, α)-H older continuous, if there exists a finite constant c L > 0 such that ℓf c L for all ℓ {1, . . . , k} and kf(x) kf(x ) c L x x α

for all x, x X. The set of such functions is denoted by Ck,α(X).

We remark that k decides the order of smoothness for f Ck,α, and larger k indicates that f enjoys a higher order of smoothness. For the special case k = 0, the function space C0,α coincides with the commonly used α-H older continuous function space Cα.

With these preparations, the next proposition shows that when the Bayes function f L,P Ck,α. HRTR can be approximated by the Taylor series concerning the homothetic ratio whose degree depends on the smoothness of the Bayes function.

Proposition 2.4. Suppose that PX has upper and lower bounded function over [0, 1]d and the Bayes function f L,P Ck,α Moreover, let f P,r be defined by (9). Then we have

f P,r(x) f L,P(x) =

j=1 bjrj + δr,A, (11)

where the remainder δr,A diam(A(x))k+α and b1, bk < B are constants that depends on x.

(11) tells us that HRTR converges to the target function as the ratio converges to zero. The theorem above directly indicates that

f D,r(x) f L,P(x) =

j=1 bjrj + εr, (12)

where εr = δr,A + f D,r(x) f P,r(x). This implies that the f D,r(x) behaves similar to f P,r(x) with additional error f D,r(x) f P,r(x) due to the empirical measure D.

2.4. Extrapolated Random Tree for Regression

We dedicate this section to the extrapolation procedure. To be specific, we first fix a ratio sequence {ri}V i=1 with ri = i/V and a pre-specified order parameter L V 1. Then, we compute the HRTR estimation for r = ri, 1 i V . Motivated by (12), we consider the following linear regression model to extrapolate these estimations to r = 0,

f D,ri(x) = b0 +

j=1 bjrj i + ϵi, 1 i V, (13)

where ϵi := δri,A + f D,ri(x) f P,ri(x), and b = (b0, . . . , b L) is the regression coefficient vector to be estimated. For 1 i k, bi is specified in Proposition 2.4 and bi = 0 for k + 1 i L. Note that the regression function is a polynomial of ri. Therefore, it can be solved efficiently through the standard least square method.

ˆb := arg min b RL+1

f D,ri(x) b0

Extrapolated Random Tree for Regression

0.0 0.2 0.4 0.6 0.8 1.0 x

(a) ERTR estimation

0.0 0.2 0.4 0.6 0.8 1.0 r

(b) Extrapolation at (5π/32, 1)

0.0 0.2 0.4 0.6 0.8 1.0 r

Samples for Regression Regression Curve

(c) Extrapolation at (π/8, 0)

Figure 3. Illustration of extrapolation. In (a), the blue line stands for the target function and the orange line stands for ERTR estimation. In (b) and (c), the blue points are (ri, f D,ri(x)) with V = 10 for extrapolation. The green line represents the linear regression of the extrapolation. The Bayes function value at two points is represented by the horizontal dashed line.

The estimation can be explicitly expressed as

ˆb = (R R) 1R (f D,r1(x), , f D,r V (x)) (15)

where R = (Rij) is a V (L + 1) matrix with Rij = rj 1 i . Then we propose the extrapolated random tree for regression (ERTR) as

f D,E(x) = ˆb0. (16)

The complete algorithm is presented in Algorithm 2.

Algorithm 2 Extrapolated Random Tree for Regression

Input: Depth of the random tree p, order L, and number of estimations V . Generate random tree partition πp by Algorithm 1; for i = 1 to V do

Compute Ari(x) with ri = i/V by (7); Compute f D,ri(x) by (10); end for Compute the coefficient estimation ˆb by (14); Output: Extrapolated random tree for regression f D,E(x) by (16).

ERTR first randomly splits the input space by the max-edge rule introduced in Algorithm 1. Then, for the test instance x, ERTR computes several HRTRs with a sequence of homothetic ratios ri, 1 i V . Finally, we obtain the estimation by using the linear regression model to extrapolate these estimations to r = 0.

Since ERTR eradicates the dominant terms in Taylor s expansion, i.e. rj, 1 j k, ERTR adapts to the high-order smoothness of the target function. Moreover, f D,E(x) is actually a weighted average of f D,ri(x), 1 i V from (15) and (16). This implies that ERTR is a well-interpretable model, which can not only tell us which node of the tree is taken for the prediction but also quantify the contributions of these data points in the prediction.

An explanation of ERTR is presented in Figure 3 for the following synthetic model employed in Cai et al. (2020),

Y = sin(16X) + ε, (17)

where X Unif[0, 1] and ε N(0, 1). We generate 2000 samples from this model for training and set p = 3, L = 1, and V = 10 for our algorithm. Figure 3(a) displays the target function and the estimation. To visualize the regression model in (14), we plot the curve of (r, f D,r(x)) in blue for two fixed points x = 5π/32 and π/8 which reside in the same cell (0.375, 0.5), respectively. The samples selected for the regression model (13) are colored in orange. It is clear to see that RTR predicts both points by an identical value of around 0.5. On the contrary, from Figure 3(b) and 3(c), we see that the curve of HRTR can be approximated by a polynomial function. As a result, ERTR can make a more precise prediction thanks to the extrapolation method.

3. Theoretical Results

In this section, we present the theoretical results and related comments. To demonstrate the benefits of extrapolation, we establish optimal convergence rates of ERTR and the lower bound of the convergence rates for RTR in section 3.1 and 3.2, respectively. Finally, in Section 3.3, we conduct complexity analysis for ERTR.

3.1. Convergence Rates for ERTR in Ck,α

Theorem 3.1. Suppose that PX has upper and lower bounded density over [0, 1]d and the Bayes function f L,P Ck,α. Let f D,E(x) be the random tree extrapolation for regression defined by (16). Moreover, let pn, V , and L be chosen as pn log(n/ log n) and V 1 L k. Then for all sufficiently large n, with probability Pn at least 1 2/n2, we have

RL,D(f D,E) R L,P (log n/n)

2(k+α) 2(k+α)+d . (18)

Theorem 3.1 shows that when the Bayes function lies in the function space Ck,α, our ERTR achieves optimal convergence rates with properly chosen depth p and a sufficiently large V . Although Linero & Yang (2018) established similar convergence rates, their conclusion is derived from a bayesian perspective. To the best of our knowledge, we for the first time establish the optimal convergence rates for the

Extrapolated Random Tree for Regression

smooth regression tree from the learning theory perspective.

3.2. Convergence Rates for RTR in Ck,α

The following theorem presents the lower bound of the excess risk of the random tree for regression with some mild conditions on the gradients.

Theorem 3.2. Let the regression model be defined by Y := f(X) + ε, where PX is the uniform distribution over [0, 1]d , E(ε|X = x) = 0, and Var(ε|X = x) = σ2 < for x [0, 1]d. Moreover, assume that f Ck,α, k 1 and there exists a constant cf (0, ) such that p

(12d 7)/(12d 9) f(x) 2 f(x) 1 cf for all x [0, 1]d. Then for all n > N1, there holds

RL,P(f D) R L,P n 2/(2+d) (19)

in expectation with respect to Pn, where the constant N1 is specified in the proof.

The theorem above shows that in the space Ck,α with k 1 and 0 < α 1, the lower bound of the convergence rates RTR is merely of the order O(n 2/(2+d)). This together with 3.1 implies that for any k 1 and α (0, 1], the upper bound of the convergence rate (18) for ERTR will be strictly smaller than the lower bound (19) for RTR, which explains the benefits of the extrapolation in our method for the target function with high-order smoothness. In fact, as pointed out in Theorem A.3 in Appendix A.1.2, RTR only achieves the optimal convergence rates in the space C0,α.

Since the edges of each cell are parallel to the axes, it is more convenient to use the conditions on the derivatives to derive the lower bound of RTR. The condition on the gradients in Theorem 3.2 can be satisfied in the general space Rd with d 1. Let us consider the following example, where f(x) = Pd i=1 aixi + 1 with ai = (12d 8)i 1

for x = (x1, . . . , xd) Rd. It is clear to verify that p

(12d 7)/(12d 9) f(x) 2 f(x) 1. In fact, this condition is satisfied for a large class of functions whose derivatives along different directions vary widely.

3.3. Complexity Analysis

We consider the computation complexity of ERTR including both the training and testing stages. The training stage consists of only the Algorithm 1, whose average cost is O(n log n). Then for each test sample, computation of

f D,ri, 1 i V takes O dn

2(k+α) 2(k+α)+d time since the number of samples in each cell is around O(n/2p) =

2(k+α) 2(k+α)+d , where we use 2p (n/ log n)

by Theorem 3.1. Therefore, the test complexity of ERTR

2(k+α) 2(k+α)+d . Thus, ERTR is an efficient method with sub-linear complexity.

In comparison, we discuss the complexities of several other tree models. For the standard decision tree, the training complexity is O(n log n) and the testing complexity is O(log n). For PR tree (Alkhoury et al., 2020), the training stage involves solving the weight matrix for all training samples which yields computation cost of O(n K2d) where K is the number of cells. To ensure consistency, K needs to grow with some order of n, which yields the super-linear cost of the PR tree. Soft Bayesian additive trees (Linero & Yang, 2018) require MCMC sampling which brings a heavy computation burden. For soft trees (Irsoy et al., 2012), its optimization relies on first-order methods, and thus their complexities are not comparable to ours. In short, though ERTR requires additional computation compared to the decision tree, it is still an efficient method compared to other regression trees. Therefore, an extension of ERTR to ensemble methods appears to be computationally feasible.

4. Experiments

In experiments, we first conduct experiments on synthetic datasets in Section 4.1 to illustrate theoretical findings in Section 3. Then, in Section 4.2, we compare ERTR, as well as its ensemble extensions, with other tree-based regression models to show its superior performance.

4.1. Synthetic Experiments

In synthetic experiments, we first investigate the power of extrapolation in Section 4.1.1 and demonstrate the interpretability of ERTR in Section 4.1.2. Then, we conduct parameter analysis in Section 4.1.3.

4.1.1. POWER OF EXTRAPOLATION

We first conduct experiments to show that the extrapolation method helps to improve the accuracy and the smoothness. To this end, we investigate the choice of the extrapolation order L. We fit RTR and ERTR with L = 0, 1, 3 on 2000 training samples from the synthetic model in (17). The depth of partition p for both methods is set to 3. The regression curves as well as the ground truth are plotted in Figure 4. As in Figure 4(a), RTR using piece-wise constant functions fails to capture the variation of smooth functions and has poor performance. By contrast, ERTR is adaptive to the smoothness of the target function. From Figure 4(b), 4(c), and 4(d), we see that the choice of order L affects the smoothness of the regressor. ERTR is under-fitting with a small L due to poor approximation ability. In this case, ERTR can not benefit from high-order smoothness as in Figure 4(b). On the other hand, when the order L is chosen too large, there are only a few samples in the cell. As a result, ERTR tends to overfit the model as shown in Figure 4(d). A proper extrapolation order can lead to a stable regressor with adequate approximation ability, as is shown in 4(c) for

Extrapolated Random Tree for Regression

0.0 0.2 0.4 0.6 0.8 1.0 x

0.0 0.2 0.4 0.6 0.8 1.0 x

0.0 0.2 0.4 0.6 0.8 1.0 x

0.0 0.2 0.4 0.6 0.8 1.0 x

Figure 4. The estimated regression curve of RTR and ERTR with L = 0, 1, 2. The blue curves and the orange curves stand for ground truth and estimated target function respectively.

L = 1. These observations are compatible with Theorem A.3 that extrapolation brings a faster convergence rate with a suitably chosen order L for a smoother target function.

4.1.2. INTERPRETABILITY

We illustrate the interpretability of ERTR in prediction as mentioned in Section 2.4. (15) and (16) tell us that ERTR is the weighted average of the responses in the cell of the random tree partition. Therefore, to demonstrate the interpretability, we visualize the weights versus r for L = 0, 1, 2 and RTR under the same simulation setting in Section 4. In contrast to RTR, which assigns equal weight to each sample, ERTR assigns larger weights to samples close to x and smaller weights to samples far from x. Moreover, as L increases, ERTR becomes overfitting as it puts too much weight on the nearby points. This implies that ERTR is interpretable in the sense that, we can not only recognize the points in the cell that influence the estimation but also quantify the influence of each data point.

0.0 0.2 0.4 0.6 0.8 1.0 r

RTR L = 0 L = 1 L = 2

Figure 5. The weights of samples versus r for RTR and ERTR with L = 0, 1, 2 at x = 5π/32.

We mention that the parameters V and L should be carefully chosen in terms of balancing the convergence rate and

interpretability. (i) According to the parameter choices in Theorem 3.1, a smaller V yields a smaller L. In particular, if V k, we can show that the convergence rate of ERTR becomes (n/ log n) 2L/(2L+d). In this case, ERTR can not fully use the target function s smoothness to achieve the optimal convergence rate. (ii) Both the choices of V and L affect the interpretability of the model. More specifically, L controls the pattern of the weights and V affects the values of these weights. As shown in Figure 5, ERTR with larger L assigns larger weights to samples close to x and smaller weights to samples far from x, while ERTR with smaller L assigns close weights to the samples in the cell A(x). This implies that small V and L help to enhance the interpretability of the model. Therefore, we need to choose appropriate V and L to balance the convergence rate and interpretability.

4.1.3. PARAMETER ANALYSIS

In this section, we conduct experiments to investigate the selections of p and L in terms of MSE. We pick p {1, 2, 3} and plot MSE versus L for L {0, 1, 2, 3, 4} on (17). For each pair of (p, L), we set λ = 10 4 as the regularized parameter for ridge regression and choose V {15, 20, 25} by cross-validation. We take 10 times averaged MSE with 1000 training samples in each repetition. The result is displayed in Figure 6(a). Apparently, for each p, as L increases, MSE first decreases until L reaches a certain value. Then MSE begins to increase as L grows. This further confirms the trade-off observed in Section 4.1.1. Moreover, Figure 6(a) shows that the order L at which the test error achieves the minimum decreases as p increases. Intuitively, both the increasing of p and L enhance the approximation ability of the model. Therefore, a small L is demanded to achieve the best performance for large p.

Next, we investigate the relation between depth p and MSE under different L under analogous settings. We also plot the MSE-p curve for the RTR estimation. The result is displayed in Figure 6(b). The relation between MSE and p is U-shaped under each L. This illustrates that a properly chosen p is needed as explained in Theorem 3.1.

0 1 2 3 4 L

1 2 3 4 5 6 7 p

RTR L=0 L=1 L=2

Figure 6. 6(a) MSE versus L with different p. 6(b) MSE versus p under different L.

Extrapolated Random Tree for Regression

Table 1. Average MSE over real data sets for tree methods. The best results are bolded and the second best results are underlined. The best results with significance are marked with . Running of ST is corrupted on three data sets that are marked with -.

Random Partition Variance Reduction Partition ERTR RTR STRT PRT ERTR DT ST STRT PRT ABA 5.60* 8.01 7.33 6.11 5.20 5.75 4.59* 5.53 5.89 AIR 2.79e+1* 3.92e+1 3.31e+1 3.56e+1 1.13e+1* 1.43e+1 1.22e+1 2.02e+1 3.49e+1 ALG 9.27e-2* 1.87e-1 1.51e-1 9.66e-2 3.03e-2* 3.44e-2 2.77e-1 3.21e-2 3.30e-2 BIAS 3.87 4.38 4.55 2.45* 1.60* 1.75 - 2.01 2.09 CBM 1.52e-1* 1.53 1.70e-1 3.61e-1 2.31e-10 4.24e-27* 3.70 9.08e-4 1.89e-1 CCP 3.95e+1 6.33e+1 3.60e+1 4.93e+1 3.92e+1 1.82e+1 2.93e+2 2.07e+1 2.56e+1 CPU 5.01e+1* 3.18e+2 3.23e+2 2.77e+2 1.36e+1* 1.54e+1 3.38e+2 1.63e+1 3.00e+1 IST 1.20e-4 1.58e-4 6.55e-5 2.71e-5* 3.69e-5 4.43e-5 - 5.57e-5 3.98e-5 FOR 4.56e+3 4.15e+3 4.70e+3 4.39e+3 4.25e+3 5.14e+3 4.08e+3 4.35e+3 4.11e+3 MG 1.90e-2 2.09e-2 2.75e-2 1.58e-2* 1.63e-2* 1.83e-2 2.15e-2 1.85e-2 2.58e-2 WR 5.72e-1 6.02e-1 5.85e-1 4.46e-1* 4.63e-1 4.64e-1 5.12e-1 4.29e-1* 4.63e-1 SPA 1.89e-2* 2.83e-2 2.25e-2 2.18e-2 1.72e-2 1.84e-2 1.72e-2 1.88e-2 2.46e-2 WW 5.95e-1* 7.59e-1 7.36e-1 7.62e-1 5.46e-1* 5.57e-1 - 5.50e-1 5.82e-1 rank sum 21 45 37 27 22 37 50 38 48

4.2. Real Data Performance

In this section, we conduct experiments on real-world data sets to show the promising performance of ERTR compared with other regression trees in Section 4.2.1. In Section 4.2.2, we further investigate the extension of extrapolated trees to ensemble methods.

Splitting Rule Note that most popular tree methods design their partition rules utilizing the information gained from the data. For a fair comparison, we also introduce the variance reduction scheme (Breiman, 2001) to the tree construction. To be more specific, we replace the random partition rule in Algorithm 2 with the variance reduction splitting rule. Each current region will be decomposed into two sub-regions with respect to a coordinate and a splitting point that minimizes the weighted sum of variances in the resulting sub-regions.

Comparison Methods Under the variance reduction splitting rule, the comparison methods include DT, ST (Irsoy et al., 2012), STRT (Da Rosa et al., 2008), and PRT (Alkhoury et al., 2020). Under the random splitting rule in Section 2.2, the comparison methods include RTR, PRT, and STRT. For forest methods, we compare the extension of ERTR, called ERF, with RF, PRRF, and SBART (Linero & Yang, 2018). For boosting methods, we compare the extension of ERTR, called GBERTR, with GBRT, GBPRT, and GBSTRT. The variance reduction splitting rule is used when ERTR is extened to the ensemble methods. All implementation details are presented in Appendix C.1.

Experiment Setup We conduct experiments on 12 real data sets. To ensure significance, we adopt the Wilcoxon signedrank test (Wilcoxon, 1992) to check if the best result is significant. For ERTR, regression in (14) is ridge regularized with shrinkage parameter λ. We summarize the data sets and pre-processing details in Appendix C.1.

4.2.1. COMPARISON OF TREE METHODS

Results on the comparison of tree methods are presented in Table 1. For models with the random splitting rule, ERTR shows superior performance over the other models by achieving 7 best results and rank sum 21. Especially, ERTR outperforms RTR on most of the data sets. For models with the variance reduction splitting rule, ERTR outperforms DT on 11 out of 13 data sets. Moreover, ERTR performs promisingly compared to the other popular tree methods by achieving 8 best results and rank sum 22. In both settings, extrapolated methods greatly improve the regression performance compared to their ordinary counterparts.

Table 2. Average MSE over real data sets for forest methods. The best results are bolded and the second best results are underlined. The best results with significance are marked with . Running of SBART is corrupted on one data set which is marked with -.

ERF RF PRRF SBART ABA 4.64* 4.71 4.87 4.91 AIR 6.35 6.11 2.00e+1 3.90* ALG 1.76e-2* 2.13e-2 4.83e-2 2.59e-2 BIAS 1.12 1.19 1.90 8.67e-1* CBM 5.88e-9 1.21e-27* 3.11e-4 - CCP 1.27e+1* 1.41e+1 1.96e+1 1.55e+1 CPU 9.79 9.87 2.02e+1 7.67* IST 3.45e-5 3.40e-5 3.03e-5 2.93e-5 FOR 3.24e+3 3.52e+3 5.14e+3 3.81e+3 MG 1.39e-2 1.42e-2 1.88e-2 1.43e-2 WR 3.89e-1 3.69e-1 3.97e-1 3.90e-1 SPA 1.36e-2 1.37e-2 1.94e-2 1.07e-2* WW 4.55e-1* 4.62e-1 5.40e-1 4.93e-1 rank sum 23 28 50 30

4.2.2. COMPARISON OF ENSEMBLE METHODS

Extending our estimator to a forest method that adopts ERTR as base learners is straightforward. We compare

Extrapolated Random Tree for Regression

the arising forest method, which we refer to as ERF, with other forest approaches using different base learners. The results for forest methods are presented in Table 2. ERF not only achieves much smaller MSE than single tree methods in Table 1 but also outperforms other competitors. ERF has the best performance on 6 data sets and the lowest rank sum 22. The promising performance of ERF comes from the complementary property of random forest and extrapolation. Random forests aim at reducing the variance with a small cost of increasing the bias, whereas extrapolation intends to reduce the bias. Therefore, base learners with low bias are averaged to have low variance, which leads to a method that significantly outperforms RF. Similarly, ERTR is extended to gradient boosting with results displayed in Table 4, Appendix C.3, where GBERTR attains the best ranking sum 25. These results show that the application of extrapolated trees enhances the performance of ensemble methods.

5. Conclusion

In this paper, we propose a novel tree-based algorithm called Extrapolated Random Tree for Regression (ERTR) that adapts to the high-order smoothness of the target function while maintaining the interpretability of the tree. On the theoretical side, we for the first time establish optimal convergence rates for ERTR when the target function resides in the general H older space and the lower bound of the convergence rates of the RTR, which shows that ERTR outperforms RTR for the target function with high-order smoothness by taking advantage of extrapolation. In experiments, we empirically demonstrate the power of the extrapolation method. Moreover, we show the experimental preponderance of ERTR compared to state-of-the-art regression trees. Furthermore, we extend ERTR to ensemble methods including random forest and gradient boosting. Promising improvements are brought by using the extrapolated trees as base learners. The application of the extrapolation method to enhance the performance of learning algorithms can be served as future works.

Acknowledgements

The authors would like to thank the reviewers for their constructive comments, which led to a significant improvement in this work. This work was supported in part by the National Key R&D Program of China (2022YFB2702401). This research was supported by Public Computing Cloud, Renmin University of China.

Abid, F. and Izeboudjen, N. Predicting forest fire in algeria using data mining techniques: Case study of the decision tree algorithm. In International Conference on Advanced

Intelligent Systems for Sustainable Development, pp. 363 370. Springer, 2020.

Akbilgic, O., Bozdogan, H., and Balaban, M. E. A novel hybrid rbf neural networks model as a forecaster. Statistics and Computing, 24(3):365 375, 2014.

Alkhoury, S., Devijver, E., Clausel, M., Tami, M., Gaussier, E., et al. Smooth and consistent probabilistic regression trees. Advances in Neural Information Processing Systems, 33:11345 11355, 2020.

Altosole, M., Benvenuto, G., Figari, M., and Campora, U. Real-time simulation of a cogag naval ship propulsion system. Proceedings of the institution of mechanical engineers, part M: journal of engineering for the maritime environment, 223(1):47 62, 2009.

Amini, S., Saber, M., Rabiei-Dastjerdi, H., and Homayouni, S. Urban land use and land cover change analysis using random forest classification of landsat time series. Remote Sensing, 14(11):2654, 2022.

Basak, P., Linero, A., Sinha, D., and Lipsitz, S. Semiparametric analysis of clustered interval-censored survival data using soft bayesian additive regression trees (sbart). Biometrics, 78(3):880 893, 2022.

Bernstein, S. N. The Theory of Probabilities. Gastehizdat Publishing House, Moscow, 1946.

Breiman, L. Random forests. Machine learning, 45(1): 5 32, 2001.

Brezinski, C. and Zaglia, M. R. Extrapolation methods: theory and practice. Elsevier, 2013.

Brooks, T. F., Pope, D. S., and Marcolini, M. A. Airfoil self-noise and prediction. Technical report, 1989.

Cai, Y., Hang, H., Yang, H., and Lin, Z. Boosted histogram transform for regression. In International Conference on Machine Learning, pp. 1251 1261. PMLR, 2020.

Chang, C.-C. and Lin, C.-J. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1 27, 2011.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785 794, 2016.

Chipman, H. A., George, E. I., and Mc Culloch, R. E. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266 298, 2010.

Extrapolated Random Tree for Regression

Cho, D., Yoo, C., Im, J., and Cha, D.-H. Comparative assessment of various machine learning-based bias correction methods for numerical weather prediction model forecasts of extreme air temperatures in urban areas. Earth and Space Science, 7(4):e2019EA000740, 2020.

Cortez, P. and Morais, A. d. J. R. A data mining approach to predict forest fires using meteorological data. 2007.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4): 547 553, 2009.

Cucker, F. and Zhou, D.-X. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, 2007.

Da Rosa, J. C., Veiga, A., and Medeiros, M. C. Treestructured smooth transition regression models. Computational Statistics & Data Analysis, 52(5):2469 2488, 2008.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Flake, G. W. and Lawrence, S. Efficient svm regression training with smo. Machine Learning, 46(1):271 290, 2002.

Friedman, J. H. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367 378, 2002.

Frosst, N. and Hinton, G. Distilling a neural network into a soft decision tree. ar Xiv preprint ar Xiv:1711.09784, 2017.

Gao, W. and Zhou, Z.-H. Towards convergence rate analysis of random forests for classification. Advances in neural information processing systems, 33:9300 9311, 2020.

Gin e, E. and Nickl, R. Mathematical Foundations of Infinite Dimensional Statistical Models. Cambridge University Press, 2021.

Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 2012.

Irsoy, O., Yıldız, O. T., and Alpaydın, E. Soft decision trees. In Proceedings of the 21st international conference on pattern recognition (ICPR2012), pp. 1819 1822. IEEE, 2012.

Kosorok, M. R. Introduction to Empirical Processes and Semiparametric Inference. Springer Series in Statistics. Springer, New York, 2008.

K unzel, S. R., Saarinen, T. F., Liu, E. W., and Sekhon, J. S. Linear aggregation in tree-based estimators. Journal of Computational and Graphical Statistics, pp. 1 18, 2022.

Li, B., Friedman, J., Olshen, R., and Stone, C. Classification and regression trees (cart). Biometrics, 40(3):358 361, 1984.

Linero, A. R. and Yang, Y. Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):1087 1110, 2018.

Massart, P. Concentration Inequalities and Model Selection, volume 1896 of Lecture Notes in Mathematics. Springer, Berlin, 2007.

Mourtada, J., Ga ıffas, S., and Scornet, E. Minimax optimal rates for mondrian trees and forests. The Annals of Statistics, 48(4):2253 2276, 2020.

Nash, W. J., Sellers, T. L., Talbot, S. R., Cawthorn, A. J., and Ford, W. B. The population biology of abalone (haliotis species) in tasmania. i. blacklip abalone (h. rubra) from the north coast and islands of bass strait. Sea Fisheries Division, Technical Report, 48:p411, 1994.

Okuno, A. and Shimodaira, H. Extrapolation towards imaginary 0-nearest neighbour and its improved convergence rate. Advances in Neural Information Processing Systems, 33:21889 21899, 2020.

Pace, R. K. and Barry, R. Quick computation of spatial autoregressive estimators. Geographical analysis, 29(3): 232 247, 1997.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Roˇckov a, V. and Saha, E. On theory for bart. In The 22nd international conference on artificial intelligence and statistics, pp. 2839 2848. PMLR, 2019.

Roˇckov a, V. and van der Pas, S. Posterior concentration for bayesian regression trees and forests. The Annals of Statistics, 48(4):2108 2131, 2020.

Steinwart, I. and Christmann, A. Support Vector Machines. Springer Science & Business Media, 2008.

Su arez, A. and Lutsko, J. F. Globally optimal fuzzy decision trees for classification and regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(12): 1297 1311, 1999.

T ufekci, P. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems, 60:126 140, 2014.

Extrapolated Random Tree for Regression

van der Vaart, A. W. and Wellner, J. A. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer-Verlag, New York, 1996.

Wilcoxon, F. Individual comparisons by ranking methods. In Breakthroughs in statistics, pp. 196 202. Springer, 1992.

Yu, X., Rao, Y., Zhao, W., Lu, J., and Zhou, J. Group-aware contrastive regression for action quality assessment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7919 7928, 2021.

Extrapolated Random Tree for Regression

The appendix consists of supplementary for both theoretical analysis and experiments. In Appendix A, we present the error analysis for both RTR and ERTR. The proofs for theoretical results in the main content and Appendix A are presented in Appendix B. In Appendix C, we show the supplementary for numerical experiments, including implementation details, data set details and additional real data results.

A. Error Analysis

In this section, we present the error analysis of RTR and ERTR in Section A.1 and A.2, respectively.

A.1. Error Analysis of RTR

In this section, we present the error decompositions for the lower bound and upper bound of the convergence for RTR in Section A.1.1 and A.1.2, respectively.

A.1.1. LOWER BOUND FOR THE CONVERGENCE RATES OF RTR

For the base regressor RTR, we are concerned with the lower bound for f D. We make the error decomposition

Eνn RL,P(f D) R L,P = Eνn EPX f D(X) f L,P(x) 2

= Eνn EPX f D(X) f P(X) 2 + Eνn EPX f P(X) f L,P(x) 2. (20)

It is important to note that the two terms on the right-hand side of (20) are dataand partition-independent due to the expectation with respect to D and Z. Loosely speaking, the first error term corresponds to the expected estimation error of the estimator f D, while the second one demonstrates the expected approximation error.

The following two propositions present the lower bound of approximation error and sample error of RTR respectively.

Proposition A.1. Let the random tree for regression f P be defined as in (4) and the regression model be defined by

Y := f(X) + ε, (21)

where PX is the uniform distribution over [0, 1]d, E(ε|X) = 0 and Var(ε|X) = σ2 < . Moreover, assume that f Ck,α, k 1 and there exists a constant cf (0, ) such that f(x) 2 c f(x) 1 cf for all x [0, 1]d with the constant c := p

(12d 9)/(12d 7). Then for all n 1, there holds

RL,P(f P) R L,P d c2 f(384d 288) 1 2 2p/d.

in expectation with respect to PZ.

Proposition A.2. Let the random tree for regression f P and its empirical estimate f D be defined by (4) and (5), respectively. Moreover, let the regression model be defined as in (21) with f Ck,α, k 1. Moreover, assume that PX is the uniform distribution over [0, 1]d , E(ε|X = x) = 0, and Var(ε|X = x) = σ2 < for x [0, 1]d. Then there holds

RL,P(f D) RL,P(f P) σ2 2p(1 2e 1) n 1

in expectation with respect to Pn.

A.1.2. UPPER BOUND FOR THE CONVERGENCE RATES OF RTR

In addition to the lower bound of RTR, we also provide the following upper bound.

Theorem A.3. Suppose that PX has upper and lower bounded density over [0, 1]d and the Bayes function f L,P(x) Ck,α. Let f D(x) be the random tree for regression defined by (5) and pn log(n/ log n). Then for all sufficiently large n, with probability Pn at least 1 2/n2, we have

RL,D(f D) R L,P (n/ log n) 2((α+k) 1) 2((α+k) 1)+d . (22)

The theorem above implies that when the Bayes function lies in the function space Ck,α, under mild assumptions, RTR has the convergence rates of the order O(n 2α 2α+d ). This rate is optimal only when the target function lies in the space C0,α. In

Extrapolated Random Tree for Regression

other words, RTR can not take full advantage of the smoothness of the Bayes function and thus have slower convergence rates.

The proof relies on the following error decomposition.

f D f L,P f D f P + f L,P f P .

The two terms are bounded by the following two propositions.

Proposition A.4. Let πp be a random tree partition of [0, 1]d as in (2). Let f P(x) be defined by (4) and f L,P(x) be the Bayes function. Then we have

f P f L,P c L(2

d)(k+α) 12 p(k+α) 1 /d.

Proposition A.5. Let πp be a random tree partition of [0, 1]d as in (2). Let f P(x) and f D(x) be defined by (4) and (5), respectively. Assume that PX has uppper and lower bounded density over [0, 1]d and Y [ M, M]. Then for all n 1, with probability Pn PZ at least 1 2/n2, there holds

2p+1(4d + 5) log n

n + 2p+2 M(4d + 5) log n

3n + 2p+3 M

A.2. Error Analysis of ERTR

Let e1 denote the vector (1, 0, , 0) . The definition of ERTR in (16) can be reformalized as

f D,E(x) = e 1 (R R) 1R (f D,r1(x), , f D,r V (x)) .

We also define

f P,E(x) = e 1 (R R) 1R (f P,r1(x), , f P,r V (x)) (23)

which stands for the population version of the extrapolated estimator. Then we have the following error decomposition

f D,E f L,P f D,E f P,E + f L,P f P,E .

Proposition A.6. Let f P,E(x) be defined by (23) and f L,P(x) be the Bayes function. If we choose V 1 L k, there holds

f P,E f L,P c V,L(2

d)k+α2 p(k+α)/d

for constant c V,L depending only on V and L.

Proposition A.7. Let f P,E(x) and f D,E(x) be defined by (23) and (16), respectively. Suppose that PX has upper and lower bounded density over [0, 1]d and Y [ M, M]. Then for all n 1, with probability Pn PZ at least 1 2/n2, there holds

f D,E f P,E 2c V,LV d M

2p+1(4d + 5) log n

n + 2p+2 c V,LV d M(4d + 5) log n

3n + 2p+3 c V,LV d M

for constant c V,L depending only on V and L.

B.1. Fundamental Results on the Properties of the Random Tree

Lemma B.1. Let πp := {Aj p, j Ip} be the random tree partition as in (2). Then for any Aj p := d i=1[ai, bi], j Ip, we have

d/2 2 p/d diam(Aj p) 2

d 2 p/d. Moreover, for any 1 i d, there holds

diam(Aj p)2 (4d 3)(bi ai)2. (24)

Extrapolated Random Tree for Regression

Proof of Lemma B.1. According to the random tree partition rule, when the depth of the tree p is a multiple of dimension d, each cell of the random tree partition is a high-dimensional cube with side length of 2 p/d. On the other hand, when the depth of the tree p is not a multiple of dimension d, we consider the random tree partition with depth p/d and p/d , whose corresponding side length of the higher dimensional cube is 2 p/d and 2 p/d . Note that under the splitting criterion of random tree, the side length of each sub-rectangle decreases monotonically with the increase of p, so the side length of a random tree partition cell is between 2 p/d and 2 p/d . This implies that

d 2 p/d diam(Aj p)

Since p/d 1 p/d p/d p/d + 1, we immediately get

d/2 2 p/d diam(Aj p) 2

d 2 p/d. This shows the first assertion.

Since the random tree partition rule divides the rectangles along the midpoint of the coordinate with maximal length, for every 1 i d, there holds

|bj aj| 2|bi ai|, 1 j d, j = i.

This implies that

diam(Aj p)2

ℓ=1 (bℓ aℓ)2 4(d 1)(bi ai)2 + (bi ai)2 = (4d 3)(bi ai)2.

This completes the proof.

B.2. Proofs of the Results for RTR

B.2.1. PROOFS RELATED TO SECTION A.1.1

Proof of Proposition A.1. Recall that the regression model is defined as Y = f(X) + ε with X following the uniform distribution. Let πp = {Aj p, j Ip} be the random tree partition in (2). Then we have

RL,P(f P) R L,P = EPX f P(X) f L,P(x) 2 = X

f P(x) f L,P(x) 2 dx. (25)

Since PX is the uniform distribution over [0, 1]d, for any x Aj p, we have

Aj p f L,P(x ) dx = 1

Aj p f L,P(x ) dx .

where µ is the Lesbugue measure. Since the Bayes function f L,P(x) is continuous on Aj p, the mean value theorem implies that there exists an ξ Aj p such that f P(x) = f L,P(ξ). Consequently, we get Z

f P(x) f L,P(x) 2 dx = Z

f L,P(ξ) f L,P(x) 2 dx. (26)

For every fixed x Aj p, we define h(t) := f L,P((1 t)x + tξ) for 0 t 1. It is clear to see that h(0) = f L,P(x) and h(1) = f L,P(ξ). Then by Lagrange s mean value theorem, there exists 0 tx,ξ 1 such that

f L,P(x) f L,P(ξ) = h(1) h(0) = h (tx,ξ) = f L,P((1 tx,ξ)x + tx,ξξ) (ξ x).

For the sake of notation simplicity, we write η := f L,P((1 tx,ξ)x + tx,ξξ). Then, we obtain

f L,P(ξ) f L,P(x) 2 dx = Z

η (ξ x) 2 dx = Z

i=1 ηi(ξi xi) 2 dx

ηi(ξi xi) 2 dx + 2 Z

1 i<l d ηiηl(ξi xi)(ξl xl) dx := (I) + (II). (27)

Extrapolated Random Tree for Regression

Therefore, to derive the lower bound of (26), it suffices to calculate the upper bound and the lower bound of (I) and |(II)| respectively. Let us first consider the lower bound of (I). Assume that Aj p := d i=1[ai, bi]. Then, a simple calculation yields that

ηi(ξi xi) 2 dx =

Aj p (ξi xi)2 dx =

ai (ξi xi)2 dxi Y

l =i (bl al). (28)

Minimizing the following quadratic function with respect to ξi, we obtain Z bi

ai (ξi xi)2 dxi = ξ2 i Z bi

ai dxi 2ξi Z bi

ai xi dxi + Z bi

ai x2 i dxi (bi ai)3

This together with (28) and Lermma B.1 implies that

i=1 η2 i (bi ai)2 µ Aj p η 2 2 12(4d 3) diam Aj p 2. (29)

Next, let us consider the upper bound of (II). Assume that Aj p := d i=1[ai, bi]. Then, a simple calculation yields that

Aj p ηiηl (ξi xi)(ξl xl) dx

1 i<l d ηiηl Z bi

ai (ξi xi) dxi Z bl

al (ξl xl) dxl Y

k =i,l (bk ak)

1 i<l d ηiηl Z bi

ai |ξi xi| dxi Z bl

al |ξl xl| dxl Y

k =i,l (bk ak)

1 i<l d ηiηl(bi ai)(bl al) µ Aj p diam Aj p 2 η 2 1 η 2 2

Recall that η = f L,P((1 tx,ξ)x + tx,ξξ). Therefore, the condition p

(12d 7)/(12d 9) f(x) 2 f(x) 1 implies that

η 2 1 η 2 2/16 η 2 2/(16d 12).

This together with (29) and (30) yields that

(I) + (II) (I) |(II)| µ Aj p η 2 2 diam Aj p 2/ 24(4d 3) .

On combining this with (27) and the condition η 2 2 c2 f, we have

f L,P(ξ) f L,P(x) 2 dx µ Aj p η 2 2 24(4d 3) diam Aj p 2 c2 f µ Aj p

24(4d 3) diam Aj p 2 d c2 f µ Aj p

96(4d 3) 22p/d ,

where the last inequality follows from Lemma B.1. This together with (25) implies that

RL,P(f P) R L,P X

d c2 f µ Aj p

96(4d 3) 22p/d = d c2 f 96(4d 3) 22p/d .

This completes the proof.

Proof of Proposition A.2. Let πp = {Aj p, 1 j 2p} be the random tree partition as in (2). Then for any fixed 1 j 2p, we define the random variable Zj by

i=1 1Aj p(Xi).

Extrapolated Random Tree for Regression

Since the random variables {1Aj p(Xi)}n i=1 are i.i.d. Bernoulli distributed with parameter P(X Aj p), elementary probability theory implies that the random variable Zj is Binomial distributed with parameters n and P(X Aj p). Therefore, for any j Ip, we have

E(Zj) = n P(X Aj p).

Moreover, the RTR regressor f D can be defined by

Pn i=1 Yi1Aj p(Xi) Pn i=1 1Aj p(Xi) 1Aj p(x) if Zj > 0,

0 if Zj = 0.

By the law of total probability, we get

EPX f D(X) f P(X) 2 = X

j Ip EPX f D(X) f P(X) 2 X Aj p P(X Aj p)

j Ip EPX f D(X) f P(X) 2 X Aj p, Zj > 0 P(Zj > 0) P(X Aj p) (31)

j Ip EPX f D(X) f P(X) 2 X Aj p, Zj = 0 P(Zj = 0) P(X Aj p). (32)

For the term (31), we have

j Ip EPX (f D(X) f P(X))2 X Aj p, Zj > 0 P(Zj > 0)P(X Aj p)

Pn i=1 Yi1Aj p(Xi) Pn i=1 1Aj p(Xi) E(f L,P(X)|X Aj p) 2 P(Zj > 0)P(X Aj p)

i=1 1Aj p(Xi) Yi E(f L,P(X)|X Aj p) 2 P(X Aj p) (Pn i=1 1Aj p(Xi))2 P(Zj > 0),

which yields that for a fixed j Ip, there holds

i=1 1Aj p(Xi) Yi E(f L,P(X)|X Aj p) 2 P(X Aj p) (Pn i=1 1Aj p(Xi))2

i=1 12 Aj p(Xi)E Y f P(X) 2 X Aj p P(X Aj p) (Pn i=1 1Aj p(Xi))2

P(X Aj p) Pn i=1 1Aj p(Xi) E Y f P(X))2 X Aj p . (33)

Obviously, for any fixed j Ip, there holds

E(f P(X)|X Aj p) = E(f L,P(X)|X Aj p)

and consequently we obtain

E (Y f P(X))2 X Aj p = E (Y f L,P(X))2 X Aj p + E (f L,P(X) f P(X))2 X Aj p

= σ2 + E (f L,P(X) f P(X))2 X Aj p .

Extrapolated Random Tree for Regression

Taking expectation over both sides of (33) with respect to Pn, we get

ED Pn EPX f D(X) f P(X) 2 = ED Pn E EPX f D(X) f P(X))2 Xi Aj p

P(X Aj p)ED Pn n X

i=1 1Aj p(Xi) 1 Zj > 0

σ2 + E(f L,P(X) f P(X))2 P(Zj > 0)

= σ2 + E(f L,P(X) f P(X))2 X

P(X Aj p)ED Pn(Z 1 j |Zj > 0) P(Zj > 0)

= n 1 σ2 + E(f L,P(X) f P(X))2 X

E(Zj) E(Z 1 j |Zj > 0) P(Zj > 0).

Clearly, x 1 is convex for x > 0. Therefore, by Jensen s inequality, we get

E(Zj) E(Z 1 j |Z > 0)P(Zj > 0) E(Zj) E(Zj|Zj > 0) 1P(Zj > 0) = E(Z) E(Z1{Z>0}) 1P(Z > 0)P(Z > 0)

= P(Z > 0)2 = (1 P(Z = 0))2 = 1 (1 P(X Aj p))n 2 1 2e n P(X Aj p),

where the last inequality follows from (1 x)n e nx, x (0, 1).

We now turn to estimate the term (32). By the definition of f D, we have X

j Ip EPX f D(X) f P(X))2 X Aj p, Zj = 0 P(Zj = 0) P(X Aj p)

j Ip EPX f P(X) 2 X Aj p P(Zj = 0) P(X Aj p) 0.

Then we obviously have P(X Aj p) = µ Aj p 2 p d for all j Ip. Combing the above results, we obtain

ED Pn EPX f D(X) f P(X) 2 = X

j Ip EPX (f D(X) f P(X))2|X Aj p, Zj > 0 P(Zj > 0) P(X Aj p)

j Ip EPX (f D(X) f P(X))2 X Aj p, Zj = 0 P(Zj = 0) P(X Aj p)

j Ip EPX (f D(X) f P(X))2 X Aj p, Zj > 0 P(Zj > 0) P(X Aj p)

1 2e n P(X Aj p) E(f L,P(X) f P(X))2 + σ2

j Ip 2e n P(X Aj p) .

Therefore, we have

ED Pn EPX f D(X) f P(X))2 σ2

j Ip 2e n P(X Aj p) = σ2

|Ip| 2|Ip| exp n2 p d

σ2 2p(1 2e 1) n 1. (34)

Taking expectation with respect to PZ, we obtain the desired assertion.

B.2.2. PROOFS RELATED TO SECTION 3.2

Proof of Theorem 3.2. By Proposition A.1 and A.2, the error decomposition in (20) tells us that

EPZ RL,P(f D) R L,P = EPZEPX f D(X) f L,P(x) 2

d c2 f(384d 288) 1 2 2p/d + σ2 2p(1 2e 1) n 1 n 2 2+d ,

where the last inequality holds if and only if 2p nd/(2+d). This yields the desired assertion.

Extrapolated Random Tree for Regression

B.2.3. PROOFS RELATED TO SECTION A.1.2

proof of Proposition A.4. By the definition of f P, we have

|f P(x) f L,P(x)| =

A(x) f L,P(x )d PX(x ) R

A(x) d PX(x ) f (x)

A(x) |f L,P(x ) f L,P(x)| d PX(x ) R

A(x) d PX(x )

Since f L,P(x) Ck,α, we have |f L,P(x ) f L,P(x)| c L x x (k+α) 1. Then, by Lemma B.1, we get |f L,P(x ) f L,P(x)| c Ldiam(A(x))(k+α) 1. Consequently, we have

|f P(x) f L,P(x)| c Ldiam(A(x))(k+α) 1 R

A(x) PX(x ) R

A(x) PX(x ) c Ldiam(A(x))(k+α) 1.

This together with Lemma B.1 yields the desired assertion.

To conduct our analysis, we first need to recall the definitions of VC dimension (VC index) and covering number, which are frequently used in capacity-involved arguments and measure the complexity of the underlying function class (van der Vaart & Wellner, 1996; Kosorok, 2008; Gin e & Nickl, 2021).

Definition B.2 (VC dimension). Let B be a class of subsets of X and A X be a finite set. The trace of B on A is defined by {B A : B B}. Its cardinality is denoted by B(A). We say that B shatters A if B(A) = 2#(A), that is, if for every A A, there exists a B B such that A = B A. For n N, let

m B(n) := sup A X, #(A)=n B(A). (35)

Then, the set B is a Vapnik-Chervonenkis class if there exists n < such that m B(n) < 2n and the minimal of such n is called the VC dimension of B, and abbreviate as VC(B).

Since an arbitrary set of n points {x1, . . . , xn} possess 2n subsets, we say that B picks out a certain subset from {x1, . . . , xn} if this can be formed as a set of the form B {x1, . . . , xn} for a B B. The collection B shatters {x1, . . . , xn} if each of its 2n subsets can be picked out in this manner. From Definition B.2 we see that the VC dimension of the class B is the smallest n for which no set of size n is shattered by B, that is,

VC(B) = inf n n : max x1,...,xn B({x1, . . . , xn}) 2no ,

where B({x1, . . . , xn}) = #{B {x1, . . . , xn} : B B}. Clearly, the more refined B is, the larger is its index.

To further bound the capacity of the function sets, we need to introduce the following fundamental descriptions of covering number which enables an approximation of an infinite set by finite subsets.

Definition B.3 (Covering Number). Let (X, d) be a metric space and A X. For ε > 0, the ε-covering number of A is denoted as

N(A, d, ε) := min n 1 : x1, . . . , xn X such that A

i=1 B(xi, ε) ,

where B(x, ε) := {x X : d(x, x ) ε}.

To prove Lemma B.4, we need the following fundamental lemma concerning with the VC dimension of random partitions in Section 2.2, which follows the idea put forward by Gao & Zhou (2020) of the construction of random forest. To this end, let p N be fixed and πp be a partition of X with number of splits p and π(p) denote the collection of all partitions πp.

Lemma B.4. Let A be the collection of all cells d i=1[ai, bi] in Rd. The VC index of A equals 2d + 1. Moreover, for all 0 < ε < 1, there exists a universal constant C such that

N(1 A, L1(Q), ε) C(2d + 1)(4e)2d+1(1/ε)2d.

Extrapolated Random Tree for Regression

Proof of Lemma B.4. The first result of VC index follows from Example 2.6.1 in van der Vaart & Wellner (1996). The second result of covering number follows directly from Theorem 9.2 in Kosorok (2008).

Before we proceed, we list the well-known Bernstein s inequality that will be used frequently in the proofs. Lemma B.5 was introduced in Bernstein (1946) and can be found in many statistical learning textbooks, see e.g., Massart (2007); Cucker & Zhou (2007); Steinwart & Christmann (2008).

Lemma B.5 (Bernstein s inequality). Let B > 0 and σ > 0 be real numbers, and n 1 be an integer. Furthermore, let ξ1, . . . , ξn be independent random variables satisfying EP ξi = 0, ξi B, and EP ξi2 σ2 for all i = 1, . . . , n. Then for all τ > 0, we have

Proof of Proposition A.5. Let πp = {Aj p, j Ip} be the random tree partition of [0, 1]d as in (2). Then by the definition of f P(x) and f D(x) and the triangle inequality, we have

f P f D = sup j Ip

Pn i=1 1Aj p(Xi)Yi Pn i=1 1Aj p(Xi)

Aj p f L,P(x )d PX(x ) R

Aj p d PX(x )

Pn i=1 1Aj p(Xi)Yi R

Aj p d PX(x ) Pn i=1 1Aj p(Xi) R

Aj p f L,P(x )d PX(x ) Pn i=1 1Aj p(Xi) R

Aj p d PX(x )

Pn i=1 1Aj p(Xi) R

Aj p f L,P(x )d PX(x ) Pn i=1 1Aj p(Xi)Yi Pn i=1 1Aj p(Xi) R

Aj p d PX(x )

Pn i=1 1Aj p(Xi)Yi R

Aj p d PX(x ) Pn i=1 1Aj p(Xi) Pn i=1 1Aj p(Xi) R

Aj p d PX(x ) .

For the notation simplicity, we write

(I) := sup j Ip

i=1 1Aj p(Xi) Z

Aj p d PX(x ) , (II) := sup j Ip

i=1 1Aj p(Xi)Yi Z

Aj p f L,P(x )d PX(x )

Then the estimation error in L -norm is bounded by

f P f D (I) 2p sup j Ip

Pn i=1 1Aj p(Xi)Yi Pn i=1 1Aj p(Xi) + (II) 2p.

Since Y [ M, M], we have |Yi| M for 1 i n. Consequently, there holds

f P f D M (I) + (II) 2p. (36)

Therefore, it suffice to bound (I) and (II) respectively. Let us first consider the term (I). Let A be the collection of all cells d i=1[ai, bi] in Rd. Applying Lemma B.4 with Q := (DX + PX)/2, there exists an ε-net { Ak}K k=1 A with

K C(2d + 1)(4e)2d+1(1/ε)2d (37)

such that for any j Ip, there exist some k {1, . . . , K} such that

1Aj p 1 Ak L1((DX+PX)/2) ε,

Since 1Aj p 1 Ak L1((DX+PX)/2) = 1/2 1Aj p 1 Ak L1(DX) + 1/2 1Aj p 1 Ak L1(PX), we get

1Aj p 1 Ak L1(DX) 2ε, 1Aj p 1 Ak L1(PX) 2ε. (38)

Extrapolated Random Tree for Regression

Consequently, by the definition of the covering number and the triangle inequality, for any j Ip, there holds 1 n

i=1 1Aj p(Xi) PX Aj p 1 n

i=1 1 Ak(Xi) PX( Ak) + 1Aj p 1 Ak L1(DX) + 1Aj p 1 Ak L1(PX)

i=1 1 Ak(Xi) PX( Ak) + 4ε.

Therefore, we get

(I) = sup j Ip

i=1 1Aj p(Xi) PX Aj p sup 1 k K

i=1 1 Ak(Xi) PX( Ak) + 4ε. (39)

For any fixed 1 k K, let the random variable ξi be defined by ξi := 1 Ak(Xi) PX( Ak). Then we have EPXξi = 0, ξ 1, and EPXξ2 i PX( Ak). Since PX has upper and lower bounded density over [0, 1]d, there holds EPXξ2 i 2 p. Applying Bernstein s inequality in Lemma B.5, we obtain 1 n

i=1 1 Ak(Xi) PX( Ak)

n + 2τ log n

with probability Pn at least 1 2e τ. Then the union bound together with the covering number estimate (37) implies that

i=1 1 Ak(Xi) PX( Ak)

21 p(τ + log(2K))

n + 2(τ + log(2K)) log n

with probability Pn at least 1 e τ. Let τ = 2 log n and ε = 1/n. Then for any n > N1 := (2C) (2d + 1) (4e), we have τ + log(2K) = 2 log n + log(2C) + log(2d + 1) + (2d + 1) log(4e) + 2d log n (4d + 5) log n. Therefore, we have

i=1 1 Ak(Xi) PX( Ak)

21 p(4d + 5) log n

n + 2(4d + 5) log n

with probability Pn at least 1 1/n2. This together with (39) yields that

21 p(4d + 5) log n

n + 2(4d + 5) log n

Next, let us consider the term (II). Let A be the collection of all cells d i=1[ai, bi] in Rd. Then there exists an ε-net { Ak}K k=1 A with K bounded by (37) such that for any j Ip, (38) holds for some k {1, . . . , K}. Consequently, by the definition of the covering number and the triangle inequality, for any j Ip, there holds

i=1 1Aj p(Xi)Yi Z

Aj p f L,P(x )d PX(x )

i=1 1 Ak(Xi)Yi Z

Ak f L,P(x )d PX(x )

1Aj p(x ) 1 Ak(x ) f L,P(x ) d PX(x ) +

1 Ak(Xi) 1Aj p(Xi) Yi .

Consequently, we have

i=1 1Aj p(Xi)Yi Z

Aj p f L,P(x )d PX(x )

i=1 1 Ak(Xi)Yi Z

Ak f L,P(x )d PX(x ) + max 1 i n |Yi| 1Aj p 1 Ak L1(DX) + f L,P 1Aj p 1 Ak L1(PX)

i=1 1 Ak(Xi)Yi Z

Ak f L,P(x )d PX(x ) + 4Mε. (42)

Extrapolated Random Tree for Regression

where the last inequality follow from the condition Y [ M, M].

For any fixed 1 k K, let the random variable ξi be defined by ξi := 1 Ak(Xi)Yi R Ak f L,P(x ) d PX(x ). Then we have EP ξi = 0, ξ 1, and EP ξ2 i M 2PX( Ak). Since PX has upper and lower bounded density over [0, 1]d, there holds EP ξ2 i M 2 2 p. Applying Bernstein s inequality in Lemma B.5, we obtain

i=1 1 Ak(Xi)Yi Z

Ak f L,P(x )d PX(x )

n + 2Mτ log n

with probability Pn at least 1 2e τ. Similar to (40), one can show that for any n N1, there holds

i=1 1 Ak(Xi)Yi Z

Ak f L,P(x )d PX(x ) M

n + 2Mτ log n

with probability Pn at least 1 1/n2. This together with (42) yields that

21 p(4d + 5) log n

n + 2M(4d + 5) log n

On combining (41) and (43) with (36), we get

2p+1(4d + 5) log n

n + 2p+2 M(4d + 5) log n

3n + 2p+3 M

This completes the proof of Proposition A.5.

Proof of Theorem A.3. By Proposition A.4 and A.5, the triangle inequality tells us that

f D f L,P f D f P + f P f L,P

d)(k+α) 12 (p(k+α) 1)/d + 2M

2p+1(4d + 5) log n

n + 2p+2 M(4d + 5) log n

3n + 2p+3 M

holds with probability Pn PZ at least 1 2/n2. By choosing 2p (n/ log n)d/(2((k+α) 1)+d), i.e. p log(n/ log n), we obtain the desired assertion.

B.3. Proofs of the Results for ERTR

B.3.1. PROOFS RELATED TO SECTION 2.4

Lemma B.6. Suppose that g(x) : [0, 1] R Ck,α with the constant c L. Let g(i)(x) be the i-th order derivative of g(x). Then for 0 r 1, we have

Proof of Lemma B.6. Let

h(x) = g(x)

A simple calculation yields that h(0) = h(r) = 0. Therefore, Rolle s theorem tells us that there exists an 0 ξ1 r such that h(1)(ξ1) = 0. Note that h(i)(0) = 0 for 1 i k 1, applying Rolle s theorem k 1 times again implies that there exists an ξk [0, r] such that h(k)(ξk) = 0. This yields that

i! ri = rkg(k)(ξk)

Extrapolated Random Tree for Regression

Consequently, we have

= rk|g(k)(ξk) g(k)(0)|

k! c Lrkξα k k! c L

where the last inequality follows from g(k)(x) g(k)(x ) c L x x α and ξk r 1. This completes the proof.

Proof of Proposition 2.4. From the definition of f P,r(x) in (9), we get

Ar(x) f L,P(x ) d PX(x ) R

Ar(x) d PX(x ) .

Since PX has the upper and lower bounded density over [0, 1]d, we have

Ar(x) f L,P(x )f X(x ) dx R

Ar(x) f X(x ) dx ,

where f X(x) represents the density function of PX. From (6), we have x = x + r(z x) with z A(x). Therefore, by the substitution x = x + r(z x), we get

A(x) f L,P(x + r(z x)) f X(x + r(z x)) dz

A(x) f X(x + r(z x)) dz . (44)

For fixed x, z Rd, we define g : [0, 1] R by g(r) = f L,P(x + r(z x)). Then we have g(r) Ck,α since f L,P(x) Ck,α. Moreover, for 0 ℓ k, the l-th order derivative of g(r) at r = 0 is

g(j)(0) = X

j! jf L,P(x)

xi1 1 xid d

iℓ! dj A(x) X

j! ℓf L,P(x)

xi1 1 xid d c Lj! (j + 1)ddℓ A(x) < .

where the last inequality follows from ℓf c L. Furthermore, for 0 r r 1, we have

|g(k)(r) g(k)(r )| c L(k + 1)! (k + 1)ddk+α A (x)|r r |α.

Consequently, by Lemma B.6, we get

c L(k + 1)d+1dk+α A (x).

Therefore, we have

f L,P(x + r(z x)) f X(x + r(z x))

j=0 b jrj f X(x + r(z x)) c L(k + 1)d+1dk+α A (x) f X(x + r(z x)).

with b j expressed as

xi1 1 xid d

Extrapolated Random Tree for Regression

Consequently, we get

A(x) f L,P(x + r(z x)) f X(x + r(z x)) dz

A(x) b jf X(x + r(z x)) dz

A(x) c L(k + 1)d+1dk+α A (x) f X(x + r(z x)) dz

c L(k + 1)d+1dk+α A (x) Z

A(x) f X(x + r(z x)) dz (45)

On combining (44) with (45), we get

j=0 bjrj c L(k + 1)d+1dk+α A (x).

A(x) b jf X(x + r(z x)) dz R

A(x) f X(x + r(z x)) dz .

This completes the proof.

B.3.2. PROOFS RELATED TO SECTION A.2

Lemma B.7. Let ri = i/V for i = 1, , V for V > 0. For V L + 1, let R be a V (L + 1) matrix whose i, j-th entry is rj 1 i . Then we have

e 1 (R R) 1R 1 c V,L

for some constant c V,L depending only on V and L.

Proof of Lemma B.7. Note that R is a Vandermonde matrix (Horn & Johnson, 2012). Then, for V L + 1 and ri = rj for any i = j, R has rank L + 1 and its eigenvalues are strictly positive. Thus, the operator norm of (R R) 1 2 can be bounded by some constant c V,L depending only on V and L. Then, there holds

e 1 (R R) 1R 1

V e 1 (R R) 1R 2 =

e 1 (R R) 1e1 q

V c V,L := c V,L.

This completes the proof.

Proof of Proposition A.6. By Proposition 2.4, there exists b1, , bk such that

f P,r1 f P,r2 ... f P,r V

1 r1 1 r L 1 1 r1 2 r L 2 ... ... ... ... 1 , r1 V r L V

f L,P(x) b1 ... b V

δr1,A δr2,A ... δr V ,A

where δri,A dk+α A for i = 1, , V . For notation simplicity, let b denote the vector (f L,P(x), b1, , b V ) and δ denote the vector (δr1,A, , δr V ,A) . Combining (46) and (23), we have

f P,E(x) f (x) = e 1 (R R) 1R (R b + δ) f L,P(x) e 1 (R R) 1R δ.

Then, by Lemma B.7, there holds

|f P,E(x) f (x)| e 1 (R R) 1R 1 δ c V,Ldk+α A

for all x X. Applying Lemma B.1 completes the proof.

Extrapolated Random Tree for Regression

Proof of Proposition A.7. Note that f D,r and f P,r consider the rectangles Ai/V (x) for i = 1, , V . Since the collection of Ai/V (x) is a subset of all cells in X, as a direct corollary of Proposition A.5, we have

f P,r f D,r 2V d M

2p+1(4d + 5) log n

n + 2p+2 V d M(4d + 5) log n

3n + 2p+3 V d M

with probability Pn PZ at least 1 2/n2. There is an additional constant V d since now R

Ar(x) d PX(x ) 2 p V d. Then, by Lemma B.7, there holds

|f P,E(x) f D,E(x)| e 1 (R R) 1R 1 f P,r f D,r

2p+1(4d + 5) log n

n + 2p+2 V d M(4d + 5) log n

3n + 2p+3 V d M

for all x X, which completes the proof.

B.3.3. PROOFS RELATED TO SECTION 3.1

Proof of Theorem 3.1. By Proposition A.6 and A.7, the triangle inequality tells us that

f D,E f L,P f D,E f P,E + f P,E f L,P

d)k+α2 p(k+α)/d + 2c V,LV d M

2p+1(4d + 5) log n

n + 2p+2c V,LV d M(4d + 5) log n

3n + 2p+3c V,LV d M

holds with probability Pn PZ at least 1 2/n2. By choosing 2p (n/ log n)d/(2(k+α)+d), i.e. p log(n/ log n), we obtain the desired assertion.

C. Experiments

C.1. Implementation Details

All experiments are conducted on a machine with 72-core Intel Xeon 2.60GHz and 128GB main memory. All code is available on Git Hub1. For ERTR, we use the parameter grids p {2, 3, 4, 5, 6, 7, 8}, C {0, 1} and λ {0.001, 0.01, 0.1}. V is fixed to be max( n 2 (p+2) , 5). For each node, if the number of samples in the node is less than 5, then we stop the recursive partition process of the current node. The grids for each base learner in ERF and GBERTR are set similarly. For ERF, we set the number of trees to 200 and subsample { 0.5d , 0.75d , d} features in each split procedure to look for the best cut. In addition, each base learner is trained on a { 0.8n , n, 1.2n } samples bootstrapped with replacement from D. For GBERTR, we set the number of trees to 100 and the learning rate to 0.01. { 0.5d , 0.75d , d} features are sub-sampled in each split procedure to look for the best cut. In addition, each base learner is trained on a { 0.8n , n, 1.2n } samples bootstrapped with replacement from D.

The implementation details for deterministically partitioned tree methods are as follows.

Decision Tree (DT). For standard decision trees, we use the implementation by Scikit-Learn (Pedregosa et al., 2011). We select max depth in {2, 4, 6, 8} and other parameters by default.

Soft Tree (ST) is proposed by Irsoy et al. (2012). We use the implementation in C++2. The implementation provides a self-contained validation procedure that requires an additional validation set. We take 30% of the training data as the validation set.

Smooth Transition Tree (STRT) is proposed by Da Rosa et al. (2008). We use the implementation in R3. We set the depth d max in {2, 4, 6, 8} and choose the ratio of samples considered in each cell split p to be 0.5.

1https://github.com/Karlmyh/ERTR 2https://github.com/oir/soft-tree 3https://github.com/gabrielrvsc/Boo ST

Extrapolated Random Tree for Regression

Probabilistic Regression Tree (PRT) is proposed by Alkhoury et al. (2020). We use the implementation in Python 4. We set σ {0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and min leaf percentage=0.1. Their implementation is not compatible with cross-validation tools provided by Scikit-Learn. Thus, we take 30% of the training data as the validation set.

The implementation details for random partitioned tree methods are as follows.

Random Tree for Regression (RTR) is defined in (5). We pick p {2, 3, 4, 5, 6, 7, 8}.

Smooth Transition Tree with max-edge partition (STRT) replace the partition procedure in Da Rosa et al. (2008) with max-edge random partition. We modify the implementation on Git Hub5. We set the depth d max in {2, 4, 6, 8}.

Probabilistic Regression Tree with max-edge partition (PRT) replace the partition procedure in Alkhoury et al. (2020) with max-edge random partition. We set σ {0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2} and min leaf percentage=0.1.

The implementation details for other forest methods are as follows. All methods are ideally paralleled.

Random forest (RF). For the standard random forest, the Scikit-Learn package in python is applied with n estimators= 200 and max depth {2, 4, 6, 8}.

Soft Bayesian Additive Regression Tree (SBART) is proposed by Linero & Yang (2018) and is implemented in R6. To avoid unacceptable running time, we set num tree= 50 and let other parameters be default.

Probabilistic Regression Random Forest (PRRF) is an direct extension of PRT in Alkhoury et al. (2020). We set n estimators = 200 and min leaf percentage=0.1. σ is fixed to 1.

The implementation details for other boosting methods are as follows.

Gradient Boosting Regression Tree (GBRT). For the standard gradient boosting proposed by Friedman (2002), the Scikit-Learn package in python is applied with n estimators= 100 and max depth {2, 4, 6, 8}.

Gradient Boosted Smooth Transition (GBSTRT) utilize smooth transition tree (Da Rosa et al., 2008) as base learner. We set the depth to 4 and the number of trees to 100.

Probabilistic Regression Boosting (GBPRT) is an direct extension of PRT in Alkhoury et al. (2020). We set n estimators = 100 and min leaf percentage=0.1.

C.2. Details of Real Data Sets

We summarize the details of real data sets in Table 3, with the number of instances and features after pre-processing reported. Each feature is min-max scaled to the range [0, 1] individually. We also present additional information of the data sets including the data source and the preprocessing details.

ABA: The Abalone dataset originally comes from biological research (Nash et al., 1994) and now it is accessible on UCI Machine Learning Repository (Dua & Graff, 2017). ABA contains 4177 observations of one target variable and 8 attributes related to the physical measurements of abalone.

AIR: The Airfoil Self-Noise dataset on UCI Machine Learning Repository records the result of a series of aerodynamic and acoustic tests of airfoil blade sections conducted in an anechoic wind tunnel (Brooks et al., 1989). It comprises 1503 instances of 6 attributes including wind tunnel speeds and angles of attack.

ALG: The Algerian Forest Fires dataset on UCI Machine Learning Repository contains 244 instances of 11 attributes and 1 output attribute. The task is to predict the condition of forest fires in Algeria (Abid & Izeboudjen, 2020). The attribute date is omitted when conducting regression in our experiments.

4https://gitlab.com/sami.kh/pr-tree 5https://github.com/gabrielrvsc/Boo ST 6https://github.com/theodds/Soft BART

Extrapolated Random Tree for Regression

Table 3. Description of real datasets

DATASET n d DATASET n d

ABA 4177 8 CPU 8192 12 AIR 1503 6 IST 536 8 ALG 244 12 FOR 517 13 BIAS 7750 25 MG 1385 6 CBM 11934 16 WR 4898 12 CCP 9568 4 SPA 3107 6 WW 4898 12

BIAS: The Bias correction of numerical prediction model temperature forecast dataset on UCI Machine Learning Repository is for the purpose of bias correction of next-day maximum and minimum air temperatures forecast of the model operated by Korea Meteorological Administration over Seoul (Cho et al., 2020). It contains 7750 instances of 23 input attributes and 2 output attributes. We chose the output attribute minimum air temperature as the target variable of our regression model.

CBM: The Condition Based Maintenance of Naval Propulsion Plants dataset (Altosole et al., 2009) on UCI Machine Learning Repository was generated from a sophisticated simulator of Gas Turbines. It contains 11934 instances of 16 features.

CCP: The Combined Cycle Power Plant Data Set dataset (T ufekci, 2014) on UCI Machine Learning Repository contains 9568 data points. There are 4 features that can be used to predict the net hourly electrical energy output of the power plant.

CPU: The cpusmall dataset is from LIBSVM(Chang & Lin, 2011). It contains 8192 instances, each with 12 attributes.

IST: The Istanbul Stock Exchange dataset (Akbilgic et al., 2014) on UCI Machine Learning Repository contains returns of the Istanbul Stock Exchange together with seven other international indexes. It has 536 instances and the time range is from January 5, 2009, to February 22, 2011. The date column in the dataset is dropped before constructing the regression models.

FOR: The Forest Fires dataset(Cortez & Morais, 2007) on UCI Machine Learning Repository comprises 517 instances of 13 attributes. The task is to predict the burned area of forest fires. In our experiments, two attributes, the month of the year and the day of the week were not used in the regression models.

MG: This dataset can be traced back to (Flake & Lawrence, 2002). It consists of 1385 observations of dimension 6.

WR: This dataset contains the information on red wine of the Wine Quality dataset (Cortez et al., 2009) on UCI Machine Learning Repository. There are 11 input variables to predict the output variable wine quality. 4898 instances are collected in the dataset.

SPA: The Geographical Analysis Spatial dataset is accessible in Libstat of CMU, originally uploaded by (Pace & Barry, 1997). It comprises 3107 observations of dimension 6.

WW: This dataset also originates from the Wine Quality dataset (Cortez et al., 2009) on UCI Machine Learning Repository. There are 11 features related to white wine to predict the corresponding wine quality.

C.3. Additional Experiment Results for GBERTR

We present the results for boosting methods in Table 4.

C.4. Additional Experiment Results for Efficiency

In section 3.3, we justified that ERTR is efficient enough for adaption to ensemble methods. As evidence, we provide the computation time of forest methods. As we can see in Table 5, ERF achieves stably the second computation time on each data set and often outperforms PRRF and SBART by magnitudes. It is reasonable that ERF is always worse than RF. Also, we argue that ERF is implemented in pure python while RF is implemented by cython. This can also result in some performance gaps.

We argue that the test stage of ERTR is highly parallelizable since the computation of f D,r(x) only involves samples in A(x). Thus, we can use the divide and conquer strategy to distribute the computation tasks on each cell to t different

Extrapolated Random Tree for Regression

Table 4. Average MSE over real data sets for boosting methods.

GBERTR GBRT GBSTRT GBPRT ABA 4.64 4.82 4.47 4.72 AIR 2.88* 3.53 8.01 1.34e+1 ALG 2.13e-2 2.18e-2 3.32e-2 3.30e-2 BIAS 7.28e-1* 7.40e-1 1.33 1.67 CBM 2.71e-8 4.85e-9* 9.40e-6 4.70e-5 CCP 1.76e+1 1.06e+1 1.65e+1 1.77e+1 CPU 8.37 7.79 7.52 1.29e+1 DAK 3.40e-5 3.27e-5 3.09e-5 3.01e-5 FOR 4.13e+3 6.84e+3 8.66e+3 4.89e+3 MG 1.43e-2* 1.51e-2 1.47e-2 1.62e-2 WR 3.81e-1 3.73e-1 4.03e-1 3.90e-1 SPA 1.15e-2 1.25e-2 1.07e-2 1.36e-2 WW 4.20e-1 4.05e-1* 4.74e-1 5.05e-1 ranking sum 25 28 33 44

Table 5. Average running time (s) over real data sets for forest methods.

ERF RF PRRF SBART ABA 1.01e+1 1.15 2.31e+2 5.94e+2 AIR 2.12 3.05e-1 8.07e+1 3.02e+2 ALG 9.28e-1 2.36e-1 1.09e+1 4.06e+1 BIAS 2.76e+1 6.23 4.45e+2 2.71e+3 CBM 2.78e+2 4.33 9.66e+2 - CCP 5.17e+2 2.13 5.94e+2 1.61e+3 CPU 2.93e+1 3.80 5.65e+2 1.71e+3 DAK 8.74e-1 4.00e-1 2.79e+1 9.89e+1 FOR 1.15 2.98e-1 3.41e+1 6.12e+1 MG 2.49 5.62e-1 7.35e+1 2.46e+2 WR 4.18 9.56e-1 9.75e+1 2.16e+2 SPA 5.82 1.09 1.66e+2 4.50e+2 WW 1.85e+1 1.90 2.87e+2 9.19e+2

The best results are bolded and the second best results are underlined. The best results with significance are marked with . The running of SBART is corrupted on three data sets that are marked with -.

machines. Since no additional storage is needed, the space complexity remains O(nd) and the prediction computation time is divided by t. In comparison, k nearest neighbors-based methods, if paralleled, require O(ndt) memory to achieve the same computation time. Similarly, paralleled PR tree also has O(Kdt) storage. This means that, for large data sets, the overhead of parallel computing caused by memory bandwidth restrictions is significantly smaller for ERTR than for PR tree. Hence, ERTR is suitable for parallelism to promote computation speed.