# optimal_weighted_random_forests__2f386235.pdf

Journal of Machine Learning Research 25 (2024) 1-81 Submitted 5/23; Revised 10/24; Published 10/24

Optimal Weighted Random Forests

Xinyu Chen SA21204192@mail.ustc.edu.cn International Institute of Finance School of Management University of Science and Technology of China Hefei, 230026, Anhui, China

Dalei Yu yudalei@126.com School of Mathematics and Statistics Xi an Jiaotong University Xi an, 710049, Shaanxi, China

Xinyu Zhang xinyu@amss.ac.cn Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China International Institute of Finance School of Management University of Science and Technology of China Hefei, 230026, Anhui, China

Editor: Mladen Kolar

The random forest (RF) algorithm has become a very popular prediction method for its great ﬂexibility and promising accuracy. In RF, it is conventional to put equal weights on all the base learners (trees) to aggregate their predictions. However, the predictive performance of diﬀerent trees within the forest can vary signiﬁcantly due to the randomization of the embedded bootstrap sampling and feature selection. In this paper, we focus on RF for regression and propose two optimal weighting algorithms, namely the 1 Step Optimal Weighted RF (1step-WRFopt) and 2 Steps Optimal Weighted RF (2steps-WRFopt), that combine the base learners through the weights determined by weight choice criteria. Under some regularity conditions, we show that these algorithms are asymptotically optimal in the sense that the resulting squared loss and risk are asymptotically identical to those of the infeasible but best possible weighted RF. Numerical studies conducted on real-world data sets and semi-synthetic data sets indicate that these algorithms outperform the equalweight forest and two other weighted RFs proposed in the existing literature in most cases.

Keywords: weighted random forest, model averaging, optimality, splitting criterion

. Dalei Yu is the corresponding author.

c 2024 Xinyu Chen, Dalei Yu and Xinyu Zhang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0607.html.

Chen, Yu and Zhang

1. Introduction

Random forest (RF) (Breiman, 2001), growing trees using Classiﬁcation and Regression Trees (CART) algorithm, is one of the most successful machine learning algorithms that scale with the volume of information while maintaining suﬃcient statistical eﬃciency (Biau and Scornet, 2016). Due to its great ﬂexibility and promising accuracy, RF has been widely used in diverse areas of data analysis, including policy-making (Yoon, 2021; Lin et al., 2021), business analysis (Pallathadka et al., 2023; Ghosh et al., 2022), chemoinformatics (Svetnik et al., 2003), real-time recognition of human pose (Shotton et al., 2011), and so on. RF ensembles multiple decision trees grown on bootstrap samples and yields highly accurate predictions. In the conventional implementation of RF, it is customary and convenient to allocate equal weight to each decision tree. Theoretically, the predictive performance varies from tree to tree due to the application of randomly selected sub-spaces of data and features. In other words, trees exhibit greater diversity due to the injected randomness. An immediate question then arises: Is it always optimal to employ equal weights? In fact, there is suﬃcient evidence to indicate that an averaging strategy with appropriately selected unequal weights may achieve better performance than simple averaging (that is, equally weighting) if individual learners exhibit non-identical strength (Zhou, 2012; Peng and Yang, 2022).

To solve the problem mentioned above, some eﬀorts have been made in the literature regarding weighted RFs. Speciﬁcally, Trees Weighting Random Forest (TWRF) introduced by Li et al. (2010) employs the accuracy in the out-of-bag data as an index that measures the classiﬁcation power of the tree and sets it as the weight. Winham et al. (2013) develop Weighted Random Forests (w RF), where the weights are determined based on tree-level prediction error. Based on w RF, Xuan et al. (2018) put forward Reﬁned Weighted Random Forests (RWRF) using all training data, including in-bag and out-of-bag data. A novel weights formula is also developed in RWRF but cannot be manipulated into a regression pattern. Pham and Olafsson (2019) replace the regular average with a Cesáro average with theoretical analysis. However, these studies have predominantly focused on classiﬁcation, and less attention has been paid to the regression pattern (that is, estimating the conditional expectation), although some mechanisms for classiﬁcation can be transformed into corresponding regression patterns. In addition, none of the aforementioned studies have investigated the theoretical underpinnings regarding the optimality properties of their methods.

Recently, Qiu et al. (2020) propose a novel framework that averages the outputs of multiple machine learning algorithms by the weights determined by Mallows-type criteria. Motivated by their work, we extend this approach by developing an asymptotically optimal weighting strategy for RF. Speciﬁcally, we treat the individual trees within the RF as base learners and employ Mallows-type criterion to obtain their respective weights. It is worth noting that Qiu et al. (2020) implicitly assume the independence of the hat matrix from the response values (the term hat matrix generally refers to the matrix that maps the response vector to the corresponding vector of ﬁtting values), which is deemed unrealistic in practical situations. In the current study, we remove this restriction and establish the asymptotic optimality under the standard setting where the hat matrix is determined by a response-based splitting criterion (Breiman, 2001; Chi et al., 2022). Moreover, to reduce

Optimal Weighted Random Forests

computational burden, we further propose an accelerated algorithm that requires only two quadratic optimization tasks. Asymptotic optimality is established for both the original and accelerated weighted RF estimators. Extensive analyses on real-world and semi-synthetic data sets demonstrate that the proposed methods show promising performance over existing RFs. The remaining part of the paper proceeds as follows: Section 2 formulates the problem. Section 3 establishes our weighted RF algorithms and provides theoretical analysis. Section 4 shows their promising performance on 12 real-world data sets from UCI Machine Learning Repository and Openml.org, as well as on semi-synthetic data. Section 5 concludes. The data and codes are available publicly on https://github.com/Xinyu Chen-hey/Optimal Weighted-Random-Forests.

2. Model and Problem Formulation

Let x0 i = (xi1, xi2, . . .) be a set of countably inﬁnite predictors (or explanatory variables, attributes, features) and yi be a univariate response variable for i = 1, . . . , n. The data generating process is as follows yi = µi + ei,

where {ei, x0 i }n i=1 are independent and identically distributed random variables with E(ei|x0 i ) = 0 and E(e2 i |x0 i ) = σ2 i , and µi = E(yi|x0 i ). So conditional heteroscedasticity is allowed here. Consider an observable data set D = {yi, xi}n i=1, where xi = (xi1, . . . , xip) with p being the dimension of the feature. Given a predictor vector xi, the corresponding prediction for yi by a tree (or base learner, BL) in the construction of RF can be written as follows

byi = P BL(xi, X, y, B, Θ)y,

where y = (y1, . . . , yn) is the vector of the response variable, PBL(xi, X, y, B, Θ) is an ndimensional vector for tree conﬁguration, and X = (x1, . . . , xn) is the matrix of predictors. The variables B and Θ play implicit roles in injecting randomness into RFs. First, each tree is ﬁtted to an independent bootstrap sample from the original data, with B determining the randomization inherent in this bootstrap sampling process. Second, Θ dictates the randomness in feature selection at each node within the trees. The speciﬁc nature and dimensions of B and Θ are contingent upon the speciﬁc implementation of each tree. Note that B is irrelevant to y, while Θ relies on y for guiding splits.1

Let us assume that we have drawn Mn bootstrap data sets of size n and grown Mn trees on their bootstrap data, where Mn can grow with n or remain ﬁxed. Take the mth tree for example. Dropping an instance (y0, x0) down this base learner and end up with a speciﬁc tree leaf l with nl observations Dl = {(yi1, xi1), . . . , (yinl, xinl)}. Assume that the number of occurrences of instance (yi, xi) in this tree is hi for all i because of the bootstrap sampling procedure. Then PBL(x0, X, y, B, Θ) for this tree is a sparse vector, with elements being hi/nl or zero, depending on the relationship between D and Dl. More speciﬁcally, the ith

element of PBL(x0, X, y, B, Θ) equals 0 if (yi, xi) / Dl and (y0, x0) Dl. Elements of PBL(x0, X, y, B, Θ) are weights put on elements of y to make a prediction for y0.

1. In case where the tree structure is unaﬀected by y, Θ becomes independent of y, therefore reducing PBL(xi, X, y, B, Θ) to PBL(xi, X, B, Θ). Such a situation occurs with split-unsupervised trees as discussed in Section 3.2.

Chen, Yu and Zhang

By randomly selecting sub-spaces of data and features, trees in RFs are given more randomness than trees without these randomization techniques. Speciﬁcally, bootstrap data are used to grow trees rather than the original training data. In addition, when searching for the best splitting variable at each node, we draw q (q < p) variables from the total pool of p variables, rather than using all p variables. If without the bootstrap procedure, we have hi 1 for all i {1, . . . , n}, and PBL(x0, X, y, B, Θ) will contain fewer zero elements. The prediction for yi by the mth tree (or the mth base learner) within the forest obeys the following relationship

by(m) i = P BL(m)(xi, X, y, B(m), Θ(m))y, (1)

where by(m) i is the prediction for yi by the mth tree, and PBL(m)(xi, X, y, B(m), Θ(m)) is the n-dimensional vector for conﬁguring the mth tree. The ﬁnal output of the forest is integrated by

m=1 w(m)by(m) i ,

where w(m) is the weight put on the mth tree. Our goal is to determine appropriate weights to improve prediction accuracy of RF, given a predictor vector x. Clearly, the conventional RF has w(m) 1/Mn for m = 1, . . . , Mn.

3. Mallows-type Weighted RFs

Let PBL(m) be an n n hat matrix , of which the ith row is P BL(m)(xi, X, y, B(m), Θ(m)). Let

m=1 w(m)PBL(m),

m=1 w(m)by(m),

by(m) = by(m) 1 , . . . , by(m) n .

Deﬁne the following averaged squared error function

Ln(w) by(w) µ 2, (2)

which measures the sum of squared biases between the true µ = (µ1, . . . , µn) and its model averaging estimate by(w). Denote Rn(w) = E {Ln(w)|X}, where X is the σ-algebra generated by {x0 1, . . . , x0 n, B(1), . . . , B(Mn), Θ(1), . . . , Θ(Mn)}. We will propose some weight choice criteria to obtain weights based on Rn(w).

Optimal Weighted Random Forests

3.1 Mallows-type Weight Choice Criteria

Considering the choice of weights, we use the solution obtained by minimizing the following Mallows-type criterion (3) with the restriction of w H n w [0, 1]Mn : PMn m=1 w(m) = 1 o

Cn(w) = y P(w)y 2 + 2

i=1 e2 i Pii(w), (3)

where Pii(w) is the ith diagonal term in P(w), and e = (e1, . . . , en) is the true error term vector. This criterion is originally proposed by Zhao et al. (2016) for considering linear models. In the context of linear models, E{Cn(w) | X} equals the conditional risk Rn(w) up to a constant term that is irrelevant to w. Zhao et al. (2016) further show that the criterion is asymptotically optimal in the context considered therein. However, ei s are unobservable terms in practice. So they further consider the following feasible criterion, replacing the true error terms with averaged residuals

C n(w) = y P(w)y 2 + 2

i=1 ˆei(w)2Pii(w), (4)

ˆe(w) = {ˆe1(w), . . . , ˆen(w)} =

m=1 w(m)ˆe(m) = {In P(w)}y,

ˆe(m) is the residual vector for the mth candidate model, and In is an identity matrix with dimension n. This feasible criterion also accommodates conditional heteroscedasticity. Besides, it relies on all candidate models to estimate the true error vector, which avoids placing too much conﬁdence on a single model. Similar criterion has also been considered in Qiu et al. (2020). We apply criterion (4) to determine w in by(w). Criterion (4) comprises two terms. The ﬁrst term measures the ﬁtting error of the weighted RF in the training data, by computing the residual sum of squares. The second term penalizes the complexity of the trees in the forest. For each m {1, . . . , Mn}, PBL(m),ii denotes the ith diagonal term in PBL(m). As explained in Section 2, PBL(m),ii is the proportion of the ith observation to the total number of samples in the leaf that includes the ith observation. Thus, for each m {1, . . . , Mn} and i {1, . . . , n}, the larger the value of PBL(m),ii, the smaller the gap between yi and

by(m) i . In the most extreme case where a tree is so deep that the leaf node containing the ith observation is pure, PBL(m),ii equals 1, and by(m) i equals yi. Essentially, this tree has low prediction error within the training sample, but may exhibit poor generalization performance when applied to new data. To mitigate the contribution of overﬁtted trees in the ensemble output, this algorithm assigns lower weights to these trees, thereby decreasing the second term. From another aspect, noting that Pn i=1 ˆei(w)2Pii(w) min1 i n ˆei(w)2 Pn i=1 Pii(w), the summation part Pn i=1 Pii(w) = PMn m=1 w(m) Pn i=1 PBL(m),ii = PMn m=1 w(m)ℓ(m) represents the weighted number of leaves of all trees, where ℓ(m) is the number of leaves of the mth tree, representing the complexity of the tree. The regularized objective for minimization in the

Chen, Yu and Zhang

Extreme Gradient Boosting (XGBoost) algorithm, proposed by Chen and Guestrin (2016), also contains a penalty term that penalizes the number of leaves in the tree. In light of this, both the weighted RF with weights obtained by minimizing criterion (4) and XGBoost employ the number of leaves in a tree to measure its complexity. The regularization term helps to allocate the weights to avoid overﬁtting (Chen and Guestrin, 2016). Inherent from the Mallows criterion (Hansen, 2007), the resulting weights of (4) exhibit sparsity. It is important to note that some trees within a RF might contribute to the deterioration of the ensemble s overall performance. Therefore, forming a more accurate committee via removal of trees with poor performance is a more rational strategy. It is called the theorem of MCBTA ( many could be better than all ) introduced by Zhou et al. (2002), which indicates that for supervised learning, given a set of individual learners, it may be better to ensemble some instead of all of these individual learners. Employing sparse weights can be regarded as a form of adaptive tree selection, reducing the risk of integrating trees that could weaken the ﬁnal outcome of the ensemble. Additionally, it oﬀers advantages over model selection methods, which only choose a single tree and thereby ignore model uncertainty. In short, our approach with sparse weights provides a balanced and adaptive solution, selectively aggregating members while acknowledging the signiﬁcance of trees diversity. It is clear that criterion (4) is a cubic function of w, whose optimization is substantially more time-consuming than that of quadratic programming. To expedite the process, we further suggest an accelerated algorithm that estimates e using a vector that is irrelevant to w. The accelerated algorithm consists of two steps, where the ﬁrst step involves calculating the estimated error terms, and the second step involves substituting the vector obtained in the ﬁrst step for the true error terms in criterion (3). In speciﬁc, consider the following intermediate criterion,

C n(w) = y P(w)y 2 + 2

i=1 ˆσ2Pii(w), (5)

where ˆσ2 = y P(w0)y 2/n, and w0 is an n-dimensional vector with all elements being 1/Mn. Solve this quadratic programming task over w H, and get a solution w . Then, calculate the residual vector by

e = ( e1, . . . , en) = {In P(w )}y.

Next, consider the following criterion

C n(w) = y P(w)y 2 + 2

i=1 e2 i Pii(w). (6)

Both (5) and (6) are quadratic functions of w, while criterion (4) is a cubic function. Many contemporary software packages, such as quadprog in R or MATLAB, can eﬀectively handle quadratic programming problems. In fact, from the real data analysis conducted in Section 4.1, it is observed that the time required to solve two quadratic programming problems is notably lower compared to that required to solve a more intricate nonlinear programming problem of higher order. Please see Table 5 for more details.

Optimal Weighted Random Forests

Algorithm 1: 1step-WRFopt

Input: (1) The training data set D = {yi, xi}n i=1 (2) The number of trees in

random forest Mn Output: Weight vector bw H

1 for m = 1 to Mn do

2 Draw a bootstrap data set D(m) of size n from the training data set D;

3 Grow a random-forest tree ˆf(m) to the bootstrap data D(m), by recursively

repeating the following steps for each terminal node of the tree, until the

minimum node size nodesize is reached ; // nodesize, q are hyper

4 i. Select q variables at random from the p variables;

5 ii. Pick the best variable/ splitting point among the q;

6 iii. Split the node into two daughter nodes.

7 for i = 1 to n do

8 Drop xi down the the mth tree and get PBL(m)(xi, X, y, B(m), Θ(m)).

10 PBL(m) {PBL(m)(x1, X, y, B(m), Θ(m)), . . . , PBL(m)(xn, X, y, B(m), Θ(m))} .

12 Solve the convex optimization problem:

bw = ( bw(1), . . . , bw(Mn)) arg min w H C n(w).

We refer to the RF with tree-level weights derived from optimizing criterion (4) as 1 Step Optimal Weighted RF (1step-WRFopt), and the RF with weights of trees obtained by optimizing criterion (6) as 2 Steps Optimal Weighted RF (2steps-WRFopt). Their details are given in Algorithms 1 and B.1, respectively. For the sake of simplicity, we provide the complete description of Algorithm B.1 in Appendix B.1, since it shares a large part in common with Algorithm 1 except for the weight choice criteria. In the following, we use the WRFopt to refer to the algorithms including both the 1step-WRFopt and 2steps-WRFopt.

3.2 Asymptotic Optimality

Denote the selected weight vectors from C n(w) and C n(w) by

bw = arg min w H C n(w) and ew = arg min w H C n(w),

respectively. In this section, we aim to analyze the loss and risk behavior associated with bw and ew. Before providing the theoretical support for the proposed algorithms, as an intermediate step, we ﬁrst introduce a special tree-type algorithm which splits nodes without the guidance of response values in learning samples. Namely, the hat matrix is independent

Chen, Yu and Zhang

Tree-Based Candidate Models BL-of-RF-Based Candidate Models Dependence of P on y Least squares model averaging (Hansen, 2007)

Mallows-type averaging (Qiu et al., 2020)

WRFopt with SUT WRFopt with CART

Table 1: Flowchart of Theoretical Analysis (Inside the dotted box are our works.)

of the output values of learning samples. In this context, the vector PBL(xi, X, y, B, Θ) reduces to PBL(xi, X, B, Θ). This setup has also been imposed in the theoretical analysis of Geurts et al. (2006) and Biau (2012). Besides, this theoretical framework is referred to as honesty in the ﬁeld of causal inference (Athey and Imbens, 2016). We term this methodology as Split-Unsupervised Tree (SUT) in contrast to CART whose splitting criterion relies on response information. Adopting the SUT simpliﬁes theoretical analysis, aligning it with the framework of Mallows model averaging estimator (Hansen, 2007) in linear regression, where the hat matrix P is independent of response values. This approach is also similar to the estimators explored by Qiu et al. (2020), making it a useful intermediary for further analysis on the asymptotic optimality of our proposed weighted RFs. In the remaining part of this section, we ﬁrst discuss the asymptotic optimality of the WRFopt with SUT. Subsequently, we present the asymptotic optimality of the WRFopt with CART, which is the most important conclusion in this paper. This is achieved by theoretically linking the large sample behavior of WRFopt with CART to the behavior of WRFopt with SUT. The corresponding theoretical analysis process is outlined in Table 1. Appendix A provides the details of the CART algorithm and an example of SUT algorithm (used in Section 4.1).

3.2.1 Asymptotic Optimality with SUT

In this section, we will establish the asymptotic optimality of the 1step-WRFopt estimator and 2steps-WRFopt estimator with SUT trees. To diﬀerentiate notations associated with the SUT methodology from those used in CART, we employ the script on RF-related notations when referring to their SUT counterparts. This indicates that the notations pertain to the SUT methodology, while maintaining their fundamental meanings with the exception of the splitting criterion. For example, for m = 1, . . . , Mn, P BL(m) and PBL(m) share the same fundamental meaning, with the former associated with SUT trees and the latter related to CART trees. Likewise, P (w) = PMn m=1 w(m)P BL(m) contrasts with P(w) for CART trees. Let ξn = infw H Rn(w), n(m),l be the number of observations in the lth leaf of the mth

Optimal Weighted Random Forests

tree, and n = min1 m Mn min1 l ℓ(m) n(m),l be the smallest sample size (controlled by the hyper parameter nodesize in R package random Forest) across all leaves in all trees within the CART trees. We employ ξ n and n to denote the counterparts of ξn and n pertaining to SUT trees, respectively. For brevity, we will not enumerate each SUT-corresponding notation individually. We list and discuss some technical conditions as follows.

Condition 1 ξ 1 n M2 n = o(1) almost surely, and E ξ 1 n M2 n exists for all ﬁxed n 1.

Condition 2 There exists a positive constant v such that E e4 i | x0 i v < almost surely for i = 1, . . . , n.

Additionally, we deﬁne h(m),i as the number of occurrences of instance (yi, xi) in the mth

bootstrap sample. Following Chi et al. (2022), when the summation is over an empty set, we deﬁne its value as zero; also, we deﬁne 0/0 = 0. Thus, without loss of generality, we assume that h(m),i > 0 for all m = 1, . . . , Mn and i = 1, . . . , n in the remaining part of this study. Denote hmax = max1 m Mn,1 i n h(m),i.

Condition 3 hmaxn 1 n1/2 = O(1) almost surely.

Condition 4 For each ﬁxed i, j {1, . . . , n}, there exists a positive constant c such that ch(m),j h(m),i each m = 1, . . . , Mn, almost surely.

Condition 5 ξ 1 n Mnn1/2 = o(1) almost surely, and E ξ 1 n Mnn1/2 exists for all ﬁxed n 1.

Conditions 1 and 5 restrict the increasing rates of the number of trees Mn and regulate the behavior of the minimum averaging risk ξ n. Similar conditions have been considered and discussed by Zhang et al. (2020), Zhang (2021), Zou et al. (2022), and others. One typical scenario that guarantees these two conditions occurs when all trees are misspeciﬁed, which precludes any trees within the RF yield perfect predictions and dominate others. The misspeciﬁcation scenario is particularly common under high-dimensional data contexts, where important predictors might be omitted from tree construction due to the randomization process inherent in feature selection at each split. Besides, inspired by Chi et al. (2022), we ﬁrst rigorously deﬁne the notion of important predictors in the context of RF under a simpliﬁed scenario in Appendix C.7. Then, under this situation, it is reasonable to expect that the key identity ξ 1 n converges to 0 at the rate O(n 1). In such cases, the rate of convergence in Conditions 1 and 5 can both be further simpliﬁed to M2 nn 1 = o(1). Moreover, even when all the important predictors are incorporated into the tree-building process, in a very simpliﬁed situation where a non-adaptive tree is considered, Lin and Jeon (2006) show that the mean square error of the limiting value of the RF estimator is bounded below by C/{N logp 1(n)}, where C is a positive constant and N represents the maximal number of samples across all leaves and all trees. This indicates that ξ 1 n = O[1/{N logp 1(n)}]. In this case, Conditions 1 and 5 can be further reduced to M2 n/{N logp 1(n)} = o(1) and Mnn1/2/{N logp 1(n)} = o(1), respectively. However, given the complexity of RF, achieving an explicit rate of convergence for ξ 1 n under general RF scenarios remains an open question, which we identify as a direction for future research.

Chen, Yu and Zhang

Condition 2 imposes the boundedness of the conditional moments, which is a mild condition and can be found in much literature, including works by Hansen and Racine (2012), and Hansen (2007). Condition 3 is a high-level assumption that restricts the structure of the RF and its constituent trees, which is equivalent to Condition C.9 of Qiu et al. (2020) in the context therein. In fact, Condition 3 requires that the minimum sample size across all leaves and all trees grows with order no slower than n1/2. Besides, it yields that trace(P BL(m)) = O(n1/2), almost surely. This poses restrictions on the degrees-of-freedom or complexity of trees. In other words, trees should not be fully developed. Actually, as noted by Probst et al. (2019), increasing hyper parameter nodesize, leading to less tree leaves, can decreases the computation time approximately exponentially. Our practical experience, particularly with large data sets, suggests that setting this hyper parameter to a value higher than the default can signiﬁcantly reduce runtime, often without markedly compromising prediction performance. Speciﬁcally, Segal (2004) show an example where increasing the number of noise variables results in a higher optimal nodesize. Therefore, it is advisable to construct trees in a RF that are moderately developed, balancing between being too shallow, which may result in underﬁtting (Hastie et al., 2009; James et al., 2013), and being excessively deep, which can impose a signiﬁcant computational burden. Thus, Condition 3 is reasonable and easy to be satisﬁed in practice. Condition 4 excludes the unbalanced sampling in bootstrap sampling process, speciﬁcally ruling out the extreme case where certain samples disproportionately dominate the bootstrap samples. The following theorems establish the asymptotic optimality of the 1step-WRFopt estimator and 2steps-WRFopt estimator, respectively.

Theorem 1 (Asymptotic Optimality for 1step-WRFopt with SUT) Assume Conditions 1 - 4 hold. Then, as n ,

L n (bw ) infw H L n(w)

If, in addition, there exists an integrable random variable η such that {L n(bw ) ξ n} ξ 1 n η,

then R n(bw ) infw H R n(w)

Theorem 2 (Asymptotic Optimality for 2steps-WRFopt with SUT) Assume Conditions 1 - 5 hold. Then, as n ,

L n (ew ) infw H L n(w)

If, in addition, there exists an integrable random variable η such that {L n(ew ) ξ n} ξ 1 n η,

then R n(ew ) infw H R n(w)

Optimal Weighted Random Forests

Results obtained in (7) and (9) regard the asymptotic optimality in the sense of achieving the lowest possible squared loss, while (8) and (10) yield asymptotic optimality in the sense of achieving the lowest possible squared risk. Proofs of Theorems 1 and 2 are presented in Appendices C.2 and C.3, respectively.

3.2.2 Asymptotic Optimality with CART

Theoretical analysis regarding CART is generally very challenging, as the splitting criterion relies on response information and the black-box nature of the procedure (Scornet et al., 2015). Nevertheless, there are some pioneering studies in the literature that employ the CART splitting criterion. For example, in the context of additive regression models, Scornet et al. (2015) study the consistency of Breiman s results (Breiman, 2001). Moreover, Klusowski (2021) establishes the universal consistency of CART in the context of high dimensional additive models. In addition, Syrgkanis and Zampetakis (2020) analyze the ﬁnite sample properties of CART with binary features, under a sparsity constraint. More recently, Chi et al. (2022) establish the consistency rates for the original CART. Now, we are equipped to establish the asymptotic optimality of WRFopt using CART trees, which will be achieved by studying the diﬀerence between CART-based RF and its limiting version, based on the intermediate results established in Section 3.2.1. More speciﬁcally, following the recent work of Chi et al. (2022) and Scornet et al. (2015), we term the limiting version of CART-splitting criterion as the theoretical CART-splitting criterion. Unlike its practical counterpart, the theoretical CART-splitting criterion does not depend on response values and therefore can be considered as a special type of splitting criterion employed by SUT trees. Our analysis in this section focuses on the discrepancy between estimators using the CART-splitting criterion and those using the theoretical CART-splitting criterion (referred to as theoretical RF). To facilitate the technical presentation, we ﬁrst introduce additional notations for the structure of a tree and its cells. Following Chi et al. (2022), without loss of generality, we set xi [0, 1]p and deﬁne a cell as a rectangle t = p d=1td, representing the Cartesian product of the closed or half-closed interval td in [0, 1]. Let θk = {θk,1, . . . , θk,2k} with elements θk, {1, . . . , p} be the sets of available features for the 2k 1 splits at level k 1 that grow the 2k cells (including empty cells). Given the CART-splitting criterion and a set of θ1:k = {θ1, . . . , θk}, let T(θ1:k) = {t1:k,1, . . . , t1:k,2k} be the collection of all cell sequences connecting the root to the end cells at level k, which is determined by Θ. Each sequence t1:k,s for s = 1, . . . , 2k can be considered as a tree branch , representing a partition of [0, 1]p. One can refer to Figure 1 of Chi et al. (2022) for a graphical illustration for this splitting scheme. It is natural to consider varying tree heights within a RF, implying that the maximum level k depends on m for each m = 1, . . . , Mn. Therefore, we use k(m) to denote the maximum level of the mth tree. For a new instance x, which is an independent copy of x1, the mth tree estimator given T(θ1:k(m)) and {h(m),1, . . . , h(m),n} can be expressed as follows

bµT(θ1:k(m))(x) =

s=1 I x tk(m),s Pn i=1 h(m),iyi I xi tk(m),s

Pn i=1 h(m),i I xi tk(m),s

Chen, Yu and Zhang

h(m),i I x tk(m),s I xi tk(m),s

Pn i =1 h(m),i I xi tk(m),s , (11)

where tk(m),s is the end cell of the tree branch t1:k(m),s (also referred to leaves if it is not empty), and I( ) denotes the indicator function. Now, by deﬁnition, the (i, j)th element of PBL(m) can be expressed as

PBL(m),ij =

h(m),j I xi tk(m),s I xj tk(m),s

Pn j =1 h(m),j I xj tk(m),s , (12)

for all m = 1, . . . , Mn. Consistent with Chi et al. (2022), given a cell t, the CART-splitting criterion is deﬁned as

CARTt(d, c) = min c1 R1

Pn i=1 I(xid < c, xi t) (yi c1)2

+ min c2 R1

Pn i=1 I(xid c, xi t) (yi c2)2

with its theoretical version

CART t(d, c) = Pr (x1d < c | x1 t) var (y1 | x1d < c, x1 t)

+ Pr (x1d c | x1 t) var (y1 | x1d c, x1 t) ,

where d is the label of splitting variable, c is a corresponding splitting point, and xid is the dth entry of xi. Analogous to (12), we denote the corresponding theoretical version of PBL(m) pertaining to CART t(d, c) for each m {1, . . . , Mn} as

P BL(m) = P BL(m),ij

h(m),j I xi t k(m),s I xj t k(m),s

Pn j =1 h(m),j I xj t k(m),s

where t k(m),s for each m {1, . . . , Mn} and s {1, . . . , 2k(m)} is the sth end cell of the mth

base learner based on the theoretical CART-splitting criterion CART t(d, c). Further, we continue employing the script on RF-related notations when referring to their theoretical counterparts. Detailed deﬁnitions corresponding to theoretical CART-splitting criterion will not be enumerated here, in the interest of brevity. Note that the theoretical CART-splitting criterion is independent of response values and therefore can be viewed as a special type of splitting criterion used by SUT trees. Consequently, as an immediate application, Theorems 1 - 2 in Section 3.2.1 guarantee the asymptotic optimality of the WRFopt with theoretical CART-splitting criterion. In what follows, we introduce additional notations crucial for quantifying the discrepancy between the RF and theoretical RF. Let

Pr n I xi tk(m),s t k(m),s = 1 o = p(m),is,

Optimal Weighted Random Forests

δn = max 1 i n,1 m Mn,1 s 2k(m) I xi tk(m),s t k(m),s ,

where S1 S2 = (S1 S2) (S1 S2) is the symmetric diﬀerence between two generic sets S1, S2. Let Kn = max1 m Mn 2k(m) be the largest number of end cells (including empty end cells) among all trees, and N = max1 m Mn,1 l ℓ(m) n(m),l be the largest sample size across all leaves in all trees within the RF,2 with N being the theoretical counterpart of N. Now, we are equipped to explore the discrepancy between the RF and theoretical RF. The following result bounds the gap between two pivotal matrices PBL(m) and P BL(m) for all m = 1, . . . , Mn.

Lemma 1 Under Condition 4, there exist two positive constants c1, c2 such that

PBL(m),ij P BL(m),ij c1 Kn(N + N )δn, (14)

PBL(m),ij P BL(m),ij 2 + c2

almost surely, uniformly for all m = 1, . . . , Mn.

Clearly, δn plays a pivotal role in bounding the proximity between the matrices PBL(m) and P BL(m). The proof of Lemma 1 will be provided in Appendix C.4. Let ξ n and n be the counterparts of ξn and n under theoretical CART, we list and discuss some technical conditions as follows.

Condition 1 ξ 1 n M2 n = o(1) almost surely, and E ξ 1 n M2 n exists for all ﬁxed n 1.

Condition 2 There exist two positive constants v1 and v2 such that E |ei|r | x0 i v2 1vr 2 2 r!/2 almost surely for every i = 1, . . . , n and r 2.

Condition 3 hmaxn 1 n1/2 = O(1) almost surely.

Condition 5 ξ 1 n Mnn1/2 = o(1) almost surely, and E ξ 1 n Mnn1/2 exists for all ﬁxed n 1.

Condition 6 As n , N /n, N + N and p(m),is are bounded above by deterministic series, say, rn , Nn and pn, respectively.

2. Note that for all m = 1, . . . , Mn, l {1, . . . , ℓ(m)} represents the index of leaves (non-empty end cells) in the mth tree, while s {1, . . . , 2k(m)} denotes the index of all end cells, inclusive of those that are empty.

Chen, Yu and Zhang

Condition 7 As n , pn < 1/2,

ϵn = ξ 1 n Kn Nn log2 (n)

v u u t2 log(2n Mn Kn)

log 1 pn 1 + pn

almost surely, and E( ϵn) exists all ﬁxed n 1.

Conditions 1 - 3 and 5 correspond to Conditions 1 - 3 and 5, respectively, with the focus shifted to theoretical CART-splitting. Moreover, Condition 2 guarantees that E|ei|r v2 1vr 2 2 r!/2, which is referred to as Bernstein s moment condition (Zhang and Chen, 2021). Particularly, if {ei}n i=1 are generated independently by Normal(0, σ2), it follows that v2 1 = 2σ2 and v2 = σ2 (Zhang and Chen, 2021, Example 5.4). In addition, Condition 6 imposes restrictions on the hyper parameters of RF, requiring the behavior of these parameters does not become too erratic as the amount of data increases. Condition 7 requires that the CART criterion leads to reasonably accurate splits in the sense that the diﬀerence between tk(m),s and t k(m),s (measured by E(δn), which is bounded above by q

2 log(2n Mn Kn)/log p 1 n 1 + pn) is ignorable in comparison to ξn.

Theorem 3 (Asymptotic Optimality under CART and criterion 5) Assume that Conditions 1 - 3 , 4, and 6 - 7 hold. Then, as n ,

Ln(w ) infw H Ln(w)

where w is the solution of minimizing the criterion (5) over w H.

The proof of Theorem 3 is provided in Appendix C.5. Theorem 3 demonstrates the asymptotic optimality of the weighted RF with CART-splitting criterion, with weights obtained by criterion (5). This criterion is derived under a working homoskedastic framework and serves as a groundwork for further theoretical analysis. Based on Theorem 3, similar theoretical results regarding the heteroskedastic criteria (that is, C n(w) and C n(w)) can be established.

Corollary 1 (Asymptotic Optimality for 1step-WRFopt with CART) Assume Conditions 1 - 3 , 4, and 6 - 7 hold. Then, as n ,

Ln (bw) infw H Ln(w)

If, in addition, there exists an integrable random variable η such that {Ln(bw) ξn} ξ 1 n η,

then Rn(bw) infw H Rn(w)

Optimal Weighted Random Forests

Corollary 2 (Asymptotic Optimality for 2steps-WRFopt with CART) Assume Conditions 1 - 3 , 4, 5 and 6 - 7 hold. Then, as n ,

Ln (ew) infw H Ln(w)

If, in addition, there exists an integrable random variable η such that {Ln(ew) ξn} ξ 1 n η,

then Rn(ew) infw H Rn(w)

With the result established in Theorem 3, Corollaries 1 and 2 can readily be veriﬁed within the same framework that proves Theorem 2 and thus is omitted. Theorem 3 and Corollaries 1 and 2 extend the results in Theorems 1 and 2 from SUT to CART. We are unaware of any similar results in this ﬁeld.

4. Numerical Study

In this section, we will conduct experiments using real and semi-synthetic data, the latter derived from the former, to evaluate the performance of diﬀerent weighted RFs.

4.1 Real Data Analysis

To assess the prediction performance of diﬀerent weighted RFs in practical situations, we used 11 data sets from the UCI data repository for machine learning (Dua and Graﬀ, 2017). Because most of these data sets are low-dimensional, one additional high-dimensional data set from openml.org (Vanschoren et al., 2013) was also included. The details of the 12 data sets are listed in Table 2. Appendix B features a demonstration of two competitors, namely w RF and CRF. For the sake of brevity, in the following, we will refer to each data set by its abbreviation. We randomly partitioned each data set into training data, testing data and validation data, in the ratio of 0.5 : 0.3 : 0.2. The training data was used to construct trees and to calculate weights, and the test data was used to evaluate the predictive performance of diﬀerent algorithms. The validation data was employed to select tuning parameters, such as the exponent in the expression for calculating weights in the w RF, and probability sequence in the SUT algorithm. In this section, the number of trees Mn was set to 100. Before each split, the dimension of random feature sub-space q was set to p/3 , which is the default value in the regression mode of the R package random Forest. We set the minimum leaf size nodesize to n in CART trees and 5 in SUT trees, in order to control the depth of trees. We also tried other values of Mn and nodesize, and the patterns of the performance remain stable in general. Figures D.12 - D.23 in Appendix D will provide more information on the robustness of the proposed methods over diﬀerent RF hyper parameters.3

3. In our robustness tests for Mn, we varied Mn across 100, 200, 400 and 800, while keeping nodesize ﬁxed at n . For nodesize robustness tests, nodesize was set to the quintiles within the range of [5, n ], with Mn ﬁxed at 100.

Chen, Yu and Zhang

Data set Abbreviation Attributes Samples Boston Housing BH 12 506 Servo Servo 4 167 Concrete Compressive Strength CCS 9 1030 Airfoil Self-Noise ASN 5 1503 Combined Cycle Power Plant CCPP 4 9568 Concrete Slump Test CST 7 103 Energy Eﬃciency EE 8 768 Parkinsons Telemonitoring PT 20 5875 QSAR aquatic toxicity QSAR 8 546 Synchronous Machine SM 4 557 Yacht Hydrodynamics YH 6 308 Tecator Tecator 124 240

Table 2: Summary of 12 Data Sets

For each strategy, the number of replication was set to D = 1000 and the forecasting performance was accessed by the following two criteria:

MSFE = 1 D ntest

i=1 (yi byi,d)2 and MAFE = 1 D ntest

i=1 |yi byi,d| ,

where ntest is the size of testing data, and byi,d is the forecast for yi in the dth repetition. MSFE and MAFE are abbreviations of Mean Squared Forecast Error and Mean Absolute Forecast Error , respectively. As noted in Section 1, an averaging strategy with appropriately selected unequal weights may outperform simple averaging if individual learners display non-identical strength. To ascertain the relationship between the performance of the WRFopt and the diversity level of base learners, we employ the following weighted correlation between the residuals r(m) = y by(m) and r(m ) = y by(m ) where 1 m, m Mn and m = m , as proposed by Breiman (2001),

1 m<m Mn ρ(r(m), r(m )) sd(r(m)) sd(r(m ))

m=1 sd(r(m))

where ρ(r(m), r(m )) is the correlation between r(m) and r(m ), and sd(r(m)) is the standard deviation of r(m). It is clear that a larger ρ signiﬁes reduced diversity among base learners. In this scenario, it is expected that equal weights may also yield reasonable performance (Zhou, 2012). In fact, our numerical experiments below echo this speculation, showing that the WRFopt strategies signiﬁcantly outperform conventional RFs when ρ is relatively small, for instance, ρ < 0.5. Consequently, with a relatively small ρ, one can expect more pronounced improvements in predictive performance by adopting appropriate unequal weights. Next, we will provide the results of diﬀerent weighting techniques on RFs with CART trees and RFs with SUT trees, respectively.

Optimal Weighted Random Forests

Data set RF 2steps-WRFopt 1step-WRFopt w RF CRF BH 15.484(5) 13.958(1) 14.038(2) 14.517(3) 14.664(4)

Servo 1.610(5) 0.825(1) 0.836(2) 0.860(3) 1.169(4)

CCS 60.460(5) 50.004(1) 50.048(2) 52.868(3) 54.065(4)

ASN 20.022(5) 14.572(1) 14.575(2) 15.611(3) 17.054(4)

CCPP 18.016(5) 16.065(2) 16.062(1) 16.237(3) 16.556(4)

CST 26.421(5) 19.877(1) 20.074(2) 21.500(3) 22.534(4)

EE 4.332(5) 3.643(2) 3.642(1) 3.964(3) 4.087(4)

PT 14.641(5) 8.653(2) 8.649(1) 9.099(3) 10.819(4)

QSAR 1.436(5) 1.423(3) 1.434(4) 1.417(1) 1.420(2)

SM( 10 4) 6.981(5) 3.403(1) 3.404(2) 4.342(3) 5.214(4)

YH 35.442(5) 3.727(1) 3.735(2) 5.603(3) 13.422(4)

Tecator 5.274(5) 3.295(2) 3.282(1) 3.458(3) 3.879(4)

Table 3: Test Error Comparisons by MSFE for Diﬀerent Forests with CART Trees

4.1.1 RFs with CART Trees

Tables 3 and 4 exhibit the risks of RFs with CART trees in terms of MSFE and MAFE, respectively. Each row presents the results for a speciﬁc data set, comparing the risks associated with diﬀerent RF algorithms across columns. Values in parentheses in the upper right corner indicate the risk ranking for each RF algorithm within the same data set, with a lower rank denoting a lower risk. Regarding MSFE, the 1step-WRFopt or 2steps-WRFopt estimator manifests the best performance in 11 out of 12 data sets, whereas exhibits the best performance in 10 out of 12 data sets in terms of MAFE. It is observed that the w RF becomes the best method in some data sets. Of all cases considered, the CRF is found to never be the best method. It is also noticeable that the 2steps-WRFopt is superior to the 1step-WRFopt in most cases, albeit with minor diﬀerences. Table 5 compares the time consumption of the 2steps-WRFopt and 1step-WRFopt algorithms for a single run, averaged over D repetitions, with the ratio of the latter to the former in the fourth column. Apparently, the 2steps-WRFopt can accelerate optimization by tens or hundreds of times when compared to the 1step-WRFopt, given that solving quadratic optimization is considerably faster than solving a higher-order nonlinear optimization task. To further assess the performance of the 2steps-WRFopt over other competing methods, we evaluated their relative risks with respect to the 2steps-WRFopt. Speciﬁcally, we calculated the relative risks of the RF, 1step-WRFopt, w RF, and CRF by dividing their respective risks by that of the benchmark 2steps-WRFopt. In the following, we assert that a relative risk is not essential if it falls in the interval of (0.95, 1.05), while it is essential if it is lower than 0.95 or higher than 1.05. The relative MSFE and MAFE of each method on 12 data sets are reported in Figures 1 and 2, respectively. The results are depicted by blue, green, purple and red bars, respectively, for the RF, 1step-WRFopt, w RF, and CRF. Furthermore, to illustrate the relationship between the performance of the WRFopt and the diversity level of base learners, we displayed averaged ρ over D replications below the names of the data sets, arranged in ascending order from the smallest to the largest.

Chen, Yu and Zhang

Data set RF 2steps-WRFopt 1step-WRFopt w RF CRF BH 2.608(5) 2.536(1) 2.549(3) 2.539(2) 2.562(4)

Servo 0.900(5) 0.550(2) 0.550(3) 0.535(1) 0.754(4)

CCS 6.092(5) 5.500(1) 5.503(2) 5.668(3) 5.751(4)

ASN 3.607(5) 3.013(1) 3.013(2) 3.121(3) 3.295(4)

CCPP 3.243(5) 3.058(2) 3.058(1) 3.075(3) 3.108(4)

CST 4.023(5) 3.425(1) 3.445(2) 3.568(3) 3.676(4)

EE 1.563(5) 1.349(2) 1.349(1) 1.423(3) 1.487(4)

PT 2.933(5) 2.201(2) 2.201(1) 2.241(3) 2.493(4)

QSAR 0.892(4) 0.888(3) 0.892(5) 0.885(1) 0.887(2)

SM( 10 2) 2.044(5) 1.374(1) 1.375(2) 1.551(3) 1.746(4)

YH 3.877(5) 1.182(1) 1.182(2) 1.358(3) 2.329(4)

Tecator 1.637(5) 1.319(2) 1.318(1) 1.362(3) 1.435(4)

Table 4: Test Error Comparisons by MAFE for Diﬀerent Forests with CART Trees

Data set 2steps-WRFopt 1step-WRFopt Ratio BH 0.065 3.898 60.371 Servo 0.072 1.347 18.778 CCS 0.081 6.822 83.982 ASN 0.094 6.468 68.840 CCPP 0.566 2.785 4.916 CST 0.075 2.061 27.630 EE 0.072 3.498 48.304 PT 0.291 40.445 138.967 QSAR 0.069 3.148 45.431 SM 0.060 1.409 23.327 YH 0.065 2.631 40.546 Tecator 0.041 2.710 66.098

Table 5: Time Consumption Comparisons (Unit: seconds)

Some ﬁndings are worth mentioning in Figure 1. First, the improvement of the WRFopt (including the 1step-WRFopt and 2steps-WRFopt) over the conventional RF is essential in 11 out of 12 data sets. What stands out in the ﬁgure is that the relative MSFEs of others with respect to the benchmark are conspicuously large in the YH data set. This spells the great success of our WRFopt methods in practice. More importantly, the WRFopt methods outperform competitors essentially in 8 out of 12 data sets, while none of the competitors dominate the benchmark essentially in all cases, underscoring the robustness of the WRFopt. Figure 2 remains the similar qualitative results, albeit with less notable power of the WRFopt than Figure 1. Speciﬁcally, the WRFopt shows essential improvement over the conventional RF in 10 out of 12 data sets, and dominates all competitors essentially in 3 out of 12 data sets. These proportions are relatively lower than those in Figure 1. But none of the competitors surpass the benchmark essentially in all cases, which is consistent with Figure 1.

Optimal Weighted Random Forests

0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10

Servo 0.725 Data sets

Relative MSFE

RF 1step WRFopt w RF CRF

Figure 1: Relative MSFE for Diﬀerent Forests with CART Trees (The horizontal axis shows the names of 12 data sets, arranged in ascending order of their corresponding averaged ρ displayed beneath each data set s abbreviation. The green bars are barely discernible, with a relative risk close to 1 due to their negligible predictive error diﬀerence compared to the 2steps-WRFopt. This pattern consistent across subsequent Figures 2 - 4 and 7 - 8.)

Note that the w RF algorithm requires tuning a parameter outside of the training set, whereas the WRFopt and CRF do not. For the fairness of the comparison, all three weighted RFs should use identical tree models built in the same training data set. Were WRFopt and CRF not compared with w RF, they can employ more training samples, potentially leading to superior predictive performance than currently observed. Combining all the ﬁndings together, we can conclude that the proposed WRFopt methods yield more accurate predictions compared to the conventional RF and other existing weighted RFs in most cases. Additionally, it is clear from Figures 1 and 2 that in the ﬁrst four data sets, which have relatively low averaged ρ , WRFopt methods tend to yield more pronounced performance over their competitors, with the exception of the Servo data set. In other words, these ﬁrst four data sets grow trees with greater diversity, suggesting a stronger preference for unequal weights over equal weights, thus yielding better performance of the WRFopt. This observation aligns with our anticipation, and exactly the motivation for adopting unequal weights, as discussed in Section 1. The Servo data set stands out as an exception, possessing the highest ρ value among the 12 data sets yet exhibiting relatively pronounced predictive

Chen, Yu and Zhang

1.70 2.15 2.60 3.05 3.50

Servo 0.725 Data sets

Relative MAFE

RF 1step WRFopt w RF CRF

Figure 2: Relative MAFE for Diﬀerent Forests with CART Trees (The horizontal axis shows the names of 12 data sets, arranged in ascending order of their corresponding averaged ρ displayed beneath each data set s abbreviation.)

performance. This anomaly is likely due to its small number of samples and attributes (n = 167 and p = 4). Similar phenomenon has also been observed in the literature, as seen in Hansen (2007), Zhang et al. (2013) and Zhang et al. (2016).

4.1.2 RFs with SUT Trees

For comparison purpose, we also study the performance of our proposed methods based on SUT trees, reporting results in Tables 6 and 7. The 1step-WRFopt or 2steps-WRFopt estimator consistently outperforms the conventional RFs and the two competitors in terms of MSFE, while performing best in 11 out of 12 data sets in terms of MAFE. Additionally, the gaps between the 1step-WRFopt and 2steps-WRFopt are relatively small, akin to RFs with CART trees. The relative MSFE and MAFE are depicted in Figures 3 and 4, respectively. With SUT trees rather than CART trees, the WRFopt methods perform better at upgrading equal-weight forests. Concerning the MSFE and MAFE, the number of supporting data sets remains 11 and 10 out of 12 data sets, respectively. Additionally, the proportion of outperforming rivals climbs to 10 out of 12 data sets in terms of MSFE and 8 out of 12 data sets in terms of MAFE. Notably, the advantages in the SM, YH, and EE data sets are

Optimal Weighted Random Forests

Data set RF 2steps-WRFopt 1step-WRFopt w RF CRF BH 38.213(5) 24.516(1) 24.522(2) 28.797(3) 29.974(4)

Servo 1.604(5) 0.964(1) 0.968(2) 0.998(3) 1.232(4)

CCS 149.276(5) 119.471(1) 119.505(2) 136.243(4) 132.435(3)

ASN 36.391(5) 33.465(1) 33.472(2) 35.675(4) 35.419(3)

CCPP 50.329(5) 36.619(2) 36.613(1) 39.145(4) 38.600(3)

CST 42.899(5) 25.933(2) 25.928(1) 32.806(3) 36.783(4)

EE 17.768(5) 5.140(2) 5.140(1) 6.193(3) 8.026(4)

PT 98.864(5) 89.299(1) 89.325(2) 97.924(4) 95.321(3)

QSAR 1.737(5) 1.657(1) 1.661(2) 1.680(4) 1.668(3)

SM( 10 4) 20.963(5) 0.212(1) 0.212(2) 0.251(3) 4.460(4)

YH 33.241(5) 2.433(2) 2.431(1) 3.171(3) 8.152(4)

Tecator 143.584(5) 101.741(1) 101.940(2) 126.409(3) 128.264(4)

Table 6: Test Error Comparisons by MSFE for Diﬀerent Forests with SUT Trees

Data set RF 2steps-WRFopt 1step-WRFopt w RF CRF BH 3.759(5) 3.128(1) 3.131(2) 3.314(3) 3.349(4)

Servo 0.847(5) 0.578(1) 0.578(2) 0.606(3) 0.701(4)

CCS 9.914(5) 8.764(1) 8.764(2) 9.421(4) 9.273(3)

ASN 4.903(5) 4.689(2) 4.685(1) 4.813(4) 4.765(3)

CCPP 5.805(5) 4.860(2) 4.858(1) 5.019(4) 4.999(3)

CST 5.209(5) 3.927(2) 3.926(1) 4.455(3) 4.773(4)

EE 3.356(5) 1.611(2) 1.609(1) 1.799(3) 2.094(4)

PT 7.995(5) 7.553(2) 7.544(1) 7.948(4) 7.815(3)

QSAR 1.001(5) 0.979(3) 0.980(4) 0.977(2) 0.972(1)

SM( 10 2) 3.674(5) 0.290(1) 0.291(2) 0.304(3) 1.628(4)

YH 3.508(5) 0.856(1) 0.857(2) 0.902(3) 1.483(4)

Tecator 9.624(5) 7.787(1) 7.805(2) 8.886(3) 8.993(4)

Table 7: Test Error Comparisons by MAFE for Diﬀerent Forests with SUT Trees

particularly substantial. Besides, the relationship between the performance of the WRFopt and the diversity level of SUT trees follows a similar pattern to that observed in Figures 1 and 2. Without response data for guiding splits, the WRFopt methods with SUT trees yield worse predictive performance than their counterparts with CART trees. However, it is worthwhile noting that the improvement of the WRFopt methods over RFs employing equal weights become more signiﬁcant, demonstrating the potential advantage of our WRFopt with weaker base learners.

4.2 Semi-Synthetic Experiments

To further investigate the data set characteristics that aﬀect the eﬀectiveness of the WRFopt, we conducted semi-synthetic experiments using the real data sets in Section 4.1. Clearly,

Chen, Yu and Zhang

8 38 68 98 128

0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55 1.65 1.75 1.85 1.95 2.05 2.15 2.25 2.35 2.45 2.55 2.65 2.75 2.85 2.95 3.05 3.15 3.25 3.35 3.45 3.55

Servo 0.820

PT 0.919 Data sets

Relative MSFE

RF 1step WRFopt w RF CRF

Figure 3: Relative MSFE for Diﬀerent Forests with SUT Trees (The horizontal axis shows the names of 12 data sets, arranged in ascending order of their corresponding averaged ρ displayed beneath each data set s abbreviation.)

both the signal-to-noise ratio (SNR) and dimension are critical characteristics of data sets. Therefore, we will ﬁrst assess the performance of WRFopt on semi-synthetic data sets under various noise scenarios. Moreover, given our prior analysis on low-dimensional data sets, we now turn our attention to the performance of diﬀerent algorithms on high-dimensional data, which has been augmented via feature engineering.

4.2.1 Improvement Ratio vs SNR

In order to examine the eﬀectiveness of the WRFopt on noisy data sets, we consider three diﬀerent noise injection schemes, similar to Reis et al. (2018):4

(a) Noise in the predictive variables only (that is, X),

(b) Noise in the response variable only (that is, y),

(c) Noise both in the predictive and response variables (that is, X and y).

We follow Reis et al. (2018) to conﬁgure noise injection:

4. They focus on classiﬁcation scenarios.

Optimal Weighted Random Forests

3 6 9 12 15

0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20

Servo 0.820

PT 0.919 Data sets

Relative MAFE

RF 1step WRFopt w RF CRF

Figure 4: Relative MAFE for Diﬀerent Forests with SUT Trees (The horizontal axis shows the names of 12 data sets, arranged in ascending order of their corresponding averaged ρ displayed beneath each data set s abbreviation.)

For response variable or continuous predictive variables The noise comes from a Gaussian distribution, with its magnitude randomly set for each object and feature in the data set. Speciﬁcally, a per-object noise coeﬃcient, No for each o {1, . . . , n}, is randomly drawn from a uniform distribution between 0 and 1. Likewise, a per-feature noise coeﬃcient, Nf for each f = 1, . . . , p, follows the same uniform distribution. Then, the noise coeﬃcient for a speciﬁc measurement, corresponding to a particular object-feature pair, is deﬁned as No,f = No Nf Ns. Here, Ns is the overall noise coeﬃcient for the data set, which will be varied from 0 to 1 throughout the experiment. The synthetic noise for the fth feature is deﬁned by σo,f = No,f σf, with σf being the standard deviation of the given feature across all objects. This multiplication ensures that the noisy data retains the same physical units as the original. Finally, the noisy measurement for each object-feature pair is drawn from the normal distribution Normal xo,f, σ2 o,f , where xo,f denotes the original measurement. To evaluate performances of diﬀerent strategies as a function of the noise level in the data set, σo,f/σf, averaged across diﬀerent features and objects, can be used to represent the average scatter due to the noise with respect to the intrinsic scatter of the original data set (Reis et al., 2018). Thus, it can be considered as the average inverse of the relative

Chen, Yu and Zhang

SNR of the semi-synthetic data set, with respect to the real data set. An Ns value of 0 implies no noise injection, while a value of 1 yields an relative SNR close to the scale of 4. By combining the inherent noise from the original data with the injected noise, the real SNR of the noisy data (that is, semi-synthetic data) is suﬃciently low to mimic the actual noisy scenarios encountered in real world.

For categorical predictive variables the probability of a class switch is determined by po,f = po pf. In this context, po and pf are independently drawn from a uniform distribution ranging between 0 and 0.5. Consequently, the class of the object-feature pair is randomly switched to another class with a probability of po,f/(C 1), where C represents the total number of classes for the given feature. That is, the class remains unchanged with a probability of 1 po,f. Note that the number of categorical features in the data set is considerably less than that of continuous features, resulting in a relatively minor impact on the overall noise level of the data. For simplicity, we opt not to introduce an additional parameter for varying noise levels speciﬁcally for categorical predictive variables.

The 12 noisy data sets serve as inputs to the RF algorithms, and we focus on the relative performance of four weighted RFs over conventional RFs, rather than the predictive performance itself. Hence, the improvement ratio (IR) is used to evaluate diﬀerent algorithms:

Pntest i=1 yi by RF i,d 2

Pntest i=1 yi by XRF i,d 2 1

where by RF i,d and by XRF i,d represent the forecasts for yi by conventional RF and any weighted RF in the dth repetition, respectively. If not speciﬁed, other settings remain the same as those in Section 4.1.1. The IR values of all weighted strategies under various noise levels were calculated for all 12 data sets. For the sake of simplicity, we display only two representative ﬁgures here, while the remaining ﬁgures can be found in Appendix D. Figures 5 and 6 illustrate the IR of the 1step-WRFopt, 2steps-WRFopt, w RF and CRF on the ASN and CCPP data sets, represented by blue, green, purple, and red lines, respectively. Additionally, the averaged ρ under diﬀerent noise scenarios are shown using yellow bars. It is clear from Figures 5 and 6 that the WRFopt algorithms consistently outperform the conventional RF and two competitors across all noise scenarios. As expected, the IR values of all weighted RFs decrease as the noise level increases. This phenomenon can be attributed to data sets with higher levels of noise yielding trees that are less informative and possess reduced predictive power, thereby diminishing the beneﬁts from post-processing (that is, weighting strategy). What stands out in the ﬁgures is the inverse relationship between ρ and IR, highlighting the practical signiﬁcance of ρ as an indicator for the eﬀectiveness of WRFopt. Regarding Figures D.1 - D.10 in Appendix D, the trends are mostly align to those observed in the ASN and CCPP data sets. However, note that the WRFopt methods continue to underperform compared to the conventional RF or the two competitors in varied noise scenarios on the data sets where they failed in Section 4.1. To summarize, SNR plays an unneglectable role in aﬀecting the improvability of the WRFopt over equal weight strategy.

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

(c) Noise in Both X and y

Figure 5: Improvement Ratio vs Noise on ASN Data Set (The green and blue lines are nearly indistinguishable from each other, owing to the similar performance between the 1step-WRFopt and 2steps-WRFopt. This pattern may consistent across subsequent Figures 6, D.1 - D.10 and D.12 - D.23).

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

(c) Noise in Both X and y

Figure 6: Improvement Ratio vs Noise on CCPP Data Set

More speciﬁcally, it is observed from numerical study that when the data are deemed to have relatively low noise levels, WRFopt strategies typically provide more obvious improvements.

4.2.2 Performance under High-Dimensional Setting

Having analyzed performance on low-dimensional data sets earlier (that is, where the sample size is much larger than the number of attributes), we now shift our focus to high-dimensional data sets. For this purpose, we semi-synthesize high-dimensional data using all the data sets from Section 4.1, generating additional attributes by sklearn.preprocessing.Polynomial Features (Pedregosa et al., 2011). More speciﬁcally, we create new attributes comprising all polynomial combinations of the original attributes up to a speciﬁed degree. For instance, in a two-dimensional feature set represented as {a, b}, the degree-2 polynomial attributes would be {1, a, b, a2, ab, b2}. The details of these 12 semi-synthetic data sets are provided in Table 8, with the last column indicating the degree parameter used in Polynomial Features.

In alignment with the analytical methods in Section 4.1, we present the risks of diﬀerent RFs regarding MSFE and MAFE in Tables D.1 and D.2, respectively. For brevity, these tables are included in Appendix D, as their patterns are similar to those in Tables 3 and 4

Optimal Weighted Random Forests

Data set Abbreviation Attributes Samples Degree Boston Housing BH 558 506 3 Servo Servo 363 167 3 Concrete Compressive Strength CCS 1286 1030 5 Airfoil Self-Noise ASN 1286 1503 8 Combined Cycle Power Plant CCPP 10625 9568 20 Concrete Slump Test CST 119 103 3 Energy Eﬃciency EE 1286 768 5 Parkinsons Telemonitoring PT 10625 5875 4 QSAR aquatic toxicity QSAR 494 546 4 Synchronous Machine SM 494 557 8 Yacht Hydrodynamics YH 461 308 5 Tecator Tecator 7874 240 1

Table 8: Summary of 12 High-Dimensional Data Sets

but with notably smaller values.5 Additionally, the relative MSFE and MAFE, based on the 2steps-WRFopt, are depicted in Figures 7 and 8, respectively. While the scale of improvement for conventional RFs is less pronounced in these ﬁgures compared to their original counterparts, which is expected given the signiﬁcant risk reduction achieved through feature engineering. Generally, it is challenging to achieve comparable levels of improvement in a model that has already undergone substantial optimization. Nonetheless, the WRFopt methods continue to exhibit commendable performance on high-dimensional data. Specifically, they signiﬁcantly improve conventional RFs in 5 out of 12 data sets in terms of MSFE, and in 4 out of 12 data sets in terms of MAFE. Moreover, none of the competitors consistently outperform the WRFopt across all scenarios, highlighting the robustness of WRFopt in high-dimensional situations. Once again, it is clear that the WRFopt methods show more pronounced improvement when associated with relatively low ρ , similar to the low-dimensional scenario. As mentioned before, this phenomenon is owing to the fact that the data set with large ρ yields similar base learners and thus the conventional RF with equal weights can also provide promising prediction in this situation.

5. Conclusion

This study proposes an optimal weighted RF algorithm for regression and the corresponding accelerated variant is also studied. These methods are proven to be asymptotically optimal. We also employ a cost-eﬃcient indicator, ρ, to guide RF users in deciding whether to consider the WRFopt methods as a post-processing technique to enhance predictive performance. Empirical evidence demonstrates that the proposed methods yield lower risks compared to RFs with equal weights and other existing unequally weighted forests, on both low-

5. Conventional RFs exhibit enhanced predictive power after feature engineering, potentially reducing risks to as little as one percent of the original. For a comprehensive comparison of relative risks between conventional RFs using augmented (that is, high-dimensional) data sets and original data sets, refer to Figure D.11 in Appendix D. It is a common practice in machine learning to employ automated feature engineering tools to boost the predictive capabilities of models.

Chen, Yu and Zhang

Servo 0.948 Data sets

Relative MSFE

RF 1step WRFopt w RF CRF

Figure 7: Relative MSFE for Diﬀerent Forests on High-Dimensional Data (The horizontal axis shows the names of 12 data sets, arranged in ascending order of their corresponding averaged ρ displayed beneath each data set s abbreviation.)

dimensional and high-dimensional data. Additionally, our weighting methods show good robustness across various conﬁgurations of key hyper parameters within the RF algorithm, as veriﬁed in numerical experiments. In light of the results obtained, we provide the following suggestions for RF users seeking to improve the predictive capabilities of their ﬁnely-tuned models:

1. We suggest RF users to consider the WRFopt under the following scenarios:

(a) Pre-RF building: The data are considered to exhibit a relatively high signal-tonoise ratio or to be well cleansed.

(b) Post-RF building: The ρ of the RF model is relatively small, for instance, ρ < 0.5.

2. We recommend using the 2steps-WRFopt rather than the 1step-WRFopt as the former oﬀers comparable performance but is less computationally burdensome.

While the current study focuses on regression, it is also important to study the optimal weighted RF for classiﬁcation with diﬀerent loss functions. We identify this as a promising future research direction.

Optimal Weighted Random Forests

Servo 0.948 Data sets

Relative MAFE

RF 1step WRFopt w RF CRF

Figure 8: Relative MAFE for Diﬀerent Forests on High-Dimensional Data (The horizontal axis shows the names of 12 data sets, arranged in ascending order of their corresponding averaged ρ displayed beneath each data set s abbreviation.)

Acknowledgments

The authors would like to thank the Action Editor Professor Mladen Kolar and two anonymous reviewers for their constructive suggestions and comments that have substantially improved earlier version of this article. Dalei Yu was supported by National Natural Science Foundation of China (NNSFC, grant no. 12071414). Xinyu Zhang was supported by National Natural Science Foundation of China (NNSFC, grant nos. 71925007, 72091212 and 12288201), Beijing Natural Science Foundation (Z240004) and the CAS Project for Young Scientists in Basic Research (YSBR-008). All authors equally contributed to this work and are listed in the alphabetical order.

Chen, Yu and Zhang

Appendix Appendix A. Tree-Building Algorithms

We will elucidate the diﬀerences between the two practical splitting criteria in this appendix. When constructing RFs using CART trees, we employ Algorithm A.1, and when building them with SUT trees, we adopt Algorithm A.2. The structures of SUT trees are developed in an unsupervised manner, eliminating the reliance on response values during splits, whereas CART trees use the information of y to obtain the best splitting variables and cut points. They are the same in other procedures, such as growing on the bootstrap data. Note that there are many ways to grow SUT trees, provided that their splitting processes are not dependent on response values. Algorithm A.2 represents just one of these methods. When selecting the probability sequence P in Algorithm A.2, we built conventional RFs with CART trees in the validation data to compute variables importance. The variable importance is the total decrease in node impurities from splitting on the variable, averaged over all trees. For regression, the node impurity is measured by residual sum of squares. After that, the probability sequence P was determined by the normalized variables importance.

Appendix Appendix B. Detailed Demonstration of the 2steps-WRFopt and Competitors

In this appendix, we provide details of the 2steps-WRFopt algorithm, and present an exposition of two weighted RF algorithms introduced in Section 1. Since these two competitors are proposed for classiﬁcation trees, we further describe a methodology to transform classiﬁcation patterns into regression patterns to address regression tasks.

B.1 2steps-WRFopt

The following Algorithm B.1 presents the complete 2steps-WRFopt algorithm described in Section 3.1.

B.2 Weighted RF (w RF)

Much of the current literature on binary classiﬁcation pay particular attention to out-of-bag data. Namely, Li et al. (2010) use the accuracy in the out-of-bag data as an index of the classiﬁcation ability of a given tree. This metric is subsequently employed to assign weights to the individual trees. Winham et al. (2013) provide a family of weights choice based on the prediction error in the out-of-bag data of each tree. The reason why using out-of-bag individuals instead of another shared data set is that it gives internal estimates that are helpful in understanding the predictive performance and how to improve it without testing data set aside (Breiman, 2001). Speciﬁcally, Winham et al. (2013) deﬁne the tree-level prediction error (t PE), measuring the predictive ability of the mth tree as follows

t PEm = 1 Pn i=1 OOBim

i=1 |vim yi| OOBim, (B.1)

where vim is the vote for the ith subject in the mth tree, and OOBim is the indicator for the out-of-bag status of the ith subject in the mth tree. By drawing on the concept of t PE, they

Optimal Weighted Random Forests

Algorithm A.1: CART

Split_a_node(S)

Input: The local learning subset S corresponding to the node we want to split

Output: A split [a < c] or nothing

-If Stop_split(S) is TRUE then return nothing.

-Otherwise select q attributes Aq = aj1, . . . , ajq randomly among all non constant

(in S) candidate attributes;

-Return the best split s , where s = Find_the_best_split(S, Aq).

Find_the_best_split(S, Aq)

Input: The subset S and the selected attribute list Aq Output: The best split

- Seek the splitting variable aj and cut point c that solve

mind {j1,...,jq},c CARTt(d, c);

- Return the split [aj < c].

Stop_split(S)

Input: A subset S

Output: A boolean

- If |S| < nodesize, then return TRUE;

- If all attributes are constant in S, then return TRUE;

- If the output is constant in S, then return TRUE;

- Otherwise, return FALSE.

have been able to show that weights inversely related to t PE are appropriate. Such as

w(m) = 1 t PEm, (B.2)

w(m) = exp 1 t PEm

w(m) = 1 t PEm

λ for some λ. (B.4)

In their proposed w RF algorithm, they normalized weights of the form

w(m) = w(m) PMn m=1 w(m) .

Chen, Yu and Zhang

The classiﬁcation model can be easily turned into a regression model by simply changing (B.1) to the following deﬁnition

t PE m = 1 Pn i=1 OOBim

ˆf(m)(xi) yi OOBim, (B.5)

where ˆf(m)(xi) is the prediction for yi by the mth tree. The details of the w RF in regression pattern is in Algorithm B.2, which selects (B.4) for example. For simplicity, we only present the best result of the w RF family as a representative in Section 4.

B.3 Cesáro RF (CRF)

Another unequally weighted RF mentioned earlier is the CRF proposed by Pham and Olafsson (2019), which replaces the regular average with the Cesáro average. Their method is based on a renowned theory that if a sequence converges to a number c, then the Cesáro sequence also converges to c. To implement the CRF, a strategy for sequencing Mn trees from best to worst must be established. This can be done by ranking trees based on their out-of-bag error rates or accuracy on a separate training set. Next, a weight sequence w(m) Mn m=1 is obtained by arranging weights in descending order, where w(m) = PMn ν=m ν 1, with normalizer being PMn m=1 PMn ν=m ν 1. This classiﬁcation model can be easily converted into a regression model as well through a simple modiﬁcation in the sequencing methods. We can draw t PE deﬁned by the w RF algorithm, and subsequently rank trees using out-of-bag data. The details of the CRF in regression pattern are in Algorithm B.3.

Appendix Appendix C. Proofs and Derivations

In the current appendix, we provide proofs of the lemmas and theorems in Section 3.2, along with derivations and further discussions to support our ﬁndings.

C.1 Preliminary Results

The following preliminary results will be used in the proofs. Note that this appendix and the following ones, Appendices C.2 - C.3, focus on the SUT methodology and are based on the work by Qiu et al. (2020). Their theoretical framework presumes tree structures are independent of response values (that is, hat matrix is independent of y). Consequently, notations throughout these appendices are distinguished with the script for clarity and consistency.

Lemma C.1 For each m {1, . . . , Mn}, let P X(m) denote the hat matrix corresponding to any algorithm for ensemble. In addition to the independence of P X(m) from y, assume the following Conditions C.4 - C.9 hold as well.6

C.4 There exists a positive constant c0 such that for all m, r {1, . . . , Mn},

trace P X(m)P X(m) c0 > 0 and trace P X(m)P X(r) 0,

6. The condition labels here are directly adopted from Qiu et al. (2020) for ease of reference.

Optimal Weighted Random Forests

almost surely.

C.5 There exists a positive constant c1 such that for all m {1, . . . , Mn},

ζmax P X(m)P X(m) c1,

almost surely, where ζmax(B) denotes the largest singular value of a generic matrix B.

C.6 There exists a positive constant c2 such that for all m, r {1, . . . , Mn},

trace P2 X(m) c2 trace P X(m)P X(m) ,

and trace P X(r)P X(m)P X(r)P X(m) c2 trace P X(m)P X(m) ,

almost surely.

C.7 ξ 1 n M2 n 0 almost surely, as n , and E(ξ 1 n M2 n) exists for any ﬁxed n 1.

C.8 There exists a positive constant v such that E e4 i | x0 i v < almost surely for i = 1, . . . , n.

C.9 ιn = max1 m Mn max1 i n ι(m) ii = O(n 1/2) almost surely, as n , where ι(m) ii is the ith diagonal element of P X(m).

Then, as n , we have (7).

The original version of Lemma C.1 is established in the proof of Theorem 1 in Qiu et al. (2020) for non-stochastic X. However, as demonstrated in Appendix C.6, by substituting expectations with conditional expectations, and applying the Law of Iterated (or Total) Expectation, Pull-out rule and Lebesgue s Dominated Convergence Theorem, the proof by Qiu et al. (2020) can be readily extended to accommodate scenarios with stochastic X. Thus we omit the step-by-step proof in the current study. Next, we introduce four other lemmas for proving Theorems 1 - 3.

Lemma C.2 (Gao et al., 2019) Let

ew = argminw H {Ln(w) + an(w) + bn} ,

where an(w) is a term related to w and bn is a term unrelated to w. If

sup w H |an(w)| /Rn(w) = op(1), sup w H |Rn(w) Ln(w)| /Rn(w) = op(1),

and there exists a constant c and a positive integer n0 so that when n n0 and infw H Rn(w) c > 0 almost surely, then Ln(ew)/ infw H Ln(w) 1 in probability.

Chen, Yu and Zhang

Lemma C.3 (Saniuk and Rhodes, 1987) For any n n matrices G1 and G2 with both G1, G2 0, trace(G1G2) G1 2 trace(G2),

where 2 denotes the spectral norm or largest singular value. Besides, for any n n symmetric matrices G1 and G2 with G2 0,

trace(G1G2) λmax(G1) trace(G2),

where λmax( ) denotes the largest eigenvalue.

Lemma C.4 (Buldygin and Moskvichova, 2013) Let β(p) denote a Bernoulli random variable with probabilities p [0, 1] and q = 1 p. Let β(0)(p) be the centered Bernoulli random variable with parameter p, that is

β(0)(p) = β(p) E {β(p)} = β(p) p.

Then, when p (0, 1/2), the variance proxy (or the square of sub-Gaussian norm) of β(0)(p) is

log 1 p 1 ,

which is a monotonically increasing function of p.

Lemma C.5 (Zhang and Chen, 2021) Let {Zi}n i=1 be sub-Gaussian random variables (without independence assumption) with mean zero and variance proxy τ 2. Then, we have

E max 1 i n |Zi| τ p

C.2 Proof of Theorem 1

To verify (7), it remains to verify that Conditions 1 - 4 in the main paper can guarantee Conditions C.4 - C.9 in Lemma C.1. Having clearly conﬁgured the base learners within RFs, we now verify that Conditions C.4 - C.6 are satisﬁed when P BL(m) is used in place of P X(m). For each m = 1, . . . , Mn, i = 1, . . . , n and j = 1, . . . , n, recall that

P BL(m),ij =

h(m),j I xi t k(m),s I xj t k(m),s

Pn j=1 h(m),j I xj t k(m),s ,

which is the counterpart of PBL(m)i,j (the precise deﬁnition of PBL(m)i,j is given in Equation 12),7 pertaining to SUT trees. Then, there exists a positive constant c such that

trace P BL(m)P BL(m)

7. This notation is introduced in Section 3.2.2, under the context of CART trees, but is also applicable to SUT trees.

Optimal Weighted Random Forests

h(m),j I xi t k(m),s I xj t k(m),s

Pn j=1 h(m),j I xj t k(m),s

h2 (m),j I xi t k(m),s I xj t k(m),s I xi t k(m),k I xj t k(m),k

Pn j=1 h(m),j I xj t k(m),s Pn j=1 h(m),j I xj t k(m),k

h2 (m),j I xi t k(m),s I xj t k(m),s

n Pn j=1 h(m),j I xj t k(m),s o2

h(m),ih(m),j I xi t k(m),s I xj t k(m),s

n Pn j=1 h(m),j I xj t k(m),s o2

Pn i=1 h(m),i I xi t k(m),s Pn j=1 h(m),j I xj t k(m),s

n Pn j=1 h(m),j I xj t k(m),s o2

= c ℓ(m) > 0, (C.1)

almost surely, where the fourth step is from Condition 4. Clearly, for all m = 1, . . . , Mn, the elements of P BL(m) are non-negative. Therefore, for all m, r {1, . . . , Mn}, it is clear that

trace P BL(m)P BL(r) 0.

Thus, Condition C.4 in Lemma C.1 is satisﬁed. Besides, we have

P BL(m) = max 1 i n

h(m),j I xi t k(m),s I xj t k(m),s

Pn l=1 h(m),l I xl t k(m),s

= max 1 i n

s=1 I xi t k(m),s Pn j=1 h(m),j I xj t k(m),s

Pn l=1 h(m),l I xl t k(m),s

P BL(m) 1 = max 1 j n

h(m),j I xi t k(m),s I xj t k(m),s

Pn l=1 h(m),l I xl t k(m),s

= max 1 j n

s=1 I xj t k(m),s Pn i=1 h(m),j I xj t k(m),s

Pn l=1 h(m),l I xl t k(m),s

c max 1 j n

s=1 I xj t k(m),s Pn i=1 h(m),i I xi t k(m),s

Pn l=1 h(m),l I xl t k(m),s

Chen, Yu and Zhang

=c max 1 j n

s=1 I xj t k(m),s

almost surely. Here, the last step in (C.2) is from the fact that P2k(m) s=1 I xi t k(m),s 1, and the inequality in (C.3) comes from Condition 4. Combining (C.2) and (C.3), we have

ζ2 max P BL(m)P BL(m) =ζ2 max P BL(m)P BL(m)

=λmax P BL(m)P BL(m)P BL(m)P BL(m)

= P BL(m)P BL(m) 2

P BL(m) P BL(m) 1 c,

almost surely. Thus, Condition C.5 in Lemma C.1 is satisﬁed. In addition, from Condition 4 and (C.1), we have

trace P2 BL(m) =

h(m),lh(m),i I xi t k(m),s I xl t k(m),s

n Pn j=1 h(m),j I xj t k(m),s o2

h2 (m),l I xi t k(m),s I xl t k(m),s

n Pn j=1 h(m),j I xj t k(m),s o2

= c trace P BL(m)P BL(m) ,

almost surely. Additionally, by Lemma C.3, for each m, r {1, . . . , Mn}, we have

trace P BL(m)P BL(r)P BL(m)P BL(r)

trace P BL(r)P BL(r)P BL(m)P BL(m)

λmax P BL(r)P BL(r) trace P BL(m)P BL(m)

c trace P BL(m)P BL(m) ,

almost surely, where the ﬁrst inequality comes from the fact that trace n A B 2o

trace AA BB for any generic matrices A, B Rm n, and the second step stems from Lemma C.3. Thus, Condition C.6 in Lemma C.1 is satisﬁed. Further, Conditions 1 and 2 in the main text are equivalent to Conditions C.7 and C.8 in Lemma C.1, respectively. Besides, Condition 3 guarantees Conditions C.9. Based on these observations, it is readily seen that Lemma C.1 holds under Conditions 1 - 4, and this proves (7).

Optimal Weighted Random Forests

By (7) in Section 3, and (A34) in Proof of Theorem 2 by Qiu et al. (2020) , we further have L n(bw )ξ 1 n = 1 + op(1). (C.4)

This is owing to the fact that

L n(bw )ξ 1 n sup w H

n L n(bw )L n 1(w) o sup w H

n L n(w)R n 1(w) o

n L n(bw )L n 1(w) o 1 + sup w H

n |L n(w) R n(w)| R n 1(w) o

=1 + op(1), (C.5)

L n(bw )ξ 1 n sup w H

n L n(bw )L n 1(w) o inf w H

n L n(w)R n 1(w) o

n L n(bw )L n 1(w) o 1 + inf w H

n (L n(w) R n(w)) R n 1(w) o

n L n(bw )L n 1(w) o 1 sup w H

n |L n(w) R n(w)| R n 1(w) o

= 1 + op(1). (C.6)

Then, under the same framework of Appendix A.4 in Zhang et al. (2020), it is readily seen from Lebesgue s Dominated Convergence Theorem that E {L n(bw ) ξ n} ξ 1 n 0, and this proves (8).

C.3 Proof of Theorem 2

Based on Lemma C.2, we now present the proof of Theorem 2. It is seen that

C n (w) = C n(w) + 2

e 2 i e2 i P ii(w),

where C n(w) and C n (w), related to SUT trees, are the counterparts of Cn(w) and C n(w), respectively. Hence, from Lemma C.2 above and the derivation of (24) in Qiu et al. (2020), in order to prove (9), we need only to verify that

e 2 i e2 i P ii(w)

= op(1). (C.7)

For each m = 1, . . . , Mn, let ι(m) ii be the ith diagonal element of P BL(m),

Q (m) = diag ι(m) 11 , . . . , ι(m) nn ,

m=1 w(m)Q (m),

Chen, Yu and Zhang

and K n = diag e 2 1 e2 1, . . . , e 2 n e2 n .

Then, we have

e 2 i e2 i P ii(w)

|trace {Q (w)K n}| R n(w) .

We observe that for any δ > 0, under Conditions 2 and 3,

trace {Q (w)K n} R n(w)

trace Q (m)K n

trace Q (m)K n

m=1 E n trace Q (m)K n ξ 1 n o

m=1 E n trace Q (m) K n ξ 1 n o

m=1 E ℓ (m) K n ξ 1 n

c1δ 1Mn E n1/2 K n ξ 1 n

c1δ 1Mn E1/2 n1/2ξ 1 n 2 E1/2 K n 2

c2δ 1E1/2 Mnn1/2ξ 1 n 2 ,

where c1 and c2 are positive constants, the second inequality follows from the Markov s Inequality, the fourth inequality is obtained by Lemma C.3, the ﬁfth inequality comes from Condition 3, the last second inequality is from the Cauchy-Schwarz Inequality, and the last step stems from the boundedness of E1/2 K n 2 by combining equality (C.2) and Conditions 2 and 3. Thus, (9) is proved by Condition 5 and the Lebesgue s Dominated Convergence Theorem. Similar to the proof techniques of Theorem 1, we have L n(ew )ξ 1 n = 1 + op(1), which yields (10).

C.4 Proof of Lemma 1

It follows from the deﬁnitions of PBL(m),ij and P BL(m),ij that

PBL(m),ij P BL(m),ij

Optimal Weighted Random Forests

h(m),j I xi tk(m),s I xj tk(m),s

Pn j =1 h(m),j I xj tk(m),s

h(m),j I xi tk(m),s I xj t k(m),s

Pn j =1 h(m),j I xj t k(m),s

h(m),j I xi tk(m),s I xj t k(m),s

Pn j =1 h(m),j I xj t k(m),s

h(m),j I xi t k(m),s I xj t k(m),s

Pn j =1 h(m),j I xj t k(m),s

ΞC41 + ΞC42. (C.8)

In addition, note that

I xj tk(m),s I xj t k(m),s = I xj tk(m),s I xj t k(m),s 2 .

From Condition 4, one has

h(m),j I xi tk(m),s I xj tk(m),s

Pn j =1 h(m),j I xj tk(m),s

h(m),j I xi tk(m),s I xj t k(m),s

Pn j =1 h(m),j I xj tk(m),s

h(m),j I xi tk(m),s I xj t k(m),s

Pn j =1 h(m),j I xj tk(m),s

h(m),j I xi tk(m),s I xj t k(m),s

Pn j =1 h(m),j I xj t k(m),s

I xj tk(m),s I xj t k(m),s h(m),i I xi tk(m),s

Pn j =1 h(m),j I xj tk(m),s

Pn j =1 I xj tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s Pn j =1 h(m),j I xj t k(m),s

i=1 h(m),ih(m),j I xi tk(m),s I xj t k(m),s )

Chen, Yu and Zhang

I xj tk(m),s I xj t k(m),s

I xj tk(m),s I xj t k(m),s 2

n I xj tk(m),s + I xj t k(m),s o

2c Kn(N + N )δn, (C.9)

almost surely, and

h(m),j I xi tk(m),s I xi t k(m),s I xj t k(m),s

Pn j =1 h(m),j I xj t k(m),s

Kn(N + N )δn. (C.10)

Thus, (14) is veriﬁed by combining (C.9) - (C.10) with (C.8).

Besides, for each i = 1, . . . , n, we have

PBL(m),ij P BL(m),ij

h(m),j I xi tk(m),s I xj tk(m),s

Pn j =1 h(m),j I xj tk(m),s h(m),j I xi t k(m),s I xj t k(m),s

Pn j =1 h(m),j I xj tk(m),s

h(m),j I xi t k(m),s I xj t k(m),s

Pn j =1 h(m),j I xj tk(m),s h(m),j I xi t k(m),s I xj t k(m),s

Pn j =1 h(m),j I xj t k(m),s

h(m),j I xi t k(m),s I xj tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s

h(m),j I xj tk(m),s I xi tk(m),s I xi t k(m),s Pn j =1 h(m),j I xj tk(m),s

h(m),j I xi t k(m),s I xj t k(m),s

Pn j =1 h(m),j I xj tk(m),s h(m),j I xi t k(m),s I xj t k(m),s

Pn j =1 h(m),j I xj t k(m),s

ΞC43 + ΞC44 + ΞC45. (C.11)

Optimal Weighted Random Forests

Note that P2k(m) s=1 I xi t k(m),s 1 and P2k(m) s=1 I xi tk(m),s 1 for all i = 1, . . . , n and m = 1, . . . , Mn. Similar to the proofs of (C.9) and (C.10), it is seen that under Condition 4,

s=1 I xi t k(m),s n X

h(m),i I xj tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s

s=1 I xi t k(m),s Pn j=1 h(m),j I xj tk(m),s + I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s

almost surely,

I xi tk(m),s I xi t k(m),s

n I xi tk(m),s + I xi t k(m),s o

=2δn, (C.13)

s=1 I xi t k(m),s

Pn j=1 h(m),j I xj t k(m),s

Pn j =1 h(m),j I xj t k(m),s

Pn j =1 h(m),j I xj t k(m),s I xj tk(m),s Pn j =1 h(m),j I xj tk(m),s

s=1 I xi t k(m),s Pn j =1 h(m),j n I xj t k(m),s + I xj tk(m),s o

Pn j =1 h(m),j I xj tk(m),s

s=1 I xi t k(m),s

Thus, (15) can be veriﬁed by combining (C.11) - (C.14) together and this concludes the proof.

C.5 Proof of Theorem 3

In accordance with Section 3.2.2, notations without the script correspond to the CARTsplitting criterion, while those with the script are related to the theoretical CART-splitting

Chen, Yu and Zhang

criterion. It is evident that δn is a key factor bounding the diﬀerence between matrices PBL(m) and P BL(m). It is clear that

Pr n I xi tk(m),s t k(m),s = 0 o = 1 p(m),is,

from the deﬁnition of p(m),is. Therefore, when 0 < p(m),is < 1/2, from Lemma C.4,

I xi tk(m),s t k(m),s p(m),is is a centered Bernoulli random variable with variance proxy

τ 2 p(m),is =

1 2 p(m),is

log 1 p(m),is 1 ,

where τ 2 p(m),is is a strictly increasing function of p(m),is. Then, for p(m),is (0, 1/2), we have

max 1 i n,1 m Mn,1 s 2k(m) τ 2 p(m),is

1 2 log 1 pn 1 .

By Lemma C.5 (which does not require the assumption of independence), for each r 1, we have

E|δn|r = E|δn| E

max 1 i n,1 m Mn,1 s 2k(m)

I xi tk(m),s t k(m),s p(m),is

v u u t2 log(2n Mn Kn)

log 1 pn 1 + pn, (C.15)

and this provides the upper bound of E|δn|. With the above point addressed, we now resume the primary trajectory of the proof. It is suﬃcient to verify that

|Rn(w) Ln(w)|

Rn(w) = op(1), (C.16)

|C n(w) Ln(w)|

Rn(w) = op(1). (C.17)

We will verify them successively. Note that

|Rn(w) Ln(w)|

Rn(w) sup w H

|Rn(w) R n(w)| Rn(w) + sup w H

|L n(w) Ln(w)|

|R n(w) L n(w)| Rn(w) , (C.18)

|C n(w) Ln(w)|

Rn(w) sup w H

|{C n(w) Ln(w)} {C n (w) L n(w)}| Rn(w)

Optimal Weighted Random Forests

|C n (w) L n(w)| Rn(w) . (C.19)

Moreover, we have

|Rn(w) R n(w)|

E n µ P (w)P(w)µ + 2µ P (w)P(w)e + e P (w)P(w)e | X o

E n µ P (w)P (w)µ + 2µ P (w)P (w)e + e P (w)P (w)e | X o

+2 E n µ P(w)µ + µ P(w)e | X o E n µ P (w)µ + µ P (w)e | X o

E h e n P (w)P(w) P (w)P (w) o e | X i

+2 E h µ n P (w)P(w) P (w)P (w) o e | X i

+ µ n P (w)P(w) P (w)P (w) o µ

+2 E h µ {P(w) P (w)} e | X i + 2 µ {P(w) P (w)} µ , (C.20)

|Ln(w) L n(w)|

µ P (w)P(w)µ + 2µ P (w)P(w)e + e P (w)P(w)e

n µ P (w)P (w)µ + 2µ P (w)P (w)e + e P (w)P (w)e o

+2 n µ P(w)µ + µ P(w)e o n µ P (w)µ + µ P (w)e o . (C.21)

Now, by (14) of Lemma 1, there exists a positive constant c1 such that

e n P (w)P(w) P (w)P (w) o e

r=1 w(m)w(r) e P BL(m)PBL(r) P BL(m)P BL(r) e

max 1 m,r Mn

e P BL(m)PBL(r) P BL(m)P BL(r) e

max 1 m,r Mn e 2

PBL(m),ti PBL(r),tj PBL(m),ti P BL(r),tj

+ PBL(m),ti P BL(r),tj P BL(m),ti P BL(r),tj

max 1 m,r Mn e 2

PBL(r),tj P BL(r),tj

i=1 PBL(m),ti

PBL(m),ti P BL(m),ti

j=1 P BL(r),tj

Chen, Yu and Zhang

c1 e 2 Kn(N + N )δn c1 e 2 Kn Nnδn, (C.22)

almost surely, where the last second step is from the fact that

i=1 PBL(m),ti =

i=1 P BL(m),ti 1,

for all m = 1, . . . , Mn and t = 1, . . . , n, and the last step comes from Condition 6. Similarly, one has

µ n P (w)P(w) P (w)P (w) o e c1 µ e Kn Nnδn, (C.23)

almost surely, and

µ n P (w)P(w) P (w)P (w) o µ c1 µ 2 Kn Nnδn, (C.24)

almost surely. Besides, we have

µ {P(w) P (w)} e sup w H

m=1 w(m) µ PBL(m) P BL(m) e

µ PBL(m) P BL(m) e

max 1 m Mn µ e

PBL(m),ij P BL(m),ij

c1 µ e Kn Nnδn, (C.25)

almost surely, and

µ {P(w) P (w)} µ c1 µ 2 Kn Nnδn, (C.26)

almost surely. Then, under Conditions 2 , 4, 6, and 7, by combining (C.20), (C.22) and (C.23) - (C.26) together, it is readily seen that there exists a positive constant c3 such that

E supw H |Rn(w) R n(w)| ξn

c3 Kn Nn E δn(1 + e + e 2 ) ξn

+ E1/2 e 2 E1/2 δn

+ E1/2 e 4 E1/2 δn

3c3E1/2 K2 n N2 n log4(n)δn

3c3E1/4(δn)E1/4 K4 n N4 n log8(n)

Optimal Weighted Random Forests

v u u t2 log(2n Mn Kn)

log 1 pn 1 + pn

E1/4 K4 n N4 n log8(n)

= o(1), (C.27)

almost surely. Here, several facts contribute to obtaining (C.27). The second inequality arises from the Cauchy-Schwartz Inequality. The third is owing to the Maximal Inequality with Bernstein s moment conditions (Zhang and Chen, 2021, Proposition 7.1), along with the fact that E1/r|Z|r is a non-decreasing function of r for r > 0 and any generic random variable Z. The penultimate step results from (C.15). Finally, the last equality is from Condition 7 combined with the Lebesgue s Dominated Convergence Theorem. By (C.21), under the same conditions and framework that derives (C.27), we have

E supw H |Ln(w) L n(w)| ξn

= o(1), (C.28)

almost surely. Besides, since the theoretical CART-splitting criterion is independent of response values, it can be considered as a special type of splitting criterion used by SUT trees. Therefore, under Conditions 1 - 3 and 4, by combining Theorem 1, (C.27) and the proof of Theorem 1 by Qiu et al. (2020), we have

|R n(w) L n(w)| Rn(w) sup w H

|R n(w) L n(w)| R n(w)

1 + sup w H

|Rn(w) R n(w)| Rn(w)

= op(1). (C.29)

Then, (C.16) can be veriﬁed by combining this with (C.18), (C.20), (C.21), (C.27) and (C.28) together. Additionally,

|{C n(w) Ln(w)} {C n (w) L n(w)}|

= 2e {P (w) P(w)}µ 2e {P (w) P(w)}e

+2bσ2 trace{P(w)} 2bσ2 trace{P (w)}

2 e {P (w) P(w)}µ + 2 e {P (w) P(w)}e

+2bσ2 trace{P(w) P (w)} + 2 bσ2 bσ2 trace{P (w)} . (C.30)

Under the same framework that establishes (C.25), one has

e {P(w) P (w)} e c1 e 2 Kn Nnδn. (C.31)

In the light of (15) in Lemma 1 and Condition 6, there exists a positive constant c2 such that

sup w H |trace{P(w) P (w)}| max 1 m Mn

PBL(m),ii P BL(m),ii

Chen, Yu and Zhang

δn{2 + c2(1 + rn)}, (C.32)

almost surely. As analogues of (C.2) and (C.3) in the context of CART trees, under Condition 4, there exists a positive constant c such that PBL(m) = 1,

and PBL(m) 1 c,

almost surely, for all m = 1, . . . , Mn. Then, we have

1 Mn PBL(m)

PBL(m) max 1 m Mn

PBL(m) 1 PBL(m) c,

(C.33) almost surely. Therefore, it is seen that

E1/2 bσ2 2 = E1/2 y P(w0)y 2

2E1/2 y 2 + P(w0) 2 y 2

2(1 + c2)E1/2 y 2

= O(1), (C.34)

almost surely, where the third step comes from (C.33), and the last step is from Condition 2 . Likewise, we have

E1/2 bσ2 bσ2 2 E1/2 y P(w0)y 2

2 + E1/2 y P (w0)y 2

= O(1), (C.35)

almost surely. By combining this with (C.25), (C.31), (C.32), (C.34) and (C.35) with (C.30), we have

E supw H | {C n(w) Ln(w)} {C n (w) L n(w)} | ξn

= o(1). (C.36)

Similar to (C.29), under Conditions 1 - 3 and 4, by combining Theorem 1, (C.27) and the proof of Theorem 1 in Qiu et al. (2020), we have

|C n (w) L n(w)| Rn(w) sup w H

|C n (w) L n(w)| R n(w)

1 + sup w H

|Rn(w) R n(w)| Rn(w)

= op(1). (C.37)

By combining (C.36), (C.37) with (C.19), we obtain (C.17) and this concludes the proof.

Optimal Weighted Random Forests

C.6 Verifying Results of Qiu et al. (2020) for Stochastic X

For the sake of convenience, Qiu et al. (2020) assume in all proofs that X is non-stochastic rather than stochastic. However, the framework developed in Qiu et al. (2020) can readily be extended for stochastic X. We take

e A (w)µ /Rn(w) = op(1)

for example, where A (w) = In P (w). In Appendix A.2 of Qiu et al. (2020), this result is labeled as (A6). Note that the framework proposed by Qiu et al. (2020) necessitates the independence of the hat matrix from response values, warranting notations in this section to carry the script . Proof of (A6) in Qiu et al. (2020) when X is Stochastic. Let Ω= diag(σ2 1, . . . , σ2 n),

A (m) = In P BL(m) for all m = 1, . . . , Mn. Let Φ = µ A (m)A (s)µ

Mn Mn, that is,

the (m, s)th component of Φ is µ A (m)A (s)µ, Gn Mn = A (1)µ, . . . , A (Mn)µ , Ψ = n trace P BL(m)P BL(s)Ω o

Ψ0 = diag n trace P BL(1)P BL(1)Ω , . . . , trace P BL(Mn)P BL(Mn)Ω o .

So Φ = G G. For any w H,

w Ψ0w w Ψw, (C.38)

because for any m, s {1, . . . , Mn} , w(m) 0, w(s) 0 and trace P BL(m)P BL(s)Ω 0 by Condition C.4. In addition, it is clear that

R n(w) = E P (w)µ µ + P (w)e 2 X

= A (w)µ 2 + trace n P (w)P (w)Ω o

= w (Φ + Ψ)w

w (Φ + Ψ0) w, (C.39)

where the last step is from (C.38). We also have

Φ + Ψ0 > 0, (C.40)

because Φ = G G and Ψ0 > 0 by deﬁnition. Let ρ = e A (1)µ . . . , e A (Mn)µ . It is straightforward to show that

E(ρ|X) = 0. (C.41)

Besides, under Condition 2, there exists a positive constant v such that

Var(ρ|X) = E ρρ X

= E e A (m)µµ A (s)e

Chen, Yu and Zhang

= µ A (s)ΩA (m)µ

Mn Mn vΦ, (C.42)

almost surely. It is seen that

R 2 n (w) = sup w H

PMn m=1 w(m)e A (m)µ 2

w (Φ + Ψ0) w sup w H

ξ 1 n ρ (Φ + Ψ0) 1 ρ, (C.43)

where the third step is from (C.39), and the last step is from (C.40) and Lemma 1 in Qiu et al. (2020). By Markov Inequality, we have that for any δ > 0,

Pr n ξ 1 n ρ (Φ + Ψ0) 1 ρ > δ o

δ 1E n ξ 1 n ρ (Φ + Ψ0) 1 ρ o

= δ 1E h E n ξ 1 n ρ (Φ + Ψ0) 1 ρ X oi

= δ 1E h ξ 1 n E n ρ (Φ + Ψ0) 1 ρ X oi

= vδ 1E h ξ 1 n trace n (Φ + Ψ0) 1 Φ oi

vδ 1E h ξ 1 n trace n (Φ + Ψ0) 1 Φ + Ψ1/2 0 (Φ + Ψ0) 1 Ψ1/2 0 oi

= vδ 1E ξ 1 n Mn , (C.44)

where the second step follows from the Law of Iterated (or Total) Expectation, the third step is obtained by the Pull-out rule, and the fourth step is guaranteed by (C.41) and (C.42). Combining (C.43), (C.44) and Condition 1, we obtain similar result in (A6) of Qiu et al. (2020) by the Lebesgue s Dominated Convergence Theorem. This demonstration bears a striking resemblance to Proof of (A6) in Appendix A.2 of Qiu et al. (2020). The only modiﬁcation lies in the substitution of expectations with conditional expectations in (C.41) and (C.42), as well as the applications of the Law of Iterated Expectation, Pull-out rule, and Lebesgue Dominated Convergence Theorem in (C.44). Similarly, (A7) - (A9) and (A34) - (A35) for proving Theorems 1 and 2 in Qiu et al. (2020) can also be extrapolated using the same techniques.

C.7 Some Additional Discussions on the Behavior of Risk Function When Relevant/Important Features Are Not Involved in the Model

In many practical situations, investigators often cannot include all the relevant features in the model (Flynn et al., 2013), which leads to unignorable misspeciﬁcation. In the current

Optimal Weighted Random Forests

section, we will adopt a simpliﬁed setting to demonstrate the impact of ignoring important or relevant features in weighted random forest. For simplicity, we use the full sample to grow random forest and only theoretical CART is considered (similar setting has also been considered in Chi et al. (2022)). In this case, the cells are only constructed based on Θ = {Θ(1), . . . , Θ(Mn)}, and X reduces to the σalgebra generated by {x0 1, . . . , x0 n, Θ}. Now we formally introduce the notion of ignoring relevant/important features in our context. Inspired by the Deﬁnition 1 in Chi et al. (2022), we say our variable pool misses relevant/important features when

E µ(x0 1) E µ(x0 1) | x1, Θ 2 c0 > 0 (C.45)

for some positive constant c0, where µ(x0 1) = µ1. In this scenario, we will demonstrate that the conditional risk is asymptotically bounded below by a positive constant. To see this, assume that min1 m Mn E n I x1 tk(m),s | Θ o c0. Then, for a given tree, we have

m=1 w(m)bµm(xi)

m=1 w(m){µ(x0 i ) bµm(xi)}

s=1 I xi tk(m),s Pn j=1 yj I xj tk(m),s

Pn j=1 I xj tk(m),s

s=1 I xi tk(m),s Pn j=1 n µ(x0 j) + εj o I xj tk(m),s

Pn j=1 I xj tk(m),s

s=1 I xi tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o

E n I x1 tk(m),s | Θ o

s=1 I xi tk(m),s

"E n µ(x0 1)I x1 tk(m),s | Θ o

E n I x1 tk(m),s | Θ o

Pn j=1 µ(x0 j)I xj tk(m),s

n Pn j=1 I xj tk(m),s

s=1 I xi tk(m),s Pn j=1 εj I xj tk(m),s

Pn j=1 I xj tk(m),s

s=1 I xi tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o

E n I x1 tk(m),s | Θ o

+ r1,i + r2,i.

Chen, Yu and Zhang

Now we aim to bound r1,i and r2,i. In speciﬁc, if we assume that

and max 1 m Mn 2k(m) Kn,

where nn and Kn are deterministic positive series that grow to inﬁnity as n , and c0 is a positive constant. Then, by Condition 2 and the fact that x0 j s are independent conditionally on Θ, uniformly for every i = 1, . . . , n, we have

E1/2|r1,i|2

I xi tk(m),s

E n µ(x0 1)I x1 tk(m),s | Θ o

E{I x1 tk(m),s | Θ}

Pn j=1 µ(x0 j)I xj tk(m),s

n Pn j=1 I xj tk(m),s

E n µ(x0 1)I x1 tk(m),s | Θ o

E n I x1 tk(m),s | Θ o

Pn j=1 µ(x0 j)I xj tk(m),s

n Pn j=1 I xj tk(m),s

Pn j=1 µ(x0 j)I xj tk(m),s

n E n µ(x0 1)I x1 tk(m),s | Θ o

Pn j=1 I xj tk(m),s

E n µ(x0 1)I x1 tk(m),s | Θ o

Pn j=1 I xj tk(m),s

E n µ(x0 1)I x1 tk(m),s | Θ o

E{I x1 tk(m),s | Θ}

where the last inequality is based on Chebyshev s inequality. Similarly, under Condition 2, uniformly for every i = 1, . . . , n, we also have

E1/2|r2,i|2 = O

Combine (C.47) and (C.48) with (C.46), it is readily seen that Pn i=1 E n |µ(x0 i ) PMn m=1 w(m)bµm(xi)|2 | X o

µ(x0 i ) PMn m=1 w(m) P2k(m) s=1 I xi tk(m),s E n µ(x0 1)I x1 tk(m),s |Θ o

E n I x1 tk(m),s |Θ o

Optimal Weighted Random Forests

= Ex1,c|x1,Θ

s=1 I x1 tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o

E n I x1 tk(m),s | Θ o

1 n1/2 + n1/2 Kn

where x1,c represents the vector containing the features that are not included in the model, and the Ex1,c|x1,Θ( ) is taken with respect to those missing features, conditional on x1 and Θ. Then, by the Projection Theorem, it is readily seen that the leading term on the righthand-side of the last line of (C.49) satisﬁes that

s=1 I x1 tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o

E n I x1 tk(m),s | Θ o

Ex1,c|x1,Θ µ(x0 1) E µ(x0 1) | x1, Θ 2

where we have used the fact that

s=1 I x1 tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o

E n I x1 tk(m),s | Θ o

is σ(x1, Θ)-measurable with σ(x1, Θ) being the σ-algebra generated by {x1, Θ}. This yields that if we further assume that n1/2 Kn/ nn = o(1), one has

inf w H 1 Pn i=1 E n |µ(x0 i ) PMn m=1 w(m)bµm(xi)|2 | X o = Op

These discussions indicate that if some relevant/important features are missing in the model, it is expected that ξ 1 n achieves a satisfactory rate of convergence.

Appendix Appendix D. Additional Tables and Figures

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

(c) Noise in Both X and y

Figure D.1: Improvement Ratio vs Noise on BH Data Set

Optimal Weighted Random Forests

Algorithm A.2: SUT

Split_a_node(S)

Input: The local learning subset S corresponding to the node we want to split

Output: A split [a < c] or nothing

-If Stop_split(S) is TRUE then return nothing.

-Otherwise select q attributes {a1, . . . , aq} by probability sequence P among all non

constant (in S ) candidate attributes ; // Hyper parameter: probability

sequence P = {P1, . . . , Pp}, where Pj [0, 1], j = 1, . . . , p and Pp j=1 Pj = 1

-Draw q splits {s1, . . . , sq}, where si = Pick_a_split(S, ai) , i = 1, . . . , q;

-Return a split s such that Score (s , S) = maxi=1,...,q Score (si, S).

Pick_a_split(S, a)

Input: A subset S and an attribute a

Output: A split

- Let a S max and a S min be the maximal and minimal value of a in S;

- Calculate the cut-point c (a S min, a S max)/2 ;

- Return the split [a < c].

Stop_split(S)

Input: A subset S

Output: A boolean

- If |S| < nodesize, then return TRUE.

- If all attributes are constant in S, then return TRUE.

- If the output is constant in S, then return TRUE.

- Otherwise, return FALSE.

Score(s, S)

Input: A split s and a subset S

Output: The score of this split method

-Let XP , XL, XR be the attribute matrix of this local parent node, left daughter,

right daughter, respectively;

-Let n P , n L, n R be the number of samples contained in the local parent node, left

daughter, right daughter, respectively;

-Obtain XP , XL, XR by centering and scaling of each column of the matrices

XP , XL, XR, respectively;

-score XP n L

-Return score. 53

Chen, Yu and Zhang

Algorithm B.1: 2steps-WRFopt

Input: (1) The training data set D = {yi, xi}n i=1 (2) The number of trees in

random forest Mn Output: Weight vector ew H

1 for m = 1 to Mn do

2 Draw a bootstrap data set D(m) of size n from the training data set D;

3 Grow a random-forest tree ˆf(m) to the bootstrap data D(m), by recursively

repeating the following steps for each terminal node of the tree, until the

minimum node size nodesize is reached ; // nodesize, q are hyper

4 i. Select q variables at random from the p variables;

5 ii. Pick the best variable/ splitting point among the q;

6 iii. Split the node into two daughter nodes.

7 for i = 1 to n do

8 Drop xi down the the mth tree and get PBL(m)(xi, X, y, B(m), Θ(m)).

10 PBL(m) {PBL(m)(x1, X, y, B(m), Θ(m)), . . . , PBL(m)(xn, X, y, B(m), Θ(m))} .

12 Solve the quadratic programming problem:

w = w (1), . . . , w (Mn) arg min w H C n(w);

13 e {In P(w )}y with P(w ) = PMn m=1 w (m)PBL(m);

14 Solve the quadratic programming problem:

ew = ew(1), . . . , ew(Mn) arg min w H C n(w).

Optimal Weighted Random Forests

Algorithm B.2: w RF

Input: (1) The training data set D = {yi, xi}n i=1 (2) The number of trees in RF

Mn (3) Parameter λ

Output: Weight vector bw H

1 for m = 1 to Mn do

2 Draw a bootstrap data set D(m) of size n from the training data set D;

3 Grow a random-forest tree ˆf(m) to the bootstrap data D(m), by recursively

repeating the following steps for each node of the tree, until the minimum node

size nodesize is reached ; // nodesize, q are hyper parameters

4 i. Select q variables at random from the p variables;

5 ii. Pick the best variable/ split-point among the q;

6 iii. Split the node into two daughter nodes.

7 t PE m 1 Pn i=1 OOBim Pn i=1 ˆf(m)(xi) yi OOBim;

8 bw(m) ( 1 t PE m )λ;

10 bw bw(1), . . . , bw(Mn) ;

11 bw bw bw 1 .

Chen, Yu and Zhang

Algorithm B.3: CRF

Input: (1) The training data set D = {yi, xi}n i=1 (2) The number of trees in RF Mn Output: Weight vector bw H

1 for m = 1 to Mn do

2 Draw a bootstrap data set D(m) of size n from the training data set D;

3 Grow a random-forest tree ˆf(m) to the bootstrap data D(m), by recursively

repeating the following steps for each node of the tree, until the minimum node

size nodesize is reached ; // nodesize, q are hyper parameters

4 i. Select q variables at random from the p variables;

5 ii. Pick the best variable/ split-point among the q;

6 iii. Split the node into two daughter nodes.

7 t PE m 1 Pn i=1 OOBim Pn i=1 ˆf(m)(xi) yi OOBim;

9 Sequence {t PE 1, . . . , t PE Mn} from smallest to largest;

10 for m = 1 to Mn do

11 rm the order of the mth tree in sorted sequence;

12 bw(m) PMn ν=rm 1 ν ;

14 bw bw(1), . . . , bw(Mn) ;

15 bw bw bw 1 .

Data set RF 2steps-WRFopt 1step-WRFopt w RF CRF BH 11.913(5) 11.380(1) 11.445(3) 11.401(2) 11.521(4)

Servo 1.053(3) 1.057(4) 1.063(5) 1.017(1) 1.049(2)

CCS 27.944(4) 27.658(1) 27.721(3) 27.703(2) 27.962(5)

ASN 5.308(5) 5.095(1) 5.103(2) 5.241(4) 5.232(3)

CCPP 15.818(2) 16.084(4) 16.256(5) 15.790(1) 15.901(3)

CST 10.618(5) 9.319(1) 9.395(2) 9.740(3) 10.109(4)

EE 3.533(2) 3.534(3) 3.541(4) 3.485(1) 3.559(5)

PT 1.806(5) 1.452(2) 1.449(1) 1.571(3) 1.609(4)

QSAR 1.378(3) 1.387(4) 1.397(5) 1.358(1) 1.368(2)

SM( 10 5) 2.629(5) 1.994(1) 1.999(2) 2.444(3) 2.444(4)

YH 2.495(5) 1.850(1) 1.891(3) 1.869(2) 2.072(4)

Tecator 3.790(5) 2.955(2) 2.948(1) 3.178(3) 3.338(4)

Table D.1: Test Error Comparisons by MSFE for Diﬀerent Forests on High-Dimensional Data

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

(c) Noise in Both X and y

Figure D.2: Improvement Ratio vs Noise on Servo Data Set

Data set RF 2steps-WRFopt 1step-WRFopt w RF CRF BH 2.293(5) 2.273(3) 2.284(4) 2.254(1) 2.271(2)

Servo 0.610(3) 0.613(4) 0.614(5) 0.604(1) 0.609(2)

CCS 3.870(5) 3.836(1) 3.841(2) 3.847(3) 3.866(4)

ASN 1.672(5) 1.646(1) 1.648(2) 1.662(3) 1.663(4)

CCPP 3.030(2) 3.048(4) 3.060(5) 3.026(1) 3.034(3)

CST 2.504(5) 2.311(1) 2.320(2) 2.377(3) 2.434(4)

EE 1.191(2) 1.192(3) 1.194(4) 1.184(1) 1.197(5)

PT 0.860(5) 0.777(1) 0.777(1) 0.794(3) 0.808(4)

QSAR 0.871(3) 0.873(4) 0.876(5) 0.864(1) 0.867(2)

SM( 10 3) 2.882(5) 2.763(1) 2.773(2) 2.813(3) 2.846(4)

YH 0.724(5) 0.667(2) 0.675(3) 0.651(1) 0.676(4)

Tecator 1.342(5) 1.207(1) 1.207(1) 1.246(3) 1.275(4)

Table D.2: Test Error Comparisons by MAFE for Diﬀerent Forests on High-Dimensional Data

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

(c) Noise in Both X and y

Figure D.3: Improvement Ratio vs Noise on CCS Data Set

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

(c) Noise in Both X and y

Figure D.4: Improvement Ratio vs Noise on CST Data Set

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

0.00.10.20.30.40.50.60.70.80.91.0

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

0.00.10.20.30.40.50.60.70.80.91.0

(c) Noise in Both X and y

Figure D.5: Improvement Ratio vs Noise on EE Data Set

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

(c) Noise in Both X and y

Figure D.6: Improvement Ratio vs Noise on PT Data Set

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

0.00.10.20.30.40.50.60.70.80.91.0

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

0.00.10.20.30.40.50.60.70.80.91.0

(c) Noise in Both X and y

Figure D.7: Improvement Ratio vs Noise on QSAR Data Set

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

(c) Noise in Both X and y

Figure D.8: Improvement Ratio vs Noise on SM Data Set

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

(c) Noise in Both X and y

Figure D.9: Improvement Ratio vs Noise on YH Data Set

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

Improvement Ratio

(a) Noise in X Only

(b) Noise in y Only

(c) Noise in Both X and y

Figure D.10: Improvement Ratio vs Noise on Tecator Data Set

Chen, Yu and Zhang

BH Servo CCS ASN CCPP CST EE PT QSAR SM YH Tecator Data Sets

Relative Risk

Figure D.11: Comparative Analysis of Conventional RF Predictive Performance Between Original and Augmented Data Sets (Relative risks are calculated as the ratio of conventional RF risks on augmented data sets to those on original data sets.)

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 8 11 14 18 nodesize

Figure D.12: Improvement Ratio vs Hyper Parameters of RF on BH Data Set

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 6 7 8 9 10 nodesize

Figure D.13: Improvement Ratio vs Hyper Parameters of RF on Servo Data Set

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 10 15 20 26 nodesize

Figure D.14: Improvement Ratio vs Hyper Parameters of RF on CCS Data Set

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 12 19 32 nodesize

Figure D.15: Improvement Ratio vs Hyper Parameters of RF on ASN Data Set

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 24 43 62 81 nodesize

Figure D.16: Improvement Ratio vs Hyper Parameters of RF on CCPP Data Set

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 6 7 8 nodesize

Figure D.17: Improvement Ratio vs Hyper Parameters of RF on CST Data Set

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 9 13 17 23 nodesize

Figure D.18: Improvement Ratio vs Hyper Parameters of RF on EE Data Set

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 20 35 64 nodesize

Figure D.19: Improvement Ratio vs Hyper Parameters of RF on PT Data Set

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 9 13 19 nodesize

Figure D.20: Improvement Ratio vs Hyper Parameters of RF on QSAR Data Set

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 9 13 19 nodesize

Figure D.21: Improvement Ratio vs Hyper Parameters of RF on SM Data Set

Optimal Weighted Random Forests

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 7 9 11 14 nodesize

Figure D.22: Improvement Ratio vs Hyper Parameters of RF on YH Data Set

Chen, Yu and Zhang

ρ 1step WRFopt 2steps WRFopt w RF CRF

100 200 400 800 ntree

Improvement Ratio

5 7 9 12 nodesize

Figure D.23: Improvement Ratio vs Hyper Parameters of RF on Tecator Data Set

Optimal Weighted Random Forests

S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal eﬀects. Proceedings of the National Academy of Sciences, 113(27):7353 7360, 2016.

G. Biau. Analysis of a random forests model. The Journal of Machine Learning Research, 13(1):1063 1095, 2012.

G. Biau and E. Scornet. A random forest guided tour. Test, 25:197 227, 2016.

L. Breiman. Random forests. Machine Learning, 45(1):5 32, 2001.

V. V. Buldygin and K. K. Moskvichova. The sub-Gaussian norm of a binary random variable. Theory of Probability and Mathematical Statistics, 86:33 49, 2013.

T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785 794, 2016.

C. M. Chi, P. Vossler, Y. Fan, and J. Lv. Asymptotic properties of high-dimensional random forests. The Annals of Statistics, 50(6):3415 3438, 2022.

D. Dua and C. Graﬀ. UCI machine learning repository, 2017. URL http://archive.ics. uci.edu/ml.

C. J. Flynn, C. M. Hurvich, and J. S. Simonoﬀ. Eﬃciency for regularization parameter selection in penalized likelihood estimation of misspeciﬁed models. Journal of the American Statistical Association, 108:1031 1043, 2013.

Y. Gao, X. Zhang, S. Wang, T. T. Chong, and G. Zou. Frequentist model averaging for threshold models. Annals of the Institute of Statistical Mathematics, 71(2):275 306, 2019.

P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63 (1):3 42, 2006.

P. Ghosh, A. Neufeld, and J. K. Sahoo. Forecasting directional movements of stock prices for intraday trading using LSTM and random forests. Finance Research Letters, 46(Part A):102280, 2022.

B. E. Hansen. Least squares model averaging. Econometrica, 75(4):1175 1189, 2007.

B. E. Hansen and J. S. Racine. Jackknife model averaging. Journal of Econometrics, 167 (1):38 46, 2012.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, chapter 9, pages 307 308. Springer, 2009.

G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning, chapter 8, pages 311 314. Springer, 2013.

J. M. Klusowski. Universal consistency of decision trees in high dimensions. ar Xiv preprint ar Xiv:2104.13881, 2021.

Chen, Yu and Zhang

H. Li, W. Wang, H. Ding, and J. Dong. Trees weighting random forest method for classifying high-dimensional noisy data. In 2010 IEEE 7th International Conference on E-Business Engineering, pages 160 163, 2010.

J. Lin, S. Lu, X. He, and F. Wang. Analyzing the impact of three-dimensional building structure on CO2 emissions based on random forest regression. Energy, 236(1):121502, 2021.

Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578 590, 2006.

H. Pallathadka, E. H. Ramirez-Asis, T. P. Loli-Poma, K. Kaliyaperumal, R. J. M. Ventayen, and M. Naved. Applications of artiﬁcial intelligence in business management, e-commerce and ﬁnance. Materials Today: Proceedings, 80:2610 2613, 2023.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

J. Peng and Y. Yang. On improvability of model selection by model averaging. Journal of Econometrics, 229(2):246 262, 2022.

H. Pham and S. Olafsson. On Cesáro averages for weighted trees in the random forest. Journal of Classiﬁcation, 37(1):1 14, 2019.

P. Probst, M. N. Wright, and A. L. Boulesteix. Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3):e1301, 2019.

Y. Qiu, T. Xie, J. Yu, and X. Zhang. Mallows-type averaging machine learning techniques. Working paper, 2020.

I. Reis, D. Baron, and S. Shahaf. Probabilistic random forest: A machine learning algorithm for noisy data sets. The Astronomical Journal, 157(1):16, 2018.

J. Saniuk and I. Rhodes. A matrix inequality associated with bounds on solutions of algebraic riccati and lyapunov equations. IEEE Transactions on Automatic Control, 32(8):739 740, 1987.

E. Scornet, G. Biau, and J. P. Vert. Consistency of random forests. The Annals of Statistics, 43(4):1716 1741, 2015.

M. R. Segal. Machine learning benchmarks and random forest regression. Center for Bioinformatics and Molecular Biostatistics, 2004.

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1297 1304, 2011.

Optimal Weighted Random Forests

V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston. Random forest: A classiﬁcation and regression tool for compound classiﬁcation and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6):1947 1958, 2003.

V. Syrgkanis and M. Zampetakis. Estimation and inference with trees and forests in high dimensions. In Proceedings of Thirty Third Conference on Learning Theory: PMLR, volume 125, pages 3453 3454, 2020.

J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49 60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198.

S. J. Winham, R. R. Freimuth, and J. M. Biernacka. A weighted random forests approach to improve predictive performance. Statistical Analysis and Data Mining: The ASA Data Science Journal, 6(6):496 505, 2013.

S. Xuan, G. Liu, and Z. Li. Reﬁned weighted random forest and its application to credit card fraud detection. In Computational Data and Social Networks, pages 343 355, 2018.

J. Yoon. Forecasting of real GDP growth using machine learning models: Gradient boosting and random forest approach. Computational Economics, 57(1):247 265, 2021.

H. Zhang and S. Chen. Concentration inequalities for statistical inference. Communications in Mathematical Research, 37(1):1 85, 2021.

X. Zhang. A new study on asymptotic optimality of least squares model averaging. Econometric Theory, 37(2):388 407, 2021.

X. Zhang, A. T. K. Wan, and G. Zou. Model averaging by jackknife criterion in models with dependent data. Journal of Econometrics, 174(2):82 94, 2013.

X. Zhang, D. Yu, G. Zou, and H. Liang. Optimal model averaging estimation for generalized linear models and generalized linear mixed-eﬀects models. Journal of the American Statistical Association, 111(516):1775 1790, 2016.

X. Zhang, G. Zou, H. Liang, and R. J. Carroll. Parsimonious model averaging with a diverging number of parameters. Journal of the American Statistical Association, 115 (530):972 984, 2020.

S. Zhao, X. Zhang, and Y. Gao. Model averaging with averaging covariance matrix. Economics Letters, 145:214 217, 2016.

Z. Zhou. Ensemble Methods: Foundations and Algorithms, chapter 4, pages 68 71. CRC Press, 2012.

Z. Zhou, J. Wu, and W. Tang. Ensembling neural networks: many could be better than all. Artiﬁcial Intelligence, 137(1-2):239 263, 2002.

J. Zou, W. Wang, X. Zhang, and G. Zou. Optimal model averaging for divergent-dimensional Poisson regressions. Econometric Reviews, 41(7):775 805, 2022.