# optimal_weighted_random_forests__2f386235.pdf Journal of Machine Learning Research 25 (2024) 1-81 Submitted 5/23; Revised 10/24; Published 10/24 Optimal Weighted Random Forests Xinyu Chen SA21204192@mail.ustc.edu.cn International Institute of Finance School of Management University of Science and Technology of China Hefei, 230026, Anhui, China Dalei Yu yudalei@126.com School of Mathematics and Statistics Xi an Jiaotong University Xi an, 710049, Shaanxi, China Xinyu Zhang xinyu@amss.ac.cn Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China International Institute of Finance School of Management University of Science and Technology of China Hefei, 230026, Anhui, China Editor: Mladen Kolar The random forest (RF) algorithm has become a very popular prediction method for its great flexibility and promising accuracy. In RF, it is conventional to put equal weights on all the base learners (trees) to aggregate their predictions. However, the predictive performance of different trees within the forest can vary significantly due to the randomization of the embedded bootstrap sampling and feature selection. In this paper, we focus on RF for regression and propose two optimal weighting algorithms, namely the 1 Step Optimal Weighted RF (1step-WRFopt) and 2 Steps Optimal Weighted RF (2steps-WRFopt), that combine the base learners through the weights determined by weight choice criteria. Under some regularity conditions, we show that these algorithms are asymptotically optimal in the sense that the resulting squared loss and risk are asymptotically identical to those of the infeasible but best possible weighted RF. Numerical studies conducted on real-world data sets and semi-synthetic data sets indicate that these algorithms outperform the equalweight forest and two other weighted RFs proposed in the existing literature in most cases. Keywords: weighted random forest, model averaging, optimality, splitting criterion . Dalei Yu is the corresponding author. c 2024 Xinyu Chen, Dalei Yu and Xinyu Zhang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0607.html. Chen, Yu and Zhang 1. Introduction Random forest (RF) (Breiman, 2001), growing trees using Classification and Regression Trees (CART) algorithm, is one of the most successful machine learning algorithms that scale with the volume of information while maintaining sufficient statistical efficiency (Biau and Scornet, 2016). Due to its great flexibility and promising accuracy, RF has been widely used in diverse areas of data analysis, including policy-making (Yoon, 2021; Lin et al., 2021), business analysis (Pallathadka et al., 2023; Ghosh et al., 2022), chemoinformatics (Svetnik et al., 2003), real-time recognition of human pose (Shotton et al., 2011), and so on. RF ensembles multiple decision trees grown on bootstrap samples and yields highly accurate predictions. In the conventional implementation of RF, it is customary and convenient to allocate equal weight to each decision tree. Theoretically, the predictive performance varies from tree to tree due to the application of randomly selected sub-spaces of data and features. In other words, trees exhibit greater diversity due to the injected randomness. An immediate question then arises: Is it always optimal to employ equal weights? In fact, there is sufficient evidence to indicate that an averaging strategy with appropriately selected unequal weights may achieve better performance than simple averaging (that is, equally weighting) if individual learners exhibit non-identical strength (Zhou, 2012; Peng and Yang, 2022). To solve the problem mentioned above, some efforts have been made in the literature regarding weighted RFs. Specifically, Trees Weighting Random Forest (TWRF) introduced by Li et al. (2010) employs the accuracy in the out-of-bag data as an index that measures the classification power of the tree and sets it as the weight. Winham et al. (2013) develop Weighted Random Forests (w RF), where the weights are determined based on tree-level prediction error. Based on w RF, Xuan et al. (2018) put forward Refined Weighted Random Forests (RWRF) using all training data, including in-bag and out-of-bag data. A novel weights formula is also developed in RWRF but cannot be manipulated into a regression pattern. Pham and Olafsson (2019) replace the regular average with a Cesáro average with theoretical analysis. However, these studies have predominantly focused on classification, and less attention has been paid to the regression pattern (that is, estimating the conditional expectation), although some mechanisms for classification can be transformed into corresponding regression patterns. In addition, none of the aforementioned studies have investigated the theoretical underpinnings regarding the optimality properties of their methods. Recently, Qiu et al. (2020) propose a novel framework that averages the outputs of multiple machine learning algorithms by the weights determined by Mallows-type criteria. Motivated by their work, we extend this approach by developing an asymptotically optimal weighting strategy for RF. Specifically, we treat the individual trees within the RF as base learners and employ Mallows-type criterion to obtain their respective weights. It is worth noting that Qiu et al. (2020) implicitly assume the independence of the hat matrix from the response values (the term hat matrix generally refers to the matrix that maps the response vector to the corresponding vector of fitting values), which is deemed unrealistic in practical situations. In the current study, we remove this restriction and establish the asymptotic optimality under the standard setting where the hat matrix is determined by a response-based splitting criterion (Breiman, 2001; Chi et al., 2022). Moreover, to reduce Optimal Weighted Random Forests computational burden, we further propose an accelerated algorithm that requires only two quadratic optimization tasks. Asymptotic optimality is established for both the original and accelerated weighted RF estimators. Extensive analyses on real-world and semi-synthetic data sets demonstrate that the proposed methods show promising performance over existing RFs. The remaining part of the paper proceeds as follows: Section 2 formulates the problem. Section 3 establishes our weighted RF algorithms and provides theoretical analysis. Section 4 shows their promising performance on 12 real-world data sets from UCI Machine Learning Repository and Openml.org, as well as on semi-synthetic data. Section 5 concludes. The data and codes are available publicly on https://github.com/Xinyu Chen-hey/Optimal Weighted-Random-Forests. 2. Model and Problem Formulation Let x0 i = (xi1, xi2, . . .) be a set of countably infinite predictors (or explanatory variables, attributes, features) and yi be a univariate response variable for i = 1, . . . , n. The data generating process is as follows yi = µi + ei, where {ei, x0 i }n i=1 are independent and identically distributed random variables with E(ei|x0 i ) = 0 and E(e2 i |x0 i ) = σ2 i , and µi = E(yi|x0 i ). So conditional heteroscedasticity is allowed here. Consider an observable data set D = {yi, xi}n i=1, where xi = (xi1, . . . , xip) with p being the dimension of the feature. Given a predictor vector xi, the corresponding prediction for yi by a tree (or base learner, BL) in the construction of RF can be written as follows byi = P BL(xi, X, y, B, Θ)y, where y = (y1, . . . , yn) is the vector of the response variable, PBL(xi, X, y, B, Θ) is an ndimensional vector for tree configuration, and X = (x1, . . . , xn) is the matrix of predictors. The variables B and Θ play implicit roles in injecting randomness into RFs. First, each tree is fitted to an independent bootstrap sample from the original data, with B determining the randomization inherent in this bootstrap sampling process. Second, Θ dictates the randomness in feature selection at each node within the trees. The specific nature and dimensions of B and Θ are contingent upon the specific implementation of each tree. Note that B is irrelevant to y, while Θ relies on y for guiding splits.1 Let us assume that we have drawn Mn bootstrap data sets of size n and grown Mn trees on their bootstrap data, where Mn can grow with n or remain fixed. Take the mth tree for example. Dropping an instance (y0, x0) down this base learner and end up with a specific tree leaf l with nl observations Dl = {(yi1, xi1), . . . , (yinl, xinl)}. Assume that the number of occurrences of instance (yi, xi) in this tree is hi for all i because of the bootstrap sampling procedure. Then PBL(x0, X, y, B, Θ) for this tree is a sparse vector, with elements being hi/nl or zero, depending on the relationship between D and Dl. More specifically, the ith element of PBL(x0, X, y, B, Θ) equals 0 if (yi, xi) / Dl and (y0, x0) Dl. Elements of PBL(x0, X, y, B, Θ) are weights put on elements of y to make a prediction for y0. 1. In case where the tree structure is unaffected by y, Θ becomes independent of y, therefore reducing PBL(xi, X, y, B, Θ) to PBL(xi, X, B, Θ). Such a situation occurs with split-unsupervised trees as discussed in Section 3.2. Chen, Yu and Zhang By randomly selecting sub-spaces of data and features, trees in RFs are given more randomness than trees without these randomization techniques. Specifically, bootstrap data are used to grow trees rather than the original training data. In addition, when searching for the best splitting variable at each node, we draw q (q < p) variables from the total pool of p variables, rather than using all p variables. If without the bootstrap procedure, we have hi 1 for all i {1, . . . , n}, and PBL(x0, X, y, B, Θ) will contain fewer zero elements. The prediction for yi by the mth tree (or the mth base learner) within the forest obeys the following relationship by(m) i = P BL(m)(xi, X, y, B(m), Θ(m))y, (1) where by(m) i is the prediction for yi by the mth tree, and PBL(m)(xi, X, y, B(m), Θ(m)) is the n-dimensional vector for configuring the mth tree. The final output of the forest is integrated by m=1 w(m)by(m) i , where w(m) is the weight put on the mth tree. Our goal is to determine appropriate weights to improve prediction accuracy of RF, given a predictor vector x. Clearly, the conventional RF has w(m) 1/Mn for m = 1, . . . , Mn. 3. Mallows-type Weighted RFs Let PBL(m) be an n n hat matrix , of which the ith row is P BL(m)(xi, X, y, B(m), Θ(m)). Let m=1 w(m)PBL(m), m=1 w(m)by(m), by(m) = by(m) 1 , . . . , by(m) n . Define the following averaged squared error function Ln(w) by(w) µ 2, (2) which measures the sum of squared biases between the true µ = (µ1, . . . , µn) and its model averaging estimate by(w). Denote Rn(w) = E {Ln(w)|X}, where X is the σ-algebra generated by {x0 1, . . . , x0 n, B(1), . . . , B(Mn), Θ(1), . . . , Θ(Mn)}. We will propose some weight choice criteria to obtain weights based on Rn(w). Optimal Weighted Random Forests 3.1 Mallows-type Weight Choice Criteria Considering the choice of weights, we use the solution obtained by minimizing the following Mallows-type criterion (3) with the restriction of w H n w [0, 1]Mn : PMn m=1 w(m) = 1 o Cn(w) = y P(w)y 2 + 2 i=1 e2 i Pii(w), (3) where Pii(w) is the ith diagonal term in P(w), and e = (e1, . . . , en) is the true error term vector. This criterion is originally proposed by Zhao et al. (2016) for considering linear models. In the context of linear models, E{Cn(w) | X} equals the conditional risk Rn(w) up to a constant term that is irrelevant to w. Zhao et al. (2016) further show that the criterion is asymptotically optimal in the context considered therein. However, ei s are unobservable terms in practice. So they further consider the following feasible criterion, replacing the true error terms with averaged residuals C n(w) = y P(w)y 2 + 2 i=1 ˆei(w)2Pii(w), (4) ˆe(w) = {ˆe1(w), . . . , ˆen(w)} = m=1 w(m)ˆe(m) = {In P(w)}y, ˆe(m) is the residual vector for the mth candidate model, and In is an identity matrix with dimension n. This feasible criterion also accommodates conditional heteroscedasticity. Besides, it relies on all candidate models to estimate the true error vector, which avoids placing too much confidence on a single model. Similar criterion has also been considered in Qiu et al. (2020). We apply criterion (4) to determine w in by(w). Criterion (4) comprises two terms. The first term measures the fitting error of the weighted RF in the training data, by computing the residual sum of squares. The second term penalizes the complexity of the trees in the forest. For each m {1, . . . , Mn}, PBL(m),ii denotes the ith diagonal term in PBL(m). As explained in Section 2, PBL(m),ii is the proportion of the ith observation to the total number of samples in the leaf that includes the ith observation. Thus, for each m {1, . . . , Mn} and i {1, . . . , n}, the larger the value of PBL(m),ii, the smaller the gap between yi and by(m) i . In the most extreme case where a tree is so deep that the leaf node containing the ith observation is pure, PBL(m),ii equals 1, and by(m) i equals yi. Essentially, this tree has low prediction error within the training sample, but may exhibit poor generalization performance when applied to new data. To mitigate the contribution of overfitted trees in the ensemble output, this algorithm assigns lower weights to these trees, thereby decreasing the second term. From another aspect, noting that Pn i=1 ˆei(w)2Pii(w) min1 i n ˆei(w)2 Pn i=1 Pii(w), the summation part Pn i=1 Pii(w) = PMn m=1 w(m) Pn i=1 PBL(m),ii = PMn m=1 w(m)ℓ(m) represents the weighted number of leaves of all trees, where ℓ(m) is the number of leaves of the mth tree, representing the complexity of the tree. The regularized objective for minimization in the Chen, Yu and Zhang Extreme Gradient Boosting (XGBoost) algorithm, proposed by Chen and Guestrin (2016), also contains a penalty term that penalizes the number of leaves in the tree. In light of this, both the weighted RF with weights obtained by minimizing criterion (4) and XGBoost employ the number of leaves in a tree to measure its complexity. The regularization term helps to allocate the weights to avoid overfitting (Chen and Guestrin, 2016). Inherent from the Mallows criterion (Hansen, 2007), the resulting weights of (4) exhibit sparsity. It is important to note that some trees within a RF might contribute to the deterioration of the ensemble s overall performance. Therefore, forming a more accurate committee via removal of trees with poor performance is a more rational strategy. It is called the theorem of MCBTA ( many could be better than all ) introduced by Zhou et al. (2002), which indicates that for supervised learning, given a set of individual learners, it may be better to ensemble some instead of all of these individual learners. Employing sparse weights can be regarded as a form of adaptive tree selection, reducing the risk of integrating trees that could weaken the final outcome of the ensemble. Additionally, it offers advantages over model selection methods, which only choose a single tree and thereby ignore model uncertainty. In short, our approach with sparse weights provides a balanced and adaptive solution, selectively aggregating members while acknowledging the significance of trees diversity. It is clear that criterion (4) is a cubic function of w, whose optimization is substantially more time-consuming than that of quadratic programming. To expedite the process, we further suggest an accelerated algorithm that estimates e using a vector that is irrelevant to w. The accelerated algorithm consists of two steps, where the first step involves calculating the estimated error terms, and the second step involves substituting the vector obtained in the first step for the true error terms in criterion (3). In specific, consider the following intermediate criterion, C n(w) = y P(w)y 2 + 2 i=1 ˆσ2Pii(w), (5) where ˆσ2 = y P(w0)y 2/n, and w0 is an n-dimensional vector with all elements being 1/Mn. Solve this quadratic programming task over w H, and get a solution w . Then, calculate the residual vector by e = ( e1, . . . , en) = {In P(w )}y. Next, consider the following criterion C n(w) = y P(w)y 2 + 2 i=1 e2 i Pii(w). (6) Both (5) and (6) are quadratic functions of w, while criterion (4) is a cubic function. Many contemporary software packages, such as quadprog in R or MATLAB, can effectively handle quadratic programming problems. In fact, from the real data analysis conducted in Section 4.1, it is observed that the time required to solve two quadratic programming problems is notably lower compared to that required to solve a more intricate nonlinear programming problem of higher order. Please see Table 5 for more details. Optimal Weighted Random Forests Algorithm 1: 1step-WRFopt Input: (1) The training data set D = {yi, xi}n i=1 (2) The number of trees in random forest Mn Output: Weight vector bw H 1 for m = 1 to Mn do 2 Draw a bootstrap data set D(m) of size n from the training data set D; 3 Grow a random-forest tree ˆf(m) to the bootstrap data D(m), by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nodesize is reached ; // nodesize, q are hyper 4 i. Select q variables at random from the p variables; 5 ii. Pick the best variable/ splitting point among the q; 6 iii. Split the node into two daughter nodes. 7 for i = 1 to n do 8 Drop xi down the the mth tree and get PBL(m)(xi, X, y, B(m), Θ(m)). 10 PBL(m) {PBL(m)(x1, X, y, B(m), Θ(m)), . . . , PBL(m)(xn, X, y, B(m), Θ(m))} . 12 Solve the convex optimization problem: bw = ( bw(1), . . . , bw(Mn)) arg min w H C n(w). We refer to the RF with tree-level weights derived from optimizing criterion (4) as 1 Step Optimal Weighted RF (1step-WRFopt), and the RF with weights of trees obtained by optimizing criterion (6) as 2 Steps Optimal Weighted RF (2steps-WRFopt). Their details are given in Algorithms 1 and B.1, respectively. For the sake of simplicity, we provide the complete description of Algorithm B.1 in Appendix B.1, since it shares a large part in common with Algorithm 1 except for the weight choice criteria. In the following, we use the WRFopt to refer to the algorithms including both the 1step-WRFopt and 2steps-WRFopt. 3.2 Asymptotic Optimality Denote the selected weight vectors from C n(w) and C n(w) by bw = arg min w H C n(w) and ew = arg min w H C n(w), respectively. In this section, we aim to analyze the loss and risk behavior associated with bw and ew. Before providing the theoretical support for the proposed algorithms, as an intermediate step, we first introduce a special tree-type algorithm which splits nodes without the guidance of response values in learning samples. Namely, the hat matrix is independent Chen, Yu and Zhang Tree-Based Candidate Models BL-of-RF-Based Candidate Models Dependence of P on y Least squares model averaging (Hansen, 2007) Mallows-type averaging (Qiu et al., 2020) WRFopt with SUT WRFopt with CART Table 1: Flowchart of Theoretical Analysis (Inside the dotted box are our works.) of the output values of learning samples. In this context, the vector PBL(xi, X, y, B, Θ) reduces to PBL(xi, X, B, Θ). This setup has also been imposed in the theoretical analysis of Geurts et al. (2006) and Biau (2012). Besides, this theoretical framework is referred to as honesty in the field of causal inference (Athey and Imbens, 2016). We term this methodology as Split-Unsupervised Tree (SUT) in contrast to CART whose splitting criterion relies on response information. Adopting the SUT simplifies theoretical analysis, aligning it with the framework of Mallows model averaging estimator (Hansen, 2007) in linear regression, where the hat matrix P is independent of response values. This approach is also similar to the estimators explored by Qiu et al. (2020), making it a useful intermediary for further analysis on the asymptotic optimality of our proposed weighted RFs. In the remaining part of this section, we first discuss the asymptotic optimality of the WRFopt with SUT. Subsequently, we present the asymptotic optimality of the WRFopt with CART, which is the most important conclusion in this paper. This is achieved by theoretically linking the large sample behavior of WRFopt with CART to the behavior of WRFopt with SUT. The corresponding theoretical analysis process is outlined in Table 1. Appendix A provides the details of the CART algorithm and an example of SUT algorithm (used in Section 4.1). 3.2.1 Asymptotic Optimality with SUT In this section, we will establish the asymptotic optimality of the 1step-WRFopt estimator and 2steps-WRFopt estimator with SUT trees. To differentiate notations associated with the SUT methodology from those used in CART, we employ the script on RF-related notations when referring to their SUT counterparts. This indicates that the notations pertain to the SUT methodology, while maintaining their fundamental meanings with the exception of the splitting criterion. For example, for m = 1, . . . , Mn, P BL(m) and PBL(m) share the same fundamental meaning, with the former associated with SUT trees and the latter related to CART trees. Likewise, P (w) = PMn m=1 w(m)P BL(m) contrasts with P(w) for CART trees. Let ξn = infw H Rn(w), n(m),l be the number of observations in the lth leaf of the mth Optimal Weighted Random Forests tree, and n = min1 m Mn min1 l ℓ(m) n(m),l be the smallest sample size (controlled by the hyper parameter nodesize in R package random Forest) across all leaves in all trees within the CART trees. We employ ξ n and n to denote the counterparts of ξn and n pertaining to SUT trees, respectively. For brevity, we will not enumerate each SUT-corresponding notation individually. We list and discuss some technical conditions as follows. Condition 1 ξ 1 n M2 n = o(1) almost surely, and E ξ 1 n M2 n exists for all fixed n 1. Condition 2 There exists a positive constant v such that E e4 i | x0 i v < almost surely for i = 1, . . . , n. Additionally, we define h(m),i as the number of occurrences of instance (yi, xi) in the mth bootstrap sample. Following Chi et al. (2022), when the summation is over an empty set, we define its value as zero; also, we define 0/0 = 0. Thus, without loss of generality, we assume that h(m),i > 0 for all m = 1, . . . , Mn and i = 1, . . . , n in the remaining part of this study. Denote hmax = max1 m Mn,1 i n h(m),i. Condition 3 hmaxn 1 n1/2 = O(1) almost surely. Condition 4 For each fixed i, j {1, . . . , n}, there exists a positive constant c such that ch(m),j h(m),i each m = 1, . . . , Mn, almost surely. Condition 5 ξ 1 n Mnn1/2 = o(1) almost surely, and E ξ 1 n Mnn1/2 exists for all fixed n 1. Conditions 1 and 5 restrict the increasing rates of the number of trees Mn and regulate the behavior of the minimum averaging risk ξ n. Similar conditions have been considered and discussed by Zhang et al. (2020), Zhang (2021), Zou et al. (2022), and others. One typical scenario that guarantees these two conditions occurs when all trees are misspecified, which precludes any trees within the RF yield perfect predictions and dominate others. The misspecification scenario is particularly common under high-dimensional data contexts, where important predictors might be omitted from tree construction due to the randomization process inherent in feature selection at each split. Besides, inspired by Chi et al. (2022), we first rigorously define the notion of important predictors in the context of RF under a simplified scenario in Appendix C.7. Then, under this situation, it is reasonable to expect that the key identity ξ 1 n converges to 0 at the rate O(n 1). In such cases, the rate of convergence in Conditions 1 and 5 can both be further simplified to M2 nn 1 = o(1). Moreover, even when all the important predictors are incorporated into the tree-building process, in a very simplified situation where a non-adaptive tree is considered, Lin and Jeon (2006) show that the mean square error of the limiting value of the RF estimator is bounded below by C/{N logp 1(n)}, where C is a positive constant and N represents the maximal number of samples across all leaves and all trees. This indicates that ξ 1 n = O[1/{N logp 1(n)}]. In this case, Conditions 1 and 5 can be further reduced to M2 n/{N logp 1(n)} = o(1) and Mnn1/2/{N logp 1(n)} = o(1), respectively. However, given the complexity of RF, achieving an explicit rate of convergence for ξ 1 n under general RF scenarios remains an open question, which we identify as a direction for future research. Chen, Yu and Zhang Condition 2 imposes the boundedness of the conditional moments, which is a mild condition and can be found in much literature, including works by Hansen and Racine (2012), and Hansen (2007). Condition 3 is a high-level assumption that restricts the structure of the RF and its constituent trees, which is equivalent to Condition C.9 of Qiu et al. (2020) in the context therein. In fact, Condition 3 requires that the minimum sample size across all leaves and all trees grows with order no slower than n1/2. Besides, it yields that trace(P BL(m)) = O(n1/2), almost surely. This poses restrictions on the degrees-of-freedom or complexity of trees. In other words, trees should not be fully developed. Actually, as noted by Probst et al. (2019), increasing hyper parameter nodesize, leading to less tree leaves, can decreases the computation time approximately exponentially. Our practical experience, particularly with large data sets, suggests that setting this hyper parameter to a value higher than the default can significantly reduce runtime, often without markedly compromising prediction performance. Specifically, Segal (2004) show an example where increasing the number of noise variables results in a higher optimal nodesize. Therefore, it is advisable to construct trees in a RF that are moderately developed, balancing between being too shallow, which may result in underfitting (Hastie et al., 2009; James et al., 2013), and being excessively deep, which can impose a significant computational burden. Thus, Condition 3 is reasonable and easy to be satisfied in practice. Condition 4 excludes the unbalanced sampling in bootstrap sampling process, specifically ruling out the extreme case where certain samples disproportionately dominate the bootstrap samples. The following theorems establish the asymptotic optimality of the 1step-WRFopt estimator and 2steps-WRFopt estimator, respectively. Theorem 1 (Asymptotic Optimality for 1step-WRFopt with SUT) Assume Conditions 1 - 4 hold. Then, as n , L n (bw ) infw H L n(w) If, in addition, there exists an integrable random variable η such that {L n(bw ) ξ n} ξ 1 n η, then R n(bw ) infw H R n(w) Theorem 2 (Asymptotic Optimality for 2steps-WRFopt with SUT) Assume Conditions 1 - 5 hold. Then, as n , L n (ew ) infw H L n(w) If, in addition, there exists an integrable random variable η such that {L n(ew ) ξ n} ξ 1 n η, then R n(ew ) infw H R n(w) Optimal Weighted Random Forests Results obtained in (7) and (9) regard the asymptotic optimality in the sense of achieving the lowest possible squared loss, while (8) and (10) yield asymptotic optimality in the sense of achieving the lowest possible squared risk. Proofs of Theorems 1 and 2 are presented in Appendices C.2 and C.3, respectively. 3.2.2 Asymptotic Optimality with CART Theoretical analysis regarding CART is generally very challenging, as the splitting criterion relies on response information and the black-box nature of the procedure (Scornet et al., 2015). Nevertheless, there are some pioneering studies in the literature that employ the CART splitting criterion. For example, in the context of additive regression models, Scornet et al. (2015) study the consistency of Breiman s results (Breiman, 2001). Moreover, Klusowski (2021) establishes the universal consistency of CART in the context of high dimensional additive models. In addition, Syrgkanis and Zampetakis (2020) analyze the finite sample properties of CART with binary features, under a sparsity constraint. More recently, Chi et al. (2022) establish the consistency rates for the original CART. Now, we are equipped to establish the asymptotic optimality of WRFopt using CART trees, which will be achieved by studying the difference between CART-based RF and its limiting version, based on the intermediate results established in Section 3.2.1. More specifically, following the recent work of Chi et al. (2022) and Scornet et al. (2015), we term the limiting version of CART-splitting criterion as the theoretical CART-splitting criterion. Unlike its practical counterpart, the theoretical CART-splitting criterion does not depend on response values and therefore can be considered as a special type of splitting criterion employed by SUT trees. Our analysis in this section focuses on the discrepancy between estimators using the CART-splitting criterion and those using the theoretical CART-splitting criterion (referred to as theoretical RF). To facilitate the technical presentation, we first introduce additional notations for the structure of a tree and its cells. Following Chi et al. (2022), without loss of generality, we set xi [0, 1]p and define a cell as a rectangle t = p d=1td, representing the Cartesian product of the closed or half-closed interval td in [0, 1]. Let θk = {θk,1, . . . , θk,2k} with elements θk, {1, . . . , p} be the sets of available features for the 2k 1 splits at level k 1 that grow the 2k cells (including empty cells). Given the CART-splitting criterion and a set of θ1:k = {θ1, . . . , θk}, let T(θ1:k) = {t1:k,1, . . . , t1:k,2k} be the collection of all cell sequences connecting the root to the end cells at level k, which is determined by Θ. Each sequence t1:k,s for s = 1, . . . , 2k can be considered as a tree branch , representing a partition of [0, 1]p. One can refer to Figure 1 of Chi et al. (2022) for a graphical illustration for this splitting scheme. It is natural to consider varying tree heights within a RF, implying that the maximum level k depends on m for each m = 1, . . . , Mn. Therefore, we use k(m) to denote the maximum level of the mth tree. For a new instance x, which is an independent copy of x1, the mth tree estimator given T(θ1:k(m)) and {h(m),1, . . . , h(m),n} can be expressed as follows bµT(θ1:k(m))(x) = s=1 I x tk(m),s Pn i=1 h(m),iyi I xi tk(m),s Pn i=1 h(m),i I xi tk(m),s Chen, Yu and Zhang h(m),i I x tk(m),s I xi tk(m),s Pn i =1 h(m),i I xi tk(m),s , (11) where tk(m),s is the end cell of the tree branch t1:k(m),s (also referred to leaves if it is not empty), and I( ) denotes the indicator function. Now, by definition, the (i, j)th element of PBL(m) can be expressed as PBL(m),ij = h(m),j I xi tk(m),s I xj tk(m),s Pn j =1 h(m),j I xj tk(m),s , (12) for all m = 1, . . . , Mn. Consistent with Chi et al. (2022), given a cell t, the CART-splitting criterion is defined as CARTt(d, c) = min c1 R1 Pn i=1 I(xid < c, xi t) (yi c1)2 + min c2 R1 Pn i=1 I(xid c, xi t) (yi c2)2 with its theoretical version CART t(d, c) = Pr (x1d < c | x1 t) var (y1 | x1d < c, x1 t) + Pr (x1d c | x1 t) var (y1 | x1d c, x1 t) , where d is the label of splitting variable, c is a corresponding splitting point, and xid is the dth entry of xi. Analogous to (12), we denote the corresponding theoretical version of PBL(m) pertaining to CART t(d, c) for each m {1, . . . , Mn} as P BL(m) = P BL(m),ij h(m),j I xi t k(m),s I xj t k(m),s Pn j =1 h(m),j I xj t k(m),s where t k(m),s for each m {1, . . . , Mn} and s {1, . . . , 2k(m)} is the sth end cell of the mth base learner based on the theoretical CART-splitting criterion CART t(d, c). Further, we continue employing the script on RF-related notations when referring to their theoretical counterparts. Detailed definitions corresponding to theoretical CART-splitting criterion will not be enumerated here, in the interest of brevity. Note that the theoretical CART-splitting criterion is independent of response values and therefore can be viewed as a special type of splitting criterion used by SUT trees. Consequently, as an immediate application, Theorems 1 - 2 in Section 3.2.1 guarantee the asymptotic optimality of the WRFopt with theoretical CART-splitting criterion. In what follows, we introduce additional notations crucial for quantifying the discrepancy between the RF and theoretical RF. Let Pr n I xi tk(m),s t k(m),s = 1 o = p(m),is, Optimal Weighted Random Forests δn = max 1 i n,1 m Mn,1 s 2k(m) I xi tk(m),s t k(m),s , where S1 S2 = (S1 S2) (S1 S2) is the symmetric difference between two generic sets S1, S2. Let Kn = max1 m Mn 2k(m) be the largest number of end cells (including empty end cells) among all trees, and N = max1 m Mn,1 l ℓ(m) n(m),l be the largest sample size across all leaves in all trees within the RF,2 with N being the theoretical counterpart of N. Now, we are equipped to explore the discrepancy between the RF and theoretical RF. The following result bounds the gap between two pivotal matrices PBL(m) and P BL(m) for all m = 1, . . . , Mn. Lemma 1 Under Condition 4, there exist two positive constants c1, c2 such that PBL(m),ij P BL(m),ij c1 Kn(N + N )δn, (14) PBL(m),ij P BL(m),ij 2 + c2 almost surely, uniformly for all m = 1, . . . , Mn. Clearly, δn plays a pivotal role in bounding the proximity between the matrices PBL(m) and P BL(m). The proof of Lemma 1 will be provided in Appendix C.4. Let ξ n and n be the counterparts of ξn and n under theoretical CART, we list and discuss some technical conditions as follows. Condition 1 ξ 1 n M2 n = o(1) almost surely, and E ξ 1 n M2 n exists for all fixed n 1. Condition 2 There exist two positive constants v1 and v2 such that E |ei|r | x0 i v2 1vr 2 2 r!/2 almost surely for every i = 1, . . . , n and r 2. Condition 3 hmaxn 1 n1/2 = O(1) almost surely. Condition 5 ξ 1 n Mnn1/2 = o(1) almost surely, and E ξ 1 n Mnn1/2 exists for all fixed n 1. Condition 6 As n , N /n, N + N and p(m),is are bounded above by deterministic series, say, rn , Nn and pn, respectively. 2. Note that for all m = 1, . . . , Mn, l {1, . . . , ℓ(m)} represents the index of leaves (non-empty end cells) in the mth tree, while s {1, . . . , 2k(m)} denotes the index of all end cells, inclusive of those that are empty. Chen, Yu and Zhang Condition 7 As n , pn < 1/2, ϵn = ξ 1 n Kn Nn log2 (n) v u u t2 log(2n Mn Kn) log 1 pn 1 + pn almost surely, and E( ϵn) exists all fixed n 1. Conditions 1 - 3 and 5 correspond to Conditions 1 - 3 and 5, respectively, with the focus shifted to theoretical CART-splitting. Moreover, Condition 2 guarantees that E|ei|r v2 1vr 2 2 r!/2, which is referred to as Bernstein s moment condition (Zhang and Chen, 2021). Particularly, if {ei}n i=1 are generated independently by Normal(0, σ2), it follows that v2 1 = 2σ2 and v2 = σ2 (Zhang and Chen, 2021, Example 5.4). In addition, Condition 6 imposes restrictions on the hyper parameters of RF, requiring the behavior of these parameters does not become too erratic as the amount of data increases. Condition 7 requires that the CART criterion leads to reasonably accurate splits in the sense that the difference between tk(m),s and t k(m),s (measured by E(δn), which is bounded above by q 2 log(2n Mn Kn)/log p 1 n 1 + pn) is ignorable in comparison to ξn. Theorem 3 (Asymptotic Optimality under CART and criterion 5) Assume that Conditions 1 - 3 , 4, and 6 - 7 hold. Then, as n , Ln(w ) infw H Ln(w) where w is the solution of minimizing the criterion (5) over w H. The proof of Theorem 3 is provided in Appendix C.5. Theorem 3 demonstrates the asymptotic optimality of the weighted RF with CART-splitting criterion, with weights obtained by criterion (5). This criterion is derived under a working homoskedastic framework and serves as a groundwork for further theoretical analysis. Based on Theorem 3, similar theoretical results regarding the heteroskedastic criteria (that is, C n(w) and C n(w)) can be established. Corollary 1 (Asymptotic Optimality for 1step-WRFopt with CART) Assume Conditions 1 - 3 , 4, and 6 - 7 hold. Then, as n , Ln (bw) infw H Ln(w) If, in addition, there exists an integrable random variable η such that {Ln(bw) ξn} ξ 1 n η, then Rn(bw) infw H Rn(w) Optimal Weighted Random Forests Corollary 2 (Asymptotic Optimality for 2steps-WRFopt with CART) Assume Conditions 1 - 3 , 4, 5 and 6 - 7 hold. Then, as n , Ln (ew) infw H Ln(w) If, in addition, there exists an integrable random variable η such that {Ln(ew) ξn} ξ 1 n η, then Rn(ew) infw H Rn(w) With the result established in Theorem 3, Corollaries 1 and 2 can readily be verified within the same framework that proves Theorem 2 and thus is omitted. Theorem 3 and Corollaries 1 and 2 extend the results in Theorems 1 and 2 from SUT to CART. We are unaware of any similar results in this field. 4. Numerical Study In this section, we will conduct experiments using real and semi-synthetic data, the latter derived from the former, to evaluate the performance of different weighted RFs. 4.1 Real Data Analysis To assess the prediction performance of different weighted RFs in practical situations, we used 11 data sets from the UCI data repository for machine learning (Dua and Graff, 2017). Because most of these data sets are low-dimensional, one additional high-dimensional data set from openml.org (Vanschoren et al., 2013) was also included. The details of the 12 data sets are listed in Table 2. Appendix B features a demonstration of two competitors, namely w RF and CRF. For the sake of brevity, in the following, we will refer to each data set by its abbreviation. We randomly partitioned each data set into training data, testing data and validation data, in the ratio of 0.5 : 0.3 : 0.2. The training data was used to construct trees and to calculate weights, and the test data was used to evaluate the predictive performance of different algorithms. The validation data was employed to select tuning parameters, such as the exponent in the expression for calculating weights in the w RF, and probability sequence in the SUT algorithm. In this section, the number of trees Mn was set to 100. Before each split, the dimension of random feature sub-space q was set to p/3 , which is the default value in the regression mode of the R package random Forest. We set the minimum leaf size nodesize to n in CART trees and 5 in SUT trees, in order to control the depth of trees. We also tried other values of Mn and nodesize, and the patterns of the performance remain stable in general. Figures D.12 - D.23 in Appendix D will provide more information on the robustness of the proposed methods over different RF hyper parameters.3 3. In our robustness tests for Mn, we varied Mn across 100, 200, 400 and 800, while keeping nodesize fixed at n . For nodesize robustness tests, nodesize was set to the quintiles within the range of [5, n ], with Mn fixed at 100. Chen, Yu and Zhang Data set Abbreviation Attributes Samples Boston Housing BH 12 506 Servo Servo 4 167 Concrete Compressive Strength CCS 9 1030 Airfoil Self-Noise ASN 5 1503 Combined Cycle Power Plant CCPP 4 9568 Concrete Slump Test CST 7 103 Energy Efficiency EE 8 768 Parkinsons Telemonitoring PT 20 5875 QSAR aquatic toxicity QSAR 8 546 Synchronous Machine SM 4 557 Yacht Hydrodynamics YH 6 308 Tecator Tecator 124 240 Table 2: Summary of 12 Data Sets For each strategy, the number of replication was set to D = 1000 and the forecasting performance was accessed by the following two criteria: MSFE = 1 D ntest i=1 (yi byi,d)2 and MAFE = 1 D ntest i=1 |yi byi,d| , where ntest is the size of testing data, and byi,d is the forecast for yi in the dth repetition. MSFE and MAFE are abbreviations of Mean Squared Forecast Error and Mean Absolute Forecast Error , respectively. As noted in Section 1, an averaging strategy with appropriately selected unequal weights may outperform simple averaging if individual learners display non-identical strength. To ascertain the relationship between the performance of the WRFopt and the diversity level of base learners, we employ the following weighted correlation between the residuals r(m) = y by(m) and r(m ) = y by(m ) where 1 m, m Mn and m = m , as proposed by Breiman (2001), 1 m 0 and trace P X(m)P X(r) 0, 6. The condition labels here are directly adopted from Qiu et al. (2020) for ease of reference. Optimal Weighted Random Forests almost surely. C.5 There exists a positive constant c1 such that for all m {1, . . . , Mn}, ζmax P X(m)P X(m) c1, almost surely, where ζmax(B) denotes the largest singular value of a generic matrix B. C.6 There exists a positive constant c2 such that for all m, r {1, . . . , Mn}, trace P2 X(m) c2 trace P X(m)P X(m) , and trace P X(r)P X(m)P X(r)P X(m) c2 trace P X(m)P X(m) , almost surely. C.7 ξ 1 n M2 n 0 almost surely, as n , and E(ξ 1 n M2 n) exists for any fixed n 1. C.8 There exists a positive constant v such that E e4 i | x0 i v < almost surely for i = 1, . . . , n. C.9 ιn = max1 m Mn max1 i n ι(m) ii = O(n 1/2) almost surely, as n , where ι(m) ii is the ith diagonal element of P X(m). Then, as n , we have (7). The original version of Lemma C.1 is established in the proof of Theorem 1 in Qiu et al. (2020) for non-stochastic X. However, as demonstrated in Appendix C.6, by substituting expectations with conditional expectations, and applying the Law of Iterated (or Total) Expectation, Pull-out rule and Lebesgue s Dominated Convergence Theorem, the proof by Qiu et al. (2020) can be readily extended to accommodate scenarios with stochastic X. Thus we omit the step-by-step proof in the current study. Next, we introduce four other lemmas for proving Theorems 1 - 3. Lemma C.2 (Gao et al., 2019) Let ew = argminw H {Ln(w) + an(w) + bn} , where an(w) is a term related to w and bn is a term unrelated to w. If sup w H |an(w)| /Rn(w) = op(1), sup w H |Rn(w) Ln(w)| /Rn(w) = op(1), and there exists a constant c and a positive integer n0 so that when n n0 and infw H Rn(w) c > 0 almost surely, then Ln(ew)/ infw H Ln(w) 1 in probability. Chen, Yu and Zhang Lemma C.3 (Saniuk and Rhodes, 1987) For any n n matrices G1 and G2 with both G1, G2 0, trace(G1G2) G1 2 trace(G2), where 2 denotes the spectral norm or largest singular value. Besides, for any n n symmetric matrices G1 and G2 with G2 0, trace(G1G2) λmax(G1) trace(G2), where λmax( ) denotes the largest eigenvalue. Lemma C.4 (Buldygin and Moskvichova, 2013) Let β(p) denote a Bernoulli random variable with probabilities p [0, 1] and q = 1 p. Let β(0)(p) be the centered Bernoulli random variable with parameter p, that is β(0)(p) = β(p) E {β(p)} = β(p) p. Then, when p (0, 1/2), the variance proxy (or the square of sub-Gaussian norm) of β(0)(p) is log 1 p 1 , which is a monotonically increasing function of p. Lemma C.5 (Zhang and Chen, 2021) Let {Zi}n i=1 be sub-Gaussian random variables (without independence assumption) with mean zero and variance proxy τ 2. Then, we have E max 1 i n |Zi| τ p C.2 Proof of Theorem 1 To verify (7), it remains to verify that Conditions 1 - 4 in the main paper can guarantee Conditions C.4 - C.9 in Lemma C.1. Having clearly configured the base learners within RFs, we now verify that Conditions C.4 - C.6 are satisfied when P BL(m) is used in place of P X(m). For each m = 1, . . . , Mn, i = 1, . . . , n and j = 1, . . . , n, recall that P BL(m),ij = h(m),j I xi t k(m),s I xj t k(m),s Pn j=1 h(m),j I xj t k(m),s , which is the counterpart of PBL(m)i,j (the precise definition of PBL(m)i,j is given in Equation 12),7 pertaining to SUT trees. Then, there exists a positive constant c such that trace P BL(m)P BL(m) 7. This notation is introduced in Section 3.2.2, under the context of CART trees, but is also applicable to SUT trees. Optimal Weighted Random Forests h(m),j I xi t k(m),s I xj t k(m),s Pn j=1 h(m),j I xj t k(m),s h2 (m),j I xi t k(m),s I xj t k(m),s I xi t k(m),k I xj t k(m),k Pn j=1 h(m),j I xj t k(m),s Pn j=1 h(m),j I xj t k(m),k h2 (m),j I xi t k(m),s I xj t k(m),s n Pn j=1 h(m),j I xj t k(m),s o2 h(m),ih(m),j I xi t k(m),s I xj t k(m),s n Pn j=1 h(m),j I xj t k(m),s o2 Pn i=1 h(m),i I xi t k(m),s Pn j=1 h(m),j I xj t k(m),s n Pn j=1 h(m),j I xj t k(m),s o2 = c ℓ(m) > 0, (C.1) almost surely, where the fourth step is from Condition 4. Clearly, for all m = 1, . . . , Mn, the elements of P BL(m) are non-negative. Therefore, for all m, r {1, . . . , Mn}, it is clear that trace P BL(m)P BL(r) 0. Thus, Condition C.4 in Lemma C.1 is satisfied. Besides, we have P BL(m) = max 1 i n h(m),j I xi t k(m),s I xj t k(m),s Pn l=1 h(m),l I xl t k(m),s = max 1 i n s=1 I xi t k(m),s Pn j=1 h(m),j I xj t k(m),s Pn l=1 h(m),l I xl t k(m),s P BL(m) 1 = max 1 j n h(m),j I xi t k(m),s I xj t k(m),s Pn l=1 h(m),l I xl t k(m),s = max 1 j n s=1 I xj t k(m),s Pn i=1 h(m),j I xj t k(m),s Pn l=1 h(m),l I xl t k(m),s c max 1 j n s=1 I xj t k(m),s Pn i=1 h(m),i I xi t k(m),s Pn l=1 h(m),l I xl t k(m),s Chen, Yu and Zhang =c max 1 j n s=1 I xj t k(m),s almost surely. Here, the last step in (C.2) is from the fact that P2k(m) s=1 I xi t k(m),s 1, and the inequality in (C.3) comes from Condition 4. Combining (C.2) and (C.3), we have ζ2 max P BL(m)P BL(m) =ζ2 max P BL(m)P BL(m) =λmax P BL(m)P BL(m)P BL(m)P BL(m) = P BL(m)P BL(m) 2 P BL(m) P BL(m) 1 c, almost surely. Thus, Condition C.5 in Lemma C.1 is satisfied. In addition, from Condition 4 and (C.1), we have trace P2 BL(m) = h(m),lh(m),i I xi t k(m),s I xl t k(m),s n Pn j=1 h(m),j I xj t k(m),s o2 h2 (m),l I xi t k(m),s I xl t k(m),s n Pn j=1 h(m),j I xj t k(m),s o2 = c trace P BL(m)P BL(m) , almost surely. Additionally, by Lemma C.3, for each m, r {1, . . . , Mn}, we have trace P BL(m)P BL(r)P BL(m)P BL(r) trace P BL(r)P BL(r)P BL(m)P BL(m) λmax P BL(r)P BL(r) trace P BL(m)P BL(m) c trace P BL(m)P BL(m) , almost surely, where the first inequality comes from the fact that trace n A B 2o trace AA BB for any generic matrices A, B Rm n, and the second step stems from Lemma C.3. Thus, Condition C.6 in Lemma C.1 is satisfied. Further, Conditions 1 and 2 in the main text are equivalent to Conditions C.7 and C.8 in Lemma C.1, respectively. Besides, Condition 3 guarantees Conditions C.9. Based on these observations, it is readily seen that Lemma C.1 holds under Conditions 1 - 4, and this proves (7). Optimal Weighted Random Forests By (7) in Section 3, and (A34) in Proof of Theorem 2 by Qiu et al. (2020) , we further have L n(bw )ξ 1 n = 1 + op(1). (C.4) This is owing to the fact that L n(bw )ξ 1 n sup w H n L n(bw )L n 1(w) o sup w H n L n(w)R n 1(w) o n L n(bw )L n 1(w) o 1 + sup w H n |L n(w) R n(w)| R n 1(w) o =1 + op(1), (C.5) L n(bw )ξ 1 n sup w H n L n(bw )L n 1(w) o inf w H n L n(w)R n 1(w) o n L n(bw )L n 1(w) o 1 + inf w H n (L n(w) R n(w)) R n 1(w) o n L n(bw )L n 1(w) o 1 sup w H n |L n(w) R n(w)| R n 1(w) o = 1 + op(1). (C.6) Then, under the same framework of Appendix A.4 in Zhang et al. (2020), it is readily seen from Lebesgue s Dominated Convergence Theorem that E {L n(bw ) ξ n} ξ 1 n 0, and this proves (8). C.3 Proof of Theorem 2 Based on Lemma C.2, we now present the proof of Theorem 2. It is seen that C n (w) = C n(w) + 2 e 2 i e2 i P ii(w), where C n(w) and C n (w), related to SUT trees, are the counterparts of Cn(w) and C n(w), respectively. Hence, from Lemma C.2 above and the derivation of (24) in Qiu et al. (2020), in order to prove (9), we need only to verify that e 2 i e2 i P ii(w) = op(1). (C.7) For each m = 1, . . . , Mn, let ι(m) ii be the ith diagonal element of P BL(m), Q (m) = diag ι(m) 11 , . . . , ι(m) nn , m=1 w(m)Q (m), Chen, Yu and Zhang and K n = diag e 2 1 e2 1, . . . , e 2 n e2 n . Then, we have e 2 i e2 i P ii(w) |trace {Q (w)K n}| R n(w) . We observe that for any δ > 0, under Conditions 2 and 3, trace {Q (w)K n} R n(w) trace Q (m)K n trace Q (m)K n m=1 E n trace Q (m)K n ξ 1 n o m=1 E n trace Q (m) K n ξ 1 n o m=1 E ℓ (m) K n ξ 1 n c1δ 1Mn E n1/2 K n ξ 1 n c1δ 1Mn E1/2 n1/2ξ 1 n 2 E1/2 K n 2 c2δ 1E1/2 Mnn1/2ξ 1 n 2 , where c1 and c2 are positive constants, the second inequality follows from the Markov s Inequality, the fourth inequality is obtained by Lemma C.3, the fifth inequality comes from Condition 3, the last second inequality is from the Cauchy-Schwarz Inequality, and the last step stems from the boundedness of E1/2 K n 2 by combining equality (C.2) and Conditions 2 and 3. Thus, (9) is proved by Condition 5 and the Lebesgue s Dominated Convergence Theorem. Similar to the proof techniques of Theorem 1, we have L n(ew )ξ 1 n = 1 + op(1), which yields (10). C.4 Proof of Lemma 1 It follows from the definitions of PBL(m),ij and P BL(m),ij that PBL(m),ij P BL(m),ij Optimal Weighted Random Forests h(m),j I xi tk(m),s I xj tk(m),s Pn j =1 h(m),j I xj tk(m),s h(m),j I xi tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj t k(m),s h(m),j I xi tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj t k(m),s h(m),j I xi t k(m),s I xj t k(m),s Pn j =1 h(m),j I xj t k(m),s ΞC41 + ΞC42. (C.8) In addition, note that I xj tk(m),s I xj t k(m),s = I xj tk(m),s I xj t k(m),s 2 . From Condition 4, one has h(m),j I xi tk(m),s I xj tk(m),s Pn j =1 h(m),j I xj tk(m),s h(m),j I xi tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s h(m),j I xi tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s h(m),j I xi tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj t k(m),s I xj tk(m),s I xj t k(m),s h(m),i I xi tk(m),s Pn j =1 h(m),j I xj tk(m),s Pn j =1 I xj tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s Pn j =1 h(m),j I xj t k(m),s i=1 h(m),ih(m),j I xi tk(m),s I xj t k(m),s ) Chen, Yu and Zhang I xj tk(m),s I xj t k(m),s I xj tk(m),s I xj t k(m),s 2 n I xj tk(m),s + I xj t k(m),s o 2c Kn(N + N )δn, (C.9) almost surely, and h(m),j I xi tk(m),s I xi t k(m),s I xj t k(m),s Pn j =1 h(m),j I xj t k(m),s Kn(N + N )δn. (C.10) Thus, (14) is verified by combining (C.9) - (C.10) with (C.8). Besides, for each i = 1, . . . , n, we have PBL(m),ij P BL(m),ij h(m),j I xi tk(m),s I xj tk(m),s Pn j =1 h(m),j I xj tk(m),s h(m),j I xi t k(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s h(m),j I xi t k(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s h(m),j I xi t k(m),s I xj t k(m),s Pn j =1 h(m),j I xj t k(m),s h(m),j I xi t k(m),s I xj tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s h(m),j I xj tk(m),s I xi tk(m),s I xi t k(m),s Pn j =1 h(m),j I xj tk(m),s h(m),j I xi t k(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s h(m),j I xi t k(m),s I xj t k(m),s Pn j =1 h(m),j I xj t k(m),s ΞC43 + ΞC44 + ΞC45. (C.11) Optimal Weighted Random Forests Note that P2k(m) s=1 I xi t k(m),s 1 and P2k(m) s=1 I xi tk(m),s 1 for all i = 1, . . . , n and m = 1, . . . , Mn. Similar to the proofs of (C.9) and (C.10), it is seen that under Condition 4, s=1 I xi t k(m),s n X h(m),i I xj tk(m),s I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s s=1 I xi t k(m),s Pn j=1 h(m),j I xj tk(m),s + I xj t k(m),s Pn j =1 h(m),j I xj tk(m),s almost surely, I xi tk(m),s I xi t k(m),s n I xi tk(m),s + I xi t k(m),s o =2δn, (C.13) s=1 I xi t k(m),s Pn j=1 h(m),j I xj t k(m),s Pn j =1 h(m),j I xj t k(m),s Pn j =1 h(m),j I xj t k(m),s I xj tk(m),s Pn j =1 h(m),j I xj tk(m),s s=1 I xi t k(m),s Pn j =1 h(m),j n I xj t k(m),s + I xj tk(m),s o Pn j =1 h(m),j I xj tk(m),s s=1 I xi t k(m),s Thus, (15) can be verified by combining (C.11) - (C.14) together and this concludes the proof. C.5 Proof of Theorem 3 In accordance with Section 3.2.2, notations without the script correspond to the CARTsplitting criterion, while those with the script are related to the theoretical CART-splitting Chen, Yu and Zhang criterion. It is evident that δn is a key factor bounding the difference between matrices PBL(m) and P BL(m). It is clear that Pr n I xi tk(m),s t k(m),s = 0 o = 1 p(m),is, from the definition of p(m),is. Therefore, when 0 < p(m),is < 1/2, from Lemma C.4, I xi tk(m),s t k(m),s p(m),is is a centered Bernoulli random variable with variance proxy τ 2 p(m),is = 1 2 p(m),is log 1 p(m),is 1 , where τ 2 p(m),is is a strictly increasing function of p(m),is. Then, for p(m),is (0, 1/2), we have max 1 i n,1 m Mn,1 s 2k(m) τ 2 p(m),is 1 2 log 1 pn 1 . By Lemma C.5 (which does not require the assumption of independence), for each r 1, we have E|δn|r = E|δn| E max 1 i n,1 m Mn,1 s 2k(m) I xi tk(m),s t k(m),s p(m),is v u u t2 log(2n Mn Kn) log 1 pn 1 + pn, (C.15) and this provides the upper bound of E|δn|. With the above point addressed, we now resume the primary trajectory of the proof. It is sufficient to verify that |Rn(w) Ln(w)| Rn(w) = op(1), (C.16) |C n(w) Ln(w)| Rn(w) = op(1). (C.17) We will verify them successively. Note that |Rn(w) Ln(w)| Rn(w) sup w H |Rn(w) R n(w)| Rn(w) + sup w H |L n(w) Ln(w)| |R n(w) L n(w)| Rn(w) , (C.18) |C n(w) Ln(w)| Rn(w) sup w H |{C n(w) Ln(w)} {C n (w) L n(w)}| Rn(w) Optimal Weighted Random Forests |C n (w) L n(w)| Rn(w) . (C.19) Moreover, we have |Rn(w) R n(w)| E n µ P (w)P(w)µ + 2µ P (w)P(w)e + e P (w)P(w)e | X o E n µ P (w)P (w)µ + 2µ P (w)P (w)e + e P (w)P (w)e | X o +2 E n µ P(w)µ + µ P(w)e | X o E n µ P (w)µ + µ P (w)e | X o E h e n P (w)P(w) P (w)P (w) o e | X i +2 E h µ n P (w)P(w) P (w)P (w) o e | X i + µ n P (w)P(w) P (w)P (w) o µ +2 E h µ {P(w) P (w)} e | X i + 2 µ {P(w) P (w)} µ , (C.20) |Ln(w) L n(w)| µ P (w)P(w)µ + 2µ P (w)P(w)e + e P (w)P(w)e n µ P (w)P (w)µ + 2µ P (w)P (w)e + e P (w)P (w)e o +2 n µ P(w)µ + µ P(w)e o n µ P (w)µ + µ P (w)e o . (C.21) Now, by (14) of Lemma 1, there exists a positive constant c1 such that e n P (w)P(w) P (w)P (w) o e r=1 w(m)w(r) e P BL(m)PBL(r) P BL(m)P BL(r) e max 1 m,r Mn e P BL(m)PBL(r) P BL(m)P BL(r) e max 1 m,r Mn e 2 PBL(m),ti PBL(r),tj PBL(m),ti P BL(r),tj + PBL(m),ti P BL(r),tj P BL(m),ti P BL(r),tj max 1 m,r Mn e 2 PBL(r),tj P BL(r),tj i=1 PBL(m),ti PBL(m),ti P BL(m),ti j=1 P BL(r),tj Chen, Yu and Zhang c1 e 2 Kn(N + N )δn c1 e 2 Kn Nnδn, (C.22) almost surely, where the last second step is from the fact that i=1 PBL(m),ti = i=1 P BL(m),ti 1, for all m = 1, . . . , Mn and t = 1, . . . , n, and the last step comes from Condition 6. Similarly, one has µ n P (w)P(w) P (w)P (w) o e c1 µ e Kn Nnδn, (C.23) almost surely, and µ n P (w)P(w) P (w)P (w) o µ c1 µ 2 Kn Nnδn, (C.24) almost surely. Besides, we have µ {P(w) P (w)} e sup w H m=1 w(m) µ PBL(m) P BL(m) e µ PBL(m) P BL(m) e max 1 m Mn µ e PBL(m),ij P BL(m),ij c1 µ e Kn Nnδn, (C.25) almost surely, and µ {P(w) P (w)} µ c1 µ 2 Kn Nnδn, (C.26) almost surely. Then, under Conditions 2 , 4, 6, and 7, by combining (C.20), (C.22) and (C.23) - (C.26) together, it is readily seen that there exists a positive constant c3 such that E supw H |Rn(w) R n(w)| ξn c3 Kn Nn E δn(1 + e + e 2 ) ξn + E1/2 e 2 E1/2 δn + E1/2 e 4 E1/2 δn 3c3E1/2 K2 n N2 n log4(n)δn 3c3E1/4(δn)E1/4 K4 n N4 n log8(n) Optimal Weighted Random Forests v u u t2 log(2n Mn Kn) log 1 pn 1 + pn E1/4 K4 n N4 n log8(n) = o(1), (C.27) almost surely. Here, several facts contribute to obtaining (C.27). The second inequality arises from the Cauchy-Schwartz Inequality. The third is owing to the Maximal Inequality with Bernstein s moment conditions (Zhang and Chen, 2021, Proposition 7.1), along with the fact that E1/r|Z|r is a non-decreasing function of r for r > 0 and any generic random variable Z. The penultimate step results from (C.15). Finally, the last equality is from Condition 7 combined with the Lebesgue s Dominated Convergence Theorem. By (C.21), under the same conditions and framework that derives (C.27), we have E supw H |Ln(w) L n(w)| ξn = o(1), (C.28) almost surely. Besides, since the theoretical CART-splitting criterion is independent of response values, it can be considered as a special type of splitting criterion used by SUT trees. Therefore, under Conditions 1 - 3 and 4, by combining Theorem 1, (C.27) and the proof of Theorem 1 by Qiu et al. (2020), we have |R n(w) L n(w)| Rn(w) sup w H |R n(w) L n(w)| R n(w) 1 + sup w H |Rn(w) R n(w)| Rn(w) = op(1). (C.29) Then, (C.16) can be verified by combining this with (C.18), (C.20), (C.21), (C.27) and (C.28) together. Additionally, |{C n(w) Ln(w)} {C n (w) L n(w)}| = 2e {P (w) P(w)}µ 2e {P (w) P(w)}e +2bσ2 trace{P(w)} 2bσ2 trace{P (w)} 2 e {P (w) P(w)}µ + 2 e {P (w) P(w)}e +2bσ2 trace{P(w) P (w)} + 2 bσ2 bσ2 trace{P (w)} . (C.30) Under the same framework that establishes (C.25), one has e {P(w) P (w)} e c1 e 2 Kn Nnδn. (C.31) In the light of (15) in Lemma 1 and Condition 6, there exists a positive constant c2 such that sup w H |trace{P(w) P (w)}| max 1 m Mn PBL(m),ii P BL(m),ii Chen, Yu and Zhang δn{2 + c2(1 + rn)}, (C.32) almost surely. As analogues of (C.2) and (C.3) in the context of CART trees, under Condition 4, there exists a positive constant c such that PBL(m) = 1, and PBL(m) 1 c, almost surely, for all m = 1, . . . , Mn. Then, we have 1 Mn PBL(m) PBL(m) max 1 m Mn PBL(m) 1 PBL(m) c, (C.33) almost surely. Therefore, it is seen that E1/2 bσ2 2 = E1/2 y P(w0)y 2 2E1/2 y 2 + P(w0) 2 y 2 2(1 + c2)E1/2 y 2 = O(1), (C.34) almost surely, where the third step comes from (C.33), and the last step is from Condition 2 . Likewise, we have E1/2 bσ2 bσ2 2 E1/2 y P(w0)y 2 2 + E1/2 y P (w0)y 2 = O(1), (C.35) almost surely. By combining this with (C.25), (C.31), (C.32), (C.34) and (C.35) with (C.30), we have E supw H | {C n(w) Ln(w)} {C n (w) L n(w)} | ξn = o(1). (C.36) Similar to (C.29), under Conditions 1 - 3 and 4, by combining Theorem 1, (C.27) and the proof of Theorem 1 in Qiu et al. (2020), we have |C n (w) L n(w)| Rn(w) sup w H |C n (w) L n(w)| R n(w) 1 + sup w H |Rn(w) R n(w)| Rn(w) = op(1). (C.37) By combining (C.36), (C.37) with (C.19), we obtain (C.17) and this concludes the proof. Optimal Weighted Random Forests C.6 Verifying Results of Qiu et al. (2020) for Stochastic X For the sake of convenience, Qiu et al. (2020) assume in all proofs that X is non-stochastic rather than stochastic. However, the framework developed in Qiu et al. (2020) can readily be extended for stochastic X. We take e A (w)µ /Rn(w) = op(1) for example, where A (w) = In P (w). In Appendix A.2 of Qiu et al. (2020), this result is labeled as (A6). Note that the framework proposed by Qiu et al. (2020) necessitates the independence of the hat matrix from response values, warranting notations in this section to carry the script . Proof of (A6) in Qiu et al. (2020) when X is Stochastic. Let Ω= diag(σ2 1, . . . , σ2 n), A (m) = In P BL(m) for all m = 1, . . . , Mn. Let Φ = µ A (m)A (s)µ Mn Mn, that is, the (m, s)th component of Φ is µ A (m)A (s)µ, Gn Mn = A (1)µ, . . . , A (Mn)µ , Ψ = n trace P BL(m)P BL(s)Ω o Ψ0 = diag n trace P BL(1)P BL(1)Ω , . . . , trace P BL(Mn)P BL(Mn)Ω o . So Φ = G G. For any w H, w Ψ0w w Ψw, (C.38) because for any m, s {1, . . . , Mn} , w(m) 0, w(s) 0 and trace P BL(m)P BL(s)Ω 0 by Condition C.4. In addition, it is clear that R n(w) = E P (w)µ µ + P (w)e 2 X = A (w)µ 2 + trace n P (w)P (w)Ω o = w (Φ + Ψ)w w (Φ + Ψ0) w, (C.39) where the last step is from (C.38). We also have Φ + Ψ0 > 0, (C.40) because Φ = G G and Ψ0 > 0 by definition. Let ρ = e A (1)µ . . . , e A (Mn)µ . It is straightforward to show that E(ρ|X) = 0. (C.41) Besides, under Condition 2, there exists a positive constant v such that Var(ρ|X) = E ρρ X = E e A (m)µµ A (s)e Chen, Yu and Zhang = µ A (s)ΩA (m)µ Mn Mn vΦ, (C.42) almost surely. It is seen that R 2 n (w) = sup w H PMn m=1 w(m)e A (m)µ 2 w (Φ + Ψ0) w sup w H ξ 1 n ρ (Φ + Ψ0) 1 ρ, (C.43) where the third step is from (C.39), and the last step is from (C.40) and Lemma 1 in Qiu et al. (2020). By Markov Inequality, we have that for any δ > 0, Pr n ξ 1 n ρ (Φ + Ψ0) 1 ρ > δ o δ 1E n ξ 1 n ρ (Φ + Ψ0) 1 ρ o = δ 1E h E n ξ 1 n ρ (Φ + Ψ0) 1 ρ X oi = δ 1E h ξ 1 n E n ρ (Φ + Ψ0) 1 ρ X oi = vδ 1E h ξ 1 n trace n (Φ + Ψ0) 1 Φ oi vδ 1E h ξ 1 n trace n (Φ + Ψ0) 1 Φ + Ψ1/2 0 (Φ + Ψ0) 1 Ψ1/2 0 oi = vδ 1E ξ 1 n Mn , (C.44) where the second step follows from the Law of Iterated (or Total) Expectation, the third step is obtained by the Pull-out rule, and the fourth step is guaranteed by (C.41) and (C.42). Combining (C.43), (C.44) and Condition 1, we obtain similar result in (A6) of Qiu et al. (2020) by the Lebesgue s Dominated Convergence Theorem. This demonstration bears a striking resemblance to Proof of (A6) in Appendix A.2 of Qiu et al. (2020). The only modification lies in the substitution of expectations with conditional expectations in (C.41) and (C.42), as well as the applications of the Law of Iterated Expectation, Pull-out rule, and Lebesgue Dominated Convergence Theorem in (C.44). Similarly, (A7) - (A9) and (A34) - (A35) for proving Theorems 1 and 2 in Qiu et al. (2020) can also be extrapolated using the same techniques. C.7 Some Additional Discussions on the Behavior of Risk Function When Relevant/Important Features Are Not Involved in the Model In many practical situations, investigators often cannot include all the relevant features in the model (Flynn et al., 2013), which leads to unignorable misspecification. In the current Optimal Weighted Random Forests section, we will adopt a simplified setting to demonstrate the impact of ignoring important or relevant features in weighted random forest. For simplicity, we use the full sample to grow random forest and only theoretical CART is considered (similar setting has also been considered in Chi et al. (2022)). In this case, the cells are only constructed based on Θ = {Θ(1), . . . , Θ(Mn)}, and X reduces to the σalgebra generated by {x0 1, . . . , x0 n, Θ}. Now we formally introduce the notion of ignoring relevant/important features in our context. Inspired by the Definition 1 in Chi et al. (2022), we say our variable pool misses relevant/important features when E µ(x0 1) E µ(x0 1) | x1, Θ 2 c0 > 0 (C.45) for some positive constant c0, where µ(x0 1) = µ1. In this scenario, we will demonstrate that the conditional risk is asymptotically bounded below by a positive constant. To see this, assume that min1 m Mn E n I x1 tk(m),s | Θ o c0. Then, for a given tree, we have m=1 w(m)bµm(xi) m=1 w(m){µ(x0 i ) bµm(xi)} s=1 I xi tk(m),s Pn j=1 yj I xj tk(m),s Pn j=1 I xj tk(m),s s=1 I xi tk(m),s Pn j=1 n µ(x0 j) + εj o I xj tk(m),s Pn j=1 I xj tk(m),s s=1 I xi tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o E n I x1 tk(m),s | Θ o s=1 I xi tk(m),s "E n µ(x0 1)I x1 tk(m),s | Θ o E n I x1 tk(m),s | Θ o Pn j=1 µ(x0 j)I xj tk(m),s n Pn j=1 I xj tk(m),s s=1 I xi tk(m),s Pn j=1 εj I xj tk(m),s Pn j=1 I xj tk(m),s s=1 I xi tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o E n I x1 tk(m),s | Θ o + r1,i + r2,i. Chen, Yu and Zhang Now we aim to bound r1,i and r2,i. In specific, if we assume that and max 1 m Mn 2k(m) Kn, where nn and Kn are deterministic positive series that grow to infinity as n , and c0 is a positive constant. Then, by Condition 2 and the fact that x0 j s are independent conditionally on Θ, uniformly for every i = 1, . . . , n, we have E1/2|r1,i|2 I xi tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o E{I x1 tk(m),s | Θ} Pn j=1 µ(x0 j)I xj tk(m),s n Pn j=1 I xj tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o E n I x1 tk(m),s | Θ o Pn j=1 µ(x0 j)I xj tk(m),s n Pn j=1 I xj tk(m),s Pn j=1 µ(x0 j)I xj tk(m),s n E n µ(x0 1)I x1 tk(m),s | Θ o Pn j=1 I xj tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o Pn j=1 I xj tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o E{I x1 tk(m),s | Θ} where the last inequality is based on Chebyshev s inequality. Similarly, under Condition 2, uniformly for every i = 1, . . . , n, we also have E1/2|r2,i|2 = O Combine (C.47) and (C.48) with (C.46), it is readily seen that Pn i=1 E n |µ(x0 i ) PMn m=1 w(m)bµm(xi)|2 | X o µ(x0 i ) PMn m=1 w(m) P2k(m) s=1 I xi tk(m),s E n µ(x0 1)I x1 tk(m),s |Θ o E n I x1 tk(m),s |Θ o Optimal Weighted Random Forests = Ex1,c|x1,Θ s=1 I x1 tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o E n I x1 tk(m),s | Θ o 1 n1/2 + n1/2 Kn where x1,c represents the vector containing the features that are not included in the model, and the Ex1,c|x1,Θ( ) is taken with respect to those missing features, conditional on x1 and Θ. Then, by the Projection Theorem, it is readily seen that the leading term on the righthand-side of the last line of (C.49) satisfies that s=1 I x1 tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o E n I x1 tk(m),s | Θ o Ex1,c|x1,Θ µ(x0 1) E µ(x0 1) | x1, Θ 2 where we have used the fact that s=1 I x1 tk(m),s E n µ(x0 1)I x1 tk(m),s | Θ o E n I x1 tk(m),s | Θ o is σ(x1, Θ)-measurable with σ(x1, Θ) being the σ-algebra generated by {x1, Θ}. This yields that if we further assume that n1/2 Kn/ nn = o(1), one has inf w H 1 Pn i=1 E n |µ(x0 i ) PMn m=1 w(m)bµm(xi)|2 | X o = Op These discussions indicate that if some relevant/important features are missing in the model, it is expected that ξ 1 n achieves a satisfactory rate of convergence. Appendix Appendix D. Additional Tables and Figures Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF Improvement Ratio (a) Noise in X Only (b) Noise in y Only (c) Noise in Both X and y Figure D.1: Improvement Ratio vs Noise on BH Data Set Optimal Weighted Random Forests Algorithm A.2: SUT Split_a_node(S) Input: The local learning subset S corresponding to the node we want to split Output: A split [a < c] or nothing -If Stop_split(S) is TRUE then return nothing. -Otherwise select q attributes {a1, . . . , aq} by probability sequence P among all non constant (in S ) candidate attributes ; // Hyper parameter: probability sequence P = {P1, . . . , Pp}, where Pj [0, 1], j = 1, . . . , p and Pp j=1 Pj = 1 -Draw q splits {s1, . . . , sq}, where si = Pick_a_split(S, ai) , i = 1, . . . , q; -Return a split s such that Score (s , S) = maxi=1,...,q Score (si, S). Pick_a_split(S, a) Input: A subset S and an attribute a Output: A split - Let a S max and a S min be the maximal and minimal value of a in S; - Calculate the cut-point c (a S min, a S max)/2 ; - Return the split [a < c]. Stop_split(S) Input: A subset S Output: A boolean - If |S| < nodesize, then return TRUE. - If all attributes are constant in S, then return TRUE. - If the output is constant in S, then return TRUE. - Otherwise, return FALSE. Score(s, S) Input: A split s and a subset S Output: The score of this split method -Let XP , XL, XR be the attribute matrix of this local parent node, left daughter, right daughter, respectively; -Let n P , n L, n R be the number of samples contained in the local parent node, left daughter, right daughter, respectively; -Obtain XP , XL, XR by centering and scaling of each column of the matrices XP , XL, XR, respectively; -score XP n L -Return score. 53 Chen, Yu and Zhang Algorithm B.1: 2steps-WRFopt Input: (1) The training data set D = {yi, xi}n i=1 (2) The number of trees in random forest Mn Output: Weight vector ew H 1 for m = 1 to Mn do 2 Draw a bootstrap data set D(m) of size n from the training data set D; 3 Grow a random-forest tree ˆf(m) to the bootstrap data D(m), by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nodesize is reached ; // nodesize, q are hyper 4 i. Select q variables at random from the p variables; 5 ii. Pick the best variable/ splitting point among the q; 6 iii. Split the node into two daughter nodes. 7 for i = 1 to n do 8 Drop xi down the the mth tree and get PBL(m)(xi, X, y, B(m), Θ(m)). 10 PBL(m) {PBL(m)(x1, X, y, B(m), Θ(m)), . . . , PBL(m)(xn, X, y, B(m), Θ(m))} . 12 Solve the quadratic programming problem: w = w (1), . . . , w (Mn) arg min w H C n(w); 13 e {In P(w )}y with P(w ) = PMn m=1 w (m)PBL(m); 14 Solve the quadratic programming problem: ew = ew(1), . . . , ew(Mn) arg min w H C n(w). Optimal Weighted Random Forests Algorithm B.2: w RF Input: (1) The training data set D = {yi, xi}n i=1 (2) The number of trees in RF Mn (3) Parameter λ Output: Weight vector bw H 1 for m = 1 to Mn do 2 Draw a bootstrap data set D(m) of size n from the training data set D; 3 Grow a random-forest tree ˆf(m) to the bootstrap data D(m), by recursively repeating the following steps for each node of the tree, until the minimum node size nodesize is reached ; // nodesize, q are hyper parameters 4 i. Select q variables at random from the p variables; 5 ii. Pick the best variable/ split-point among the q; 6 iii. Split the node into two daughter nodes. 7 t PE m 1 Pn i=1 OOBim Pn i=1 ˆf(m)(xi) yi OOBim; 8 bw(m) ( 1 t PE m )λ; 10 bw bw(1), . . . , bw(Mn) ; 11 bw bw bw 1 . Chen, Yu and Zhang Algorithm B.3: CRF Input: (1) The training data set D = {yi, xi}n i=1 (2) The number of trees in RF Mn Output: Weight vector bw H 1 for m = 1 to Mn do 2 Draw a bootstrap data set D(m) of size n from the training data set D; 3 Grow a random-forest tree ˆf(m) to the bootstrap data D(m), by recursively repeating the following steps for each node of the tree, until the minimum node size nodesize is reached ; // nodesize, q are hyper parameters 4 i. Select q variables at random from the p variables; 5 ii. Pick the best variable/ split-point among the q; 6 iii. Split the node into two daughter nodes. 7 t PE m 1 Pn i=1 OOBim Pn i=1 ˆf(m)(xi) yi OOBim; 9 Sequence {t PE 1, . . . , t PE Mn} from smallest to largest; 10 for m = 1 to Mn do 11 rm the order of the mth tree in sorted sequence; 12 bw(m) PMn ν=rm 1 ν ; 14 bw bw(1), . . . , bw(Mn) ; 15 bw bw bw 1 . Data set RF 2steps-WRFopt 1step-WRFopt w RF CRF BH 11.913(5) 11.380(1) 11.445(3) 11.401(2) 11.521(4) Servo 1.053(3) 1.057(4) 1.063(5) 1.017(1) 1.049(2) CCS 27.944(4) 27.658(1) 27.721(3) 27.703(2) 27.962(5) ASN 5.308(5) 5.095(1) 5.103(2) 5.241(4) 5.232(3) CCPP 15.818(2) 16.084(4) 16.256(5) 15.790(1) 15.901(3) CST 10.618(5) 9.319(1) 9.395(2) 9.740(3) 10.109(4) EE 3.533(2) 3.534(3) 3.541(4) 3.485(1) 3.559(5) PT 1.806(5) 1.452(2) 1.449(1) 1.571(3) 1.609(4) QSAR 1.378(3) 1.387(4) 1.397(5) 1.358(1) 1.368(2) SM( 10 5) 2.629(5) 1.994(1) 1.999(2) 2.444(3) 2.444(4) YH 2.495(5) 1.850(1) 1.891(3) 1.869(2) 2.072(4) Tecator 3.790(5) 2.955(2) 2.948(1) 3.178(3) 3.338(4) Table D.1: Test Error Comparisons by MSFE for Different Forests on High-Dimensional Data Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF Improvement Ratio (a) Noise in X Only (b) Noise in y Only (c) Noise in Both X and y Figure D.2: Improvement Ratio vs Noise on Servo Data Set Data set RF 2steps-WRFopt 1step-WRFopt w RF CRF BH 2.293(5) 2.273(3) 2.284(4) 2.254(1) 2.271(2) Servo 0.610(3) 0.613(4) 0.614(5) 0.604(1) 0.609(2) CCS 3.870(5) 3.836(1) 3.841(2) 3.847(3) 3.866(4) ASN 1.672(5) 1.646(1) 1.648(2) 1.662(3) 1.663(4) CCPP 3.030(2) 3.048(4) 3.060(5) 3.026(1) 3.034(3) CST 2.504(5) 2.311(1) 2.320(2) 2.377(3) 2.434(4) EE 1.191(2) 1.192(3) 1.194(4) 1.184(1) 1.197(5) PT 0.860(5) 0.777(1) 0.777(1) 0.794(3) 0.808(4) QSAR 0.871(3) 0.873(4) 0.876(5) 0.864(1) 0.867(2) SM( 10 3) 2.882(5) 2.763(1) 2.773(2) 2.813(3) 2.846(4) YH 0.724(5) 0.667(2) 0.675(3) 0.651(1) 0.676(4) Tecator 1.342(5) 1.207(1) 1.207(1) 1.246(3) 1.275(4) Table D.2: Test Error Comparisons by MAFE for Different Forests on High-Dimensional Data Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF Improvement Ratio (a) Noise in X Only (b) Noise in y Only (c) Noise in Both X and y Figure D.3: Improvement Ratio vs Noise on CCS Data Set Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF Improvement Ratio (a) Noise in X Only (b) Noise in y Only (c) Noise in Both X and y Figure D.4: Improvement Ratio vs Noise on CST Data Set Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF 0.00.10.20.30.40.50.60.70.80.91.0 Improvement Ratio (a) Noise in X Only (b) Noise in y Only 0.00.10.20.30.40.50.60.70.80.91.0 (c) Noise in Both X and y Figure D.5: Improvement Ratio vs Noise on EE Data Set Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF Improvement Ratio (a) Noise in X Only (b) Noise in y Only (c) Noise in Both X and y Figure D.6: Improvement Ratio vs Noise on PT Data Set Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF 0.00.10.20.30.40.50.60.70.80.91.0 Improvement Ratio (a) Noise in X Only (b) Noise in y Only 0.00.10.20.30.40.50.60.70.80.91.0 (c) Noise in Both X and y Figure D.7: Improvement Ratio vs Noise on QSAR Data Set Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF Improvement Ratio (a) Noise in X Only (b) Noise in y Only (c) Noise in Both X and y Figure D.8: Improvement Ratio vs Noise on SM Data Set Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF Improvement Ratio (a) Noise in X Only (b) Noise in y Only (c) Noise in Both X and y Figure D.9: Improvement Ratio vs Noise on YH Data Set Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF Improvement Ratio (a) Noise in X Only (b) Noise in y Only (c) Noise in Both X and y Figure D.10: Improvement Ratio vs Noise on Tecator Data Set Chen, Yu and Zhang BH Servo CCS ASN CCPP CST EE PT QSAR SM YH Tecator Data Sets Relative Risk Figure D.11: Comparative Analysis of Conventional RF Predictive Performance Between Original and Augmented Data Sets (Relative risks are calculated as the ratio of conventional RF risks on augmented data sets to those on original data sets.) Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 8 11 14 18 nodesize Figure D.12: Improvement Ratio vs Hyper Parameters of RF on BH Data Set Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 6 7 8 9 10 nodesize Figure D.13: Improvement Ratio vs Hyper Parameters of RF on Servo Data Set Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 10 15 20 26 nodesize Figure D.14: Improvement Ratio vs Hyper Parameters of RF on CCS Data Set Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 12 19 32 nodesize Figure D.15: Improvement Ratio vs Hyper Parameters of RF on ASN Data Set Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 24 43 62 81 nodesize Figure D.16: Improvement Ratio vs Hyper Parameters of RF on CCPP Data Set Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 6 7 8 nodesize Figure D.17: Improvement Ratio vs Hyper Parameters of RF on CST Data Set Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 9 13 17 23 nodesize Figure D.18: Improvement Ratio vs Hyper Parameters of RF on EE Data Set Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 20 35 64 nodesize Figure D.19: Improvement Ratio vs Hyper Parameters of RF on PT Data Set Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 9 13 19 nodesize Figure D.20: Improvement Ratio vs Hyper Parameters of RF on QSAR Data Set Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 9 13 19 nodesize Figure D.21: Improvement Ratio vs Hyper Parameters of RF on SM Data Set Optimal Weighted Random Forests ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 7 9 11 14 nodesize Figure D.22: Improvement Ratio vs Hyper Parameters of RF on YH Data Set Chen, Yu and Zhang ρ 1step WRFopt 2steps WRFopt w RF CRF 100 200 400 800 ntree Improvement Ratio 5 7 9 12 nodesize Figure D.23: Improvement Ratio vs Hyper Parameters of RF on Tecator Data Set Optimal Weighted Random Forests S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353 7360, 2016. G. Biau. Analysis of a random forests model. The Journal of Machine Learning Research, 13(1):1063 1095, 2012. G. Biau and E. Scornet. A random forest guided tour. Test, 25:197 227, 2016. L. Breiman. Random forests. Machine Learning, 45(1):5 32, 2001. V. V. Buldygin and K. K. Moskvichova. The sub-Gaussian norm of a binary random variable. Theory of Probability and Mathematical Statistics, 86:33 49, 2013. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785 794, 2016. C. M. Chi, P. Vossler, Y. Fan, and J. Lv. Asymptotic properties of high-dimensional random forests. The Annals of Statistics, 50(6):3415 3438, 2022. D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics. uci.edu/ml. C. J. Flynn, C. M. Hurvich, and J. S. Simonoff. Efficiency for regularization parameter selection in penalized likelihood estimation of misspecified models. Journal of the American Statistical Association, 108:1031 1043, 2013. Y. Gao, X. Zhang, S. Wang, T. T. Chong, and G. Zou. Frequentist model averaging for threshold models. Annals of the Institute of Statistical Mathematics, 71(2):275 306, 2019. P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63 (1):3 42, 2006. P. Ghosh, A. Neufeld, and J. K. Sahoo. Forecasting directional movements of stock prices for intraday trading using LSTM and random forests. Finance Research Letters, 46(Part A):102280, 2022. B. E. Hansen. Least squares model averaging. Econometrica, 75(4):1175 1189, 2007. B. E. Hansen and J. S. Racine. Jackknife model averaging. Journal of Econometrics, 167 (1):38 46, 2012. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, chapter 9, pages 307 308. Springer, 2009. G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning, chapter 8, pages 311 314. Springer, 2013. J. M. Klusowski. Universal consistency of decision trees in high dimensions. ar Xiv preprint ar Xiv:2104.13881, 2021. Chen, Yu and Zhang H. Li, W. Wang, H. Ding, and J. Dong. Trees weighting random forest method for classifying high-dimensional noisy data. In 2010 IEEE 7th International Conference on E-Business Engineering, pages 160 163, 2010. J. Lin, S. Lu, X. He, and F. Wang. Analyzing the impact of three-dimensional building structure on CO2 emissions based on random forest regression. Energy, 236(1):121502, 2021. Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578 590, 2006. H. Pallathadka, E. H. Ramirez-Asis, T. P. Loli-Poma, K. Kaliyaperumal, R. J. M. Ventayen, and M. Naved. Applications of artificial intelligence in business management, e-commerce and finance. Materials Today: Proceedings, 80:2610 2613, 2023. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. J. Peng and Y. Yang. On improvability of model selection by model averaging. Journal of Econometrics, 229(2):246 262, 2022. H. Pham and S. Olafsson. On Cesáro averages for weighted trees in the random forest. Journal of Classification, 37(1):1 14, 2019. P. Probst, M. N. Wright, and A. L. Boulesteix. Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3):e1301, 2019. Y. Qiu, T. Xie, J. Yu, and X. Zhang. Mallows-type averaging machine learning techniques. Working paper, 2020. I. Reis, D. Baron, and S. Shahaf. Probabilistic random forest: A machine learning algorithm for noisy data sets. The Astronomical Journal, 157(1):16, 2018. J. Saniuk and I. Rhodes. A matrix inequality associated with bounds on solutions of algebraic riccati and lyapunov equations. IEEE Transactions on Automatic Control, 32(8):739 740, 1987. E. Scornet, G. Biau, and J. P. Vert. Consistency of random forests. The Annals of Statistics, 43(4):1716 1741, 2015. M. R. Segal. Machine learning benchmarks and random forest regression. Center for Bioinformatics and Molecular Biostatistics, 2004. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1297 1304, 2011. Optimal Weighted Random Forests V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston. Random forest: A classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6):1947 1958, 2003. V. Syrgkanis and M. Zampetakis. Estimation and inference with trees and forests in high dimensions. In Proceedings of Thirty Third Conference on Learning Theory: PMLR, volume 125, pages 3453 3454, 2020. J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49 60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198. S. J. Winham, R. R. Freimuth, and J. M. Biernacka. A weighted random forests approach to improve predictive performance. Statistical Analysis and Data Mining: The ASA Data Science Journal, 6(6):496 505, 2013. S. Xuan, G. Liu, and Z. Li. Refined weighted random forest and its application to credit card fraud detection. In Computational Data and Social Networks, pages 343 355, 2018. J. Yoon. Forecasting of real GDP growth using machine learning models: Gradient boosting and random forest approach. Computational Economics, 57(1):247 265, 2021. H. Zhang and S. Chen. Concentration inequalities for statistical inference. Communications in Mathematical Research, 37(1):1 85, 2021. X. Zhang. A new study on asymptotic optimality of least squares model averaging. Econometric Theory, 37(2):388 407, 2021. X. Zhang, A. T. K. Wan, and G. Zou. Model averaging by jackknife criterion in models with dependent data. Journal of Econometrics, 174(2):82 94, 2013. X. Zhang, D. Yu, G. Zou, and H. Liang. Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. Journal of the American Statistical Association, 111(516):1775 1790, 2016. X. Zhang, G. Zou, H. Liang, and R. J. Carroll. Parsimonious model averaging with a diverging number of parameters. Journal of the American Statistical Association, 115 (530):972 984, 2020. S. Zhao, X. Zhang, and Y. Gao. Model averaging with averaging covariance matrix. Economics Letters, 145:214 217, 2016. Z. Zhou. Ensemble Methods: Foundations and Algorithms, chapter 4, pages 68 71. CRC Press, 2012. Z. Zhou, J. Wu, and W. Tang. Ensembling neural networks: many could be better than all. Artificial Intelligence, 137(1-2):239 263, 2002. J. Zou, W. Wang, X. Zhang, and G. Zou. Optimal model averaging for divergent-dimensional Poisson regressions. Econometric Reviews, 41(7):775 805, 2022.