# generalized_random_forests_using_fixedpoint_trees__616669ff.pdf

Generalized Random Forests using Fixed-Point Trees

David Fleischer * 1 David A. Stephens * 1 Archer Y. Yang * 1 2

We propose a computationally efficient alternative to generalized random forests (GRFs) for estimating heterogeneous effects in large dimensions. While GRFs rely on a gradient-based splitting criterion, which in large dimensions is computationally expensive and unstable, our method introduces a fixed-point approximation that eliminates the need for Jacobian estimation. This gradientfree approach preserves GRF s theoretical guarantees of consistency and asymptotic normality while significantly improving computational efficiency. We demonstrate that our method achieves a speedup of multiple times over standard GRFs without compromising statistical accuracy. Experiments on both simulated and real-world data validate our approach. Our findings suggest that the proposed method is a scalable alternative for localized effect estimation in machine learning and causal inference applications.

1. Introduction

In many real-world machine learning (ML) applications, practitioners seek to estimate how quantities of interest vary across different feature subgroups rather than assuming uniform effects. For example, medical interventions and policy treatments often have heterogeneous impacts across subpopulations, making localized estimation crucial for improving outcomes (Imai & Ratkovic, 2013; Knaus et al., 2021; Murdoch et al., 2019; Lee et al., 2020). Similarly, individualized recommendation systems adapt to user-specific features to enhance performance (Kohavi et al., 2013).

A key example of localized estimation arises in causal inference, where modern applications prioritize individualized treatment effects over average treatment effects (Neyman,

1Department of Mathematics and Statistics, Mc Gill University, Montreal, Canada 2Mila - Quebec AI Institute, Montreal, Quebec, Canada. Correspondence to: Archer Y. Yang <archer.yang@mcgill.ca>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1923; Rubin, 1974). The double machine learning framework (Chernozhukov et al., 2018) unifies various ML-based causal estimation methods, including lasso (Belloni et al., 2017), random forests (Athey et al., 2019; Cevid et al., 2022), boosting (Powers et al., 2018), deep learning (Johansson et al., 2016; Shalit et al., 2017), and general-purpose meta-algorithms (Nie & Wager, 2021; K unzel et al., 2019), all of which focus on capturing variation over feature space.

Generalized random forests (GRFs) (Athey et al., 2019; Wager & Athey, 2018) have emerged as a powerful tool for such tasks, leveraging adaptive partitioning with problemspecific moment conditions instead of standard loss-based splits. GRFs apply broadly to a wide range of important statistical models local linear regression (Friedberg et al., 2020), survival analysis and missing data problems (Cui et al., 2023), nonparametric quantile regression, heterogeneous treatment effect estimation, and nonlinear instrumental variables regression (Athey & Imbens, 2016; Athey et al., 2019). Unlike local linear models (Fan et al., 1995; Fan & Gijbels, 1996; Friedberg et al., 2020) or kernel-based models (Staniswalis, 1989; Severini & Staniswalis, 1994; Lewbel, 2007; Speckman, 1988; Robinson, 1988) which suffer from the curse of dimensionality (Robins & Ritov, 1997), the tree-based approach of GRF offers a more scalable solution.

However, GRFs gradient-based approach (Athey et al., 2019) becomes computationally expensive and unstable in large dimensions due to the reliance on Jacobian estimators for tree splitting. To address this, we propose a gradientfree approach based on fixed-point iteration, eliminating the need for Jacobian estimation while retaining GRF s theoretical guarantees of consistency and asymptotic normality. Our method significantly improves computational efficiency while maintaining statistical accuracy, achieving significant speedups in experiments on simulated and real-world datasets.

2. Background and Related Work

Given data (Xi, Oi) X O, GRF estimates a target function θ (x), defined as the solution to an estimating equation of the form

0 = EO|X ψθ (x),ν (x)(O) | X = x , (1)

Generalized Random Forests using Fixed-Point Trees

for all x X, where ψ is a score function that identifies the true (θ (x), ν (x)) as the root of (1), and ν (x) is an optional nuisance function. GRF can be understood from a nearest-neighbor perspective as approximating θ (x) through a locally parametric θ within small neighborhoods of test point x. Suppose L(x) {Xi}n i=1 is a subset of training observations of the covariates found in a region around x X over which θ (x) can be well-approximated by a local parameter. Observations Xi L(x) serve as local representatives for x in estimating θ (x) such that, given sufficiently many training samples in a small enough neighborhood of x, an empirical version of (1) over Xi L(x) defines an estimator ˆθL(x) that approaches θ (x),

(ˆθL(x), ˆνL(x)) arg min θ,ν

|L(x)| ψθ,ν(Oi)

In GRF, the set of local representatives L(x) is determined by tree-based partitions which divide the input space into disjoint regions, or leaves. The training samples Xi that fall in the same leaf as x form the subset L(x). However, single trees are known to have high variance with respect to small changes in the training data (Amit & Geman, 1997; Breiman, 1996; 2001; Dietterich, 2000), leading to estimates (2) that do not generalize well to values of x that are not part of the training set. GRF improves its estimates by leveraging an estimating function that averages many estimating functions of the form (2). Specifically, let Lb(x) denote the set of training covariates that fall in the same leaf as x, identified by a tree trained on an independent subsample of the data, indexed by b = 1, . . . , B. The GRF estimator is obtained by aggregating the individual estimating functions (2) across a forest of B independently trained trees, i.e. the solution to the following forest-averaged estimating equation:

(ˆθ(x), ˆν(x)) arg min θ,ν

i=1 αbi(x)ψθ,ν(Oi)

(3) where αbi(x) := 1(Xi Lb(x))

|Lb(x)| . Define observational weights αi(x) that measure the relative frequency with which training sample Xi falls in the same leaf as x, averaged over B trees:

b=1 αbi(x), (4)

for i = 1, . . . , n. Then, the solution (ˆθ(x), ˆν(x)) to the forest-averaged model (3) is equivalent to solving the following locally weighted estimating equation

(ˆθ(x), ˆν(x)) arg min θ,ν

i=1 αi(x)ψθ,ν(Oi)

Athey et al. (2019) present (5) as the definition of the GRF estimator, motivated in part by the mature analyses of local kernel methods (Newey, 1994) alongside more recent work on tree-based partitioning and estimating equations (Athey & Imbens, 2016; Zeileis & Hornik, 2007; Zeileis et al., 2008). The GRF algorithm for estimating θ (x) can be summarized as a two-stage procedure. Stage I: Use trees to calculate weight functions αi(x) for any test observation x X, measuring the relative importance of the i-th training sample to estimating θ ( ) near x. Stage II: Given a test observation x X, compute estimate ˆθ(x) of θ (x) by solving the locally weighted empirical estimating equation (5).

Our contribution improves the computational cost of Stage I by introducing a more efficient procedure to train the trees. Training the forest is the most resource-intensive step of GRF, and the cost of each split in the existing approach scales quadratically with the dimension of θ (x). We adopt a gradient-free splitting mechanism and significantly reduce both the time and memory demands of Stage I. Crucially, solving Stage II with weights αi(x) following our streamlined Stage I produces an estimator ˆθ(x) that preserves the finite-sample performance and asymptotic guarantees of GRF.

3. Our Method

In this section we describe the details of our accelerated algorithm for GRF. We closely follow the approach of Athey et al. (2019), and define ˆθ(x) as the solution to a locally weighted problem (5) with weighting functions αi(x) of the form (4). The weight functions are induced by a collection of local subsets {Lb(x)}B b=1, such that each subset Lb(x) is determined by the partition rules of a tree trained on a subsample. The construction of each tree, in turn, is determined by recursive splits of the subsample based on a splitting criterion designed to identify regions of X that are homogeneous with respect to θ (x). Therefore, to fully specify the weight functions αi(x), we must describe a feasible criterion for producing a split of X.

3.1. The target tree-splitting criterion for Stage I

In GRF, the goal of Stage I is to use recursive tree-based splits of the training data to induce a partition over the input space. Each split starts with a parent node P X and results in child nodes C1, C2 X, defined by a binary, axisaligned splitting rule of the form C1 = {Xi : Xi,ℓ t} and C2 = {Xi : Xi,ℓ> t}, where ℓdenotes a candidate splitting feature/axis and t R the splitting threshold. For a parent P and any child nodes C1, C2 of P, let (ˆθP , ˆνP ) and (ˆθCj, ˆνCj) denote local solutions analogous to (2) defined

Generalized Random Forests using Fixed-Point Trees

over the samples in P and Cj, respectively:

(ˆθP , ˆνP ) arg min θ,ν

{i:Xi P } ψθ,ν(Oi)

(ˆθCj, ˆνCj) arg min θ,ν

{i:Xi Cj} ψθ,ν(Oi)

for j = 1, 2. A strategy to split P into two subsets of greater homogeneity with respect to θ ( ) is as follows: Find child nodes C1 and C2 such that the total deviation between the local solutions ˆθCj and the target θ (X) is minimized, conditional on X Cj, j = 1, 2. A natural measure of deviation is the squared-error loss,

err(C1, C2) := X

j=1,2 P (X Cj | X P)

E θ (X) ˆθCj 2 X Cj

such that the resulting split (C1, C2) corresponds to least-squares optimal solutions ˆθC1 and ˆθC2. However, err(C1, C2) is intractable since θ ( ) is unknown. GRF considers a criterion that measures heterogeneity across a pair of local solutions over a candidate split

(C1, C2) := n C1n C2

ˆθC1 ˆθC2 2 , (8)

where n C1, n C2, and n P denote the number of observations in C1, C2, and P, respectively. In particular, rather than minimizing err(C1, C2), one can seek a split of P such that the cross-split heterogeneity between ˆθC1 and ˆθC2 is maximized. Athey et al. (2019) observe that err(C1, C2) and (C1, C2) are coupled according to err(C1, C2) = K(P) E [ (C1, C2)] + o(r2), where r > 0 is a small radius term tied to the sampling variance, and K(P) does not depend on the split of P. That is, splits that maximize (C1, C2) which emphasize the heterogeneity of ˆθCj across a split will asymptotically minimize err(C1, C2), which aims to improve the homogeneity of ˆθCj within a split.

Although the criterion (C1, C2) is computable, evaluating it is very computationally expensive since it requires solving (7) to obtain ˆθC1, ˆθC2 for all possible splits of P, and closedform solutions for ˆθCj are generally not available except in special cases of ψ. Instead, GRF approximates the target -criterion based on a criterion of the form

e grad(C1, C2) := n C1n C2

θgrad C1 θgrad C2

where θgrad Cj denotes a gradient-based approximation of ˆθCj.

Specifically, θgrad Cj is a first-order approximation interpreted

as the result of taking a gradient step away from the parent estimate in the direction towards the true child solution ˆθCj:

θgrad Cj := ˆθP 1 n Cj

{i:Xi Cj} ξ A 1 P ψˆθP ,ˆνP (Oi), (10)

where (ˆθP , ˆνP ) is the local solution over the parent, AP is any consistent estimator of the local Jacobian matrix (θ,ν)E[ψˆθP ,ˆνP (Oi) | Xi P], and ξ can be thought of as a term that selects a θ-subvector from a (θ, ν)-vector, e.g. if θ RK and ν R, then ξ such that θ = ξ (θ, ν) is the rectangular diagonal matrix ξ = [IK 0]. When the scoring function ψ is continuously differentiable in (θ, ν), the Jacobian estimator AP can be computed as

AP = (θ,ν) 1 n P

{i:Xi P } ψˆθP ,ˆνP (Oi)

{i:Xi P } (θ,ν)ψˆθP ,ˆνP (Oi). (11)

3.2. Limitations of gradient-based approximation

The use of the Jacobian estimator AP in (10) introduces considerable computational challenges. First, each parent node P in every tree of the forest requires a distinct AP matrix, which imposes a significant computational burden when explicitly calculating A 1 P ψˆθP ,ˆνP (Oi) to determine θgrad Cj . Second, if the local Jacobian (θ,ν)E[ψˆθP ,ˆνP (Oi) | Xi P] is ill-conditioned, then the resulting AP estimator may be nearly singular. This instability can lead to highly variable gradient-based approximations θgrad Cj and highly variable splits of P. For example, consider the following varying-coefficient model for an outcome Yi given regressors Wi = (Wi,1, . . . , Wi,K) in the presence of mediating auxiliary covariates Xi:

E[Yi | Xi = x] = ν (x) + W i θ (x), (12)

where ν ( ) is a nuisance intercept function and θ (x) = (θ 1(x), . . . , θ K(x)) are the target coefficients. Models of the form (12) encompass timeor spatially-varying coefficient frameworks, where (Xi, Yi, Wi) represent the i-th sample associated with spatiotemporal values Xi. Such models are particularly relevant in applications like heterogeneous treatment effects; see Section 5 for a more in-depth discussion. The local estimating function ψθ,ν(Yi, Wi), identifying (θ (x), ν (x)) through moment conditions as in (1), is given by:

ψθ,ν(Yi, Wi) := (Yi W i θ ν) Wi Yi W i θ ν

Generalized Random Forests using Fixed-Point Trees

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98

Regressor Correlation

Split Value

Boxplots: Split values over regressor correlations

0.80 0.85 0.90 0.95 Regressor Correlation

Split Variance

Median split variance over 250 replications

Figure 1: Splits values (top) and split variance (bottom), with 10th and 90th percentile bands, across correlations of Wi,1 and Wi,2.

Consequently, the corresponding local Jacobian estimator is

{i:Xi P } (θ,ν)ψθ,ν(Yi, Wi)

Wi W i W i Wi 1

When the regressors are highly correlated, the summation over the Wi W i block of the AP matrix leads to nearly singular values of AP , resulting in an unstable matrix inverse A 1 P , and therefore unstable values of θgrad Cj and unstable splits. This issue becomes more pronounced as the number of parent samples n P decreases, as is the case at deeper levels of the tree. These challenges highlight the limitations of relying on AP as part of an approximation for the child solutions ˆθCj.

As an illustration, consider a simple varying coefficient model with primary regressors Wi,1, Wi,2 N(0, 1), auxiliary covariates Xi Unif(0, 1), and outcomes Yi generated as

Yi = 1(Xi > 0.5)Wi,1 + Wi,2 + ϵi, (14)

where ϵi N(0, 1). Figure 1 illustrates the distribution of 2000 e grad-optimal binary splits (gradient-based tree stumps) fit over 1000 samples of the varying coefficient model (14), repeated over different regressor correlation levels Corr(Wi,1, Wi,2) {0.80, 0.81, . . . , 0.98, 0.99}. It is clear that splits based on the e grad-criterion exhibit high variability when the correlation between the regressors is large. In contrast, our proposed method, discussed in the next section, does not suffer from the same problem.

3.3. Fixed-point approximation

To address the limitations of gradient-based approximations, we propose a gradient-free approach based on the form of a single fixed-point iteration. Let ΨCj(θ, ν) := 1 n Cj P

{i:Xi Cj} ψθ,ν(Oi) denote the empirical estimating

function for the child solution (ˆθCj, ˆνCj) such that (7) is equivalently written as:

(ˆθCj, ˆνCj) arg min θ,ν

ΨCj(θ, ν) , j = 1, 2. (15)

Under mild regularity conditions, (ˆθCj, ˆνCj) is a Zestimator that solves the estimating equation ΨCj(θ, ν) = 0. Reformulating this equation as a fixed-point problem, we write:

(θ, ν) = (θ, ν) ηΨCj(θ, ν) | {z } =:f(θ,ν)

, η > 0. (16)

A necessary and sufficient condition for (ˆθCj, ˆνCj) to be a solution of (15) is characterized by the fixed-point problem (ˆθCj, ˆνCj) = f(ˆθCj, ˆνCj), where f is as defined in (16). Iterative fixed-point methods (Picard, 1890; Lindel of, 1894; Banach, 1922; Ryu & Boyd, 2016; Yang et al., 2021) solve such problems by considering an update rule of the form

(θ+, ν+) f(θ, ν). (17)

The form of (17) inspires us to approximate the true child solution ˆθCj using a single fixed-point update taken from the parent solution ˆθP :

Cj :=ˆθP ηξ ΨCj(ˆθP , ˆνP )

=ˆθP η n Cj ξ X

{i:Xi Cj} ψˆθP ,ˆνP (Oi), (18)

where the product with ξ is interpreted similarly to its role in the gradient-based approximation (10) and to express the update (17) solely in terms of the target θ-quantity. We interpret θFPT

Cj as an approximation of ˆθCj obtained by taking

a step from ˆθP in a direction that reduces the magnitude of the local estimating function ΨCj. Notably, the approximation θFPT

Cj does not involve the AP matrix, relying only on the scores ψˆθP ,ˆνP (Oi) evaluated at the parent solutions. In general, removing the inverse A 1 P provides computational cost savings of O(K3). The corresponding splitting criterion, which uses the fixed-point approximations θFPT

Cj as substitutes for ˆθCj is given by

e FPT(C1, C2) := n C1n C2

C2 2 . (19)

Revisiting the varying coefficient example from Section 3.2, we see that splits based on fixed-point approximations θFPT

Generalized Random Forests using Fixed-Point Trees

are significantly more stable than those based on θgrad Cj . Specifically, Figure 1 illustrates that splits that maximize e FPT(C1, C2) are more robust to ill-conditioning in the underlying local Jacobian (θ,ν)E[ψˆθP ,ˆνP (Oi) | Xi P], as is the case for highly correlated regressors in the varying coefficient model (14), and leading to highly stable splits.

3.4. Pseudo-outcomes

Approximations θCj of the form (10) and (18) offer an additional benefit: they enable the e -criteria of the form (9) and (19) to be efficiently optimized through a single multivariate CART split. A CART split performed with respect to vectorvalued responses ρi RK over a parent node P produces a split (C1, C2) that minimizes the following least-squares criterion: X

{i : Xi C1} ρi ρC1 2 + X

{i : Xi C2} ρi ρC2 2 , (20)

where ρCj := 1 n Cj P

{i:Xi Cj} ρi.1 Equivalently, a CART split that minimizes (20) will maximize:

n C1 ρC1 2 + n C2 ρC2 2 . (21)

The equivalence between the split that minimizes the leastsquares CART criterion (20) and the split that maximizes (21) is shown in Appendix B.1.1. GRF performs its splits by adopting gradient-based pseudo-outcomes, defined as

ρgrad i := ξ A 1 P ψˆθP ,ˆνP (Oi) (22)

such that the gradient-based approximation θgrad Cj in (10) is equivalently written:

θgrad Cj = ˆθP + 1 n Cj

{i:Xi Cj} ρgrad i = ˆθP + ρ grad Cj .

In the case of fixed-point approximation, we define fixedpoint pseudo-outcomes:

i := ηξ ψˆθP ,ˆνP (Oi), η = 0, (23)

such that the fixed-point approximation θFPT

Cj in (18) is equivalently written as

Cj = ˆθP + 1 n Cj

{i:Xi Cj} ρFPT

i = ˆθP + ρ FPT

Substitute the above form of θFPT

Cj into the e FPT-criterion (19) to equivalently express the criterion in terms of the FPT pseudo-outcomes:

e FPT(C1, C2) = n C1n C2

C2 2 , (25)

1The multivariate CART criterion uses a sum of squares impurity measure, as in De ath (2002); Segal (1992).

where an analogous equivalence holds for e grad in terms of the gradient-based pseudo-outcomes. We demonstrate in Lemma B.1 (in Appendix B.1.2) that maximizing the fixedpoint criterion e FPT(C1, C2) is equivalent to maximizing the CART criterion (21), and extend this property to any e -style criterion induced by pseudo-outcomes that can be expressed as a split-independent linear transformation of the parent scores ψˆθP ,ˆνP (Oi).

Note that our method does not rely on iterative fixed-point procedures at all. Instead, it uses only a single step of fixed-point approximation to simplify the pseudo-outcomes. These simplified pseudo-outcomes are then passed directly to a standard CART algorithm for splitting. The numerical convergence of our method therefore relies solely on CART s established and well-known stability, not on fixedpoint iteration. CART splits on pseudo-outcomes are computationally efficient. Given a parent node P, the value ρi = BψˆθP ,ˆνP (Oi) does not depend on a candidate split (C1, C2) for any matrix B that is fixed with respect to the parent. This allows much of the computation required to maximize e FPT(C1, C2) to be done at the parent level, and in particular avoids re-calculating the approximations θFPT

C1 and θFPT

C2 across the sequence of candidate splits. Once P is fixed and ρFPT

i are computed, the value of e FPT(C1, C2) for the first candidate split requires O(n P ) time, and the value for all other candidate splits of P are queried in O(1) time. While gradient-based pseudo-outcomes share this property, the use of fixed-point pseudo-outcomes eliminates the computational overhead and instability associated with estimating AP , as discussed in Section 3.2.

We show in Lemma B.2 (Appendix B.1.3) that choosing different values of η does not change the outcome of the fixed-point splitting mechanism. Specifically, the optimal split identified by CART on pseudo-outcomes ρFPT

i of the form (23) does not depend on η. This can be heuristically understood by studying how the criterion changes as a function of the candidate splits. To illustrate, we consider a VCM model of the form (12) for bivariate regressors Wi, univariate Xi [0, 1], and scalar outcomes Yi. A detailed summary of the settings is found in Appendix D.1. The sequence of valid candidate child nodes obtained by a split over univariate Xi can be parameterized through scalar t as C1(t) := {Xi : Xi t} and C2(t) := {Xi : Xi > t}. Let (t) := (C1(t), C2(t)) denote the parameterized target criterion (8), and consider the behavior of (t), e grad(t), and two fixed-point criteria e FPT

1 (t) and e FPT

2 (t) of the form (25) based on pseudooutcomes with scale factors η = 1 and η = 1/

2, respectively. Figure 2 illustrates the different splitting criteria values plotted against the sequence of candidate splits. The visualization clearly shows that the criteria curves for (t), e grad(t), and e FPT

1 (t) with η = 1 are all very close to one

Generalized Random Forests using Fixed-Point Trees

0.00 0.25 0.50 0.75 1.00 Threshold t

Criterion value

1 FPT η = 1

2 FPT η = 1 2

C 1 = {Xi : Xi t} and C 2 = {Xi : Xi > t} Criterion values (C 1,C 2) over candidate splits {C 1,C 2}

Figure 2: Criterion values across candidate splits (C1(t), C2(t)) over threshold t [0, 1]. The location of the optimal split under each criterion is given by the corresponding vertical line.

another. Critically, the fixed-point criterion with η = 1/

2, i.e. e FPT

2 (t), although scaled differently, still identifies the same maximizing split as e FPT

1 (t). This is because CART chooses a split based on a rank ordering of the criterion over all candidate splits. The absolute scale of the CART criterion does not matter, and it is only criterion rankings over the candidates that determines the optimal split. Therefore, choosing a different scalar η does not change the outcome of the splitting process.

Based on the scale-invariance of our splitting criterion, we now detail the recursive procedure for growing our fixedpoint trees pseudo-outcomes with η = 1.

The fixed-point tree algorithm. The entire fixed-point tree-growing procedure recursively applies the following two steps on a given parent node P:

(i) Labeling: Solve (6) over P to obtain the parent estimate (ˆθP , ˆνP ). Compute the pseudo-outcomes:

i := ξ ψˆθP ,ˆνP (Oi), (26)

for all i such that Xi P.

(ii) Regression: Maximize e FPT(C1, C2) by performing a CART split on the pseudo-outcomes ρFPT

3.5. Estimates of ˆθ(x) for Stage II

The fixed-point tree algorithm generates a single tree-based partition of X. Repeating this process over subsamples of the training data yields a forest of trees, each specifying local leaf functions Lb(x). These leaf functions define the local weight functions αi(x) via (4), completing Stage I of GRF. The full fixed-point tree training algorithm is described in Algorithm 1, while Algorithm 2 provides the pseudocode for the forest-wide Stage I procedure.

To compute the final GRF estimates ˆθ(x) for the target θ (x), we follow the standard GRF mechanism for Stage II. After the fixed-point trees are trained in Stage I, a test observation x0 X is assigned to local leaves Lb(x0), indexed by trees b {1, . . . , B}. Each leaf Lb(x0) contains the training observations that fall into the same leaf as x0 in tree b. Using these local leaves, the forest computes training weights αi(x0) as in (4). The final estimate ˆθ(x0) is obtained by solving the locally weighted estimating equation (5).

Importantly, as discussed in Section 2, solving for ˆθ(x0) in Stage II is independent of the specific mechanism used in Stage I. The only requirement is that Stage I produces valid weights. This ensures that Stage II remains a standard weighted estimating equation, enabling the fixed-point tree algorithm to integrate seamlessly into GRF s two-stage framework. We refer to the complete algorithm for estimating θ (x) using fixed-point trees as GRF-FPT. By preserving Stage II of GRF, the GRF-FPT estimator ˆθ(x) retains GRF s theoretical guarantees of consistency and asymptotic normality while offering a computationally efficient tree-building method. Pseudocode for Stage II of the GRF-FPT algorithm is provided in Algorithm 3, located in Appendix C.3.

4. Theoretical Analysis

In this section, we provide a theoretical foundation for the GRF-FPT estimator ˆθ(x). For Stage I, Proposition 4.1 establishes an asymptotic equivalence between the FPT criterion and a weighted oracle criterion V (C1, C2) in (27), while Lemma 4.2 demonstrates that the Specifications A.2 are met by a forest based on the V -criterion whenever they are met by a forest based on the -criterion. Assumptions A.1 and Specifications A.2 are the sufficient conditions for the consistency and asymptotic normality of ˆθ(x) in (5), and thus are used to formally justify the FPT algorithm as a mechanism for specifying an estimator of θ (x).

Proposition 4.1. Suppose Assumptions A.1 hold, and assume moreover Neyman orthogonal moment conditions (defined in Appendix A.4). Denote by r := sup{i:Xi P } Xi x P the radius of the parent P, where x P denotes the center of mass over Xi P. Let Vθθ(x P ) denote the θ-block of V (x P ) in (37). Denote by V the weighted Euclidean norm z V := Vθθ(x P )z 2 = q

z V θθ(x P )Vθθ(x P )z. Define the weighted oracle criterion V (C1, C2):

V (C1, C2) := n C1n C2

ˆθC1 ˆθC2 2

Then, treating the split as fixed with r 2 n C1, n C2 and

Generalized Random Forests using Fixed-Point Trees

sufficiently small r > 0,

e FPT(C1, C2) = V (C1, C2) + o P

r2, 1 n C1 , 1 n C2

Lemma 4.2. Let T ( ) denote a tree whose splitting mechanism seeks splits that maximize (C1, C2) defined in (8), and let T ( V ) denote a tree whose splitting mechanism seeks splits that maximize V (C1, C2) defined in (27). Suppose Assumptions A.1 hold and assume moreover that T ( ) is a tree that satisfies Specifications A.2. Then, T ( V ) satisfies Specifications A.2.

For Stage II, Theorem 4.3 establishes the consistency of the GRF-FPT estimator ˆθ(x):

Theorem 4.3. Suppose that Assumptions A.1 hold, and let (ˆθ(x), ˆν(x)) be estimates that solve (5) based on weights induced by a forest of trees grown under the fixed-point tree algorithm satisfying Specifications A.2. Then, (ˆθ(x), ˆν(x)) converges in probability to (θ (x), ν (x)).

The proof of Theorem 4.3 follows directly from Theorem 3 of Athey et al. (2019), which, under Assumptions A.1, establishes consistency for estimates (ˆθ(x), ˆν(x)) that solve (5) with weights from a forest that satisfies Specifications 1-5. Thanks to Lemma 4.2, these forest specifications must also apply to a forest grown under the FPT mechanism. Specifications 1-3 collectively impose mild boundary conditions on the splitting procedure. Meanwhile, Specification 4 requires that trees are trained on subsamples drawn without replacement (Biau et al., 2008; Scornet et al., 2015; Wager et al., 2014; Wager & Athey, 2018), and Specification 5 requires that trees must be grown using an additional subsample splitting mechanism known as honesty (Athey & Imbens, 2016; Biau, 2012; Denil et al., 2014). Appendix C.1 provides a detailed explanation of the subsampling and honest sample splitting procedure.

Finally, Theorem 4.4 establishes the asymptotic normality of the GRF-FPT estimator ˆθ(x):

Theorem 4.4. Under the conditions of Theorem 4.3, suppose moreover that Regularity Condition 1 holds, and that a forest is grown on subsamples of size s scaling as s = nβ, where β satisfies Regularity Condition 2. Then, there exists a sequence σn(x) such that (ˆθn(x) θ (x))/σn(x) N(0, 1) and σ2 n(x) = polylog(n/s) 1s/n, where polylog(n/s) is a function that is bounded away from 0 and increases at most polynomially with the log of the inverse sampling ratio log(n/s).

The proof of Theorem 4.4 is an immediate consequence of Theorem 5 of Athey et al. (2019). Theorems 4.3 and 4.4 demonstrate that the GRF-FPT estimator is able to meet key statistical guarantees.

5. Applications

In this section, we explore applications of GRF-FPT for two related models: varying coefficient models and heterogeneous treatment effects. We consider an outcome model of the form introduced in Section 3.2. For each observation, let Yi denote the observed outcome, Wi = (Wi,1, . . . , Wi,K)

a K-dimensional regressor, and Xi a set of mediating auxiliary variables, such that

Yi = ν (Xi) + W i θ (Xi) + ϵi, (28)

where ν ( ) is a nuisance intercept function, θ (x) = (θ 1(x), . . . , θ K(x)) are the target effect functions local to Xi = x, under the assumptions E[ϵi | Xi = x] = 0 and E[ϵi Wi | Xi = x] = 0.

Varying coefficient models (VCM). Given regressors Wi RK, models of the form (28) can be characterized as varying coefficient models (Hastie & Tibshirani, 1993). As discussed in Section 3.2, we must also assume that the regressors Wi are conditionally exogenous given Xi = x.

Heterogeneous treatment effects (HTE). A special case of (28) arises within the Neyman-Rubin potential outcome framework, which models the causal effect of treatment on an outcome (Neyman, 1923; Rubin, 1974). Here, θ (x) = (θ 1(x), . . . , θ K(x)) represents heterogeneous treatment effects associated with K discrete treatment levels. Let Ti {1, . . . , K} denote the observed treatment level for the i-th observation, and Yi(k) the potential outcome that would have been observed if treatment level k had been applied. The regressors Wi {0, 1}K in (28) are interpreted as a vector of dummy variables indicating the observed treatment level, Wi,k := 1(Ti = k). The auxiliary variables Xi account for potential confounding effects. The conditional average treatment effect of treatment level k {2, . . . , K} relative to the baseline level k = 1 is then defined as:

θ k(x) := E [Yi(k) Yi(1) | Xi = x] ,

where the baseline contrast is set to θ 1(x) := 0.

Under exogeneity of the regressors, the target effects θ (x) in models (28) are identified by moment conditions (1) for scoring function (Angrist & Pischke, 2009; Athey et al., 2019)

ψθ,ν(Yi, Wi) := (Yi W i θ ν) Wi Yi W i θ ν

The gradient-based pseudo-outcomes (22) are computed as

ρgrad i = A 1 P (Wi W P ) Yi Y P (Wi W P ) ˆθP , (29) where W P and Y P are the local means of Wi and Yi over the observations in P. Centering Yi Y P and Wi W P

Generalized Random Forests using Fixed-Point Trees

removes the baseline effect of the mean ˆνP on ρgrad i , and where AP is given by (13) as:

{i:Xi P } (Wi W P )(Wi W P ) . (30)

Computing ρgrad i in (29) involves the OLS coefficients ˆθP from regressing Yi Y P on Wi W P , over the observations in P:

ˆθP := A 1 P 1 n P

{i:Xi P } (Wi W P )(Yi Y P ). (31)

In comparison, ρFPT

i in (26) are computed as:

i := ξ ψˆθP ,ˆνP (Yi, Wi),

= (Wi W P ) Yi Y P (Wi W P ) ˆθP ,

The relationship ρgrad i = A 1 P ρFPT

i reveals a significant benefit of FPT pseudo-outcomes. The form of ρFPT

i eliminates the computational cost associated with the multiplication of A 1 P , leading to O(K3) computational savings. Furthermore, the computation of ˆθP in (32) no longer requires solving for A 1 P . Therefore, we can further enhance computational efficiency by using an accelerated form of pseudo-outcome ϕFPT

i instead of ρFPT

i := (Wi W P ) Yi Y P (Wi W P ) θP , (33) where ˆθP is replaced by θP in (32), which is defined as a one-step gradient descent approximation of ˆθP taken from the origin:

{i:Xi P } (Wi W P )(Yi Y P ). (34)

Here, γ denotes the exact line search step size for the regression of Yi Y P on Wi W P over P:

(W W P ) (Y Y P ) 2 2 (W W P )(W W P ) (Y Y P ) 2 2 , (35)

where W = [W1 Wn P ] and Y = [Y1 Yn P ] with the notation W W P and Y Y P understood as row-wise centering.

The computational cost associated with θP is comparatively small because many of the products that appear in (34) and (35) are already computed as part of ρFPT

i in (32). Meanwhile, we show in Appendix B.3 that the approximation for the FPT child estimator:

Cj := ˆθP + 1 n Cj

{i:Xi Cj} ϕFPT

is consistent for the original FPT child estimator θFPT

Cj = o P (1), meaning that this approximation does not alter the asymptotic behavior of our estimator. These accelerations are particularly compelling when the dimension of θ (x) is large and computational efficiency is critical, as in large-scale A/B testing with multiple concurrent treatment arms or observational studies with numerous treatment levels (Kohavi et al., 2013; Bakshy et al., 2014).

6. Simulations

In this section, we perform empirical evaluations of the computational efficiency and estimation accuracy of the GRF-FPT method. We let GRF-FPT1 denote the FPT algorithm using the exact form of the FPT VCM/HTE pseudooutcomes (32) and we let GRF-FPT2 denote the accelerated FPT algorithm based on the form of the FPT pseudooutcome approximation (33) in Section 5. We compare both implementations relative to GRF-grad under VCM and HTE designs. Implementation details and links to the reproducible code are found in Appendix C.4.

Settings. We follow the structural model in (28). The auxiliary variables Xi are drawn from the Gaussian copula with latent covariance matrix Σ, where [Σ]j,k = (0.3)|j k|. Supporting experiments for multicollinearity in Xi can be found in Appendix D.2. The outcomes Yi follow (28) with Gaussian noise ϵi N(0, 1). For VCM experiments, regressors Wi RK are sampled from NK(0, I). For HTE experiments, Wi {0, 1}K follows a multinomial distribution, Wi | Xi = x Multinomial(1, (π1(x), . . . , πK(x))), where πk(x) is the probability of treatment level k {1, . . . , K}, characterizing a variety of different locationspecific dependence structures through the setting of πk( ). We set ν (x) := 0 and vary the target effect functions θ k(x) and treatment probabilities πk(x) across different settings, fully detailed in Appendix C.4. Throughout our experiments we use subsampling ratio s/n = 0.5. Supporting experiments under different subsample ratios are found in Appendix D.2.

Results. The relative computational advantage of forests trained under GRF-FPT is displayed in Figure 3, while Figure 5 (in Appendix D.3) summarizes the absolute fit times across the three methods. These data show that the FPT mechanism is able to consistently offer a relative advantage, observing speedups of up to 3.5 faster than the gradientbased approach at the largest dimension K = 256. Figure 3 also shows increasing gains with increasing K and provides an empirical measurement of the theoretical scaling benefits discussed in Section 5. Moreover, the absolute fit times in Figure 5 (in Appendix D.3) illustrate that our method consistently remains faster than GRF-grad, with no clear computational or algorithmic bottleneck as a function of either n or K. Supporting experiments exploring the ef-

Generalized Random Forests using Fixed-Point Trees

dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5

dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5

dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5

n = 10000 n = 20000 n = 100000

4 16 64 256 4 16 64 256

1.0 1.5 2.0 2.5 3.0 3.5

1.0 1.5 2.0 2.5 3.0 3.5

1.0 1.5 2.0 2.5 3.0 3.5

n Trees = 100

Regressor dimension (K)

Speedup factor

VCM Setting

Varying coefficient model (VCM) Fit time speedup factor: GRF grad/GRF FPT (forests)

Figure 3: Speedup factor for GRF-FPT in comparison to GRFgrad for VCM timing experiments.

fects of sample sizes up to n = 500, 000 are presented in Appendix D.2, while Figures 7 and 8 (in Appendix D.3) show that even when n is small, GRF-FPT still observes a noticeable gain relative to GRF-grad. Additional timing benchmarks for VCM experiments and all HTE experiments are discussed in Appendix D.3.

To assess estimation accuracy, we evaluate the mean squared error (MSE) of ˆθ(x) across 50 replications of the model and testing on a separate set of 5, 000 observations. Figure 6 in Appendix D.3 confirms that GRF-FPT matches the accuracy of GRF-grad, while significantly reducing computation time. Further comparisons for both VCM and HTE settings are provided in Appendix D.3.

7. Real Data Application

Data. In this section we apply GRF-FPT to the analysis of geographically-varying effects θ (x) on housing prices. The data, first appearing in Kelley Pace & Barry (1997), contains 20,640 observations of housing prices taken from the 1990 California census. Each observation corresponds to measurements aggregated over a small geographical census block, and contains measurements of 9 variables: median housing value, longitude, latitude, median housing age, total rooms, total bedrooms, population, households, and median income. We employ a VCM design of the form (28) where Yi denotes the housing value, Xi denote the spatial coordinates, and Wi = (Wi,1, . . . , Wi,6) are the remaining six regressors. Details of the model and data transformations used for the California housing analysis is found in Appendix F.

Results. Table 7 summarizes the computational benefit of GRF-FPT applied to the California housing data. Figure 4 illustrates the six geographically-varying effect estimates

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

log Population log Census Block Bedrooms log Census Block Rooms

log Households Median Housing Age log Median Income

Effect on log value

3 2 1 0 1 2 3

Spatially varying effects on log median house values California housing data: GRF FPT2

Figure 4: Geographically-varying GRF-FPT2 estimates ˆθ(x).

under GRF-FPT2, with qualitatively similar results shown in Figure 16 for GRF-FPT1 and GRF-grad in Appendix F. Figure 4 shows clearly the geographically-dependent relationship between different housing features and housing prices. In major urban centers such as LA, San Francisco, and Sacramento, housing prices tend to decrease with an increasing number of households, and may reflect overcrowding in densely populated areas. In contrast, rural regions show the opposite trend: prices rise slightly when rural areas have a larger number of housing units. This suggests that, in sparsely populated rural areas, a modest increase in households makes these places more attractive and livable. Median income, however, consistently shows a positive effect on prices across nearly all of California, while population size tends to show a negative effect, highlighting broader state-wide pressures on housing affordability.

8. Conclusion

Our results demonstrate that the FPT algorithm offers a substantial computational advantage over GRF-grad with comparable statistical accuracy, and highlights GRF-FPT as a powerful method for multi-dimensional estimation, particularly when estimates of the target function must be learned from the data rather than observed directly. Future work may explore extensions to larger-scale problems and alternative estimation tasks, as in unsupervised learning and structured prediction. Our findings position GRF-FPT as a scalable and robust alternative for practitioners seeking efficient localized estimation.

Generalized Random Forests using Fixed-Point Trees

Impact Statement

This paper presents work whose goal is to advance the field of machine learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgments

This work was supported by Natural Sciences and Engineering Research Council (NSERC) Discovery Grant (RGPIN2024-06780) and FRQNT Team Research Project Grant (FRQ-NT 327788).

Amit, Y. and Geman, D. Shape quantization and recognition with randomized trees. Neural Computation, 9(7):1545 1588, 1997.

Angrist, J. D. and Pischke, J.-S. Mostly harmless econometrics: An empiricist s companion. Princeton university press, 2009.

Athey, S. and Imbens, G. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353 7360, 2016.

Athey, S., Tibshirani, J., and Wager, S. Generalized random forests. The Annals of Statistics, 47(2):1148 1178, 2019. doi: 10.1214/18-AOS1709. URL https:// doi.org/10.1214/18-AOS1709.

Bakshy, E., Eckles, D., and Bernstein, M. S. Designing and deploying online field experiments. In Proceedings of the 23rd International Conference on World Wide Web, WWW 14, pp. 283 292, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450327442. doi: 10.1145/ 2566486.2567967. URL https://doi.org/10. 1145/2566486.2567967.

Banach, S. Sur les op erations dans les ensembles abstraits et leur application aux equations int egrales. Fundamenta Mathematicae, 3:133 181, 1922.

Belloni, A., Chernozhukov, V., Fernandez-Val, I., and Hansen, C. Program evaluation and causal inference with high-dimensional data. Econometrica, 85(1):233 298, 2017.

Biau, G. Analysis of a random forests model. The Journal of Machine Learning Research, 13(1):1063 1095, 2012.

Biau, G., Devroye, L., and Lugosi, G. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(66):2015 2033, 2008.

URL http://jmlr.org/papers/v9/biau08a. html.

Breiman, L. Bagging predictors. Machine Learning, 24: 123 140, 1996.

Breiman, L. Random forests. Machine Learning, 45:5 32, 2001.

Breiman, L., Friedman, J., Olshen, R. A., and Stone, C. J. Classification and Regression Trees. CRC, 1984. ISBN 9780412048418.

Cevid, D., Michel, L., N af, J., B uhlmann, P., and Meinshausen, N. Distributional random forests: Heterogeneity adjustment and multivariate distributional regression. Journal of Machine Learning Research, 23(333):1 79, 2022. URL http://jmlr.org/papers/v23/ 21-0585.html.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 01 2018. ISSN 1368-4221. doi: 10.1111/ectj.12097. URL https://doi.org/10.1111/ectj.12097.

Cui, Y., Kosorok, M. R., Sverdrup, E., Wager, S., and Zhu, R. Estimating heterogeneous treatment effects with rightcensored data via causal survival forests. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(2):179 211, 02 2023. ISSN 1369-7412. doi: 10.1093/jrsssb/qkac001. URL https://doi.org/ 10.1093/jrsssb/qkac001.

De ath, G. Multivariate regression trees: a new technique for modeling species environment relationships. Ecology, 83(4):1105 1117, 2002.

Denil, M., Matheson, D., and De Freitas, N. Narrowing the gap: Random forests in theory and in practice. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 665 673, Bejing, China, 22 24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/ denil14.html.

Dietterich, T. G. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40:139 157, 2000.

Fan, J. and Gijbels, I. Local Polynomial Modelling and Its Applications, volume 66 of Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, London, 1996. doi: 10.1201/9780203748725. URL https://www.taylorfrancis.com/books/ mono/10.1201/9780203748725.

Generalized Random Forests using Fixed-Point Trees

Fan, J., Heckman, N. E., and Wand, M. P. Local polynomial kernel regression for generalized linear models and quasilikelihood functions. Journal of the American Statistical Association, 90(429):141 150, 1995.

Friedberg, R., Tibshirani, J., Athey, S., and Wager, S. Local linear forests. Journal of Computational and Graphical Statistics, 30(2):503 517, 2020.

Friedman, J. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pp. 1189 1232, 2001.

Hastie, T. and Tibshirani, R. Varying-coefficient models. Journal of the Royal Statistical Society. Series B (Methodological), 55(4):757 796, 1993. ISSN 00359246. URL http://www.jstor.org/stable/2345993.

Imai, K. and Ratkovic, M. Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics, 7(1):443 470, 2013. doi: 10.1214/12-AOAS593. URL https://doi.org/10. 1214/12-AOAS593.

Johansson, F., Shalit, U., and Sontag, D. Learning representations for counterfactual inference. In International Conference on Machine Learning, pp. 3020 3029. PMLR, 2016.

Kelley Pace, R. and Barry, R. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291 297, 1997. ISSN 0167-7152. doi: https://doi.org/10.1016/S0167-7152(96)00140-X. URL https://www.sciencedirect.com/ science/article/pii/S016771529600140X.

Knaus, M. C., Lechner, M., and Strittmatter, A. Machine learning estimation of heterogeneous causal effects: Empirical monte carlo evidence. The Econometrics Journal, 24(1):134 161, 2021.

Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., and Pohlmann, N. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 13, pp. 1168 1176, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450321747. doi: 10.1145/ 2487575.2488217. URL https://doi.org/10. 1145/2487575.2488217.

K unzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156 4165, 2019.

Lee, Y., Veerubhotla, K., Jeong, M. H., and Lee, C. H. Deep learning in personalization of cardiovascular stents. Journal of Cardiovascular Pharmacology and Therapeutics, 25(2):110 120, 2020.

Lewbel, A. A local generalized method of moments estimator. Economics Letters, 94(1):124 128, 2007.

Lindel of, E. Sur l application de la m ethode des approximations successives aux equations diff erentielles ordinaires du premier ordre. Comptes Rendus Hebdomadaires des S eances de l Acad emie des Sciences, 116:454 457, 1894.

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., and Yu, B. Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences, 116(44):22071 22080, 2019.

Newey, W. K. Kernel estimation of partial means and a general variance estimator. Econometric Theory, 10(2): 1 21, 1994.

Neyman, J. Sur les applications de la th eorie des probabilit es aux experiences agricoles: Essai des principes. Roczniki Nauk Rolniczych, 10(1):1 51, 1923. Reprinted and translated in Neyman, J. (1990). Statistical Science, 5(4), 463 480.

Nie, X. and Wager, S. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299 319, 2021.

Picard, E. M emoire sur la th eorie des equations aux d eriv ees partielles et la m ethode des approximations successives. Journal de Math ematiques Pures et Appliqu ees, 6:145 210, 1890.

Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., and Tibshirani, R. Some methods for heterogeneous treatment effect estimation in high dimensions. Statistics in Medicine, 37(11):1767 1787, 2018.

Robins, J. M. and Ritov, Y. Toward a curse of dimensionality appropriate (coda) asymptotic theory for semiparametric models. Statistics in Medicine, 16(1-3):285 319, 1997. doi: 10.1002/(SICI)1097-0258(19970215)16: 3 285::AID-SIM535 3.0.CO;2-\#. URL https:// pubmed.ncbi.nlm.nih.gov/9004398/.

Robinson, P. M. Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, pp. 931 954, 1988.

Rubin, D. B. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688, 1974.

Generalized Random Forests using Fixed-Point Trees

Ryu, E. K. and Boyd, S. A primer on monotone operator methods (survey). Applied and Computational Mathematics, 15(1):3 43, 2016. Survey article.

Scornet, E., Biau, G., and Vert, J.-P. Consistency of random forests. The Annals of Statistics, 43(4):1716 1741, 2015. doi: 10.1214/15-AOS1321. URL https:// doi.org/10.1214/15-AOS1321.

Segal, M. R. Tree-structured methods for longitudinal data. Journal of the American Statistical Association, 87(418): 407 418, 1992.

Severini, T. A. and Staniswalis, J. G. Quasi-likelihood estimation in semiparametric models. Journal of the American Statistical Association, 89(426):501 511, 1994.

Shalit, U., Johansson, F. D., and Sontag, D. Estimating individual treatment effect: generalization bounds and algorithms. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3076 3085. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr.press/v70/ shalit17a.html.

Speckman, P. Kernel smoothing in partial linear models. Journal of the Royal Statistical Society. Series B (Methodological), 50(3):413 436, 1988. ISSN 00359246. URL http://www.jstor.org/stable/2345705.

Staniswalis, J. G. The kernel estimate of a regression function in likelihood-based models. Journal of the American Statistical Association, 84(405):276 283, 1989.

Tibshirani, J., Athey, S., Sverdrup, E., and Wager, S. grf: Generalized Random Forests, 2024. URL https:// github.com/grf-labs/grf. R package version 2.4.0.

Wager, S. and Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228 1242, 2018.

Wager, S. and Walther, G. Adaptive concentration of regression trees, with application to random forests. ar Xiv preprint ar Xiv:1503.06388, 2015.

Wager, S., Hastie, T., and Efron, B. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. Journal of Machine Learning Research, 15 (1):1625 1651, 2014. URL https://jmlr.org/ papers/volume15/wager14a/wager14a.pdf.

Yang, Y., Gu, Y., Zhao, Y., and Fan, J. Flexible regularized estimating equations: Some new perspectives. 2021. URL https://arxiv.org/abs/2110.11074.

Zeileis, A. and Hornik, K. Generalized m-fluctuation tests for parameter instability. Statistica Neerlandica, 61(4): 488 508, 2007.

Zeileis, A., Hothorn, T., and Hornik, K. Model-based recursive partitioning. Journal of Computational and Graphical Statistics, 17(2):492 514, 2008.

Generalized Random Forests using Fixed-Point Trees

A. Technical Preliminaries

A.1. Assumptions

We follow the key assumptions of Athey et al. (2019) made for the theoretical analyses of GRF. The predictor and parameter spaces are both subsets of Euclidean space such that x X = [0, 1]p and (θ, ν) B RK, where B is a compact subset of RK. Under the analyses of Wager & Walther (2015), we suppose that the features of the auxiliary covariates Xi = (Xi,1, . . . , Xi,p) have density f X that is bounded away from 0 and , i.e. c f X(x) C < , for some constants c > 0 and C < . GRF does not require that the score function ψ is continuous in (θ, ν), as is the case for quantile estimation, one does require that the expected score/moment function

Mθ,ν(x) := EO|X [ψθ,ν(O) | X = x] , (36)

is smoothly varying in its parameters (θ, ν).

ASSUMPTION 1. For fixed (θ, ν), the M-function (36) is Lipschitz continuous in x.

ASSUMPTION 2. For fixed x, the M-function is twice-differentiable in (θ, ν) with uniformly bounded second derivative,

2 (θ,ν)Mθ,ν(x) < ,

where denotes the appropriate tensor norm for the second derivative of Mθ,ν taken with respect to (θ, ν). Let V (x) := (θ,ν)Mθ,ν(x) θ=θ (x),ν=ν (x) denote the population Jacobian at the true (θ (x), ν (x)), and assume that V (x) is invertible for all x X. We write V (x) in block form as

V (x) = Vθθ(x) Vθν(x) Vνθ(x) Vνν(x)

ASSUMPTION 3. The score functions ψθ,ν(Oi) have a continuous covariance structure in the following sense: Let γ( , ) denote the worst-case variogram:

Var O|X (ψθ1,ν1(Oi) ψθ2,ν2(Oi) | Xi = x) F ,

then, for some L > 0,

2 , for all (θ1, ν1), (θ2, ν2).

ASSUMPTION 4. The score function ψθ,ν(Oi) can be written as

ψθ,ν(Oi) = λ(θ, ν; Oi) + ζθ,ν(g(Oi)),

where λ is Lipschitz-continuous in (θ, ν), g : {Oi} R a univariate summary of the observables Oi, and ζθ : R R any family of monotone and bounded functions.

ASSUMPTION 5. For any weights αi with P αi = 1, the minimizer (ˆθ, ˆν) of the weighted empirical estimation problem (5) satisfies:

i=1 αiψˆθ,ˆν(Oi)

2 C max 1 i n{αi}, for C 0.

ASSUMPTION 6. The score function ψθ,ν(Oi) is a negative subgradient of a convex function, and the moment function Mθ,ν(Xi) is the negative gradient of a strongly convex function.

Generalized Random Forests using Fixed-Point Trees

A.2. Forest specifications

The consistency and asymptotic normality results, Theorems 4.3 and 4.4, require that the forest trained following Algorithm 2 consists of trees that satisfy a certain set of specifications. These forest specifications are precisely those imposed by Athey et al. (2019) for forests of gradient-based trees, and collectively, these specifications describe fairly mild conditions on the tree splitting mechanism, as well as specific requirements for the sampling procedure.

SPECIFICATION 1. (Symmetric) Tree estimates are invariant to permutations of the training indices. In other words, the output of a tree does not depend on the order in which the training samples are indexed.

SPECIFICATION 2. (Balanced/ω-regular) The proportion of parent observations assigned into either child is bound below by some ω > 0, i.e. n Cj ωn P .

SPECIFICATION 3. (Randomized/random-split) The probability of splitting along any feature/dimension of the input space is bound below by some π > 0.

SPECIFICATION 4. (Subsampling) Trees are trained on subsample of size s, drawn without replacement from n training samples, where s/n 0 as s .

SPECIFICATION 5. (Honesty) Trees are trained using the sample splitting procedure described in Appendix C.1.

A.3. Regularity conditions

REGULARITY CONDITION 1. Let V (x) be as defined in Assumption 2 and let ρ i (x) denote the influence function of the i-th observation with respect to the target θ (x):

ρ i (x) := ξ V (x) 1ψθ (x),ν (x)(Oi).

Then, Var(ρ i (x) | Xi = x) > 0, for all x X.

REGULARITY CONDITION 2. Trees are grown on subsamples of size s scaling as s = nβ, for some subsample scaling exponent β bound according to βmin < β < 1, such that

log ((1 ω) 1)

where 0 < π, ω < 1 are constants defined in forest Specifications 2 and 3.

A.4. Neyman orthogonality

To identify the underlying local parameters (θ (x), ν (x)) RK one must have a score ψθ,ν(O) with at least K = Kθ +Kν components, where here we use Kθ and Kν to denote the dimensions of the component subvectors θ (x) RKθ and ν (x) RKν. Conceptually, a score ψθ,ν(O) can be partitioned into the components that identify the θ-coordinates, denoted by ψ1, and those that identify the ν-coordinates, denoted by ψ2, and thus the moment functions Mθ,ν(x) in (36) can also be partitioned the same way:

ψθ,ν(O) = ψ1(θ, ν; O) ψ2(θ, ν; O)

, Mθ,ν(x) = M1(θ, ν; x) M2(θ, ν; x)

= E[ψ1(θ, ν; O) | X = x] E[ψ2(θ, ν; O) | X = x]

The corresponding Jacobian matrix of Mθ,ν(x) taken with respect to (θ, ν) and evaluated at the truth (θ (x), ν (x)) is

V (x) = (θ,ν) M(θ, ν; x)|θ=θ (x),ν=ν (x) = Vθθ(x) Vθν(x) Vνθ(x) Vνν(x)

Generalized Random Forests using Fixed-Point Trees

where here the subscripts in the block expressions of V (x) indicate the coordinates with which the gradient is taken, and in all cases are evaluated at the truth (θ (x), ν (x)):

Vθθ(x) = θ M1(θ, ν; x)|θ=θ (x),ν=ν (x) ,

Vθν(x) = ν M1(θ, ν; x)|θ=θ (x),ν=ν (x) ,

Vνθ(x) = θ M2(θ, ν; x)|θ=θ (x),ν=ν (x) ,

Vνν(x) = ν M2(θ, ν; x)|θ=θ (x),ν=ν (x) .

In this context, the assumption of Neyman orthogonal moment conditions is more completely labeled as Neyman orthogonality for the estimation of θ (x) with respect to the nuisance ν (x), and can be summarized as an assumption that the moment conditions for θ (x) are insensitive to first-order changes in ν around the truth ν (x) whenever θ = θ (x). For GRF, this means that one assumes (1) satisfies M1(θ (x), ν (x); x) = 0, and in other words, the partial derivatives of the moment functions for θ (x) with respect to ν are zero at (θ (x), ν (x)):

Vθν(x) = 0.

A.5. Example: Neyman orthogonality for VCM and HTE

Consider the VCM/HTE model with data (Yi, Wi, Xi) related according to

E[Yi | Xi = x] = ν (x) + W i θ (x),

such that, as discussed in Section 3.2, the score function ψθ,ν that identifies the underlying (θ (x), ν (x)) is

ψθ,ν(Yi, Wi) := (Yi W i θ ν)Wi Yi W i θ ν

and the corresponding local Jacobian V (x) has block form

V (x) = E Wi W i W i Wi 1

Xi = x = E[Wi W i | Xi = x] E[W i | Xi = x] E[Wi | Xi = x] 1

Therefore, for Neyman orthogonality to hold one requires that E[Wi | Xi = x] = 0.

B. Derivations and Proofs

B.1. Proofs for Section 3.4

B.1.1. MULTIVARIATE CART CRITERIA

Let ρi RK be vector-valued responses associated with covariates Xi P. A standard CART split (C1, C2) of P minimizes the conventional least-squares criterion: X

{i : Xi C1} ρi ρC1 2 + X

{i : Xi C2} ρi ρC2 2 , (38)

where ρCj := n 1 Cj P

{i:Xi Cj} ρi is the local prediction over child node Cj. We verify that a split (C1, C2) minimizes (38) if and only if it maximizes n C1 ρC1 2 + n C2 ρC2 2 . (39)

Proof. Each sum in (38) can be expanded as X

{i : Xi Cj}

ρi ρCj 2 = X

ρi ρCj 2 1(Xi Cj),

ρi 2 2ρ i ρCj + ρCj 2 1(Xi Cj),

{i:Xi P } ρi 2 1(Xi Cj) n Cj ρCj 2 .

Generalized Random Forests using Fixed-Point Trees

Therefore, the least-squares criterion CART (38) is equivalently written as

{i : Xi Cj}

ρi ρCj 2 = X

{i:Xi P } ρi 2 1(Xi Cj) n Cj ρCj 2

{i:Xi P } ρi 2 1(Xi Cj)

n C1 ρC1 2 + n C2 ρC2 2 ,

{i:Xi P } ρi 2 n C1 ρC1 2 + n C2 ρC2 2 .

The first term does not depend on the choice of split, and therefore the split that minimizes (38) is equivalent to the split that maximizes (39).

B.1.2. SPLITS VIA CART ON PSEUDO-OUTCOMES

The following result is a generalization to the claim made in Section 3.4 that a CART split on pseudo-outcomes ρFPT

i will produce a split that maximizes the e FPT-criterion, and is sufficiently general to cover gradient-based pseudo-outcomes ρgrad i and the corresponding e grad-criterion.

Lemma B.1. Suppose we can write

θCj = a + 1 n Cj

{i:Xi Cj} ρi, ρi = BψˆθP ,ˆνP (Oi), (40)

where a and B denote appropriately sized vectors and matrices whose values do not depend on the candidate child node Cj. Under Assumptions A.1, the split (C1, C2) that maximizes

e (C1, C2) = n C1n C2

θC1 θC2 2 ,

is exactly the split chosen by CART for vector-valued responses ρi fit over covariates Xi P.

Proof of Lemma B.1. The scores ψθ,ν(Oi) satisfy subgradient conditions by Assumption 6, and therefore the parent solutions (ˆθP , ˆνP ) satisfy the first-order conditions X

{i:Xi P } ψˆθP ,ˆνP (Oi) = 0.

{i:Xi P } ψˆθP ,ˆνP (Oi) = X

{i:Xi C1} ψˆθP ,ˆνP (Oi) + X

{i:Xi C2} ψˆθP ,ˆνP (Oi),

{i:Xi C1} ψˆθP ,ˆνP (Oi) + X

{i:Xi C2} ψˆθP ,ˆνP (Oi)

{i:Xi C1} ρi + X

{i:Xi C2} ρi.

Each sum in the previous expression is equivalently written as P ρi = n Cj( θCj a). Hence,

{i:Xi C1} ρi + X

{i:Xi C2} ρi,

= n C1( θC1 a) + n C2( θC2 a),

n P θC1 + n C2

Generalized Random Forests using Fixed-Point Trees

Writing ρCj := 1 n Cj P

{i:Xi Cj} ρi, one has:

ρC1 = θC1 a,

n P θC1 n C2

n P ρC1 2 = n C1n2 C2 n3 P

θC1 θC2 2 .

Applying analogous arguments with respect to C2, one has the symmetric result:

n P ρC2 2 = n C2n2 C1 n3 P

θC1 θC2 2 .

n C1 ρC1 2 + n C2 ρC2 2 = n C1n2 C2 n3 P

θC1 θC2 2 + n C2n2 C1 n3 P

θC1 θC2 2 ,

θC1 θC2 2 ,

= e (C1, C2).

Based on the arguments in Appendix B.1.1, a split (C1, C2) maximizes n C1 ρC1 2 +n C2 ρC2 2 if and only if it is a CART split performed on the ρi over P. That is, e (C1, C2) is precisely maximized by a single CART split on ρi = BψˆθP ,ˆνP (Oi) fit over covariates Xi P, as desired.

B.1.3. SCALE INVARIANCE OF CART SPLITS

Lemma B.2 (Argmax equivalence of FPT criteria). The optimal split identified by CART on pseudo-outcomes ρFPT

i of the form (23) does not depend on the scale factor η, for any η = 0.

Proof of Lemma B.2. Denote by ρ(η) i FPT pseudo-outcomes based on an arbitrary scale factor η = 0 of the form (23):

ρ(η) i := ηξ ψˆθP ,ˆνP (Oi), (41)

and let ψCj denote the child-leaf average score evaluated at the parent solution (ˆθP , ˆνP ):

ψCj := 1 n Cj

{i:Xi Cj} ψˆθP ,ˆνP (Oi),

such that the corresponding child-leaf pseudo-outcome averages ρ (η) Cj := 1 n Cj P

{i:Xi Cj} ρ(η) i are equivalently written as

ρ (η) Cj = ηξ ψCj

η (C1, C2) denote the FPT criterion of the form (25) based on pseudo-outcomes (41):

η (C1, C2) = n C1n C2

ρ (η) C1 ρ (η) Cj

2 = n C1n C2

ηξ (ψC1 ψC2) 2 .

One has: ηξ (ψC1 ψC2) 2 = η2 ξ (ψC1 ψC2) 2 ,

Generalized Random Forests using Fixed-Point Trees

and hence the e FPT

η -criteria obey the scaling relation:

η (C1, C2) = η2 e FPT

1 (C1, C2), (42)

where e FPT

1 denotes the FPT criterion induced by pseudo-outcomes ρ(1) i based on unit scale factor η = 1. The relation (42) implies that any nonzero split-independent rescaling ρ(η) i = ηρ(1) i will induce a splitting criterion e FPT

η (C1, C2) with the same maximizer as e FPT

1 (C1, C2):

arg max (C1,C2)

η (C1, C2) o = arg max (C1,C2)

1 (C1, C2) o = arg max (C1,C2)

1 (C1, C2) o .

Intuitively, a CART split is chosen by ranking the criterion values among the candidate splits and selecting the maximizing split (C1, C2). Therefore, the FPT splitting mechanism is unaffected by the scale factor η used to specify fixed-point pseudo-outcomes (23). The absolute scale of the e FPT-criterion does not matter when searching for the optimal split, and only the criterion rankings across the candidate splits determine the final partition.

B.2. Proofs for Section 4

Notation and definitions.

Let o P (a, b, c) := o P (max{a, b, c}), with an analogous abbreviation for OP ( ).

For a fixed parent node P, denote by x P the center of mass of the Xi P, and let r := sup{i:Xi P } Xi x P denote the radius of the parent P. Throughout, we consider an asymptotic regime where n Cj and r 0, corresponding to leaves over X of vanishing radius. Further, r and n Cj are related under the conditions of GRF Proposition 1, namely, r 2 n Cj and hence n Cjr2 and 1/ n Cj = o(r).

Let θ Cj denote the true parameter expectation over the child node:

θ Cj := E[θ (X) | X Cj], j = 1, 2, (43)

and let θ Cj(x P ) denote an oracle version of the gradient-based leaf statistic:

θ Cj(x P ) := θ (x P ) 1 n Cj

{i:Xi Cj} ξ V (x P ) 1ψθ (x P ),ν (x P )(Oi),

where V (x) is the underlying local Jacobian in Assumption 2. Equivalently, in terms of the oracle pseudooutcome/influence function ρ i ( ) defined in Regularity Condition 1,

θ Cj(x P ) := θ (x P ) + 1 n Cj

{i:Xi Cj} ρ i (x P ).

The following are technical lemmas used for the proof of Proposition 4.1.

Lemma B.3. Suppose Assumptions A.1 and Specifications A.2 hold. Then,

(C1, C2) = n C1n C2

θ C1 θ C2 2 + o P

r2, 1 n C1 , 1 n C2

Proof of Lemma B.3. Write the difference ˆθCj θ Cj as

ˆθCj θ Cj = ˆθCj θ Cj(x P )

+ θ Cj(x P ) E[ θ Cj(x P ) | X Cj]

+ E[ θ Cj(x P ) | X Cj] θ Cj

Generalized Random Forests using Fixed-Point Trees

Under standard LLN arguments, the second term satisfies T2 = OP (1/ n Cj), and in an asymptotic regime with r 2 n Cj one has T2 = o P (r). Meanwhile, the first and third terms appear in the proofs of Propositions 2 and 1 of Athey et al. (2019), respectively, and satisfy T1 = o P (r, 1/ n Cj) and T3 = O(r2) = T3 = o(r). It follows

ˆθCj θ Cj = o P r, 1/ n Cj ,

and in particular

ˆθC1 ˆθC2 = θ C1 θ C2 + o P

r, 1 n C1 , 1 n C2

Write A = θ C1 θ C2 and let E be any term satisfying E = o P (r, 1/ n C1, 1/ n C2) such that (C1, C2) is equivalently written (C1, C2) = (n C1n C2/n2 P ) A + E 2. Consider the difference

(C1, C2) n C1n C2

θ C1 θ C2 2 = n C1n C2

A + E 2 A 2 ,

2 A, E + E 2 .

Under Specification 2 there exists a fixed proportion ω > 0 such that n C1, n C2 ωn P , and hence n C1n C2/n2 P ω(1 ω) and also n C1n C2/n2 P 1/4 for all n C1 + n C2 = n P . Therefore n C1n C2/n2 P = O(1). Meanwhile, E 2 = o P (r2, 1/n C1, 1/n C2) is true by definition of E, and under our assumptions one may follow the arguments of Athey et al. (2019) Proposition 1 to see that A = θ C1 θ C2 = O(r). Thus,

A, E = O(r) o P

r, 1 n C1 , 1 n C2

r2, r n C1 , r n C2

and therefore

(C1, C2) n C1n C2

θ C1 θ C2 2 = o P

r2, 1 n C1 , 1 n C2

as desired.

Lemma B.4. Suppose the conditions of Lemma B.3 hold, and assume moreover Neyman orthogonal moment conditions such that the underlying Jacobian V (x) defined in Assumption 2 with block form (37). Then,

η (C1, C2) = n C1n C2

n2 P η2 Vθθ(x P )(θ C1 θ C2) 2 + o P

r2, 1 n C1 , 1 n C2

η defined in Lemma B.2 denotes the FPT criterion with arbitrary scale factor η = 0.

Proof of Lemma B.4. From the proof of Lemma B.2 one finds that FPT

η (C1, C2) is equivalently written

η (C1, C2) := n C1n C2

n2 P η2 ξ (ψC1 ψC2) 2 , ψCj := 1 n C1

{i:Xi Cj} ψˆθP ,ˆνP (Oi).

Under standard LLN arguments the average scores ψCj satisfy

ψCj = E[ψˆθP ,ˆνP (O) | X Cj] + OP (1/ n Cj). (44)

One applies iterated expectation to see

E[ψˆθP ,ˆνP (O) | X Cj] = E h E h ψˆθP ,ˆνP (O) | X i | X Cj i = E[MˆθP ,ˆνP (X) | X Cj],

ψCj = E[MˆθP ,ˆνP (X) | X Cj] + OP (1/ n Cj). (45)

Generalized Random Forests using Fixed-Point Trees

Expansion of MˆθP ,ˆνP (X). Under Assumption 2 one considers the Taylor expansion of MˆθP ,ˆνP (X) about (θ, ν) = (θ (x P ), ν (x P )):

MˆθP ,ˆνP (X) = Mθ (x P ),ν (x P )(X)

+ (θ,ν)Mθ (x P ),ν (x P )(X) ˆθP θ (x P ) ˆνP ν (x P )

ˆθP θ (x P ) ˆνP ν (x P )

The consistency of the parent solutions (ˆθP , ˆνP ) for (θ (x P ), ν (x P )) is established by Athey et al. (2019), and in particular (ˆθP , ˆνP ) (θ (x P ), ν (x P )) = OP (r, 1/ n P ). The asymptotic regime r 2 n P implies 1/ n P = o(r) and therefore the higher order quadratic term is equivalently expressed:

MˆθP ,ˆνP (X) = Mθ (x P ),ν (x P )(X) + (θ,ν)Mθ (x P ),ν (x P )(X) ˆθP θ (x P ) ˆνP ν (x P )

and therefore

E h MˆθP ,ˆνP (X) | X Cj i = E Mθ (x P ),ν (x P )(X) | X Cj

+ E (θ,ν)Mθ (x P ),ν (x P )(X) X Cj ˆθP θ (x P ) ˆνP ν (x P )

One has (θ,ν)Mθ (x P ),ν (x P )(X) = V (x P )+OP (r) because Mθ,ν(x) is Lipschitz in x, and the expansion in the previous display becomes:

E h MˆθP ,ˆνP (X) | X Cj i = E Mθ (x P ),ν (x P )(X) | X Cj + V (x P ) ˆθP θ (x P ) ˆνP ν (x P )

+ OP (r2). (46)

Expansion of Mθ (x P ),ν (x P )(X). Following similar arguments, the term Mθ (x P ),ν (x P )(X) is expanded about (θ, ν) = (θ (X), ν (X)) as:

Mθ (x P ),ν (x P )(X) = Mθ (X),ν (X)(X) + V (X) θ (x P ) θ (X) ν (x P ) ν (X)

= V (X) θ (x P ) θ (X) ν (x P ) ν (X)

where Mθ (X),ν (X)(X) = 0 holds because (θ (X), ν (X)) are defined as satisfying the GRF moment conditions (1) local to X. One takes the conditional expectation of the previous display:

E Mθ (x P ),ν (x P )(X) | X Cj = E V (X) θ (x P ) θ (X) ν (x P ) ν (X)

Whenever X Cj one has X x P = O(r), and the same Lipschitz arguments can be applied to see V (X) = V (x P ) + OP (r) conditional on X Cj, and the previous display simplifies:

E Mθ (x P ),ν (x P )(X) | X Cj = V (x P ) θ (x P ) θ Cj ν (x P ) ν Cj

+ OP (r2), (47)

where θ Cj := E[θ (X) | X Cj] and ν Cj := E[ν (X) | X Cj]. Substitute (47) into the conditional expectation (46):

E h MˆθP ,ˆνP (X) | X Cj i = V (x P ) θ (x P ) θ Cj ν (x P ) ν Cj

+ V (x P ) ˆθP θ (x P ) ˆνP ν (x P )

"ˆθP θ Cj ˆνP ν Cj

Generalized Random Forests using Fixed-Point Trees

Therefore, the child node score averages ψCj in (45) satisfy

ψCj = V (x P )

"ˆθP θ Cj ˆνP ν Cj

+ OP r2, 1/ n Cj ,

and the difference ψC1 ψC2 satisfies

ψC1 ψC2 = V (x P ) ˆθP θ C1 ˆνP ν C1

V (x P ) ˆθP θ C2 ˆνP ν C2

r2, 1 n C1 , 1 n C2

= V (x P ) θ C1 θ C2 ν C1 ν C2

r2, 1 n C1 , 1 n C2

We assume η is a fixed scalar and ξ a fixed matrix, it follows:

ηξ (ψC1 ψC2) = ηξ V (x P ) θ C1 θ C2 ν C1 ν C2

r2, 1 n C1 , 1 n C2

The fixed matrix ξ selects the coordinates of the target effect as ξ (θ, ν) = θ, and hence the product ξ V (x P ) simplifies:

ξ V (x P ) = ξ Vθθ(x P ) Vθν(x P ) Vνθ(x P ) Vνν(x P )

= Vθθ(x P ) Vθν(x P ) .

Under Neyman orthogonality one has Vθν(x P ) = 0, implying that ξ V (x P ) = [Vθθ(x P ) 0], and (48) becomes

ηξ (ψC1 ψC2) = ηVθθ(x P )(θ C1 θ C2) + OP

r2, 1 n C1 , 1 n C2

Asymptotic analysis. Let E be any term satisfying E = OP (r2, 1/ n C1, 1/ n C2). In our asymptotic regime with r 2 n Cj = 1/ n Cj = o(r), one has

r2, 1 n C1 , 1 n C2

r, 1 n C1 , 1 n C2

and therefore (49) satisfies

ηξ (ψC1 ψC2) = ηVθθ(x P )(θ C1 θ C2) + o P

r, 1 n C1 , 1 n C2

Write A = Vθθ(θ C1 θ C2) such that e FPT

η (C1, C2) is equivalently written e FPT

η (C1, C2) = (n C1n C2/n2 P ) η2 A + E 2. Consider the difference

η (C1, C2) n C1n C2

n2 P η2 Vθθ(θ C1 θ C2) 2 = n C1n C2

n2 P η2 A + E 2 A 2 ,

n2 P η2 2 A, E + E 2 .

One repeats the same arguments used in the final asymptotic analysis of Lemma B.3 to show

2 A, E + E 2 = o P

r2, 1 n C1 , 1 n C2

and thus e FPT

η (C1, C2) n C1n C2

n2 P η2 Vθθ(x P )(θ C1 θ C2) 2 = o P

r2, 1 n C1 , 1 n C2

as desired.

Generalized Random Forests using Fixed-Point Trees

Proof of Proposition 4.1. First, under Assumptions A.1 the θ-block Vθθ(x P ) of the local Jacobian V (x P ) is strictly positive definite and thus V defines a true norm. From the proof of Lemma B.3:

ˆθC1 ˆθC2 = θ C1 θ C2 + o P

r, 1 n C1 , 1 n C2

The matrix V (x P ) is non-random and fixed given P and η is a fixed scalar. It follows:

ηVθθ(x P )(ˆθC1 ˆθC2) = ηVθθ(x P )(θ C1 θ C2) + o P

r, 1 n C1 , 1 n C2

Up to a negative factor, the expression on the right is precisely the same as (50) in the proof of Lemma B.4, and thus one repeats the arguments to arrive at ηVθθ(x P )(ˆθC1 ˆθC2) 2

2 = ηVθθ(x P )(θ C1 θ C2) 2 2 + o P

r2, 1 n C1 , 1 n C2

ηV (C1, C2) = n C1n C2

n2 P η2 Vθθ(x P )(θ C1 θ C2) 2 2 + o P

r2, 1 n C1 , 1 n C2

The right hand side is precisely the same as in the statement of Lemma B.4 established for e FPT

η (C1, C2), and thus

η (C1, C2) ηV (C1, C2) = o P

r2, 1 n C1 , 1 n C2

as desired.

Proof of Lemma 4.2. Firstly, Specifications 4 (subsampling) and 5 (honesty) describe conditions imposed on the sampling mechanism and are not affected by the form of the splitting criterion. It remains to verify whether Specification 1 (symmetry), Specification 2 (balanced/ω-regular), and Specification 3 (randomized/random-split) are satisfied by T ( V ).

1. Specification 1: Symmetry. A tree is said to be symmetric if its estimates are invariant under permutations of the tree s training samples. Conditional on a sequence of criterion values computed over splits of P, the CART mechanism of selecting the best split by scanning over the collection of candidates does not depend on the parent samples at all. This means that asymmetry could only enter through the criterion values. Therefore, a sufficient condition for symmetry in the tree estimates with respect to permutations of the tree samples is whether the criterion V (C1, C2) is symmetric. Conditional on the child solutions ˆθC1, ˆθC2, the map (ˆθC1, ˆθC2) 7 V (C1, C2) does not depend on the parent samples at all, and therefore asymmetry could only enter through child solutions ˆθCj. However, both criteria use precisely the same child solutions ˆθCj in (7), and therefore V (C1, C2) will be symmetric whenever (C1, C2) is symmetric (specifically, whenever ψθ,ν(Oi) is symmetric with respect to permutations of i).

2. Specification 2: Balanced/ω-regular. This condition is enforced by by GRF by adding an additional stopping condition to the gradient-based version of Algorithm 1. Specifically, GRF stops a recursive splitting path if a proposed -optimal split were to send fewer than ωn P of the parent samples into either child. Simply stated, GRF enforces balanced splits by defining the set of valid candidate splits as those with at least ωn P parent samples in each child. This is left unchanged by our method.

3. Specification 3: Randomized/random-split. The asymptotic theory of GRF requires that, at each node, each variable is selected for a split with some lower bound probability π > 0. In order to satisfy the minimum split probability GRF uses the feature sampling device of Denil et al. (2014) which, at each step, considers only min{max{Poisson(m), 1}, p} randomly selected features as candidate variables, where p = dim(X) and m is a GRF tuning parameter. In other words, GRF defines the set of valid candidate splits such that the set of valid splitting dimensions is itself a random variable. This mechanism is left unchanged under our method.

No column of V (x P ) is all-zero V ,k(x P ) = 0 because V (x P ) is strictly positive definite symmetric, and therefore V (C1, C2) will not systematically ignore signals along parameter dimensions θk that can be detected by the - criterion. Finally, ˆθC1 ˆθC1 = 0 = Vθθ(x P )(ˆθC1 ˆθC2) = 0,

Generalized Random Forests using Fixed-Point Trees

because V (x) is strictly positive definite symmetric by Assumption 2. Therefore (C1, C2) > 0 = V (C1, C2) > 0 meaning that the V -criterion is non-degenerate and will always be able to select at least one feature whenever the -criterion can select a feature.

Therefore, all five specifications are met, and one concludes that T ( V ) must satisfy the forest Specifications A.2 whenever they are satisfied by T ( ).

B.3. Asymptotic equivalence of the pseudo-outcome approximation for VCM/HTE models

In this section we establish the asymptotic equivalence of the further acceleration of the fixed-point method proposed in Section 5 for VCM/HTE models. The accelerated algorithm is based on FPT pseudo-outcomes that use an approximation θP for the actual parent solutions ˆθP in (31). Specifically, the parent-leaf approximations θP are found by a single gradient descent step towards ˆθP taken from the origin (54). Let ρFPT

i denote the original FPT pseudo-outcomes for VCM/HTE models: ρFPT

i := (Wi W P ) Yi Y P (Wi W P ) ˆθP , (52)

where the solution ˆθP for the local model over the parent P are precisely the OLS coefficients from the regression the centered outcomes Yi Y P R on the centered regressors Wi W P RK. In contrast, let ϕFPT

i denote approximations of ρFPT

i pseudo-outcomes that are of the form

i := (Wi W P ) Yi Y P (Wi W P ) θP , (53)

where θP approximates ˆθP as:

{i:Xi P } (Wi W P )(Yi Y P ) = γ 1

n P W P YP . (54)

Here, WP Rn P K and YP Rn P denote the centered data matrices, WP := [Wi W P ]i:Xi P and YP := [Yi Y P ]i:Xi P , and the scalar γ > 0 denotes the exact line search step size corresponding to the regression of the centered outcomes on the centered regressors:

W P YP 2 2 WP W P YP 2 2 . (55)

Lemma B.5. Let θCj denote the FPT estimator of the form (24) for the child solution ˆθCj for VCM/HTE models. One can express θCj in terms of the corresponding fixed-point pseudo-outcomes:

θCj := ˆθP + 1 n Cj

i:Xi Cj ρFPT

Similarly, denote by θCj the FPT estimator of ˆθCj induced by pseudo-outcomes approximations ϕFPT

θCj := ˆθP + 1 n Cj

{i:Xi Cj} ϕFPT

Then, under the assumptions of Proposition 4.1, θCj is consistent for θCj as: θCj θCj = o P (1).

Proof. A direct calculation reveals that the difference between the original FPT pseudo-outcomes ρFPT

i in (52) and the approximations ϕFPT

i in (53) satisfy

i = (Wi W P ) h Yi Y P (Wi W P ) ˆθP i h Yi Y P (Wi W P ) θP i ,

= (Wi W P )(Wi W P ) (ˆθP θP ). (56)

Generalized Random Forests using Fixed-Point Trees

Therefore, the difference between the original FPT child estimator θCj and the approximation θCj satisfies

θCj θCj = 1 n Cj

{i:Xi Cj} (ρFPT

{i:Xi Cj} (Wi W P )(Wi W P ) (ˆθP θP ),

= SCj(ˆθP θP ),

where we denote SCj := 1 n Cj P

{i:Xi Cj}(Wi W P )(Wi W P ) . Therefore,

θCj θCj = SCj(ˆθP θP ) SCj F

ˆθP θP . (57)

Under GRF s regularity conditions, in a limit where n Cj and the parent radius r := sup{i:Xi P } Xi XP goes

to zero r 0, we have SCj p Q for some positive semidefinite symmetric matrix Q, and hence SCj F = OP (1). Meanwhile, by definition (31) for the OLS coefficients ˆθP and definition (54) for the one-step approximations θP , we have ˆθP θP = A 1 P n 1 P W P YP γ n 1 P W P YP ,

= A 1 P γI n 1 P W P YP ,

A 1 P γI F n 1 P W P YP ,

= n 1 P W P WP 1 γI F

n 1 P W P YP , (58)

where AP = n 1 P W P WP follows from the definition of AP as an estimator of the Jacobian ψ, e.g. (13) in the context of VCM/HTE models. Under the Lipschitz continuity Assumptions 1 & 3, one has the standard stochastic bound for the cross term n 1 P W P YP : n 1 P W P YP = OP

while the difference [n 1 P W P WP ] 1 γI is stochastically bound as n 1 P W P WP 1 γI F = OP (1),

because n 1 P W P WP p Cov(Wi | Xi P) is non-singular under Assumption 2. Coupling these stochastic bounds together according to (58) gives ˆθP θP = OP

and trivially, because n Cj < n P , ˆθP θP = OP

Under Proposition 1 of GRF one assumes r 2 n C1, n C2 and thus, in an asymptotic regime where n Cj and r 0, one has 1/ n Cj = o(r), and hence: ˆθP θP = o P (1). (61)

Returning to (57), the consistency of the parent approximation θP as (61) implies that the approximation θCj is itself consistent for the original FPT child estimator θCj: θCj θCj = o P (1), (62)

as desired.

Generalized Random Forests using Fixed-Point Trees

C. Implementation Details

C.1. Honest subsampling

In this section we present the honest subsampling mechanism. Trees are used to form partitions of the input space such as to to specify weight functions αi(x), defined as

b=1 αbi(x), for αbi(x) := 1(Xi Lb(x))

|Lb(x)| , i = 1, . . . , n, (63)

where Lb(x) denotes a subset training samples that fall alongside x according to the partition of tree b. The honesty mechanism ensures that no observation in leaf Lb(x) was used to build the partition rules of tree b. This is achieved by separating an initial subsample into two subsets: One for building the partition rules, and the other allocated as samples to the local leaves Lb(x) according to the trained rules. Below, we give a detailed outline of how subsampling and honest sample splitting is used to train a forest of trees, then show that weight function αi(x) given by honest trees according to (63) is conditionally independent of Oi given Xi.

Honest subsampling for GRF

For tree b {1, . . . , B},

1. (Subsampling). Draw an initial subsample I(b) of size s := |I(b)| from the training set (without replacement).

2. (Honest splitting). Split I(b) into disjoint sets J (b) 1 and J (b) 2 of size |J (b) 1 | = s/2 and |J (b) 2 | = s/2 .

(a) Train tree T(J (b) 1 ) based on the first subsample {(Xi, Oi) : i J (b) 1 }. Let R(b) 1 , . . . , R(b) M denote the partition of X induced by T(J (b) 1 ) such that

R(b) m := n x X : x satisfies the partition rules for leaf m of T(J (b) 1 ) o .

(b) Subset the samples from the second subsample {Xi : i J (b) 2 } according to the trained rules of T(J (b) 1 ), i.e. the samples of J (b) 2 in the leaves are determined by the rules of T(J (b) 1 ).

For any x X, the local leaf Lb(x) that appears in (63) is defined as the specific subset of J (b) 2 samples belonging to the same leaf of tree T(J (b) 1 ) as x, Lb(x) = {Xi R(b) m : i J (b) 2 and x R(b) m },

Conditional independence of αi(x) and Oi given Xi. By definition, the partition rules of tree T(J (b) 1 ) depend only on the J (b) 1 subsample. The rules of a tree operate only on covariate values, and therefore the task of subsetting {Xi : i J (b) 2 } into leaves according to the rules of T(J (b) 1 ) requires knowledge of the Xi values from the J (b) 2 subsample but not necessarily the Oi. Based on this understanding, we will show that αi(x) is conditionally independent of Oi given Xi. Based on (63), it is sufficient to show E[αbi(x) | Oi, Xi] = E[αbi(x) | Xi].

Case 1. Suppose i / J (b) 2 . By definition Lb(x) {Xj : j J (b) 2 }. It is immediate that 1({Xi Lb(x)}) = 0, and therefore αbi(x) = 0, and trivially

E[αbi(x) | Oi, Xi] = E[αbi(x) | Xi] = 0, for all i / J (b) 2 .

Case 2. Suppose i J (b) 2 . We show that each component used to specify αbi(x) in (63) is conditionally independent of Oi given Xi:

Tree T(J (b) 1 ) is trained using only the J (b) 1 subsample. This does not depend on Oi, for all i J (b) 2 , conditionally on Xi or otherwise.

Generalized Random Forests using Fixed-Point Trees

The rules of tree T(J (b) 1 ) operate only on input values. Therefore, conditionally on Xi for all i J (b) 2 , the leaves of the J (b) 2 subsample specified by tree T(J (b) 1 ) do not depend on the value of Oi.

Leaf Lb(x) is the specific subset of the J (b) 2 samples satisfying the same partition rules of T(J (b) 1 ) as x. Given the leaves have been specified by the previous step, this depends only on x.

Therefore, the individual component functions αbi(x) = 1({Xi Lb(x)})/|Lb(x)| are conditionally independent of Oi given Xi,

E[αbi(x) | Oi, Xi] = E[αbi(x) | Xi], for all i J (b) 2 .

Demonstration of honest subsampling. Let {(Xi, Oi)}n i=1 denote a training set of n = 20 observations, where each Xi = (Xi,1, Xi,2) is over X R2. We will use a forest of a single tree (B = 1) to specify the functional form of weights αi(x).

1. (Subsampling). Draw an initial subsample I of size s = 10.

2. (Honest splitting). Split I into two disjoint sets J1 and J2 , each with s/2 = 5 samples.

i Xi,1 Xi,2 Oi 1

i I Xi,1 Xi,2 Oi 2 3 5 8 10 11 14 15 16 20

i J1 Xi,1 Xi,2 Oi 2 3 8 15 20

i J2 Xi,1 Xi,2 Oi 5 1 0 10 2 -2 11 0 1 14 1 -2 16 2 2

(a) Train a tree using the data from the first subsample J1 , inducing a partition of X R2. Suppose the fitted tree has the following structure:

Xi,2 < 1 R3

(b) Use the trained partition rules to subset the J2 subsample into separate leaves.

Generalized Random Forests using Fixed-Point Trees

The tree trained on the J1 subsample will subset the J2 subsample as

{Xi : i J2} = {X10, X14} {X5, X11, X16} ,

where we include the trivial union with to note that the tree assigns none of the J2 samples to the partition of R2 where Xi,1 3.

The leaf Lb(x) is the specific subset of the J2 subsample such that Xi J2 satisfy the same partition rules as x. Given a test point x = x0, there are three possible scenarios for Lb(x0) that correspond to the three regions R1, R2, R3 R2 in which the test point x0 can appear.

Region 1. If x0 R1 then Lb(x0) = {X10, X14} and

αbi(x0) = 1({Xi Lb(x0)})

( 1 2 if i {10, 14}, 0 otherwise.

Therefore, αi(x0) = 1

2 for i = 10, 14 and zero for i {1, . . . , 20} \ {10, 14}.

Region 2. If x0 R2 then Lb(x0) = {X5, X11, X16} and

αbi(x0) = 1({Xi Lb(x0)})

( 1 3 if i {5, 11, 16}, 0 otherwise.

Therefore, αi(x0) = 1

3 for i = 5, 11, 16 and zero i {1, . . . , 20} \ {5, 11, 16}.

Region 3. If x0 R3 then Lb(x0) = . This is a degenerate case such that

αbi(x0) = 1({Xi Lb(x0)})

is undefined, leading to a non-identifiability problem whenever x0 R3. When this occurs, Tibshirani et al. (2024) recommends calculating αi(x0) based on only the trees with non-empty Lb(x0). Let B := {b {1, . . . , B} : |Lb(x0)| > 0} denote the indices of non-empty leaves associated with x0. Then, the GRF weight functions based on this recommendation can be written as

b B αbi(x0).

Generalized Random Forests using Fixed-Point Trees

Algorithm 1 The fixed-point tree algorithm

function TRAINFIXEDPOINTTREE

Input: node N node P0 GETSAMPLES(N) queue Q INITIALIZEQUEUE(P0) while NOTNULL(node P POP(Q)) do

(ˆθP , ˆνP ) SOLVEESTIMATINGEQUATION(P) Computes (6). ρFPT FPTPSEUDOOUTCOMES(ˆθP , ˆνP ) Applies (26) over P. split Σ CARTSPLIT(P, ρFPT) Optimizes (19). if SPLITSUCCEEDED(Σ) then

SETCHILDREN(P, GETLEFTCHILD(Σ), GETRIGHTCHILD(Σ)) ADDTOQUEUE(Q, GETLEFTCHILD(Σ)) ADDTOQUEUE(Q, GETRIGHTCHILD(Σ)) end if end while Output: tree with root node P0 end function

POP returns and removes the oldest element of queue a Q, unless Q is empty, in which case it returns NULL. CARTSPLIT runs a multivariate CART split on the pseudo-outcomes ρFPT := {ρFPT

i }i P , and either returns a pair of child nodes or indicates that no split of P is possible.

C.2. In-sample predictions

There is additional bias associated with making predictions based on in-sample observations Xi that may have been used either to train the tree structure or to populate the local leaves Lb(x). The recommendation of Tibshirani et al. (2024) is along the lines of the out-of-bag mechanism used by Breiman (2001). For an in-sample observation x {Xi}n i=1, calculate weights αoob i (x ) based only on those trees whose initial subsample I(b) does not contain x . Then the out-of-bag weight is defined as:

αoob i (x ) := 1 |{b : x / I(b)}|

{b:x / I(b)} αoob bi (x ) for αoob bi (x ) := 1(Xi Lb(x ))

The in-sample prediction ˆθoob(x ) for x is made by GRF by solving a version of the locally weighted estimating equation (5) using out of bag weights αoob i (x )

ˆθoob(x ), ˆνoob(x ) arg min θ,ν

i=1 αoob i (x )ψθ,ν(Oi)

which preserve the consistency and asymptotic normality of the GRF estimator at in-sample observations.

C.3. Algorithms and Pseudocode

C.4. Simulation Details

Implementation details. We implement the GRF-FPT algorithm in a fork of grf (Tibshirani et al., 2024) available at https://github.com/dfleis/grf. The functions grf::lm forest and grf::multi arm causal forest provide an easy to use interface for VCM and HTE estimation, respectively, and we allow the choice GRF-FPT1, GRF-FPT2, or GRF-grad to be controlled via the method argument. Code and data for reproducing all experiments and figures are available at https://github.com/dfleis/grf-experiments.

Data-generating settings. The different setting for the target effects θ k(x) include a sparse linear setting, a sparse logistic setting with interaction, a dense logistic setting, and a random function generator setting. Tables 2 and 3 provide the details of each regime for VCM and HTE experiments, respectively, for the data-generating model (28). These tables also summarize

Generalized Random Forests using Fixed-Point Trees

Algorithm 2 Stage I GRF-FPT: Training a generalized random forest using fixed-point trees

function TRAINGENERALIZEDRANDOMFORESTFPT

Input: samples S, number of trees B for b = 1, . . . , B do

set of samples I SUBSAMPLE(S) sets of samples JBUILD, JPOPULATE HONESTSPLIT(I) See honesty: Appendix C.1. tree Tb TRAINFIXEDPOINTTREE(JBUILD) See Algorithm 1. leaves Lb POPULATELEAVES(Tb, JPOPULATE) See honesty: Appendix C.1. end for Output: forest F {L1, . . . , LB} end function

POPULATELEAVES creates a collection of subsets (leaves) of the JPOPULATE samples based on the partition rules of tree Tb. For weight functions αi(x), see GETWEIGHTS in Algorithm 3. For Stage II, see ESTIMATE in Algorithm 3, where estimates ˆθ(x) are made given a forest F.

Algorithm 3 GRF-FPT: Estimates of θ (x)

function ESTIMATE

Input: forest F, test observation x X weights α GETWEIGHTS(F, x) Output: ˆθ(x), the solution to the weighted estimating equation (5) using weights α end function

function GETWEIGHTS

Input: forest F, test observation x vector of weights α ZEROS(n) Initialize weights; n = |S| used to train F. for indices i : Xi Lb(x) do

α[i] += 1/|Lb(x)| end for Output: local weights α α/|F| Weights (4). end function

Stage II of the GRF-FPT algorithm. The procedure ESTIMATE returns an estimate of θ (x) given a forest F trained under Stage I and a test observation x; see Algorithm 2.

the different settings used to generate the K-dimensional regressors Wi = (Wi,1, . . . , Wi,K) . For VCM experiments, Wi,k N(0, 1) for all k = 1, . . . , K. For HTE experiments, Wi | Xi = x Multinomial(1, (π1(x), . . . , πK(x))), where πk(x) denotes the underlying probability the sample is observed as having treatment level k {1, . . . , K}.

Random function generator. The effect functions θ k(x) = RFG(x) under VCM Setting 4 (in Table 2) and HTE Setting 5 (in Table 3) follow the random function generator design of Friedman (2001). The idea is to measure the performance of the estimator under a variety of randomly generated targets. Each θ k( ) is randomly generated as a linear combination of functions {gℓ( )}20 ℓof the form

ℓ=1 aℓgℓ(zℓ),

where the coefficients {aℓ}20 ℓ=1 are randomly generated from a uniform distribution aℓ U([ 1, 1]). Each gl(zl) is a function of a randomly selected pℓ-size subset of the p-dimensional variable x, where the size of each subset pℓis randomly chosen by pℓ= min( 1.5 + rℓ , p), and rℓis generated from an exponential distribution with mean 2, rℓ Exp(0.5). Each g(zℓ) uses a pℓ-sized random subset zℓ Rpℓof the p-dimensional input x Rp:

zℓ:= xϕℓ(1), . . . , xϕℓ(pℓ) Rpℓ,

Generalized Random Forests using Fixed-Point Trees

Parameter Values K 4; 16; 64; 256 n 10,000; 20,000; 100,000 dim(X) 5 n Trees 100

Parameter Values K 4; 16; n 1000; 4000 dim(X) 2 n Trees 100; 500

Table 1: Parameter values for VCM and HTE experiments in Section 6. Target/regressor dimension K, number of observations n, dimension of the auxiliary variables dim(X), and number of trees n Trees. Experiments include a large-n setting (left table) and a small-n setting (right table).

VCM Setting Effect function θ k(x) Wi,k 1 θ k(x) = βk1x1, βk1 N(0, 1) N(0, 1) 2 θ k(x) = ς(βk1x1)ς(βk2x2), βk1, βk2 N(0, 1) N(0, 1) 3 θ k(x) = ς(β k x), for βk Np(0, I) N(0, 1) 4 θ k(x) = RFG(x) N(0, 1)

Table 2: Settings for the true effects θ k( ) and the regressors Wi,k for VCM experiments in Section 6. The function ς(u) := 1 + (1 + e 20(u 1/3)) 1 is a logistic-type function in (Athey et al., 2019). The random function generator RFG(x) is described in Appendix C.4.

HTE Setting Treatment effect θ k(x) Treatment probability πk(x) for Wi,k 1 θ k(x) = βk1x1, βk1 N(0, 1) πk(x) = 1/K for all k.

2 θ k(x) = βk1x1, βk1 N(0, 1) πk(x) =

( x1 k = 1, 1 K 1(1 x1) k = 2, . . . , K

3 θ k(x) = ς(βk1x1)ς(βk2x2) πk(x) = 1/K for all k. for βk1, βk2 N(0, 1)

4 θ k(x) = ς(β k x), for βk Np(0, I) πk(x) =

( x1 k = 1, 1 K 1(1 x1) k = 2, . . . , K.

5 θ k(x) = RFG(x) πk(x) = exp{γ k x} PK j=1 exp{γ j x} for γk Np(0, I).

Table 3: Settings for the underlying treatment effects θ k( ) and treatment probabilities πk(x) for HTE experiments in Section 6. The function ς(u) := 1+(1+e 20(u 1/3)) 1 is a logistic-type function used in (Athey et al., 2019). The random function generator RFG(x) is described in Appendix C.4.

such that {ϕℓ(1), . . . , ϕℓ(pℓ)} is a length-pℓpermutation of indices drawn from {1, . . . , p}, without replacement. The functions gℓ( ) are Gaussian functions of the pℓsampled variables:

gℓ(zℓ) := exp 1

2(zℓ µℓ) Vℓ(zℓ µℓ) ,

where the mean vector µℓ Rpℓis randomly generated from a standard multivariate Gaussian, µℓ Npℓ(0, I). The pl pl covariance matrix Vl are formed through the spectral decomposition:

Vℓ= UℓDℓU ℓ,

where Uℓis a random pℓ pℓorthonormal matrix and Dℓ:= diag(d1,ℓ, . . . , dpℓ,ℓ) with diagonal entries dj,ℓgenerated from a uniform distribution according to p

dj,ℓ U(0.1, 2.0).

D. Additional Simulations

D.1. Settings for the criterion value experiment in Section 3.4

The criterion value experiment in Section 3.4 was run under a varying coefficient model of the form

Yi := W i θ (Xi) + ϵi, ϵi N(0, 0.52), (64)

Generalized Random Forests using Fixed-Point Trees

where the regressors Wi were generated as bivariate standard Gaussian samples Wi N2(0, I) and the auxiliary covariates were generated as standard uniform samples Xi U(0, 1). The data-generating coefficient functions were θ (x) := (sin(2πx), x) and the criterion values were computed based on n = 1000 samples following (64).

D.2. Supporting experiments for Section 6

Multicollinearity in auxiliary covariates. We conducted a VCM experiment with highly correlated auxiliary covariate features. We ran a modified version of VCM Setting 3 by generating auxiliary covariates as Xi N(0, Σ), where [Σ]j,k = ω|j k| for ω {0, 0.5, 0.9}. Table 4 provides a clear summary of the computational performance of GRF-FPT relative to GRF-grad and statistical accuracy (MSE multiplied by 100 for readability). All experiments were run over a forest of 10 trees and MSE estimates were computed over 50 replications of the model and evaluated on a separate set of n = 5, 000 samples, carried out using GRF-FPT2 and GRF-grad. These results demonstrate clearly that GRF-FPT remains robust, stable, and computationally efficient, even under high multicollinearity in Xi.

dim(X) n K ω Speedup 100 MSE grad 100 MSE FPT2 5 10,000 64 0.00 2.55 16.60 16.83 5 10,000 64 0.50 2.53 15.48 15.47 5 10,000 64 0.90 2.35 10.95 11.09

Table 4: Effect of multicollinearity in the auxiliary covariates Xi on the relative computational gain of GRF-FPT2, as well as the statistical accuracy of both GRF-FPT and GRF-grad estimators.

Subsampling ratio. We carried out an experiment to show that the subsample proportion does not affect the computational advantage or statistical accuracy of GRF-FPT relative to GRF-grad. We varied the subsampling ratio s/n {0.25, 0.50, 0.75} under VCM Setting 3 over a forest of 10 trees carried out using GRF-FPT2 and GRF-grad. Table 5 summarizes our results, averaged over 50 replications of the model, with a test set of 5,000 samples. These results show clearly that the statistical accuracy (MSE) of GRF-FPT2 relative to GRF-grad does not depend strongly on the subsample ratio.

dim(X) n K s/n Speedup 100 MSE grad 100 MSE FPT2 2 10,000 64 0.25 2.77 2.86 2.90 2 10,000 64 0.50 3.10 2.91 2.90 2 10,000 64 0.75 2.98 3.21 3.19

Table 5: Effect of the subsampling ratio s/n on the relative computational gain of GRF-FPT2, as well as the statistical accuracy of both GRF-FPT and GRF-grad estimators.

Large sample size. We ran additional experiments to clearly show how our method scales for very large datasets. Using a forest of 10 trees, we tested our method on VCM Setting 3 with sample sizes up to n = 500, 000, carried out using GRF-FPT2 and GRF-grad. The results are summarized in Table 6 and demonstrate that, even as the dataset grows very large, our method consistently remains faster than GRF-grad. While the relative speedup slightly decreases at first, it stabilizes towards a consistent advantage as grows n grows sufficiently large, suggesting that the advantage is not bottlenecked by n and maintains a robust advantage at scale.

dim(X) K n Speedup 2 256 10,000 4.54 2 256 20,000 3.59 2 256 50,000 3.49 2 256 100,000 3.11 2 256 200,000 3.04 2 256 500,000 3.08

Table 6: Effect of increasing sample sizes n on the relative computational gain of GRF-FPT2.

Generalized Random Forests using Fixed-Point Trees

K = 4 K = 16 K = 64 K = 256

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

0 100 200 300 400 500

n Trees = 100

Fit time (seconds)

VCM Setting 1 Median fit times: VCM (forests)

K = 4 K = 16 K = 64 K = 256

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

0 100 200 300 400 500

n Trees = 100

Fit time (seconds)

VCM Setting 2 Median fit times: VCM (forests)

K = 4 K = 16 K = 64 K = 256

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

0 100 200 300 400

0 25 50 75 100

n Trees = 100

Fit time (seconds)

VCM Setting 3 Median fit times: VCM (forests)

K = 4 K = 16 K = 64 K = 256

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

n Trees = 100

Fit time (seconds)

VCM Setting 4 Median fit times: VCM (forests)

Figure 5: Absolute fit times for VCM timing experiments under the settings in Table 2 and large-n settings in Table 1.

D.3. Supporting figures for Section 6

D.3.1. VCM EXPERIMENTS

Large n VCM simulations. Figure 5 illustrates the absolute fit times for the GRF-FPT algorithms under the four VCM settings for θ k(x) described in Table 2 over the large-n settings in Table 1. Across all settings and all dimensions, GRF-FPT is consistently several factors faster than GRF-grad. The speedup factor is summarized in Figure 3, which illustrates the relative speedup of GRF-FPT, calculated as the ratio of GRF-grad fit times over GRF-FPT fit times. Consistent with the observations in Section 5, we find that the speed advantage of GRF-FPT increases as the dimension of the target increases.

Figure 6 shows that this speed advantage comes while performing comparably to GRF-grad in terms of statistical accuracy. Across all settings for VCMs with K = 4 dimensional targets, the MSE estimates from GRF-FPT is highly similar to the MSE estimates of GRF-grad, while for K = 256 dimensional targets one sees more variation in MSE estimates across the methods. This effect likely reflects the increased variance associated with high-dimensional estimation. In some cases we see GRF-FPT1 slightly outperform both GRF-FPT2 and GRF-grad, in other cases we see GRF-grad slightly outperform both GRF-FPT methods, and in others GRF-FPT2 yields the lowest MSE. One sees that these differences are typically small. The key benefit we emphasize is that GRF-FPT is able to achieve nearly identical statistical accuracy with a substantial improvement in computational speed.

Small n VCM simulations. Figures 8 and 7 illustrate the absolute fit times and relative speed advantage, respectively, of GRF-FPT under the VCM design of θ k(x) over the small-n settings. One sees that even when n is more modest, GRF-FPT consistently offers a computational advantage over GRF-grad, with possible outliers under VCM Setting 2 at K = 4. We believe this negative relative advantage to be caused by random fluctuations in computation and are not representative of the FPT algorithm itself, particularly in light of the fact that the negative effect vanishes when the number of trees increases from 100 to 500. As one would expect based on the large-n results, the relative advantage tends to increase with increasing K, and generally stabilizes with increasing n. Figure 9 shows that the GRF-FPT speed advantage does not come at any material cost in statistical accuracy, with similar performance to GRF-grad across all settings.

Generalized Random Forests using Fixed-Point Trees

VCM Setting 1 VCM Setting 2 VCM Setting 3 VCM Setting 4

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

2.9 3.0 3.1 3.2 3.3 3.4 3.5

2.50 2.75 3.00 3.25

n Trees = 100

50 model replications, 5000 test observations K = 256

MSE estimates: Varying coefficient model (VCM)

VCM Setting 1 VCM Setting 2 VCM Setting 3 VCM Setting 4

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

0.8 0.9 1.0 1.1 1.2

n Trees = 100

50 model replications, 5000 test observations K = 4

MSE estimates: Varying coefficient model (VCM)

Figure 6: Estimates of MSE E[ θ (X) ˆθ(X)/K 2 2] for VCM for K = 256 dimensional (top) and K = 4 dimensional targets (bottom) under the large-n settings in Table 1.

Generalized Random Forests using Fixed-Point Trees

dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2

dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2

dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2

dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2

n = 1000 n = 4000 n = 1000 n = 4000

0.8 1.0 1.2 1.4 1.6 1.8 2.0

0.8 1.0 1.2 1.4 1.6 1.8 2.0

0.8 1.0 1.2 1.4 1.6 1.8 2.0

0.8 1.0 1.2 1.4 1.6 1.8 2.0

n Trees = 100 n Trees = 500

Regressor dimension (K)

Speedup factor

VCM Setting

Varying coefficient model (VCM) Fit time speedup factor: GRF grad/GRF FPT (forests)

Figure 7: Speedup factor for GRF-FPT in comparison to GRF-grad for VCM timing experiments under the small-n settings in Table 1.

D.4. HTE experiments

Large n HTE simulations. Figure 11 illustrates the absolute fit times for the GRF-FPT algorithm under the five HTE settings of θ k(x) and πk(x) described in Table 3 over the large-n settings in Table 1. We find that GRF-FPT is consistently faster than GRF-grad. The speedup factor of GRF-FPT relative to GRF-grad is summarized in Figure 10, calculated as the ratio of GRF-grad fit times over GRF-FPT fit times. As was seen for VCM experiments, the speed advantage of GRF-FPT scales with the dimensionality K of the target. One sees from both Figures 10 and 11 that GRF-FPT s computational advantage is less dramatic than under the VCM experiments. This can be understood based on the fact that the VCM regressors Wi are continuous while the HTE regressors represent binary indicators. Continuous regressors provide more granularity when fitting the child statistics θCj, and as a result provide a larger set of candidate splits over the covariates. Nevertheless, one sees in Figure 10 that the FPT splitting mechanism is still up to 1.5 faster under the largest regressor setting K = 256, with a more modest, but persistent savings across all settings.

The statistical benchmarks for our HTE experiments are shown in Figure 12. Consistent with the VCM experiments, one sees that the computational advantage of GRF-FPT does not come at the cost of in terms of its statistical accuracy.

Small n HTE simulations. Figures 13 and 14 summarize the relative speed advantage and absolute fit times for the GRF-FPT algorithm under the small-n HTE design. Consistent with the large-n HTE experiments the FPT2 mechanism sees a stable computational advantage across all settings, with an increasing effect in increasing K, while the FPT1 mechanism displays a persistent advantage for K = 16 and comparable computational performance for K = 4. The more modest relative advantage for the small-n experiments is itself consistent with the VCM small-n experiments, owing in large part due to the smaller values of K. Figure 15 compares the statistical performance of GRF-FPT to GRF-grad, with no material difference between either GRF-FPT1, GRF-FPT2, or GRF-grad s estimation accuracy.

Generalized Random Forests using Fixed-Point Trees

K = 4 K = 16

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2

n Trees = 100 n Trees = 500

Fit time (seconds)

VCM Setting 1 Median fit times: VCM (forests)

K = 4 K = 16

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2

n Trees = 100 n Trees = 500

Fit time (seconds)

VCM Setting 2 Median fit times: VCM (forests)

K = 4 K = 16

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2

n Trees = 100 n Trees = 500

Fit time (seconds)

VCM Setting 3 Median fit times: VCM (forests)

K = 4 K = 16

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2

n Trees = 100 n Trees = 500

Fit time (seconds)

VCM Setting 4 Median fit times: VCM (forests)

Figure 8: Absolute fit times for VCM timing experiments under the settings in Table 2 and small-n settings in Table 1.

Generalized Random Forests using Fixed-Point Trees

VCM Setting 1 VCM Setting 2 VCM Setting 3 VCM Setting 4

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

1.0 1.1 1.2 1.3 1.4 1.5

1.50 1.75 2.00 2.25 2.50 2.75

1.1 1.2 1.3 1.4 1.5 1.6 1.7

n Trees = 100 n Trees = 500

50 model replications, 5000 test observations K = 16

MSE estimates: Varying coefficient model (VCM)

VCM Setting 1 VCM Setting 2 VCM Setting 3 VCM Setting 4

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

1.7 1.8 1.9 2.0 2.1 2.2 2.3

2.0 2.5 3.0 3.5 4.0 4.5

n Trees = 100 n Trees = 500

50 model replications, 5000 test observations K = 4

MSE estimates: Varying coefficient model (VCM)

Figure 9: Estimates of MSE E[ θ (X) ˆθ(X)/K 2 2] for VCM for K = 16 dimensional (top) and K = 4 dimensional targets (bottom) under the small-n settings in Table 1.

Generalized Random Forests using Fixed-Point Trees

dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5

dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5

dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5 dim(X) = 5

n = 10000 n = 20000 n = 100000

4 16 64 256 4 16 64 256

1.0 1.1 1.2 1.3 1.4 1.5 1.6

1.0 1.1 1.2 1.3 1.4 1.5 1.6

1.0 1.1 1.2 1.3 1.4 1.5 1.6

n Trees = 100

Regressor dimension (K)

Speedup factor

HTE Setting

Heterogeneous treatment effects (HTE) Fit time speedup factor: GRF grad/GRF FPT (forests)

Figure 10: Speedup factor for GRF-FPT in comparison to GRF-grad for HTE timing experiments under the large-n setting in Table 1.

E. Additional Examples

E.1. Pseudo-outcomes for nonparametric regression

Consider the task of estimating the conditional mean function θ (x) := E[Y |X = x]. The target θ (x) is identified by a moment condition of the form (1) with scoring function ψθ(Yi) := Yi θ, the residual associated with using θ as the local estimate with respect to the i-th sample. The local solution ˆθP over P is the mean observed response over the parent,

ˆθP = Y P := 1

{i:Xi P } Yi.

The fixed-point pseudo-outcomes are simple the (negative) residuals that result from fitting (6) over P

i = (Yi ˆθP ) = (Yi Y P ).

The gradient of the score function is θψθ(y) = 1, and hence AP = 1. Therefore, up to a constant factor, the gradient-based pseudo-outcomes ρgrad i for conditional mean estimation reduce to their fixed-point counterparts ρFPT

ρgrad i = A 1 P ψˆθP (Yi) = Yi Y P = ρFPT

In this special case, we recover the conventional splitting algorithm used for univariate responses (Breiman et al., 1984; Breiman, 2001) or for multivariate responses (De ath, 2002; Segal, 1992). Trees grown using ρFPT

i , ρgrad i , or Yi will be identical to one another because CART splits are scale and translation invariant with respect to the response.

More generally, for targets θ (x) beyond the conditional mean, the form of ρgrad i will be equivalent to ρFPT

i whenever the target function θ : X Θ is a map from the input space X a one-dimensional parameter space Θ.

Generalized Random Forests using Fixed-Point Trees

K = 4 K = 16 K = 64 K = 256

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

0 500 1000 1500 2000

0 10 20 30 40 50

0 100 200 300 400 500

0 1 2 3 4 5

0 10 20 30 40

n Trees = 100

Fit time (seconds)

HTE Setting 1 Median fit times: HTE (forests)

K = 4 K = 16 K = 64 K = 256

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

0 100 200 300 400

0.0 2.5 5.0 7.5 10.0 12.5

0.0 0.5 1.0 1.5 2.0 2.5

0 10 20 30 40

n Trees = 100

Fit time (seconds)

HTE Setting 2 Median fit times: HTE (forests)

K = 4 K = 16 K = 64 K = 256

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

0 500 1000 1500 2000

0 10 20 30 40 50

0 100 200 300 400 500

0 1 2 3 4 5

0 25 50 75 100 125

0 10 20 30 40

n Trees = 100

Fit time (seconds)

HTE Setting 3 Median fit times: HTE (forests)

K = 4 K = 16 K = 64 K = 256

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

0 100 200 300 400

0.0 0.5 1.0 1.5 2.0 2.5

0 10 20 30 40

n Trees = 100

Fit time (seconds)

HTE Setting 4 Median fit times: HTE (forests)

K = 4 K = 16 K = 64 K = 256

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

0.0 0.5 1.0 1.5 2.0 2.5

0 10 20 30 40

n Trees = 100

Fit time (seconds)

HTE Setting 5 Median fit times: HTE (forests)

Figure 11: Absolute fit times for HTE timing experiments under the settings in Table 3 and large-n settings in Table 1.

F. Real Data Comparison: California Housing

Data. The California housing data appeared in Kelley Pace & Barry (1997) and can be directly obtained from the Carnegie Mellon Stat Lib repository (https://lib.stat.cmu.edu/datasets/houses.zip). The data includes 20640 observations, where each observation corresponds to measurements over an individual census block group in California taken from the 1990 census. A census block is the smallest geographical area for which the U.S. Census Bureau publishes sample data, typically with a population between 600 and 3000 people per block. Each observation from the California housing data set contains measurements of 9 variables: median housing value (dollars), longitude, latitude, median housing age (years), total rooms (count, aggregated over the census block), total bedrooms (count, aggregated over the census block), population (count), households (count), median income (dollars).

Model. We consider a varying coefficient model of the form

Yi = ν (Xi) + θ 1(Xi)Wi,1 + + θ 6(Xi)Wi,6 + ϵi (65)

where we suppose that our effects are local to spatial coordinates x := (latitudei, longitudei), Yi denotes the log median housing value of the census block, and the primary regressors Wi = (Wi,1, . . . , Wi,6) passed to the model were median housing age, log(total rooms), log(total bedrooms), log(population), log(households), and log(median income). Here, each θ k(x) denotes the geographically-varying effect of the corresponding regressor Wi,k, for k = 1, . . . , 6. The empirical distribution of the transformed regressors passed to each of the GRF models is seen in Figure 17.

Generalized Random Forests using Fixed-Point Trees

HTE Setting 1 HTE Setting 2 HTE Setting 3 HTE Setting 4 HTE Setting 5

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

10.0 12.5 15.0 17.5 20.0

5 6 7 8 9 10

n Trees = 100

50 model replications, 5000 test observations K = 256

MSE estimates: Heterogeneous treatment effects (HTE)

HTE Setting 1 HTE Setting 2 HTE Setting 3 HTE Setting 4 HTE Setting 5

n = 10000 n = 20000 n = 100000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

2 3 4 5 6 7

1.5 2.0 2.5 3.0 3.5

0.60 0.65 0.70 0.75 0.80 0.85

n Trees = 100

50 model replications, 5000 test observations K = 4

MSE estimates: Heterogeneous treatment effects (HTE)

Figure 12: Estimates of MSE E[ θ (X) ˆθ(X)/K 2 2] for HTE for K = 256 dimensional (top) and K = 4 dimensional targets (bottom) under the large-n settings in Table 1.

Algorithms. We target GRF estimates ˆθ(x) = (ˆθ1(x), . . . , ˆθ6(x)) of θ (x) = (θ 1(x), . . . , θ 6(x)) based on the GRFFPT1 and GRF-FPT2 algorithms described in Section 6, and compare those to GRF-grad. All forests were fit using the grf::lm forest function, which trains the Stage I forest and optionally solves for the Stage II estimates ˆθ(x) for varying coefficient models (65). All versions fit a forest of 2000 trees, the default settings of the original R implementation (Tibshirani et al., 2024), a subsample ratio of 0.5, and a target minimum node size of 5 observations.

Algorithm Training time (sec.) Speedup factor GRF-grad 19.1 GRF-FPT1 15.4 1.24 GRF-FPT2 12.6 1.52

Table 7: Fit times to train a forest of 2000 trees on the California housing data.

Generalized Random Forests using Fixed-Point Trees

dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2

dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2

dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2

dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2 dim(X) = 2

n = 1000 n = 4000 n = 1000 n = 4000

0.9 1.0 1.1 1.2 1.3 1.4

0.9 1.0 1.1 1.2 1.3 1.4

0.9 1.0 1.1 1.2 1.3 1.4

0.9 1.0 1.1 1.2 1.3 1.4

n Trees = 100 n Trees = 500

Regressor dimension (K)

Speedup factor

HTE Setting

Heterogeneous treatment effects (HTE) Fit time speedup factor: GRF grad/GRF FPT (forests)

Figure 13: Speedup factor for GRF-FPT in comparison to GRF-grad for HTE timing experiments under the small-n settings in Table 1.

Results. Table 7 summarizes the computational benefit of GRF-FPT applied to the California housing data. Figures 4 illustrates the local estimates ˆθ(x) made by GRF-FPT2, while Figure 16 illustrates the fits under GRF-FPT1 and GRFgrad.

Generalized Random Forests using Fixed-Point Trees

K = 4 K = 16

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2

n Trees = 100 n Trees = 500

Fit time (seconds)

HTE Setting 1 Median fit times: HTE (forests)

K = 4 K = 16

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2

n Trees = 100 n Trees = 500

Fit time (seconds)

HTE Setting 2 Median fit times: HTE (forests)

K = 4 K = 16

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2

n Trees = 100 n Trees = 500

Fit time (seconds)

HTE Setting 3 Median fit times: HTE (forests)

K = 4 K = 16

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2

n Trees = 100 n Trees = 500

Fit time (seconds)

HTE Setting 4 Median fit times: HTE (forests)

K = 4 K = 16

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2

n Trees = 100 n Trees = 500

Fit time (seconds)

HTE Setting 5 Median fit times: HTE (forests)

Figure 14: Absolute fit times for HTE timing experiments under the settings in Table 3 and small-n settings in Table 1.

Generalized Random Forests using Fixed-Point Trees

HTE Setting 1 HTE Setting 2 HTE Setting 3 HTE Setting 4 HTE Setting 5

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

3 4 5 6 7 8

6 8 10 12 14 16

3 4 5 6 7 8

2 3 4 5 6 7

1.5 2.0 2.5 3.0 3.5 4.0

n Trees = 100 n Trees = 500

50 model replications, 5000 test observations K = 16

MSE estimates: Heterogeneous treatment effects (HTE)

HTE Setting 1 HTE Setting 2 HTE Setting 3 HTE Setting 4 HTE Setting 5

n = 1000 n = 4000 n = 1000 n = 4000

grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2 grad FPT1 FPT2

1.5 2.0 2.5 3.0 3.5 4.0

n Trees = 100 n Trees = 500

50 model replications, 5000 test observations K = 4

MSE estimates: Heterogeneous treatment effects (HTE)

Figure 15: Estimates of MSE E[ θ (X) ˆθ(X)/K 2 2] for HTE for K = 16 dimensional (top) and K = 4 dimensional targets (bottom) under the small-n settings in Table 1.

Generalized Random Forests using Fixed-Point Trees

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

log Population log Census Block Bedrooms log Census Block Rooms

log Households Median Housing Age log Median Income

Effect on log value

3 2 1 0 1 2 3

Spatially varying effects on log median house values California housing data: GRF FPT1

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

San Francisco

Los Angeles

log Population log Census Block Bedrooms log Census Block Rooms

log Households Median Housing Age log Median Income

Effect on log value

3 2 1 0 1 2 3

Spatially varying effects on log median house values California housing data: GRF grad

Figure 16: Geographically-varying local estimates ˆθ(x) = (ˆθ1(x), . . . , ˆθ6(x)), fit under GRF-FPT1 (top) and GRF-grad (bottom). Results for GRF-FPT2 are presented in Figure 4 found in Section 7.

Generalized Random Forests using Fixed-Point Trees

Corr: 0.320

Corr: 0.267

Corr: 0.933

Corr: 0.242

Corr: 0.848

Corr: 0.889

Corr: 0.237

Corr: 0.922

Corr: 0.974

Corr: 0.925

Corr: 0.163

Corr: 0.293

Corr: 0.034

Corr: 0.053

Corr: 0.075

Median Housing Age

log Census Block Rooms

Block Bedrooms

log Population

log Households

Median Housing Age

log Census Block Rooms

log Census Block Bedrooms

log Population

log Households

log Median Income

0 10 20 30 40 50 4 6 8 10 2 4 6 8 4 6 8 2 4 6 8 0 1 2 3

California housing data: Regressor distribution

Figure 17: Empirical distribution of the regressors from the California housing data passed to GRF.