# lincde_conditional_density_estimation_via_lindseys_method__e00c0249.pdf

Journal of Machine Learning Research 23 (2022) 1-55 Submitted 7/21; Revised 11/21; Published 2/22

Lin CDE: Conditional Density Estimation via Lindsey s Method

Zijun Gao zijungao@stanford.edu Department of Statistics Stanford University Stanford, CA 94305, USA

Trevor Hastie hastie@stanford.edu Department of Statistics and Department of Biomedical Data Science Stanford University Stanford, CA 94305, USA

Editor: Victor Chernozhukov

Conditional density estimation is a fundamental problem in statistics, with scientiﬁc and practical applications in biology, economics, ﬁnance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey s method (Lin CDE). Lin CDE admits ﬂexible modeling of the density family and can capture distributional characteristics like modality and shape. In particular, when suitably parametrized, Lin CDE will produce smooth and non-negative density estimates. Furthermore, like boosted regression trees, Lin CDE does automatic feature selection. We demonstrate Lin CDE s eﬃcacy through extensive simulations and three real data examples. Keywords: Conditional Density Estimation, Gradient Boosting, Lindsey s Method

1. Introduction

In statistics, a fundamental problem is characterizing how a response depends on a set of covariates. Numerous methods have been developed for estimating the mean response conditioning on the covariates the so-called regression problem. However, the conditional mean may not always be suﬃcient in practice, and various distributional characteristics or even the full conditional distribution are called for, such as the mean-variance analysis of portfolios (Markowitz, 1959), the bimodality of gene expression distributions (De Santis et al., 2014; Moody et al., 2019), and the peak patterns of galaxy redshift densities (Ball et al., 2008). Conditional distributions can be used for constructing prediction intervals, downstream analysis, visualization, and interpretation (Arnold et al., 1999). Therefore, it is worthwhile to take a step forward from the conditional mean to the conditional distribution. There are several diﬃculties in estimating conditional distributions. First, distribution estimation is more complicated than mean estimation regardless of the conditioning. Second, as with conditional mean estimation, conditioning on a potentially large number of covariates suﬀers from the curse of dimensionality. When only a small subset of the covariates are relevant, proper variable selection is necessary to mitigate overﬁtting, reduce computational burden, and identify covariates that may be of interest to the practitioners.

2022 Zijun Gao and Trevor Hastie.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0840.html.

Gao and Hastie

In this paper, we develop a tree-boosted conditional density estimator based on Lindsey s method, which we call Lin CDE (pronounced linseed ) boosting. Lin CDE boosting is built on the base learner Lin CDE tree. A Lin CDE tree partitions the covariate space into subregions with homogeneous conditional distributions, estimates a local unconditional density in each subregion, and aggregates the unconditional densities to form the ﬁnal conditional estimator. Lin CDE boosting combines Lin CDE trees to form a strong ensemble learner. Lin CDE boosting possesses several desirable properties. Lin CDE boosting provides a ﬂexible modeling paradigm and is capable of capturing distributional properties such as heteroscedasticity and multimodality. Lin CDE boosting also inherits the advantages of tree and boosting methods, and in particular, Lin CDE boosting is able to detect inﬂuential covariates. Furthermore, the conditional density estimates are automatically non-negative and smooth, and other useful statistics such as conditional quantiles or conditional cumulant distribution functions (CDFs) can be obtained in a straightforward way from the Lin CDE boosting estimates. The organization of the paper is as follows. In Section 2, we formulate the problem and discuss related work. We develop Lin CDE in three steps:

In Section 3, we describe Lindsey s method for (marginal) density estimation.

In Section 4, we introduce Lin CDE trees for conditional density estimation, which combine Lindsey s method with recursive partitioning.

In Section 5, we develop a boosted ensemble model using Lin CDE trees.

In Section 6, we discuss two optional but helpful preprocessing steps response transformation and conditional mean centering. In Section 7, we evaluate the performance of Lin CDE boosting on simulated data sets. In Section 8, we apply Lin CDE boosting to three real data sets. We conclude the paper with discussions in Section 9 and provide links to the R software and instructions in Section 10. All proofs are deferred to the appendix.

2. Background

In this section, we formulate the conditional density estimation problem and discuss related work.

2.1 Problem Formulation

Let y R be a continuous response.1 Let x be a d-dimensional covariate vector and x(j) be its j-th coordinate. We assume the covariates are generated from an unknown underlying distribution fx(x), and the response given the covariates are sampled from an unknown conditional density fy|x(y | x). The model is summarized as

xi i.i.d. fx,

yi | xi ind. fy|x. (1)

1. The paper will focus on univariate responses, and the generalization to multivariate responses is straightforward and discussed in Section 9.

Lin CDE: Conditional Density Estimation via Lindsey s Method

We observe n data pairs {(xi, yi)} and aim to estimate the conditional density fy|x(y | x).

2.2 Literature

In general, there are three ways to characterize a conditional distribution: conditional density, conditional quantile, and conditional CDF. We categorize related works on conditional density, quantile, and CDF estimation according to the methodology and provide a review below. We conclude by discussing two desired properties of conditional density estimation and how our work complements the existing literature on the two points. A line of study estimates the conditional distribution by localizing unconditional distribution estimators. Localization methods weight observations according to the distances between their covariates and those at the target point, and solve the unconditional distribution estimation problem based on the weighted sample. For conditional density, Fan et al. (1996) obtain conditional density estimates by local polynomial regression. For conditional quantile, Chaudhuri (1991a,b) partitions the covariate space into bins and ﬁt a quantile model in each bin separately, and Yu and Jones (1998) tackle the conditional quantile estimation via local quantile loss minimization. For conditional CDF, Stone (1977) proposes a weighted sum of indicator functions, and Hall et al. (1999) consider a local logistic regression and a locally adjusted Nadaraya-Watson estimator. Localization methods enable systematic extensions of any unconditional estimator. Nevertheless, the weights usually treat covariates as equally important, and variable selection is typically not accommodated. This leaves the methods vulnerable to the curse of dimensionality. As discussed by Stone (1991a,b), another approach making use of unconditional methods, ﬁrst obtains the joint distribution estimate ˆfy,x(y, x) and the covariate distribution estimate ˆfx(x), and then follows

ˆfy|x(y | x) = ˆfy,x(y, x)/ ˆfx(x) (2)

to derive the conditional density. Nevertheless, the joint distribution estimation is also challenging, if not more. Arnold et al. (1999) point out except for special cases like multivariate Gaussian, the estimation of a bivariate joint distribution in a certain exponential form is onerous due to the normalizing constant. Moreover, the approach is ineﬃcient both statistically and computationally if the conditional distribution is comparatively simpler than the joint and covariate distributions; as an example, when the response is independent of the covariates. A diﬀerent thread directly models the dependence of the response s distribution on the covariates by a linear combination of a ﬁnite or inﬁnite number of bases.

For conditional density, Kooperberg and Stone (1991, 1992), Stone (1991b), Mˆaacsse and Truong (1999), and Barron and Sheu (1991) study the conditional logspline density model: modeling log(fy|x(y | x)) by tensor products of splines or trigonometric series and maximizing the conditional log-likelihood to estimate the parameters. The method is also known as entropy maximization subject to empirical constraints. In addition, Sugiyama et al. (2010) model the conditional density in reproducing kernel Hilbert spaces and estimate the loadings by the unconstrained least-squares importance ﬁtting, and Izbicki and Lee (2016) expand the conditional density in the eigen-

Gao and Hastie

functions of a kernel-based operator which adapts to the intrinsic dimension of the covariates.2

For conditional quantiles, Koenker and Bassett Jr (1978) formulate conditional quantiles as linear functions of covariates and minimize quantile losses to estimate parameters. Koenker et al. (1994) explore quantile smoothing splines minimizing for a single covariate, and He et al. (1998) extend the approach to the bi-variate setting, i.e., two covariates. Li et al. (2007) propose kernel quantile regression (KQR) considering quantile regression in reproducing Hilbert kernel spaces. Belloni et al. (2019) approximate the conditional quantile function by a growing number of bases as the sample size increases.

For conditional CDF, Foresi and Peracchi (1995) and Chernozhukov et al. (2013) consider distribution regression: estimating a sequence of conditional logit models over a grid of values of the response variable. Belloni et al. (2019) further extend the method to the high-dimensional-sparse-model setting.

The performance of parametric models depends on the selected bases or kernels. Covariatespeciﬁc bases or kernels require a lot of tuning, and the bases or kernels treat covariates equally, making the approaches less powerful in the presence of many nuisance covariates. More recently, tree-based estimators arose in conditional distribution estimation. The overall idea is partitioning the covariate space recursively and ﬁtting an unconditional model at each terminal node. For conditional quantiles, Chaudhuri and Loh (2002) investigate a tree-structured quantile regression. Nevertheless, the estimation of diﬀerent quantiles requires separate quantile loss minimization, which complicates the full conditional distribution calculation. Meinshausen (2006) proposes the quantile regression forest (QRF) that computes all quantiles simultaneously. QRF ﬁrst builds a standard random forest, then estimates the conditional CDF by a weighted distribution of the observed responses, and ﬁnally inverts the CDF to quantiles. For conditional CDF, Friedman (2019) proposes distribution boosting (DB). DB relies on Friedman s contrast trees a method to detect the lack-of-ﬁt regions of any conditional distribution estimator. DB estimates the conditional distribution by iteratively transforming the conditional distribution estimator and correcting the errors uncovered by contrast trees. Beyond trees, neural network-based conditional distribution estimators have also been developed. For conditional quantile, standard neural networks with the quantile losses serve the goal. For conditional density, Bishop (1994, 2006) introduces the mixture density networks (MDNs) which model the conditional density as a mixture of Gaussian distributions. Neural network-based methods can theoretically approximate any conditional distributions well but are computationally heavy and lack interpretation. In this paper, we propose Lin CDE boosting which complements existing works by producing smooth density curves and performing feature selection.

Compared to quantiles and CDF, smooth density curves are more suitable for the following reasons.

2. We remark that the approach in (Izbicki and Lee, 2016) adapts to the low intrinsic dimensionality of covariates but not the sparse dependency of the response on the covariates.

Lin CDE: Conditional Density Estimation via Lindsey s Method

Smooth density curves can give valuable indications of distributional characteristics such as skewness and multimodality. Visualization of smooth density curves is comprehensible to practitioners and can yield self-evident conclusions or point the way to further analysis. Conditional densities can be used to compute class-posterior probabilities using the Bayes rule. The class-posterior probabilities can be further used for classiﬁcation. Densities can be used to detect outliers: if an observation lies in a very lowdensity region, the data point is likely to be an outlier.

Lin CDE boosting generates smooth density curves, while directly transforming estimates of conditional quantiles or CDF to conditional density estimates usually produces bumpy results.

Lin CDE trees and Lin CDE boosting calculate feature importances and automatically focus on inﬂuential features in estimation. In contrast, methods like localization and modeling conditional densities in a linear space or RKHS often treat covariates equally, and covariate-speciﬁc neighborhoods, bases, and kernels may require a lot of tuning. Therefore, Lin CDE boosting is more scalable to a large number of features and less sensitive to the presence of nuisance covariates.

3. Lindsey s Method

In this section, we ﬁrst introduce the density estimation problem an intermediate step towards the conditional density estimation. We then discuss how to solve the density estimation problem by Lindsey s method (Lindsey, 1974) a stepping stone of Lin CDE. Lindsey s method cleverly avoids the normalizing issue by discretization and solves the problem by ﬁtting a simple Poisson regression. It can be thought of as a method for ﬁtting a smooth histogram with a large number of bins. We consider the density family

f(y) = κ(y)eg(y), (3)

where κ : R R is some carrying density, and g(y) is known as a tilting function. The idea is that κ(y) is known or assumed (such as Gaussian or uniform), and g(y) is represented by a model. We represent g as a linear expansion

g(y) = z(y) β + β0, (4)

where z(y) is a basis of k smooth functions. As a simple example, if we use standard Gaussian as the carrying measure and choose z(y) = (y, y2), the resulting density family corresponds to all possible Gaussian distributions. More generally we use a basis of natural cubic splines in z(y) with knots spread over the domain of y, to achieve a ﬂexible representation (Wahba, 1975, 1990). Our goal is to ﬁnd the density that maximizes the log-likelihood

max β,β0 1 n

i=1 log(κ(yi)) + z(yi) β + β0, s.t. Z κ(y)ez(y) β+β0dy = 1. (5)

Gao and Hastie

The constrained optimization problem (5) can be simpliﬁed to the unconstrained counterpart below by the method of Lagrange multipliers (Silverman, 1986),

log(κ(yi)) + z(yi) β + β0 Z κ(y)ez(y) β+β0dy, (6)

where the multiplier turns out to be one. The optimization problem (6) is diﬃcult since the integral R κ(y)ez(y) β+β0dy is generally unavailable in closed form. One way to avoid the integral is by discretization the key idea underlying Lindsey s method. One divides the response range into B equal bins of width with mid-points yb. The integral is approximated by the ﬁnite sum

Z κ(y)ez(y) β+β0dy

b=1 κ(yb)ez(yb) β+β0 .

As for the ﬁrst part of (6), one replaces yi by its bin midpoint yb(i) and groups the observations,

i=1 log(κ(yi)) + z(yi) β + β0

i=1 log(κ(yb(i))) + z(yb(i)) β + β0

b=1 nb log(κ(yb)) + z(yb) β + β0 ,

where b(i) denotes the bin that the i-th response falls in, and nb represents the number of samples in bin b. Combining the above two parts, the Lagrangian function with response discretization takes the form

b=1 nb log(κ(yb)) + z(yb) β + β0

b=1 κ(yb)ez(yb) β+β0 . (7)

The objective function (7) is equivalent to that of a Poisson regression with B observations {nb}1 b B and mean parameters µb κ(yb)ez(yb) β. Therefore, Lindsey s method estimates the coeﬃcient β by ﬁtting the Poisson regression with predictors z(y) and oﬀset log (κ(yb)). The normalizing constant β0 in (7) is absorbed in the Poisson regression s intercept. Despite the discretization error, Lindsey s estimates are consistent, asymptotically normal, and remarkably eﬃcient (Moschopoulos and Staniswalis, 1994; Efron, 2004). We will demonstrate the eﬃcacy of Lindsey s method in two examples at the end of this section. The number of bins B balances the statistical performance and the computational complexity of Lindsey s method: as B increases, the discretized objective (7) approaches the original target (6), and the resulting estimator converges to the true likelihood maximizer; on the other hand, the computations increase linearly in B. This can become a factor later when we ﬁt many of these Poisson models repeatedly. The relationship with a histogram becomes clear now, as well. We could use the counts in the B bins to form a density estimate, but this would be very jumpy. Typically we would control this by reducing the number of bins. Lindsey s method ﬁnesses this by having B

Lin CDE: Conditional Density Estimation via Lindsey s Method

large, but controlling the smoothness of the bin means via the k B basis functions and associated coeﬃcients. To control the model complexity further and avoid numeric instability, we add a regularization term to (6). For example, we can penalize deviations from Gaussian distributions via the regularizer (Wahba, 1977; Silverman, 1982, 1986)3

z(y) β + β0 2 dy. (8)

The penalty measures the roughness of the tilting function and is zero if and only if the tilting function s exponent is a quadratic polynomial, i.e., a Gaussian distribution. We also attach a hyper-parameter λ to trade-oﬀthe objective (6) and the penalty (8), and tune λ to achieve the best performance on validation data sets.4

It is convenient to tailor the spline basis functions to the penalty (8). Note that, for arbitrary bases z(y), the penalty (8) is a quadratic form in β

z(y) β + β0 2 dy =

Z z j (y)z l (y)dy =: β Ωβ, (9)

where Ωjl = R z j (y)z l (y)dy. We transform our splines so that the associated Ω= diag(ω1, . . . , ωk) is diagonal and the penalty reduces to a weighted ridge penalty

j=1 ωjβ2 j . (10)

(details in Appendix B). Figure 1 depicts an example of the transformed spline bases (in increasing order of ωj) and the corresponding smoothed versions. Among the transformed bases, the linear and quadratic components (the ﬁrst and the second bases in Figure 1) are not shrunk by the roughness penalty (Claim 1), and higher-complexity splines are more heavily penalized (Hastie et al., 2009, chap. 5, for example).

Claim 1 Assume u Rk and Ωu = 0. Then z(y) u is a linear or quadratic function of y.

We provide examples of Lindsey s performance in estimating bimodal and skewed distributions in Appendix G.

4. Lin CDE Trees

In this section, we extend the density estimation problem to the conditional density estimation problem. We introduce Lin CDE trees combining Lindsey s method and recursive partitioning: Lin CDE trees partition the covariate space and estimate a local unconditional density via Lindsey s method in each subregion.

3. We regard exponential distribution as a special case of Gaussian distribution with σ2 = . 4. The hyper-parameter is a function of the penalized Poisson regression s degrees of freedom (see Appendix A for more details), and we tune the degrees of freedom to achieve the best performance on validation data sets.

Gao and Hastie

0.0 0.4 0.8

0.0 0.4 0.8

0.0 0.4 0.8

0.0 0.4 0.8

0.0 0.4 0.8

unsmoothed smoothed

Figure 1: Transformed natural cubic spline bases. The transformed basis functions before smoothing (in black) are normalized such that R (zj(y))2dy = 1 and ordered by increasing penalty factors. The damped functions (in red) represent smoothed basis functions (with 6 degrees of freedom).

To begin, we restate the target density family (3) in the language of exponential families. We call z(y) the suﬃcient statistics and g(y) the natural parameter. We replace the normalizing constant β0 by the negative cumulant generating function ψ(β) deﬁned as

eψ(β) = Z κ(y)ez(y) βdy.

As a result, the density normalizing constraint of κ(y)ez(y) β ψ(β) is automatically satisﬁed. In the conditional density estimation problem, we consider the target family generalized from (3)

fy|x(y | x) = κ(y)ez(y) β(x) ψ(β(x)). (11)

The dependence of the response on the covariates is encoded in the parameter function β(x).

If there is only k = 1 basis function z(y) and the parameter function satisﬁes β(x) = x θ, then the conditional density family (11) reduces to a generalized linear model.

If we consider a standard Gaussian prior, basis function z(y) = y, and tree-structured β(x), then the conditional density family (11) reduces to a regression tree model. In fact, the optimization problem based on (11) is equivalent to ﬁnding the tree β(x) that minimizes the sum of squares Pn i=1(yi β(xi))2.

Lin CDE: Conditional Density Estimation via Lindsey s Method

Similar to the density estimation problem, we aim to ﬁnd the member in the family (11) that maximizes the conditional log-likelihood with the ridge penalty (10)

ℓ(R0; β) :=

log(κ(yi)) + z(yi) β(xi) ψ(β(xi)) λ

j=1 ωjβ2 j (xi)

log(κ(yi)) + z(yi) β(xi) ψ(β(xi)) λ

j=1 ωjβ2 j (xi),

where R0 denotes the full covariate space. If β(x) is constant, the problem (12) simpliﬁes to the unconditional problem (5). The conditional density estimation problem (12) is more complicated than the unconditional version due to the covariates x:

1. Given a covariate conﬁguration x, there is often at most one observation whose covariates take the value x, and it is infeasible to estimate the multi-dimensional natural parameter β(x) based on a single observation;

2. There may be a multitude of covariates, and only a few are inﬂuential. Proper variable selection or shrinkage is necessary to avoid serious overﬁtting.

One way to ﬁnesse these diﬃculties is to use trees (Breiman et al., 1984). We divide the covariate space into subregions with approximately homogeneous conditional distributions, and in each subregion, we estimate a density independent of the covariate values. We name the method Lin CDE trees . In response to the ﬁrst diﬃculty, by conditioning on a subregion instead of a speciﬁc covariate value, we have more samples for local density estimation. In response to the second diﬃculty, trees perform internal feature selection, and are thus resistant to the inclusion of many irrelevant covariates. Moreover, the advantages of tree-based methods are automatically inherited, such as being tolerant of all types of covariates, computationally eﬃcient, and easy to interpret. Before we delve into the details, we again draw a connection between Lin CDE trees and a naive binning approach ﬁtting a multinomial model using trees. The naive approach discretizes the response into multiple bins and predicts conditional cell probabilities through recursive partitioning. The normalized conditional cell probabilities serve as an approximation of the conditional densities, and the more bins used, the higher resolution the approximation is. The naive approach is able to detect subregions with homogeneous multinomial distributions. However, the estimates are bumpy, especially with a large number of bins. To stabilize the method, restrictions enforcing smoothness are required, and Lin CDE trees realize the goal by modeling the density exponent using splines. We now explain how Lin CDE trees work. In standard tree algorithms, there are two major steps:

Splitting: partitioning the covariate space into subregions;

Fitting: performing estimation in each subregion. The estimator is usually obtained by maximizing a speciﬁc objective function. For example, in a regression tree with ℓ2 loss, the estimator is the sample average; in a classiﬁcation tree with misclassiﬁcation error, the estimator is the majority s label.

Gao and Hastie

The ﬁtting step is a direct application of Lindsey s method in Section 3. In a subregion R, we treat the natural parameter functions as a constant vector and solve the density estimation problem via Lindsey s method. We denote the objective function value in region R with parameter β by ℓ(R; β), and let ˆβR := arg maxβ ℓ(R; β). Now for the splitting step. Similar to standard regression and classiﬁcation trees, we proceed with a greedy algorithm and select the candidate split that improves the objective the most. Mathematically, starting from a region R, we maximize the improvement statistic

ℓ(R, s) := ℓ(Rs,L; ˆβRs,L) + ℓ(Rs,R; ˆβRs,R) ℓ(R; ˆβR), (13)

where Rs,L and Rs,R are the regions on the left and right of the candidate split, respectively. Direct computation of the diﬀerence (13) requires running Lindsey s method twice for each candidate split s to obtain ˆβRs,L, ˆβRs,R, and the total computation time is prohibitive. Instead, we approximate the diﬀerence (13) by a simple quadratic term in Proposition 2, which can be computed much faster.

Proposition 2 (Improvement approximation for Lin CDE trees) Let n R, z R be the sample size and average suﬃcient statistics in a region R. Assume that 2ψ(ˆβR) + 2λΩis invertible, then for a candidate split s,

1 n R ℓ(R, s) = n Rs,Ln Rs,R

2n2 R ( z Rs,L z Rs,R) 2ψ(ˆβR) + 2λΩ 1 ( z Rs,L z Rs,R) + rs,

where the remainder term satisﬁes rs = O z Rs,L z R 3 2 + z Rs,R z R 3 2 .

Proposition 2 writes the diﬀerence (13) as a quadratic form plus a higher-order residual term. If z(y) = y, the model amounts to a regression tree, and the residual term is zero. For general z(y), when the average suﬃcient statistics z Rs,L, z Rs,R are similar, the residual term is of smaller order than the quadratic form and can thus be dropped; when z Rs,L, z Rs,R are considerably diﬀerent, the residual term is not guaranteed to be small theoretically. However, we empirically demonstrate in Appendix C that at such splits, the quadratic form is still suﬃciently close to the true log-likelihood diﬀerence. Based on this empirical evidence, we use the quadratic approximation to determine the optimal splits. The quadratic approximation suggested by Proposition 2 is the product of the squared diﬀerence between the average suﬃcient statistics in RL and RR normalized by 2ψ(ˆβR)+ 2λΩ, further multiplied by the sample proportions in RL and RR. By selecting the candidate split that maximizes the quadratic term, we will end up with two subregions diﬀerent in the suﬃcient statistics means and reasonably balanced in sample sizes. To compute the quadratic approximation, we need subsample proportions n Rs,L/n R, n Rs,R/n R, average suﬃcient statistics z Rs,L, z Rs,R, and the inverse matrix of 2ψ(ˆβR)+2λΩ. For the candidate splits based on the same covariate, {n Rs,L, n Rs,R, z Rs,L, z Rs,R} can be computed eﬃciently for all split points by scanning through the samples in R once. For all candidate splits, this takes O(dn Rk) operations in total. The matrix 2ψ(ˆβR) is shared by all candidate splits and needs to be computed only once. The diﬃculty is that 2ψ(β) is often unavailable in closed form. However, since 2ψ(βR) is the covariance matrix of the suﬃcient statistics z(y) if the responses y are generated from the model parameterized

Lin CDE: Conditional Density Estimation via Lindsey s Method

by βR, we apply Lindsey s method to estimate βR and compute the covariance matrix of the suﬃcient statistics based on the multinomial cell probabilities, which takes O(k2B). Appendix F shows the resulting covariance matrix approximates 2ψ(ˆβR) with a ﬁne discretization. The total time complexity of the above splitting procedure is summarized in the following Proposition 3.

Proposition 3 Assume that there are S candidate splits, k basis functions, d covariates, B discretization bins, and n R observations in the current region. Then the splitting step for Lin CDE trees is of time complexity O(dn Rk + k2B + k3 + Sk2).

According to Proposition 3, the computation time based on the quadratic approximation is signiﬁcantly reduced compared to running Lindsey s methods in Rs,L and Rs,R for all candidate splits, which takes O(S(n Rk + k2B + k3)). Having found the best split smax, we partition R into two subregions Rsmax,L and Rsmax,R, and repeat the splitting procedure in the two subregions. Along the recursively partitioning, the response distribution s heterogeneity is reduced. The ﬁtting and the splitting steps of Lin CDE trees are summarized below, the complete algorithm is given in Algorithm 1, and implementation details are displayed in Section 10. Stopping criteria for Lin CDE trees are discussed in Appendix D.

Fitting (Lin CDE tree). At a region R:

1. Count the number of observations {n R,b} in each bin.

2. Fit a Poisson regression model with the response variable {n R,b}, regressors z(yb), the oﬀset log(κ(yb)), and a weighted ridge penalty.5 Denote the estimated coeﬃcients by ˆβR.

Splitting (Lin CDE tree). At a region R:

1. Compute {n Rs,L, n Rs,R, z Rs,L, z Rs,R} for each candidate split, and approximate 2ψ(ˆβR) by z(y) s covariance matrix in R using ˆβR.

2. For each candidate split s S, compute the quadratic approximation b ℓ(R, s) by Proposition 2, and choose the split smax = arg maxs S ˆ ℓ(R, s).

Algorithm 1: Lin CDE tree

Start at the full covariate space. 1. Apply Fitting (Lin CDE tree) and obtain the natural parameter estimator ˆβ. 2. Apply Splitting (Lin CDE tree) and obtain the optimal split smax. 3. Repeat steps 1 and 2 to the left and right children of smax until the stopping rule is satisﬁed, e.g., the maximal tree depth is reached. Output the natural parameter estimator ˆβ in each subregion.

5. Weights (penalty factors) of the ridge penalty are ω = [ω1, . . . , ωk] .

Gao and Hastie

To conclude the section, we demonstrate the eﬀectiveness of Lin CDE trees in three toy examples. We generate 10 covariates randomly uniformly on [ 1, 1]. The response follows

fy|x(y | x) =

f1(y), x(1) < 0.2, f2(y), x(1) 0.2, x(2) 0, f3(y), x(1) 0.2, x(2) < 0,

with three diﬀerent local densities fl(x), 1 l 3, varying in variance, number of modes, and skewness. The response distribution is determined by the ﬁrst two covariates and independent of the rest. In Figure 2, we plot the average conditional density estimates in the three subregions. Lin CDE trees are able to distinguish the densities diﬀering in the above characteristic properties and produce good ﬁts. We also compute the normalized importance score the proportion of overall improvement in the split-criterion attributed to each splitting variable. In all settings, the ﬁrst two covariates contribute over 99% importance. In other words, Lin CDE trees focus on the ﬁrst two inﬂuential covariates and avoid splitting at nuisance covariates.

5. Lin CDE Boosting

Although Lin CDE trees are useful as stand-alone tools, our ultimate goal is to use them as weak learners in a boosting paradigm. Standard tree boosting (Friedman, 2001) builds an additive model of shallow trees in a forward stagewise manner. Though a single shallow tree is high in bias, tree boosting manages to reduce the bias by successively making small modiﬁcations to the current estimate.6

We proceed with the boosting idea and propose Lin CDE boosting. Starting from a null estimate, we iteratively modify the current estimate by modifying the natural parameter functions via a Lin CDE tree. In particular, at the t-th iteration,

γt(x) = arg max Lin CDE tree γ(x) ℓ(R0; βt(x) + γ(x)),

βt+1(x) βt(x) + γt(x). (15)

Section 5.1 gives details. We remark that the Lin CDE tree modiﬁer γt(x) for boosting is an expanded version of that in Section 4: in previous Lin CDE trees, all samples share the same carrying density κ(y), while in Lin CDE trees for boosting, the carrying densities κ(y)ez(y) βt(x) ψ(βt(x)) diﬀer across units. We elaborate on Lin CDE trees with heterogeneous carrying densities in Sections 5.2 and 5.3. Before discussing the details of Lin CDE boosting, we compare Lin CDE boosting and Lin CDE trees on a toy example in Figure 3. We consider a locally Gaussian distribution with heterogeneous mean and variance

y = x(1) + 0.5x(2)ε, ε N(0, 1). (16)

6. We remark that another successful ensemble method random forests (Breiman, 2001) are not appropriate for Lin CDE trees. Random forests construct a large number of trees with low correlation and average the predictions. Deep trees are grown to ensure low-bias estimates, which is, however, unsatisfactory here because deep Lin CDE trees will have leaves with too few observations for density estimation.

Lin CDE: Conditional Density Estimation via Lindsey s Method

6 4 2 0 2 4 6

6 4 2 0 2 4 6

II: σ2 = 0.25

6 4 2 0 2 4 6

III: σ2 = 4

truth Lin CDE tree

3 2 1 0 1 2 3

3 2 1 0 1 2 3

II: unimodel

3 2 1 0 1 2 3

III: trimodal

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

II: κ = 0.8

0.0 0.2 0.4 0.6 0.8 1.0

III: κ = 0.8

Figure 2: Lin CDE trees conditional density estimates of heteroscedastic, multimodal and skewed distributions. The responses are generated from the model (14). The local densities are Gaussian with variances σ2 {0.25, 1, 4} (ﬁrst row), Gaussian mixtures with 1, 2, 3 components (second row), and Beta distributions with skewness κ { 0.8, 0, 0.8} (third row). In each trial, we sample 400 observations. We use 10 natural cubic splines , 5 degrees of freedom, and a maximal tree depth at 2. We repeat 100 times and plot the average ﬁts against the true densities.

The covariate generating mechanism is the same as in Figure 2. We plot the average estimated conditional densities plus and minus one standard deviation. Though both Lin CDE trees and Lin CDE boosting produce good ﬁts in all settings, the estimation bands of Lin CDE

Gao and Hastie

I: µ = 0.5, σ2 = 0.6

II: µ = 0.5, σ2 = 1.6

III: µ = 0.5, σ2 = 0.6

IV: µ = 0.5, σ2 = 1.6

truth Lin CDE tree Lin CDEBoosting

Figure 3: Comparison of Lin CDE trees and Lin CDE boosting. The responses are generated from (16). We pick 4 landmarks corresponding to diﬀerent conditional means and variances. From top to bottom, the conditional means decrease from 0.5 to 0.5. From left to right, the conditional variances increase from 0.6 to 1.6. In each trial, we sample 400 observations from the target distribution. We repeat each setting 100 times, and plot the average estimated conditional densities plus and minus one standard deviation against the true densities.

boosting are always narrower by around a half. The observation implies Lin CDE boosting is more stable than Lin CDE trees.

5.1 Additive Model in the Natural Parameter Scale

In Lin CDE boosting, we build an additive model in the natural parameter scale of the density (11). We ﬁnd a sequence of Lin CDE tree-based learners with parameter functions {γt(x)}0 t T 1, and aggregate those basis functions to obtain the ﬁnal estimate7

t=0 γt(x). (17)

In other words, at the t-th iteration, we tilt the current conditional density estimate

ft+1 y|x (y | x) = ft y|x(y | x) ez(y) γt(x) φβt(x)(γt(x)) (18)

7. To stabilize the performance, we may shrink γt(x) by some learning rate η (0, 1], and let βT (x) = PT 1 t=0 ηγt(x).

Lin CDE: Conditional Density Estimation via Lindsey s Method

based on knowledge γt(x) learned by the new tree. Here φβt(x)(γt(x)) = ψ(βt(x) + γt(x)) ψ(βt(x)) is the updated normalizing function (depending on x). Appendix E shows that if the true conditional density is smooth, the approximation error of Lin CDE boosting s function class (18) with splines z(y) will vanish as the number of splines k increases. We determine the Lin CDE tree modiﬁers in (17) by log-likelihood maximization. We aim to ﬁnd the modiﬁer that produces the largest improvement in the objective ℓ(R0; β(x)+γ(x)) deﬁned as

log(ft y|x(yi | xi)) + z(yi) γ(xi) φβt(xi)(γ(xi)) + λ

j=1 ωjγ2 j (xi). (19)

Compared to the objective (12) of Lin CDE trees, the only diﬀerence in (19) is the normalizing function φβt(x)(γt(x)). When βt(x) is a constant function, the normalizing functions of Lin CDE trees and Lin CDE boosting coincide. In Section 5.2 and 5.3, we demonstrate how Lin CDE boosting s heterogeneous normalizing function complicates the ﬁtting and splitting steps and propose corresponding solutions.

5.2 Fitting Step

For ﬁtting, given a subregion R, the problem (19) can not be solved by Lindsey s method as in Lin CDE trees, because βt(x) could be non-constant in the subregion R. Explicitly, instead of the single constraint in (5), we could have up to n R constraints Z κ(y)ez(y) (βt(xi)+γt(xi))+β0(xi)dy = 1, xi R. (20)

As a result, the Lagrangian function (6) as well as subsequent discrete approximations for Lindsey s method are invalid. Fortunately, we can solve the ﬁtting problem iteratively (Fitting (Lin CDE boosting) below). Deﬁne bin probabilities

pb(βt(x)) := κ(yb)ez (yb)βt(x) PB b =1 κ(yb )ez (yb )βt(x) ,

pb(R; βt(x)) := 1

xi R pb(βt(xi)). (21)

We feed the marginal cell probabilities pb(R; βt(x)) to the ﬁtting step as the baseline for modiﬁcation. In Step 1, Lindsey s method produces a natural parameter modiﬁer and a universal intercept for all samples in R. The intercept produced by Lindsey s method guarantees that the marginal cell probabilities to sum to unity, but not for every individual xi. In Step 2, we update the individual normalizing constants to ensure all constraints (20) are satisﬁed. In Proposition 4, we show that the ﬁtting step of Lin CDE boosting converges to the maximizer of the objective (19).

Fitting (Lin CDE boosting). In a region R, initialize γ = 0 Rk, γ0 = 0 Rn R. Count the number of observations {n R,b} in each bin.

Gao and Hastie

1. Updating γ. Compute pb(R; βt(x)+γ) in (21), ﬁt a Poisson regression model with the response variable {n R,b}, regressors z(yb), the oﬀset log( pb(R; βt(x) + γ)), and a weighted ridge penalty. Denote the estimated coeﬃcients by γ. Update γ γ + γ. 2. Updating γ0 (normalization). Compute the normalizing constants for all samples in R

b pb(βt(xi))ez(yb) γ !

3. Repeat steps 1 and 2 until γ 2 ε. Output γ, γ0.

Proposition 4 Assume that λ = 0 and Y is supported on the midpoints {yb}, then the ﬁtting step of Lin CDE boosting converges, and the output γt R satisﬁes

γt R = arg max γ ℓ(R; βt(x) + γ).

We oﬀer some intuition behind Proposition 4; i.e., why the ﬁtting step of Lin CDE boosting will stop at the likelihood maximizer. If βt(x) + γ is already optimal, then the average suﬃcient statistics z(y) under marginal probabilities (21) should match the observations, which yields the KKT condition of the Poisson regression in Lindsey s method. As a result, Lindsey s method will produce zero updates, and the algorithm converges.

5.3 Splitting Step

Reminiscent of the splitting step for Lin CDE trees, we seek the split that produces the largest improvement in the objective (19). Proposition 2 is not valid due to the heterogeneity in βt(x), and we propose an expanded version.

Proposition 5 In a region R, let n R be the sample size and γt R be the optimal update. Deﬁne the average suﬃcient statistics residuals as

zi ψ(βt(xi) + γt R) .

Given a candidate split s, deﬁne

Ψt s(γt R) :=n Rs,R

xi Rs,L 2ψ βt(xi) + γt R

xi Rs,R 2ψ βt(xi) + γt R

Then the improvement of the unpenalized conditional log-likelihood satisﬁes

1 n R ℓt(R, s) = n Rs,Ln Rs,R

rt Rs,L rt Rs,R Ψt s(γt R) rt Rs,L rt Rs,R + rs,

where rs = O( rt Rs,L rt R 3 2 + rt Rs,R rt R 3 2).

Lin CDE: Conditional Density Estimation via Lindsey s Method

The average suﬃcient-statistic residuals rt R measures the deviation of the current estimator from the observations. By maximizing the quadratic approximation in Proposition 5, we ﬁnd the candidate split s such that rt Rs,L and rt Rs,R are far apart, and modify the current estimator diﬀerently in the left and right children determined by the selected split. The updated splitting procedure is summarized in Splitting (Lin CDE boosting).

Splitting (Lin CDE boosting). In a region R:

1. Compute {n Rs,L, n Rs,R, z Rs,L, z Rs,R} for each candidate split s, and approximate Ψt(γt R) in (22) by the average covariance matrix of z(y) in R.

2. For each candidate split s S, compute the quadratic approximation b ℓ(R, s) by Proposition 5, and choose the split smax = arg maxs S ˆ ℓ(R, s).

The computation of the quadratic approximation is largely the same, except that the normalization matrix Ψt s(γt R) varies across candidate splits and requires separate computation. To relieve the computational burden, we propose the following surrogate independent of candidate splits8

xi R 2ψ β(xi) + γt R

The surrogate Ψt(γt R) coincides with Ψt s(γt R) if βt(x) is a constant vector, or the normalizing function ψ(β) is quadratic. Proposition 6 gives the computational time complexity of the splitting procedure (Splitting (Lin CDE boosting)). The computation time scales linearly with regard to the sample size multiplied by dimension and the number of candidate splits. The extra computation compared to Lin CDE trees comes from residual calculations and individual normalizations.

Proposition 6 Assume that there are S candidate splits, then the splitting step for Lin CDE boosting is of computational time complexity O(dn Rk B + n Rk2B + k3 + Sk2).

6. Pretreatment

In this section, we discuss two pretreatments: response transformation and centering. The pretreatments are helpful when the response is heavy-tailed and when the conditional distributions fy|x(y | x) vary wildly in location.

6.1 Response Transformation

Heavy-tailed response distributions are common in practice, such as income and waiting time. If the response is heavy-tailed, then in Lindsey s method, most bins will be approximately empty. As a result the model tends to be over-parameterized and the estimates tend to overﬁt.

8. In practice, we add a universal diagonal matrix to 1 n R P

xi R 2ψ(β(xi) + γt R) to stabilize the matrix inversion. 9. The initialization β0(x) is usually a constant vector, e.g., zero vector, independent of the covariates.

Gao and Hastie

Algorithm 2: Lin CDE boosting

Initialize the natural parameter function β0(x).9

for t = 1:T do

1. Apply Algorithm 1 with Fitting (Lin CDE boosting), Splitting (Lin CDE boosting), and obtain the optimal Lin CDE tree modiﬁer ˆγt 1(x). 2. Update

ˆβt(x) ˆβt 1(x) + ˆγt 1(x).

end Output ˆβT (x).

0.0 0.2 0.4 0.6

0.0 0.2 0.4 0.6 0.8 1.0

without response transformation

truth estimated

0.0 0.2 0.4 0.6 0.8 1.0

with response transformation

truth estimated

Figure 4: Response transformation and Lindsey s method. We generate 100 responses y = w2, w exp(1), and apply Lindsey s method. The left panel plots the histogram of the responses, and we observe several extreme values ( 10). The middle panel plots the true density and Lindsey s estimate. Due to a limited number (20) of bins and a wide response range caused by outliers, Lindsey s estimate is inaccurate (not sharp enough) around zero. The right panel demonstrates Lindsey s estimate using log-transformed responses. The log-transformation helps the estimation around zero and does not sacriﬁce the ﬁt at the tail.

In response to the heavy-tailed responses, we recommend transforming the response ﬁrst to be marginally normally distributed. Often log and cube-root transformations are useful. In a more principled way, we can apply the Box-Cox transformation to the responses and choose the optimal power parameter to produce the best approximation of a Gaussian distribution curve. Once the model is ﬁt to the transformed data, we map the estimated conditional densities of the transformed responses back to those of the original observations. See Figure 4 for an example.

Lin CDE: Conditional Density Estimation via Lindsey s Method

6.2 Centering

For a distribution whose conditional components diﬀer wildly in location, Lin CDE needs a large number of suﬃcient statistics to capture local distributional characteristics. For instance, Figure 5 displays a conditional Gaussian mixture with location shift

y = 3x(1) + wz(1) + (1 w)z(2),

w Ber(0.5), z(1) N( 0.5, 0.06), z(2) N(0.5, 0.06), z(1) z(2). (23)

When we apply Lin CDE boosting with k = 10 suﬃcient statistics, the estimates do not reproduce the bimodalities due to a lack of ﬂexibility. We call this the disjoint support problem. A straightforward solution to the disjoint support problem is to increase the number k so that the suﬃcient statistics z(y) are adequately expressive. As a consequence, the number of components in the parameter function β(x) goes up. This approach is prone to overﬁtting, especially when there are a small number ( 20) samples in a terminal node. In addition, this approach will signiﬁcantly slow down the splitting procedure, which scales O(k3) by Proposition 6. Our solution is to center the response prior to ﬁtting the Lin CDE model. Since the difference in location causes the disjoint support problem, we suggest aligning the centers of the conditional densities in advance. Explicitly, we ﬁrst estimate the locations via some conditional mean estimator and then subtract the estimates from the responses. The support of the residuals are less heterogeneous, and we apply Lin CDE boosting to these residuals to capture additional distributional structures. Finally, we transform the resulting density estimates back to those of the responses. The procedure is summarized in Algorithm 3.

Algorithm 3: Centering

1. Estimate the conditional mean ˆh(x) using the training data {(xi, yi)}. Compute the residuals ri = yi ˆh(xi). 2. Apply Lin CDE boosting to {(xi, ri)}, and obtain ˆf R|X(r | x). 3. Deﬁne ˆf Y |X(y | x) = ˆf R|X(y ˆh(x) | x) and output ˆf Y |X.

Centering splits the task of conditional distribution estimation into conditional mean estimation and distributional property estimation. For centering we have available a variety of popular conditional mean estimators, such as the standard random forest, boosting, and neural networks. Once the data are centered, Lin CDE boosting has a more manageable task. Figure 5 shows that with centering, Lin CDE boosting is able to reproduce the bimodal structure in the above example with the same set of suﬃcient statistics.

7. Simulation

In this section, we demonstrate the eﬃcacy of Lin CDE boosting on simulated examples.

Gao and Hastie

without centering

I: location 2.4

II: location 0

III: location 2.4

truth Lin CDE boosting

with centering

I: location 2.4

II: location 0

III: location 2.4

truth Lin CDE boosting

Figure 5: Conditional density estimation with and without centering. We consider the conditional density (23) and pick 3 landmarks corresponding to diﬀerent locations. The ﬁrst row plots Lin CDE boosting s estimates without centering, and the second row plots the estimates augmented with true means. In each trial, we sample 1000 observations from the target distribution. We repeat each setting 100 times, and plot the average estimated conditional densities. In both settings, Lin CDE boosting uses k = 10 suﬃcient statistics and 20 response bins.

7.1 Data and Methods

Consider d = 20 covariates randomly generated from uniform [ 1, 1]. The responses given the covariates are sampled from the following distributions:

Locally Gaussian distribution (LGD):

Y | X = x N 0.5x(1) + x(1)x(2), 0.5 + 0.25x(2) 2 .

Lin CDE: Conditional Density Estimation via Lindsey s Method

At a covariate conﬁguration, the response is Gaussian with the mean determined by x(1) and x(2), and the variance determined by x(2). Covariates x(3) to x(20) are nuisances variables;

Locally Gaussian or Gaussian mixture distribution (LGGMD):

0.5N µ(x(1)) 0.5, σ2 +(x(3))

+ 0.5N µ(x(1)) + 0.5, σ2 (x(3)) , x(2) 0.2,

N µ(x(1)), σ2 , x(2) > 0.2,

where the means and variances are

µ x(1) = 0.25x(1), σ2 = 0.3,

σ2 + x(3) = 0.25 0.25x(3) + 0.5 2 ,

σ2 x(3) = 0.25 0.25x(3) 0.5 2 .

The mean is determined by x(1). The modality depends on x(2): in the subregion x(2) 0.2, the response follows a bimodal Gaussian mixture distribution, while in the complementary subregion, the response follows a unimodal Gaussian distribution. The skewness or symmetry is controlled by x(3) in the Gaussian mixture subregion: larger absolute values of x(3) imply higher asymmetry. Overall, the conditional distribution has location, shape, and symmetry dependent on the ﬁrst three covariates. Covariates x(4) to x(20) are nuisance variables.

The training data set consists of 1000 i.i.d. samples. The performance is evaluated on an independent test data set of size 1000. We compare Lin CDE boosting with quantile regression forest and distribution boosting.10 There are a number of tuning parameters in Lin CDE boosting. The primary parameter is the number of trees (iteration number). Secondary tuning parameters include the tree size, the learning rate, and the ridge penalty parameter. On a separate validation data set, we experimented with a grid of secondary parameters, each associated with a sequence of iteration numbers, and select the best-performing conﬁguration. By default, we use k = 10 transformed natural cubic splines and a Gaussian carrying density We use a small learning rate η = 0.01 to avoid overﬁtting. We use 40 discretization bins for training, and 20 or 50 for testing. The simulation examples do not have heavy-tail or disjoint support issues, and thus no pretreatments are needed.

7.2 Results of Conditional Density Estimation

Let the oracle be provided with the true density, and the null method estimates a marginal Gaussian distribution. We consider the following metric ℓmethod ℓnull

ℓoracle ℓnull , (24)

10. Lin CDE boosting: https://github.com/Zijun Gao/Lin CDE. Quantile regression forest: R package quantreg Forest (Meinshausen, 2017). Distribution boosting: R package con Tree (Friedman and Narasimhan, 2020).

Gao and Hastie

QRF DB Lin CDE

0.0 0.2 0.4 0.6 0.8

(lmethod lnull) (loracle lnull)

QRF DB Lin CDE

0.0 0.2 0.4 0.6 0.8

(lmethod lnull) (loracle lnull)

Figure 6: Box plots of goodness-of-ﬁt measures (24) in the setting LGD (left panel) and the setting LGGMD (right panel). The goodness-of-ﬁt measure is based on loglikelihoods, and a larger value indicates a better estimate. We compare quantile regression forests (QRF), distribution boosting (DB), and Lin CDE boosting. Densities of quantile regression forests and distribution boosting are computed according to (25) with 20 bins.

where ℓ denotes the test conditional log-likelihood of a speciﬁc method. The criterion is analogous to the goodness-of-ﬁt measure R2 of linear regression. It measures the performance of the method relative to the oracle; larger values indicate better ﬁts, and the ideal value is one. Quantile regression forests and distribution boosting estimate conditional quantiles instead of densities. To convert the quantile estimates to density estimates, we deﬁne a grid of bins with endpoints yb,L and yb,R, and approximate the density in bin b by

ˆfb = ˆq 1(yb,R) ˆq 1(yb,L)

yb,R yb,L , (25)

where ˆq 1(y) represents the inverse function of the quantile estimates. As the bin width shrinks, ˆfb is less biased but of larger variance. In simulations, we display the results with 20 bins and 50 bins (Appendix G). We observe that Lin CDE boosting is robust to the bin size, while distribution boosting and quantile regression forests prefer 20 bins due to the smaller variances. Figure 6 presents the goodness-of-ﬁt measure (24) of the three methods under the LGD and LGGMD settings. In both settings, Lin CDE boosting leads in performance, improving the null method by 60% to 80% of the oracle s improvements. Figures 7 and 8 depict the estimated conditional densities of Lin CDE boosting in different subregions. In both settings, Lin CDE boosting identiﬁes the roles of important covariates: in the LGD setting, the estimated conditional densities vary in location as x(1)

Lin CDE: Conditional Density Estimation via Lindsey s Method

2 1 0 1 2 3

I: X2 = 0.6

2 1 0 1 2 3

2 1 0 1 2 3

III: X2 = 0.6

truth Lin CDE boosting

2 1 0 1 2 3

I: X1 = 0.6

2 1 0 1 2 3

2 1 0 1 2 3

III: X1 = 0.6

truth Lin CDE boosting

Figure 7: Conditional densities estimated by Lin CDE boosting in the LGD setting. We take x(2) { 0.6, 0, 0.6} (upper panel) and x(1) { 0.6, 0, 0.6} (lower panel) and ﬁx other covariates. The estimated conditional densities are close to the truth.

changes, and in scale as x(2) changes; in the LGGMD setting, the estimated conditional densities vary in location as x(1) changes, in shape as x(2) changes, and in symmetry as x(3)

changes. To further illustrate the ability of Lin CDE boosting to detect inﬂuential covariates, we present the importance scores in Figure 9. In the LGD setting, Lin CDE boosting puts around 87% of the importance on x(1) and x(2), while quantile regression forest distributes more importance on the nuisances (x(1) and x(2) accounting for 40%). In the LGGMD setting, Lin CDE boosting is able to detect all inﬂuential covariates x(1), x(2), x(3), while the quantile regression forest only recognizes x(1).

7.3 Results of Conditional CDF Estimation

Here we evaluate the conditional CDF estimates of the three methods, bringing the comparisons closer to the home court of distribution boosting. We consider the average absolute

Gao and Hastie

I: X1 = 0.6

II: X1 = 0.6

truth Lin CDE boosting

0.0 0.2 0.4 0.6 0.8 1.0

I: X2 = 0.6

0.0 0.2 0.4 0.6 0.8 1.0

II: X2 = 0.6

truth Lin CDE boosting

0.0 0.2 0.4 0.6 0.8 1.0

I: X3 = 0.6

0.0 0.2 0.4 0.6 0.8 1.0

II: X3 = 0.6

truth Lin CDE boosting

Figure 8: Conditional densities estimated by Lin CDE boosting in the LGGMD setting. We take x(1), x(2), x(3) { 0.6, 0.6} respectively. The estimated conditional densities vary in location as x(1) changes, in shape as x(2) changes, and in symmetry as x(3) changes. The estimated conditional densities are close to the truth.

error (AAE) used by Friedman (2019)

ˆF(q(uj | xi) | xi) F(q(uj | xi) | xi) , (26)

where {uj} is an evenly spaced grid on [0, 1], and q(u | x) denotes the u quantile at the covariate value x. To compute the CDF estimates, for distribution boosting and quantile regression forest, we directly invert the estimated quantiles to CDFs. For Lin CDE boosting, we compute the multinomial cell probabilities with a ﬁne grid (50 bins) and obtain the CDFs based on the cell probabilities.

Lin CDE: Conditional Density Estimation via Lindsey s Method

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20

0.0 0.2 0.4 0.6

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20

0.0 0.1 0.2 0.3 0.4

Figure 9: Importance scores of Lin CDE boosting and quantile regression forest in the LGD (left panel) and LGGMD (right panel) settings. We normalize the importance scores to sum to one. In the LGD setting, both methods detect all the inﬂuential covariates x(1) and x(2). In the LGGMD setting, Lin CDE boosting identiﬁes all the inﬂuential covariates x(1), x(2)), and x(3), while quantile regression forest only identiﬁes x(1).

Figure 10 depicts the AAE metrics. In both settings, Lin CDE boosting produces the smallest AAE. Notice that

ˆF(q(uj) | xi) uj = ˆF(q(uj | xi) | xi) ˆF(ˆq(uj | xi) | xi)

ˆf(q(uj | xi) | xi) (q(uj | xi) ˆq(uj | xi)).

Though distribution boosting and quantile regression forest estimate the quantiles well, the CDF estimates can be harmed by the implicit density estimator multiplied. In Appendix G, we also compare the CDF estimates using Cram er-von Mises distance and observe consistent patterns to what we see here.

7.4 Results of Conditional Quantile Estimation

Here the comparisons are in the home court of quantile random forests. We evaluate the conditional quantile estimates of the three methods. We compute the quantile losses at {5%, 25%, 50%, 75%, 95%} levels (Table 1). For Lin CDE boosting, we compute the multinomial cell probabilities (50 bins) and obtain the quantiles based on the cell probabilities. Despite the fact that quantile-based metrics should favor quantile-based methods, we observe that the performance of Lin CDE boosting is similar. We also construct 50% and 90% prediction intervals based on the quantiles. In Table 1, we display the coverages and widths of the 90% prediction intervals. All methods produce

Gao and Hastie

QRF DB Lin CDE

0.006 0.007 0.008 0.009 0.010

QRF DB Lin CDE

0.0035 0.0045 0.0055 0.0065

Figure 10: Box plots of AAE (26) in the LGD (left panel) and LGGMD (right panel) settings. AAE is a metric naturally deﬁned for conditional CDF estimates (Friedman, 2019), and a smaller value indicates a better estimate.

data method quantile loss coverage width 5 % 25 % 50 % 75 % 95 % 90% PI 90% PI

QRF 0.058 0.174 0.218 0.174 0.056 93.3% 2.02 (0.001) (0.02) (0.002) (0.02) (0.001) (0.9%) (0.048)

DB 0.058 0.176 0.218 0.174 0.057 92.5% 1.95 (0.001) (0.03) (0.003) (0.03) (0.002) (1.2%) (0.051)

Lin CDE 0.055 0.168 0.212 0.169 0.054 91.9% 1.84 (0.001) (0.01) (0.001) (0.02) (0.001) (1.0%) (0.045)

QRF 0.054 0.180 0.246 0.181 0.055 89.5% 1.76 (0.001) (0.001) (0.002) (0.001) (0.001) (0.7%) (0.034)

DB 0.054 0.182 0.246 0.182 0.055 91.2% 1.88 (0.001) (0.002) (0.001) (0.002) (0.001) (0.9%) (0.038)

Lin CDE 0.053 0.181 0.246 0.181 0.055 90.5% 1.78 (0.001) (0.001) (0.001) (0.001) (0.001) (0.7%) (0.032)

Table 1: Table of quantile losses (the smaller the better) at {5%, 25%, 50%, 75%, 95%} levels and 90% prediction interval coverages (ideally 90%), interval widths (the narrower the better) in the LGD and LGGMD settings. Standard deviations are in the parentheses. PI stands for prediction interval.

fairly good coverages. Results of 50% prediction intervals are consistent and can be found in Appendix G.

Lin CDE: Conditional Density Estimation via Lindsey s Method

7.5 Computation Time

We compare the computation time of the three methods.11 We normalize the training time by the number of trees. In Table 2, quantile random forest is the fastest, followed by Lin CDE boosting. Lin CDE boosting takes about 5 seconds for 100 iterations.

time(s) QRF DB Lin CDE LGD 1.3 (0.018) 13 (0.21) 4.8 (0.36) LGGMD 1.4 (0.15) 14 (1.0) 5.1 (0.93)

Table 2: Table of computation time in seconds. We present the training time for n = 1000 samples and d = 20 features per 100 trees. Standard deviations are in parentheses.

8. Real Data Analysis

In this section, we analyze real data sets with Lin CDE boosting. The pipeline is as follows: ﬁrst, we split the samples into training and test data sets; next, we perform 5-fold crossvalidation on the training data set to select the hyper-parameters; ﬁnally, we apply the estimators with the selected hyper-parameters and evaluate multiple criteria on the test data set. We repeat the procedure 20 times and average the results. As for the centering, we use random forests as the conditional mean learner.

8.1 Old Faithful Geyser Data

The Old Faithful Geyser data records the eruptions from the Old Faithful geyser in the Yellowstone National Park (Azzalini and Bowman, 1990) and represents continuous measurement from August 1 to August 15, 1985. The data consists of 299 observations and 2 variables: eruption time and waiting time for the eruption. We estimate the conditional distribution of the eruption time given the waiting time. In Figure 11, we plot the eruption time versus the waiting time. There is a clear cutoﬀ at 70min: for any waiting time over 70min, the distribution of eruption time is bimodal, while for any waiting time below 70min, the distribution is unimodal. In Figure 11, we display the estimated conditional densities of Lin CDE boosting at waiting time 85min and 60min. Lin CDE boosting is capable of detecting the diﬀerence in modality. In Table 3, we summarize the comparison between Lin CDE boosting, distribution boosting, and quantile regression forest regarding the negative log-likelihoods and quantile losses. Lin CDE boosting outperforms in log-likelihood, and is competitive in quantile losses. We remark the AAE and Cram er-von Mises distance can not be computed since we do not have the true conditional distributions.

8.2 Human Height Data

The human height data is taken from the NHANES data set: a series of health and nutrition surveys collected by the US National Center for Health Statistics (NCHS). We estimate the

11. The experiments are run on a personal computer with a dual-core CPU and 8GB memory.

Gao and Hastie

50 70 90 110

waiting time

Figure 11: Lin CDE boosting applied to the Old Faithful Geyser data. On the left, we plot the eruption time versus the waiting time. On the right, we display the estimated conditional densities by Lin CDE boosting at waiting times 60min and 85min.

conditional distribution of the standing height. We consider two subsets: 542 samples in the age range 14 to 17, and 1956 samples in the age range 14 to 40. In the smaller subset, we only consider two covariates: age and poverty; in the larger subset, we consider 9 covariates, including age, poverty, race, gender, etc. In the smaller subset, we tune by cross-validation; in the larger subset, we split the data set for validation, training, and test (proportion 2 : 1 : 1), and tune on the hold-out validation data.

The distribution of heights combining male and female is used as a typical illustration of bimodality (Devore and Peck, 2005). However, Schilling et al. (2002) point out the separation between the heights of men and women is not large enough to produce the bimodality. In Figure 12, we demonstrate the histogram of heights of white teenagers in the age range 15-19. The distribution of the combined data is sightly bimodal. Boys heights are larger and more concentrated. We also provide Lin CDE boosting s conditional density estimates obtained from the larger data set. Overall, the estimates accord with the histograms. The estimates without the covariate gender is on the borderline of unimodal and bimodal. The estimates with the gender explains the formation of the quasi-bimodality.

The comparisons of log-likelihood and quantile losses are summarized in Table 3. In both the larger and the smaller data sets, Lin CDE boosting performs the best concerning the log-likelihood. The advantage is more signiﬁcant in the smaller data set. The reason is that in the larger data set, there are more covariates, and the conditional mean explains a larger proportion of the variation in response, while in the smaller data set, the conditional distribution after the centering contains more information, such as the bimodality, which can be learnt by Lin CDE boosting.

Lin CDE: Conditional Density Estimation via Lindsey s Method

data method -log-like quantile loss 5% 25 % 50% 75 % 95 %

QRF 1.55 0.09 0.30 0.42 0.33 0.10 (0.12) (0.01) (0.02) (0.03) (0.02) (0.01)

DB 1.28 0.09 0.27 0.37 0.30 0.09 (0.10) (0.01) (0.02) (0.03) (0.02) (0.01)

Lin CDE 1.16 0.09 0.28 0.37 0.30 0.09 (0.07) (0.01) (0.02) (0.03) (0.02) (0.01)

Height (age 14 - 40)

QRF 3.30 0.63 1.72 2.19 1.77 0.69 (0.03) (0.03) (0.06) (0.07) (0.07) (0.04)

DB 3.29 0.65 1.84 2.24 1.90 0.72 (0.04) (0.03) (0.05) (0.06) (0.07) (0.05)

Lin CDE 3.19 0.64 1.82 2.20 1.88 0.71 (0.03) (0.04) (0.05) (0.06) (0.07) (0.04)

Height (age 14 - 17)

QRF 3.93 0.95 2.99 3.87 3.19 1.12 (0.17) (0.12) (0.24) (0.25) (0.23) (0.20)

DB 4.21 1.04 3.62 4.90 4.39 1.77 (0.16) (0.04) (0.19) (0.28) (0.38) (0.32)

Lin CDE 3.61 0.84 2.92 3.79 3.05 0.96 (0.06) (0.05) (0.16) (0.23) (0.19) (0.10)

Bone density

QRF -1.67 0.004 0.012 0.015 0.013 0.004 (0.11) (0.001) (0.001) (0.001) (0.001) (0.001)

DB -1.49 0.007 0.015 0.014 0.016 0.007 (0.03) (0.000) (0.001) (0.001) (0.001) (0.000)

Lin CDE -1.89 0.004 0.011 0.014 0.013 0.005 (0.06) (0.000) (0.001) (0.001) (0.001) (0.001)

Table 3: Comparison of Lin CDE boosting, QRF, and DB on real data sets. We display negative log-likelihoods and quantile losses at {5%, 25%, 50%, 75%, 95%} levels. (For all metrics, the smaller the better.) Standard deviations are in the parentheses.

8.3 Relative Spinal Bone Mineral Density Data

The relative spinal bone mineral density (spnbmd) data contains 485 observations on 261 North American adolescents. The response is the diﬀerence in spnbmd taken on two consecutive visits divided by the average. There are three covariates: sex, race, and age (the average age over the two visits). We estimate the conditional distributions of the spnbmd. The comparisons of log-likelihood and quantile losses are summarized in Table 3. Lin CDE boosting performs the best concerning the log-likelihood. The scatterplot of spnbmd versus age demonstrates serious heteroscedasticity: the variances reach the climax at age 12 and decrease afterward. In Figure 13, we plot Lin CDE boosting s estimates at age 12, 15 and 20 of white females. The spreads of the conditional densities decrease as the age grows. At age 12, the spnmd distribution is right-skewed, while those at age 15 and 20 are approximately symmetric.

Gao and Hastie

Age 15 19, White

140 150 160 170 180 190

0 10 20 30 40 50 60

both male female

150 160 170 180 190 200

0.00 0.01 0.02 0.03 0.04 0.05

Age 17, White

both male female

Figure 12: Lin CDE boosting applied to the height data. On the left, we plot the histogram of heights of white teenagers in the age range 15-19. On the right, we plot the average estimated density by Lin CDE boosting from the larger data set at age 17, race white, and other covariates ﬁxed at the corresponding medians. The estimated conditional density without the gender is on the borderline between unimodal and bimodal. Furthermore, we contrast the histograms of male and female heights against that of the combined data in the left plot, explaining the formation of the quasi-bimodality. We also depict the estimated densities of females, males (right panel), which accord with the histograms (left panel).

9. Discussion

In this paper, we propose Lin CDE boosting for conditional density estimation. Lin CDE boosting poses no speciﬁc parametric assumptions of the density family. The estimates reﬂect a variety of distributional characteristics. In the presence of unrelated nuisance covariates, Lin CDE boosting is able to focus on the inﬂuential ones. So far, we have discussed only univariate responses. Multivariate responses emerge in multiple practical scenarios: locations on a 2D surface, joint distributions of health indices, to name a few. Lindsey s method and thus Lin CDE boosting can be easily generalized to multivariate responses. Assume the responses are p-dimensional, multivariate Lin CDE boosting considers the density (11) with suﬃcient statistics involving y(1) to y(p). As an illustrative example, if we use suﬃcient statistics {y(i)} and {y(i)y(j)}, the resulting density will be a multivariate Gaussian (see Efron and Hastie, 2016, chap. 8.3 for the galaxy data example). The response discretization now divides the hyper-rectangle response support into equal-sized p-dimensional bins and the rest of Lin CDE boosting procedures carry over. The cost of multivariate responses is the exponentially growing number of bins and suﬃcient statistics, which requires more samples as well as computational power. In contrast, conditional quantiles for multi-dimensional responses are relatively less straightforward but several promising proposals have been proposed (Barnett, 1976; Chaudhuri, 1996; Kong

Lin CDE: Conditional Density Estimation via Lindsey s Method

10 15 20 25

0.05 0.05 0.15

0.1 0.0 0.1 0.2

0.1 0.0 0.1 0.2

0.1 0.0 0.1 0.2

III: age=20

Figure 13: Lin CDE boosting applied to the relative spinal bone mineral density data. On the left, we display the scatterplot of spnbmd versus age. There is serious heteroscedasticity. On the right, we plot Lin CDE boosting s estimates at age 12, 15 and 20 of white females. The spreads of the conditional densities decrease as the age grows. At age 12, the spnbmd distribution is right-skewed, while those at age 15 and 20 are approximately symmetric.

and Mizera, 2012; Carlier et al., 2016). We refer readers to Koenker (2017) and references therein. There are several exciting applications of Lin CDE boosting.

Online learning. Online learning processes the data that become available in a sequential order, such as stock prices and online auctions. As opposed to batch learning techniques which generate the best predictor by learning on the entire training data set once, online learning updates the best predictor for future data at each step. Online updating of Lin CDE boosting is simple: we input the previous conditional density estimates as oﬀsets, and modify them to ﬁt new data.

Conditional density ratio estimation. A stream of work studies the density ratio model (DRM), particularly the semi-parametric DRM. The density ratio can be used for importance sampling, two-sample testing, outlier detection (see Sugiyama et al., 2012, for an extensive review). Boosting tilts the baseline estimate parametrically based on the smaller group.

One future research direction is adding adaptive ridge penalty in Lin CDE boosting. Recall the ﬁts in the LGGMD setting (Figure 8) where the conditional densities change from a relatively smooth Gaussian density to a curvy bimodal Gaussian mixture, the estimates get stuck in between: in the smooth Gaussian subregions, the estimates produce unnecessary curvatures; in the bumpy Gaussian mixture subregions, the estimates are not suﬃciently wavy. The lack-of-ﬁt can be attributed to the universal constraint on the degrees of freedom: the constraint may be too stringent in some subregions while lenient in others.

Gao and Hastie

10. Software

Software for Lin CDE is made available as an R package at https://github.com/Zijun Gao/ Lin CDE. The package can be installed from Git Hub with

1 install.packages("devtools")

2 devtools :: install_github("Zijun Gao/Lin CDE", build_vignettes = TRUE)

The package comes with a detailed vignette discussing hyper-parameter tuning and demonstrating a number of simulated and real data examples.

Acknowledgments

This research was partially supported by grants DMS-2013736 and IIS 1837931 from the National Science Foundation, and grant 5R01 EB 001988-21 from the National Institutes of Health.

Lin CDE: Conditional Density Estimation via Lindsey s Method

Appendix A. Regularization Parameter and Degrees of Freedom

We discuss how the hyper-parameter λ relates to the degrees of freedom.12 According to Eq. (12.74) in Efron and Hastie (2016), the eﬀective number of parameters in Poisson models takes the form

b=1 cov(ˆηb, nb), (27)

where nb are responses and ˆηb are estimates of the natural parameter. In the following Proposition 7, we obtain an approximation of Eq. (27) in the ridge Poisson regression via a quadratic expansion of the objective function.

Proposition 7 Assume the responses nb are generated independently from the Poisson model with conditional means µ b = κbez b β (z includes the intercept) and the corresponding natural parameter η b = log(µ b). Let λ > 0 be the hyper-parameter of the ridge penalty.13

For arbitrary β, the second order Taylor expansion at β of the Poisson log-likelihood with ridge penalty is

b=1 nb log(κb) + z b β κbez b β log(nb!) λ k X

2(Zβ + K ζ) W(Zβ + K ζ) λ β Ωβ + C,

Zb = z b , Kb = log(κb), ζb = η b + nb µ b µ b ,

W = diag(µ 1, . . . , µ B), Ω= diag(ω1, . . . , ωk),

for some constant C > 0 independent of β.

Let ˆβ be the minimizer of the quadratic approximation (28), then

b=1 cov log(κb) + z b ˆβ, yb = tr Hβ + 2λ Ω 1 Hβ , (29)

where Hβ = Z WZ is the Hessian matrix of the negative Poisson log-likelihood evaluated at β .

We prove Proposition 7 in Appendix F. As a corollary, if Ωis a scalar multiple of the identity matrix, Eq. (29) agrees with the degrees of freedom formula Eq. (7.34) in Hastie et al. (2009). In practice, β is unknown and we plug ˆβ in Eq. (29) to compute the number of eﬀective parameters. We remark that Eq. (29) includes one degree of freedom for the intercept and in the Lin CDE package we will use Eq. (29) minus one as the degrees of freedom.

12. The degrees of freedom are derived under the Poisson approximation (7) of the original likelihood. 13. λ = nλ for the λ in (8).

Gao and Hastie

log density

truth Lindsey

Figure 14: Estimated log-densities by Lindsey s method under diﬀerent degrees of freedom. We use 10 transformed natural cubic splines as basis functions and increase the degrees of freedom in Lindsey s method from 2 to 7 (left to right).

Figure 14 plots the estimated log-densities of a Gaussian mixture distribution from Lindsey s method under diﬀerent degrees of freedom. As the degrees of freedom (excluding the intercept) increase from 2 to 7, the log-density curves evolve from approximately quadratic to signiﬁcantly bimodal. At 7 degrees of freedom, the estimated log-density is reasonably close to the underlying truth.

Appendix B. Linear Transformation for Basis Construction

We expand on the linear transformation used in the basis construction in Section 3. Let the eigen-decomposition of Ωbe UDU , where U Rk k is orthonormal, and D Rk k is a diagonal matrix with non-negative diagonal values ordered decreasingly. Deﬁne the lineartransformed cubic spline bases z(y) = U z(y) and the corresponding coeﬃcients β = U β. Then z(y) β = z(y) β and Ω= U ΩU = U UDU U = D. Therefore, z(y) 7 U z(y) is the desired linear transformation.

Appendix C. Approximation Performance of Proposition 2

In Figures 15 and 16, we empirically demonstrate the eﬃcacy of the quadratic approximation suggested by Proposition 2 . We let the conditional densities depend solely on x(1) and jump (Figure 15) or vary continuously (Figure 16) in conditional variance, modality, or skewness. We observe that at candidate splits where the left child and the right child are similar, e.g., all the candidate splits based on x(2), the quadratic form and the exact log-likelihood diﬀerence are almost the same. At candidate splits where the left child and the right child are diﬀerent, e.g., x(1) = 0 in Figure 15, the exact log-likelihood diﬀerence is large, and the quadratic form is suﬃciently close to the diﬀerence to determine the optimal split. We remark that to gain robustness, we set the quadratic approximation to zero if one of

Lin CDE: Conditional Density Estimation via Lindsey s Method

split point

1.0 0.5 0.0 0.5 1.0

modality; x(1) skewness; x(1)

1.0 0.5 0.0 0.5 1.0

σ2; x(2) modality; x(2)

1.0 0.5 0.0 0.5 1.0

skewness; x(2)

log likelihood difference quadratic approximation

Tree-structured conditional densities

Figure 15: Comparison of the log-likelihood diﬀerence (13) and the quadratic approximation in Proposition 2. There are two subregions: x(1) 0 and x(1) > 0. In diﬀerent subregions, the conditional densities are diﬀerent in conditional variance (σ2), modality, or skewness (each column corresponds to a type of diﬀerence). In the same subregion, the conditional density does not change. In each trial, we sample 100 observations and 5 covariates. We use 5 transformed natural cubic splines as basis functions, and 30 candidate splits for each covariate equally spread across their ranges. We plot the log-likelihood diﬀerence (red) and the quadratic approximation (black) for candidate splits of x(1) (inﬂuential) and x(2)

(nuisance), respectively. The results are aggregated over 100 times.

the candidate split s children contain less than 10 samples, which leads to the imperfect approximation at the boundary candidate splits.

Appendix D. Stopping Criteria for Lin CDE Trees

We discuss stopping criteria for Lin CDE trees. There is no universally optimal choice. If we build a single tree learner, the preferred strategy, at least for regression and classiﬁcation trees according to Breiman et al. (1984), is to grow a large tree, then prune the tree using

Gao and Hastie

split point

1.0 0.5 0.0 0.5 1.0

modality; x(1) skewness; x(1)

1.0 0.5 0.0 0.5 1.0

σ2; x(2) modality; x(2)

1.0 0.5 0.0 0.5 1.0

skewness; x(2)

log likelihood difference quadratic approximation

Continuously varying conditional densities

Figure 16: Comparison of the log-likelihood diﬀerence (13) and the quadratic approximation in Proposition 2. The conditional densities vary continuously with |x(1)| in variance (σ2), modality, or skewness (each column corresponds to a type of diﬀerence). In each trial, we sample 100 observations and 5 covariates. We use 5 transformed natural cubic splines as basis functions, and 30 candidate splits for each covariate equally spread across their ranges. We plot the log-likelihood diﬀerence (red) and the quadratic approximation (black) for candidate splits of x(1) (inﬂuential) and x(2) (nuisance), respectively. The results are aggregated over 100 times.

the cost-complexity pruning. If we train a random forest learner, then Breiman (2001) recommends stopping the splitting process only when some minimum node size, default to be 5 in the package random Forest (Liaw and Wiener, 2002), is reached. As for tree boosting, Friedman (2001) uses trees with the same number of terminal nodes. The number of terminal nodes is treated as a hyper-parameter of the boosting algorithm and tuned to maximize the performance on the data set at hand. The aforementioned stopping criteria generalize to Lin CDE trees straightforwardly. Two options are currently available in codes: (1) stop when the tree depth reaches some preﬁxed level; (2) stop when the decrease in the objective fails to surpass a certain number a greedy top-down approach. We don t

Lin CDE: Conditional Density Estimation via Lindsey s Method

recommend the criterion of stopping until a terminal node is pure, because there will be insuﬃcient samples at terminal nodes for density estimation.

Appendix E. Approximation Error for Lin CDE Boosting

We characterize the expressiveness of Lin CDE boosting s function class class (18) with splines z(y). Without loss of generality, we will assume f Y |X(y|x) is supported on [0, 1] [0, 1]d. Let Ss 2 be the Sobolev space of functions h(y) on [0, 1] for which h(s 1) is absolutely continuous and R (h(s)(y))2dy is ﬁnite. Denote the space of tree boosting models with arbitrary depth and number of trees by N1 := {PN i=1 ai1x(j) Ij, N 1, Ij [0, 1]}. Denote the space of splines of order ζ on [0, 1] with equal-spaced knots {1, . . . , k ζ + 1} δ, δ = 1/(k ζ + 2), ζ k/2, by Ωk,ζ. Let Fk be the exponential family where zj(y) Ωk,ζ, ζ s 1, and the parameter functions βj(x) N1, 1 j k.

Proposition 8 Assume the log-conditional-density function h(y|x) := log(f Y |X(y|x)) Ss 2, s k/2, and h( |x) , h(s)( |x) 2 B, then

min f Y |X Fk E h D f Y |X(y|X) || f Y |X(y|X) i C

where C > 0 is a constant depending on B.

The proof of Proposition 8 can be found in Section F.

Appendix F. Proofs

In this section, we provide proofs of aforementioned propositions and claims.

F.1 Proof of Claim 1

Proof. For u Rk, u lies in the null space of Ωz if and only if u Ωu = 0. By the deﬁnition of Ω,

0 = u Ωu = Z (z (y) u)2dy,

which implies that (z(y) u) = z (y) u = 0 almost everywhere. On one hand, if z(y) u is linear or quadratic, then (z(y) u) is automatically zero everywhere. On the other hand, since z(y) are cubic spline bases, z(y) u is piece-wise cubic, and (z(y) u) = 0 implies that z(y) u is piece-wise linear or quadratic. Because z (y) and z (y) are both continuous, z(y) u is also second-order continuous, z(y) u must be linear or quadratic in y.

F.2 Proof of Proposition 7

Proof. For arbitrary µb, ηb such that ηb = log(µb), the second order Taylor expansion at µ b of the Poisson log-likelihood of the b-th sample ℓ(yb; ηb) is

ℓ(yb; ηb) ℓ(yb; η b) +

ηb ℓ(yb; η b)(ηb η b) + 1

η2 b ℓ(yb; η b)(ηb η b)2. (31)

Gao and Hastie

By deﬁnition,

ℓ(yb; η b) = ybη b κbeη b + C, ηb ℓ(yb; η b) = yb κbeη b , 2

η2 b ℓ(yb; η b) = κbeη b , (32)

for some constant C independent of η b. Plug Eq. (32) into Eq. (31),

ℓ(yb; ηb) ybη b κbeη b + (yb κbeη b )(ηb η b) 1

2κbeη b (ηb η b)2 + C

2κbeη b ηb η b yb κbeη b

2 + r(yb, η b), (33)

where the remainder r(yb, η b) is independent of ηb. Plug ηb = log(κb) + z b β, µ b = eη b in Eq. (33), we sum over all samples and ﬁnish the proof of Eq. (28). We diﬀerent the quadratic approximation (28) with respect to β and obtain the score function

Z W(Zβ + K ζ) 2λ Ωβ = 0.

We solve the score function to obtain

ˆβ = Z WZ + 2λ Ω 1 Z W(ζ K).

Then plug ˆη = K + Z ˆβ into Eq. (27) and we get

b=1 cov(ˆηb, yb) = tr cov Z Z WZ + 2λ Ω 1 Z W(ζ K), Y

= tr Z Z WZ + 2λ Ω 1 Z Wcov(ζ, Y ) .

Notice that cov(ζb, yb ) = cov(yb, yb )/µ b = 1{b=b }, then

df = tr Z Z WZ + 2λ Ω 1 Z W = tr Z WZ + 2λ Ω 1 Z WZ .

This gives Eq. (29). The Hessian of the negative Poisson log-likelihood at β is

Hβ = 2 β|β=β

b=1 ℓ(yb; ηb) =

b=1 2 β|β=β κbez b β

b=1 κbez b β zbz b = Z WZ .

Lin CDE: Conditional Density Estimation via Lindsey s Method

F.3 Proof of Proposition 2

Proof. We simplify the diﬀerences in the log-likelihood,

ℓ(R, s) = X

log(κ(yi)) + z(yi) ˆβRs,L ψ(ˆβRs,L)

log(κ(yi)) + z(yi) ˆβRs,R ψ(ˆβRs,R)

log(κ(yi)) + z(yi) ˆβR ψ(ˆβR)

λn Rs,L ˆβ Rs,LΩˆβRs,L λn Rs,R ˆβ Rs,RΩˆβRs,R + λn R ˆβ RΩˆβR

=n Rs,L z Rs,L(ˆβRs,L ˆβR) + n Rs,R z Rs,R(ˆβRs,R ˆβR)

n Rs,L(ψ(ˆβRs,L) ψ(ˆβR)) n Rs,R(ψ(ˆβRs,R) ψ(ˆβR))

λn Rs,L ˆβ Rs,LΩˆβRs,L λn Rs,R ˆβ Rs,RΩˆβRs,R + λn R ˆβ RΩˆβR,

where we use n Rs,L z Rs,L + n Rs,R z Rs,R = n z R. The score equation of ˆβR implies

z R = ψ(ˆβR) + 2λΩˆβR. (35)

and similarly for ˆβRs,L, ˆβRs,R. Plug Eq. (35) into Eq. (34), we obtain

ℓ(R, s) =n Rs,L ψ(ˆβRs,L) (ˆβRs,L ˆβR) + n Rs,R ψ(ˆβRs,R) (ˆβRs,R ˆβR)

n Rs,L(ψ(ˆβRs,L) ψ(ˆβR)) n Rs,R(ψ(ˆβRs,R) ψ(ˆβR))

+ λn Rs,L ˆβRs,L ˆβR Ω ˆβRs,L ˆβR

+ λn Rs,R ˆβRs,R ˆβR Ω ˆβRs,R ˆβR .

By the Taylor expansion of ψ(β) and ψ(β) at ˆβR,

ψ(ˆβRs,L) (ˆβRs,L ˆβR) (ψ(ˆβRs,L) ψ(ˆβR))

= ψ(ˆβRs,L) (ˆβRs,L ˆβR) ψ(ˆβR) (ˆβRs,L ˆβR)

2(ˆβRs,L ˆβR) 2ψ(ˆβR)(ˆβRs,L ˆβR) + O( (ˆβRs,L ˆβR) 3 2)

2(ˆβRs,L ˆβR) 2ψ(ˆβR)(ˆβRs,L ˆβR) + O( (ˆβRs,L ˆβR) 3 2).

Gao and Hastie

Plug Eq. (37) into Eq. (36), we get

2n Rs,L(ˆβRs,L ˆβR) 2ψ(ˆβR)(ˆβRs,L ˆβR)

2n Rs,R(ˆβRs,R ˆβR) 2ψ(ˆβR)(ˆβRs,R ˆβR)

+ λn Rs,L ˆβRs,L ˆβR Ω ˆβRs,L ˆβR

+ λn Rs,R ˆβRs,R ˆβR Ω ˆβRs,R ˆβR

+ O (ˆβRs,L ˆβR) 3 2 + (ˆβRs,R ˆβR) 3 2

2n Rs,L(ˆβRs,L ˆβR) 2ψ(ˆβR) + 2λΩ (ˆβRs,L ˆβR)

2n Rs,R(ˆβRs,R ˆβR) 2ψ(ˆβR) + 2λΩ (ˆβRs,R ˆβR)

+ O (ˆβRs,L ˆβR) 3 2 + (ˆβRs,R ˆβR) 3 2 .

Finally by Eq. (35) and the assumption that 2ψ(ˆβR) + 2λΩis invertible,

ˆβRs,L ˆβR = 2ψ(ˆβR) + 2λΩ 1 ( z Rz,L z R) + O (ˆβRs,L ˆβR) 2 2 . (39)

Then ˆβRs,L ˆβR z Rs,L z R and similarly for the right child. Plug Eq. (39) into Eq. (38),

2n Rs,L( z Rs,L z R) 2ψ(ˆβR) + 2λΩ 1 ( z Rs,L z R)

2n Rs,R( z Rs,R z R) 2ψ(ˆβR) + 2λΩ 1 ( z Rs,R z R)

+ O ( z Rs,L z R) 3 2 + ( z Rs,R z R) 3 2

=n Rs,Ln Rs,L

2n R ( z Rs,L z Rs,R) 2ψ(ˆβR) + 2λΩ 1 ( z Rs,L z Rs,R)

+ O ( z Rs,L z R) 3 2 + ( z Rs,R z R) 3 2 ,

and we ﬁnish the proof.

F.4 Proof of Claim 9

Claim 9 Assume that in R, y is supported on the midpoints {yb}, then the covariance matrix approximation by Lindsey s method equals 2ψ(ˆβR).

Proof. If y is supported on the midpoints {yb}, then the discretization in Lindsey s method is accurate. As a result, the estimator of Lindsey s method is the exact loglikelihood maximizer ˆβR. Furthermore, the multinomial cell probabilities based on the Linsey s method s estimator is indeed the response distribution indexed by ˆβR. Next, by Lehmann and Romano (2006), 2ψ(ˆβR) equals the population covariance matrix of the sufﬁcient statistics generated from the distribution indexed by ˆβR. Therefore, the covariance approximation by Lindsey s method equals 2ψ(ˆβR).

Lin CDE: Conditional Density Estimation via Lindsey s Method

F.5 Proof of Proposition 3

Proof. We order the observations and count responses in each bin ( O(n R)). Next, we run Newton-Rapson algorithm. In each iteration, the computation of the gradient vector

b=1 nb z(yb)(1 ez(yb) β+β0).

is O(k B), where z = [z , 1]. The computation of the Hessian matrix

b=1 nbez(yb) β+β0 z(yb) z(yb)

takes O(k2B) operations. Finally, one Newton-Raphson update takes O(k3) operations. Newton-Raphson algorithm is superlinear (Boyd and Vandenberghe, 2004), thus we can regard the number of Newton-Raphson updates of constant order. The cell probabilities can be computed in O(k B) time, and the covariance matrix takes O(k2B) operations. To compute the quadratic approximation, {n Rs,L, n Rs,R, z Rs,L, z Rs,R} can be computed in O(dn Rk) by scanning through the samples once per coordinate. (If observations are not ordered beforehand, we will have O(dn Rk), where O denotes the order up to log terms.) Adding the diagonal matrix Ωis cheap, and the matrix inversion takes O(k3) operations. For a candidate split, the quadratic term in Proposition 2 further takes O(k2) time. Choosing the optimal split takes O(S). In summary, the complexity is O(dn Rk + k2B + k3 + Sk2).

F.6 Proof of Proposition 8

The results are built on the expressiveness of logspline density models and tree boosting models. We ﬁrst develop an intermediate property for unconditional densities, and then combine it with the representability of tree boosting models with arbitrary depth and number of trees. We will use C to denote constants depending on B and the concrete values may vary from line to line. For unconditional densities, let the true log-density be h(y) Ss 2 and h(s)( ) 2 B. By Barron and Sheu (1991), there exists a function hk Ωk,ζ such that

h( ) hk( ) 2 Cδs h(s)( ) 2 C

h( ) hk( ) Cδs 1/2 h(s)( ) 2 C ks 1/2 , (40)

where we use δ = 1/(k ζ + 2) 2/k. Next, we turn to conditional densities. For |h(y | x)| Ss 2 and h( | x) B, by Barron and Sheu (1991, Lemma 3), there exists a unique vector solution β(x) to the minimization problem

β(x) := arg min β(x) Rk D f Y |X(y | X = x) κ(y)ez(y) β(x) ψ( β(x)) .

We next show β(x) is measurable on [0, 1]d. In fact, deﬁne the mapping

Φ(β) := E h z(Y )κ(Y )ez(Y ) β ψ(β)i .

Gao and Hastie

Again by Barron and Sheu (1991, Lemma 3), Φ is invertible and thus the inverse Φ 1 is diﬀerentiable. By the optimality of β(x), we have

Φ(β(x)) := E h z(Y )κ(Y )ez(Y ) β(x) ψ(β(x))i = E z(Y )f Y |X(Y | X = x) .

Notice that log(f Y |X(Y | X = x)) Ss 2, then f Y |X(Y | X = x) is at least (s 1)-th order smooth and as a result Φ(β(x)) is diﬀerentiable with regard to x. Finally, since β(x) = Φ 1 Φ(β(x)), we conclude β(x) is diﬀerentiable thus measurable on [0, 1]d. Note that E z(Y )f Y |X(Y | X = x) is bounded since |h(y | x)| B, then β(x) Cβ for some Cβ depending on B. Now we are ready to approximate β(x) by tree boosting models. By the simple function approximation theorem, for any bounded measurable function on [0, 1], there exists a sequence of simple functions, i.e., linear combinations of indicator functions, such that converge to the target function point-wisely and and in L1. Recall the tree boosting models {PN i=1 ai1x(j) Ij, N 1, Ij [0, 1]}, then given a measurable function βj(x) supported on the hyper-cube [0, 1]d, for any ε > 0, there exists a tree boosting model βj,ε(x) N1 such that |βj,ε(x)| |β(x)| + ε on [0, 1]d and

|βj(x) βj,ε(x)| ε, x Dε [0, 1]d, m(Dε) 1 ε,

where Dε is some measurable set and m(Dε) denotes the Lebesgue measure of the subset Dε. By the triangle inequality and Cauchy-Schwarz inequality, h(y | x) z(y) βε(x) h(y | x) z(y) β(x) + z(y) β(x) z(y) βε(x)

h(y | x) z(y) β(x) + z(y) 2 β(x) βε(x) 2.

On Dε, β(x) βε(x) 2

h( | x) z( ) βε(x) 2 C 1

where we use (40). On Dc ε, since |h(y | x)| B and |βj,ε(x)| Cβ + ε, then

h( | x) z( ) βε(x) h( | x) + z( ) βε(x) C.

Let fε(y | x) := κ(y)ez(y) βε(x) ψ(βε(x)), then by Lemma 1 in Barron and Sheu (1991),

E [D (f(y | X) fε(y | X))]

= E [D (f(y | X) fε(y | X)) , X Dε]

+ E [D (f(y | X) fε(y | X)) , X Dc ε]

k2s + kε2 z( ) 2 2 + ε .

Since (41) is valid for arbitrary ε > 0, let ε 0 and we ﬁnish the proof.

Lin CDE: Conditional Density Estimation via Lindsey s Method

F.7 Proof of Proposition 4

Proof. Without loss of generality, assume R is the covariate space. For simplicity, we omit the superscript of βt(x) and denote the natural parameter functions by β(x). We ﬁrst show γ = arg maxγ ℓ(R; β(x) + γ). For λ = 0, if γ = 0, the KKT condition of the last Poisson regression is

b=1 nbz(yb) =

b=1 pb(R; β(x) + γ)z(yb), (42)

If in R, Y is supported on the midpoints {yb},

ψ(β(xi) + γ) = 1

b=1 pb(β(xi) + γ)z(yb). (43)

Therefore, by Eq. (42) and Eq. (43),

i=1 ψ(β(xi) + γ) = 1

b=1 pb(β(xi) + γ)z(yb) =

b=1 z(yb) 1

i=1 pb(β(xi) + γ)

b=1 pb(R; β(x) + γ)z(yb) = 1

b=1 nbz(yb) = 1

which implies the KKT condition of maximizing ℓ(R; β(X)+γ) is satisﬁed at γ . Since the log-likelihood ℓ(R; β(X) + γ) is strictly concave, then γ is the unique maximizer. We next prove the algorithm converges. Plug in the MLE estimate of the intercept of the Poisson regression into the log-likelihood and we have

ℓpoisson(R; β(x) + γ + γ) =ℓpoisson(R; β(x) + γ) +

b=1 nbz(yb) γ

b=1 pb(R; β(x) + γ) exp n z(yb) γ o!

| {z } :=(I)

By Jensen s inequality,

b=1 pb(β(xi) + γ) exp n z (yb) γ o!

b=1 pb(β(xi) + γ) exp n z (yb) γ o!

Gao and Hastie

Notice that

ψ(β(xi) + γ + γ) ψ(β(xi) + γ)

PB b=1 ez(yb) (γ+ γ)

PB b=1 ez(yb) γ

b=1 pb(β(xi) + γ) exp n z(yb) γ o!

Therefore, by Eq. (45), Eq. (46) and Eq. (47),

ℓ(R; β(x) + γ + γ) ℓ(R; β(x) + γ)

i=1 z(yi) γ log

b=1 pb(β(xi) + γ) exp n z(yb) γ o!

i=1 z(Yi) γ n (I) =

b=1 nbz(yb) γ n (I)

= ℓpoisson(R; β(x) + γ + γ) ℓpoisson(R; β(x) + γ).

By the strong concavity of ℓpoisson, there exists a universal δ > 0 such that

ℓpoisson(R; β(x) + γ + γ) ℓpoisson(R; β(x) + γ) δ γ 2 2.

ℓ(R; β(x) + γ + γ) ℓ(R; β(x) + γ)

ℓpoisson(R; β(x) + γ + γ) ℓpoisson(R; β(x) + γ) δ γ 2 2.

Since the conditional log-likelihood ℓ(R; β(x)+γ) is bounded, thus can not increase linearly forever. Therefore, γ 2 0, i.e. the algorithm converges.

Lin CDE: Conditional Density Estimation via Lindsey s Method

F.8 Proof of Proposition 5

Proof. The proof follows that of Proposition 2. We simplify the diﬀerence of log-likelihoods,

ℓt(R, s) = X

z(yi) γt Rs,L ψ(βt(Xi) + γt Rs,L)

z(yi) γt Rs,R ψ(βt(Xi) + γt Rs,R)

z(yi) γt R ψ(βt(Xi) + γt R)

λn Rs,Lγt Rs,LΩγt Rs,L λn Rs,Rγt Rs,RΩγt Rs,R + λn Rγt R Ωγt R

=n Rs,L z Rs,L(γt Rs,L γt R) + n Rs,R z Rs,R(γt Rs,R γt R)

xi Rs,L ψ(βt(Xi) + γt Rs,L) ψ(βt(Xi) + γt R)

xi Rs,R ψ(βt(Xi) + γt Rs,R) ψ(βt(Xi) + γt R)

λn Rs,Lγt Rs,LΩγt Rs,L λn Rs,Rγt Rs,RΩγt Rs,R + λn Rγt R Ωγt R.

The score equation of γt R implies

i=1 ψ(βt(Xi) + γt R) + 2λΩγt R, (49)

and similarly for γt Rs,L, γt Rs,R. Plug Eq. (49) into Eq. (48), we obtain

ℓ(R, s) = X

xi Rs,L ψ(βt(Xi) + γt Rs,L) (γt Rs,L γt R)

xi Rs,R ψ(βt(Xi) + γt Rs,R) (γt Rs,R γt R)

ψ(βt(Xi) + γt Rs,L) ψ(βt(Xi) + γt R)

ψ(βt(Xi) + γt Rs,R) ψ(βt(Xi) + γt R)

+ λn Rs,L γt Rs,L γt R Ω γt Rs,L γt R

+ λn Rs,R γt Rs,R γt R Ω γt Rs,R γt R .

Gao and Hastie

By the Taylor expansion of ψ(β) and ψ(β),

ψ(βt(Xi) + γt Rs,L) (γt Rs,L γt R) (ψ(βt(Xi) + γt Rs,L) ψ(βt(Xi) + γt R))

= ψ(βt(Xi) + γt Rs,L) (γt Rs,L γt R) ψ(βt(Xi) + γt R) (γt Rs,L γt R)

2(γt Rs,L γt R) 2ψ(βt(Xi) + γt R)(γt Rs,L γt R) + O( (γt Rs,L γt R) 3 2)

2(γt Rs,L γt R) 2ψ(βt(Xi) + γt R)(γt Rs,L γt R) + O( (γt Rs,L γt R) 3 2).

Plug Eq. (51) into Eq. (50), we get

xi Rs,L (γt Rs,L γt R) 2ψ(βt(Xi) + γt R)(γt Rs,L γt R)

xi Rs,R (γt Rs,R γt R) 2ψ(βt(Xi) + γt R)(γt Rs,R γt R)

+ λn Rs,L γt Rs,L γt R Ω γt Rs,L γt R

+ λn Rs,R γt Rs,R γt R Ω γt Rs,R γt R

+ O (γt Rs,L γt R) 3 2 + (γt Rs,R γt R) 3 2

2(γt Rs,L γt R)

xi Rs,L 2ψ(βt(Xi) + γt R) + 2λn Rs,LΩ

(γt Rs,L γt R)

2(γt Rs,R γt R)

xi Rs,R 2ψ(βt(Xi) + γt R) + 2λn Rs,RΩ

(γt Rs,R γt R)

+ O (γt Rs,L γt R) 3 2 + (γt Rs,R γt R) 3 2 .

By Eq. (49),

z Rs,L z Rs,R = z Rz,L z R

xi Rs,L 2ψ(βt(Xi) + λt R)(λt Rs,L λt R)

zt Rs,L zt Rs,R + O (γt Rs,L γt R) 2 2 ,

Lin CDE: Conditional Density Estimation via Lindsey s Method

where zt Rs,L := P xi Rs,L ψ(βt(Xi) + λt R)/n Rs,L, and similarly for zt Rs,R. Finally, by the assumption of invertibility,

γt Rs,L γt R

xi Rs,L 2ψ(βt(Xi) + γt R) + 2λΩ

n R z Rs,L z Rs,R zt Rs,L zt Rs,R + O (γt Rs,L γt R) 2 2

xi Rs,L 2ψ(βt(Xi) + γt R) + 2λΩ

1 rt Rz,L rt Rz,R

+ O (γt Rs,L γt R) 2 2 .

Plug Eq. (53) into Eq. (52),

= n Rs,Ln2 Rs,R 2n2 R ( rt Rs,L rt Rs,R)

xi Rs,L 2ψ(βt(Xi) + γt R) + 2λΩ

( rt Rs,L rt Rs,R) + n2 Rs,Ln Rs,R

2n2 R ( rt Rs,L rt Rs,R)

xi Rs,R 2ψ(βt(Xi) + γt R) + 2λΩ

( rt Rs,L rt Rs,R)

+ O ( rt Rs,L rt R) 3 2 + ( rt Rs,R rt R) 3 2 ,

and we ﬁnish the proof.

F.9 Proof of Proposition 6

Proof. We ﬁrst evaluate the complexity of the ﬁtting step of Lin CDE boosting. The computation of oﬀset is O(n Rk B), and we store the cell probabilities pb(βt(xi)). The ﬁrst step of the ﬁtting runs a penalized Lindsey s method and takes O(n R + k2B + k3) as shown in the proof of 3. The second step in the ﬁtting takes O(n RB + k B), and we update the cell probabilities to pb(βt(xi) + γt R). It takes O(n Rk2B) to compute the surrogate normalization matrix Ψt(γt R) and an extra O(k3) for matrix inversion. It takes O(dn Rk B) to compute all average residuals rt R. Finally, the quadratic approximation for all candidate splits and choosing the optimal one takes O(Sk2). In summary, the time complexity is O(dn Rk B + n Rk2B + k3 + Sk2).

Appendix G. Additional Figures

In this section, we provide additional numerical results.

Gao and Hastie

truth Lindsey

0.0 0.2 0.4 0.6 0.8

truth Lin CDE tree

Figure 17: Estimation of bimodal and skewed densities using our regularized Lindsey s method. In the left panel, the true density is a Gaussian mixture, and in the right panel, the true density is a beta distribution. In each trial, we sample 400 observations. We generate 10 natural cubic splines with knots equally spread across the range of observations, and tune the penalty parameter λ to achieve an eﬀective 5 degrees of freedom. We repeat each setting 100 times and plot the average ﬁts against the true densities.

G.1 Additional Simulations

Figure 17 displays the performance of Lindsey s method in two toy examples. One target density is bimodal, and the other is skewed. In Lindsey s method, we use natural cubic splines, which are arguably the most commonly used splines because they provide good and seamless ﬁts and are easy to implement, and transform them as discussed. In both examples, the estimated densities of Lindsey s method match the true densities quite closely, except for tiny gaps at boundaries and peaks. Figure 18 presents the goodness-of-ﬁt measure (24) of the three methods under the LGD and LGGMD settings with 50 bins. Lin CDE boosting beneﬁts from ﬁner grids. In contrast, distribution boosting and quantile regression forest are hurt by larger numbers of bins due to higher variances, especially in the LGGMD setting where the densities themselves are bumpy. Similar to AAE, another commonly used CDF-based metric is the cram/ er-von Mises distance (Friedman, 2019):

ˆF(q(uj | xi) | xi) F(q(uj | xi) | xi) 2 , (54)

where {uj} is an evenly spaced grid on [0, 1], and q(u | x) denotes the u quantile at the covariate value x. Figure 19 depicts the cram/ er-von Mises distance. In both settings,

Lin CDE: Conditional Density Estimation via Lindsey s Method

QRF DB Lin CDE

0.0 0.2 0.4 0.6 0.8

(lmethod lnull) (loracle lnull)

QRF DB Lin CDE

0.0 0.2 0.4 0.6 0.8

(lmethod lnull) (loracle lnull)

Figure 18: Box plots of goodness-of-ﬁt measure (24) in the LGD (left panel) and LGGMD (right panel) settings. The densities are computed as (25) with 50 bins. QRF stands for quantile regression forest, DB stands for distribution boosting, Lin CDE stands for Lin CDE boosting respectively.

QRF DB Lin CDE

0.0006 0.0010 0.0014 0.0018

QRF DB Lin CDE

2e 04 4e 04 6e 04

Figure 19: Box plots of CVM distance (54) in the LGD (left panel) and LGGMD (right panel) settings. CVM is a metric naturally deﬁned for conditional CDF estimates (Friedman, 2019), and a smaller value indicates a better estimate.

Lin CDE boosting outperforms the other two. The results are consistent with those of AAE.

Gao and Hastie

QRF DB Lin CDE

0.40 0.45 0.50 0.55 0.60

50% coverage

QRF DB Lin CDE

0.40 0.45 0.50 0.55 0.60

50% coverage

0.4932 0.4942

Figure 20: Coverage of 50% prediction intervals (ideally 50%) in the LGD (left panel) and LGGMD (right panel) settings.

In Figure 20, we plot the coverages of the 50% prediction intervals. The results are consistent with those of 90% prediction intervals.

G.2 Air Pollution Data

The air pollution data (Wu and Dominici, 2020) focuses on PM2.5 exposures in the United States.14 The responses are 3108 county-level PM2.5 exposures averaged from 2000 to 2018. We incorporate 16 weather, socio-economic, and demographic covariates, such as winter relative humidity, house value, and proportion of white people. The target is estimating the conditional density of the average PM2.5 exposure. We split the data into training, validation, and test (proportion 2:1:1), and tune on the hold-out validation data. We also identify inﬂuential covariates which may help ﬁnd the culprits of air pollution and manage regional air quality. The PM2.5 exposure varies vastly across counties. For example, the average PM2.5 of west coast counties may soar up to 12µg/m3 due to frequent wildﬁres, while those of rural areas in Central America are typically below 8µg/m3. The diﬀerence in PM2.5 levels induces the disjoint support issue for Lin CDE in Section 6, and thus we employ the centering. The comparisons of log-likelihood and quantile losses are summarized in Table 4. Centered Lin CDE performs the best in log-likelihood and is comparable in quantile losses. In Figure 21, we display how the estimated conditional densities change with respect to winter relative humidity and house value top inﬂuential covariates identiﬁed by Lin CDE.

Winter relative humidity aﬀects the locations of the conditional densities: as the humidity goes up, the PM2.5 concentration ﬁrst increases, then decreases. One hy-

14. PM2.5: particulate particles 2.5 microns or less in diameter. The data represents 98% of the population of the United States.

Lin CDE: Conditional Density Estimation via Lindsey s Method

pothesis of the inverted U-shaped relationship is that in dry counties, wind speed that is inversely correlated with the humidity is a powerful factor for PM2.5; in wet counties, moisture particles that accelerate the deposition process of PM2.5 are the driving force.

House value is impactful to the scales of the conditional densities: more expensive houses are associated with more variable PM2.5 exposures. We conjecture that rural areas are generally low in PM2.5 while urban areas vary. Higher house values suggest a higher proportion of urban areas and thus less homogeneous air quality.

data method -log-like quantile loss 5% 25 % 50% 75 % 95 %

Air pollution

QRF 0.95 0.077 0.189 0.229 0.206 0.087 (0.02) (0.002) (0.005) (0.007) (0.006) (0.002)

DB 1.27 0.099 0.247 0.300 0.244 0.093 (0.04) (0.007) (0.009) (0.010) (0.008) (0.006)

Lin CDE 0.89 0.063 0.185 0.233 0.194 0.072 (0.03) (0.003) (0.006) (0.007) (0.007) (0.004)

Table 4: Comparison of Lin CDE boosting, QRF, and DB on the air pollution data. We display the negative log-likelihoods and quantile losses at {5%, 25%, 50%, 75%, 95%} levels. Standard deviations are in the parentheses.

Barry C Arnold, Enrique Castillo, Jos e-Mari a Sarabia, and Jose M Sarabia. Conditional Speciﬁcation of Statistical Models. Springer, 1999.

Adelchi Azzalini and Adrian W Bowman. A look at some data on the old faithful geyser. Journal of the Royal Statistical Society: Series C (Applied Statistics), 39(3):357 365, 1990.

Nicholas M Ball, Robert J Brunner, Adam D Myers, Natalie E Strand, Stacey L Alberts, and David Tcheng. Robust machine learning applied to astronomical data sets. iii. probabilistic photometric redshifts for galaxies and quasars in the sdss and galex. The Astrophysical Journal, 683(1):12, 2008.

Vic Barnett. The ordering of multivariate data (with comments). Journal of the Royal Statistical Society: Series A (General), 139(3):318 354, 1976.

Andrew R Barron and Chyong-Hwa Sheu. Approximation of density functions by sequences of exponential families. The Annals of Statistics, 19(3):1347 1369, 1991.

Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov, and Iv an Fern andez-Val. Conditional quantile processes based on series or many regressors. Journal of Econometrics, 213(1):4 29, 2019.

Gao and Hastie

7.5 8.0 8.5 9.0 9.5 10.0

0.0 0.2 0.4 0.6 0.8 1.0

Relative humidity

low medium high

7.5 8.0 8.5 9.0 9.5 10.0

0.0 0.2 0.4 0.6 0.8 1.0

House value

low medium high

Figure 21: Lin CDE boosting applied to the air pollution data. We plot the estimated densities of Lin CDE boosting across diﬀerent levels of top inﬂuential covariates: winter relative humidity (left panel) and house value (right panel). Other covariates are ﬁxed at the corresponding medians. Winter relative humidity is inﬂuential to the locations of the conditional densities: as the humidity goes up, the PM2.5 concentration ﬁrst increases, then decreases. House value is impactful to the scales of the conditional densities: more expensive houses are associated with more variable PM2.5 exposures.

Christopher M Bishop. Mixture density networks. Technical report, 1994.

Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.

Leo Breiman. Random forests. Machine learning, 45(1):5 32, 2001.

Leo Breiman, Jerome Friedman, Richard A Olshen, and Charles J Stone. Classiﬁcation and Regression Trees. CRC press, 1984.

Guillaume Carlier, Victor Chernozhukov, and Alfred Galichon. Vector quantile regression: an optimal transport approach. The Annals of Statistics, 44(3):1165 1192, 2016.

Probal Chaudhuri. Global nonparametric estimation of conditional quantile functions and their derivatives. Journal of multivariate analysis, 39(2):246 269, 1991a.

Probal Chaudhuri. Nonparametric estimates of regression quantiles and their local bahadur representation. The Annals of statistics, 19(2):760 777, 1991b.

Probal Chaudhuri. On a geometric notion of quantiles for multivariate data. Journal of the American Statistical Association, 91(434):862 872, 1996.

Lin CDE: Conditional Density Estimation via Lindsey s Method

Probal Chaudhuri and Wei-Yin Loh. Nonparametric estimation of conditional quantiles using quantile regression trees. Bernoulli, 8(5):561 576, 2002.

Victor Chernozhukov, Iv an Fern andez-Val, and Blaise Melly. Inference on counterfactual distributions. Econometrica, 81(6):2205 2268, 2013.

Carol De Santis, Jiemin Ma, Leah Bryan, and Ahmedin Jemal. Breast cancer statistics, 2013. CA: a cancer journal for clinicians, 64(1):52 62, 2014.

Jay L Devore and Roxy Peck. Statistics: the Exploration and Analysis of Data. Brooks/Cole, Cengage Learning, 5 edition, 2005.

Bradley Efron. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association, 99(465):96 104, 2004.

Bradley Efron and Trevor Hastie. Computer Age Statistical Inference, volume 5. Cambridge University Press, 2016.

Jianqing Fan, Qiwei Yao, and Howell Tong. Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. Biometrika, 83(1):189 206, 1996.

Silverio Foresi and Franco Peracchi. The conditional distribution of excess returns: An empirical analysis. Journal of the American Statistical Association, 90(430):451 466, 1995.

Jerome Friedman and Balasubramanian Narasimhan. con Tree: Contrast Trees and Boosting, 2020. R package version 0.2-4.

Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189 1232, 2001.

Jerome H Friedman. Contrast trees and distribution boosting. ar Xiv preprint ar Xiv:1912.03785, 2019.

Peter Hall, Rodney CL Wolﬀ, and Qiwei Yao. Methods for estimating a conditional distribution function. Journal of the American Statistical association, 94(445):154 163, 1999.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Prediction, Inference and Data Mining. Springer, New York, second edition, 2009.

Xuming He, Pin Ng, and Stephen Portnoy. Bivariate quantile smoothing splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(3):537 550, 1998.

Rafael Izbicki and Ann B Lee. Nonparametric conditional density estimation in a highdimensional regression setting. Journal of Computational and Graphical Statistics, 25(4): 1297 1316, 2016.

Roger Koenker. Quantile regression: 40 years on. Annual Review of Economics, 9:155 176, 2017.

Gao and Hastie

Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal of the Econometric Society, 46(1):33 50, 1978.

Roger Koenker, Pin Ng, and Stephen Portnoy. Quantile smoothing splines. Biometrika, 81 (4):673 680, 1994.

Linglong Kong and Ivan Mizera. Quantile tomography: using quantiles with multivariate data. Statistica Sinica, 22:1589 1610, 2012.

Charles Kooperberg and Charles J Stone. A study of logspline density estimation. Computational Statistics & Data Analysis, 12(3):327 347, 1991.

Charles Kooperberg and Charles J Stone. Logspline density estimation for censored data. Journal of Computational and Graphical Statistics, 1(4):301 328, 1992.

Erich L Lehmann and Joseph P Romano. Testing Statistical Hypotheses. Springer, 2006.

Youjuan Li, Yufeng Liu, and Ji Zhu. Quantile regression in reproducing kernel hilbert spaces. Journal of the American Statistical Association, 102(477):255 268, 2007.

Andy Liaw and Matthew Wiener. Classiﬁcation and regression by randomforest. R News, 2(3):18 22, 2002. URL https://CRAN.R-project.org/doc/Rnews/.

J. K. Lindsey. Comparison of probability distributions. Journal of the Royal Statistical Society: Series B (Methodological), 36(1):38 47, 1974.

Benoˆıt R Mˆaacsse and Young K Truong. Conditional logspline density estimation. Canadian Journal of Statistics, 27(4):819 832, 1999.

H. Markowitz. Portfolio selection. Investment under Uncertainty, 1959. URL https: //ci.nii.ac.jp/naid/10027840150/en/".

Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7(Jun):983 999, 2006.

Nicolai Meinshausen. quantreg Forest: Quantile Regression Forests, 2017. URL https: //CRAN.R-project.org/package=quantreg Forest. R package version 1.3-7.

Laura Moody, Suparna Mantha, Hong Chen, and Yuan-Xiang Pan. Computational methods to identify bimodal gene expression and facilitate personalized treatment in cancer patients. Journal of Biomedical Informatics: X, 1:100001, 2019.

Panagis Moschopoulos and Joan G Staniswalis. Estimation given conditionals from an exponential family. The American Statistician, 48(4):271 275, 1994.

Mark F Schilling, Ann E Watkins, and William Watkins. Is human height bimodal? The American Statistician, 56(3):223 229, 2002.

Bernard W Silverman. On the estimation of a probability density function by the maximum penalized likelihood method. The Annals of Statistics, pages 795 810, 1982.

Lin CDE: Conditional Density Estimation via Lindsey s Method

Bernard W Silverman. Density Estimation for Statistics and Data Analysis, volume 26. CRC press, 1986.

Charles J Stone. Consistent nonparametric regression. The Annals of Statistics, 5:595 645, 1977.

Charles J Stone. Asymptotics for doubly ﬂexible logspline response models. The annals of statistics, 19(4):1832 1854, 1991a.

Charles J Stone. Multivariate log-spline conditional models. Technical report, University of California, Berkeley, 1991b.

Masashi Sugiyama, Ichiro Takeuchi, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Daisuke Okanohara. Conditional density estimation via least-squares density ratio estimation. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 781 788, 2010.

Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012.

Grace Wahba. Smoothing noisy data with spline functions. Numerische mathematik, 24(5): 383 393, 1975.

Grace Wahba. Optimal smoothing of density estimates, in classiﬁcation and clustering. Academic Press, New York, pages 423 458, 1977.

Grace Wahba. Spline Models for Observational Data. SIAM, 1990.

Nethery R. C. Sabath M. B. Braun D. Wu, X. and F. Dominici. Air pollution and covid19 mortality in the united states: Strengths and limitations of an ecological regression analysis. Science advances, 6(45):p.eabd4049, 2020.

Keming Yu and M. C. Jones. Local linear quantile regression. Journal of the American statistical Association, 93(441):228 237, 1998.