# projection_pursuit_density_ratio_estimation__0cb36585.pdf

Projection Pursuit Density Ratio Estimation

Meilin Wang 1 Wei Huang 2 Mingming Gong 2 Zheng Zhang 1

Density ratio estimation (DRE) is a paramount task in machine learning, for its broad applications across multiple domains, such as covariate shift adaptation, causal inference, independence tests and beyond. Parametric methods for estimating the density ratio possibly lead to biased results if models are misspecified, while conventional non-parametric methods suffer from the curse of dimensionality when the dimension of data is large. To address these challenges, in this paper, we propose a novel approach for DRE based on the projection pursuit (PP) approximation. The proposed method leverages PP to mitigate the impact of high dimensionality while retaining the model flexibility needed for the accuracy of DRE. We establish the consistency and the convergence rate for the proposed estimator. Experimental results demonstrate that our proposed method outperforms existing alternatives in various applications.

1. Introduction

Density ratio estimation (DRE) is a fundamental concept in machine learning that focuses on directly estimating the ratio between two probability density functions (PDFs), avoiding the need to estimate each density function individually. DRE has widespread applications, such as covariate shift adaptation (Shimodaira, 2000; Sugiyama et al., 2007), outlier detection (Hido et al., 2011), independence test (Ai et al., 2024), mutual information estimation (Suzuki et al., 2009; Ai et al., 2024), importance sampling (Meng & Wong, 1996; Sinha et al., 2020), and treatment effect estimation in causal inference (Ai et al., 2021; Matsushita et al., 2023). See Sugiyama et al. (2012) for a comprehensive review of

1Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China, Beijing, China 2School of Mathematics and Statistics, University of Melbourne, Melbourne, Australia. Correspondence to: Zheng Zhang <zhengzhang@ruc.edu.cn>.

Proceedings of the 42𝑛𝑑International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

DRE and its applications in machine learning.

There are a variety of methods concerning the DRE problem in the existing literature, which can be divided into three paradigms: the parametric methods (Liu et al., 2013; 2017; Nagumo & Fujisawa, 2024), the nonparametric linear sieve methods (Kanamori et al., 2009; Sugiyama et al., 2008; 2010) and the nonparametric neural network methods (Nam & Sugiyama, 2015; Fang et al., 2020; Rhodes et al., 2020). The validity of the parametric modeling builds on the correct model specification for the density ratio. If the model is misspecified, the results may be severely biased. Nonparametric neural network methods (Nam & Sugiyama, 2015; Fang et al., 2020; Rhodes et al., 2020) can achieve superior performance without imposing parametric model assumptions; however, they require a substantial number of training samples to learn the extensive parameters involved. Furthermore, theoretical guarantees for neural network estimations remain insufficient.

Nonparametric linear sieve estimation is a statistical technique that combines the flexibility of nonparametric methods with the simplicity of linear models (Chen, 2007b). The term linear sieve" refers to a sequence of increasingly complex linear models that are used to approximate the unknown function of interest. With different choices of criteria, e.g. squared-loss (Kanamori et al., 2009; Sugiyama et al., 2011; Yamada et al., 2013), Kullback Leibler divergence (Sugiyama et al., 2008; Tsuboi et al., 2009), the method of sieves is highly flexible in estimating complicated models. Unlike nonparametric neural network methods that can become excessively complex and computationally intensive, the sieve approach seeks to balance model complexity with interpretability and computational feasibility.

Although the linear sieve methods enjoy computational convenience and theoretical support (Kanamori et al., 2010), they suffer from a significant challenge known as the curse of dimensionality. Specifically, as the dimension of the data increases, there is a notable and substantial deterioration in the performance of these linear sieve methods. This phenomenon is clearly evidenced by both ratio pattern visualization and estimation error in our experimental results. Conventional linear sieve methods, such as KLIEP (Sugiyama et al., 2007) and u LSIF (Kanamori et al., 2009) can correctly capture the pattern of density ratio functions when

Projection Pursuit Density Ratio Estimation

the dimension of variables is 2 (see in Figure 1). However, they experience a significant deterioration in performance when the dimension of variables increases to 10 (see Figure 2). The root mean squared logarithmic error (RMSLE) for the DRE based on conventional linear sieve methods further quantifies this performance decline, exhibiting a dramatic escalation as the dimension of variables increases (see in Figure 3). This deterioration in performance is attributed to the sparsity of the data in high-dimensional spaces, which hinders the ability of the linear sieve methods to effectively learn from the data and generalize to new, unseen instances.

To address the curse of dimensionality, a variety of 𝐷3 (Direct DRE with Dimension reduction) methods have been proposed (Sugiyama et al., 2010; 2011; Yamada & Sugiyama, 2011) by incorporating a dimension reduction step prior to applying a linear sieve estimation. These dimension reduction techniques assume that the density ratio function is located in a low-dimensional intrinsic space characterized by a linear transform of the high-dimensional variables. However, this fundamental assumption can be restrictive in many practical scenarios.

To overcome these limitations, in this paper, we propose a novel method for DRE based on the projection pursuit (PP) (Friedman & Tukey, 1974). The core idea is to approximate the density ratio function by a product of PP functions and estimate them iteratively. Each of the PP functions is a projection of the target function to a low-dimensional space. As a result, at each iteration, we only need to estimate a semiparametric single-index function based on a univariate linear sieve basis. It is known that the approximation error vanishes as the number of iterations approaches infinity, mitigating the curse of dimensionality when the target function can be expressed as many (or infinitely many) lowdimensional projections along different directions (Diaconis & Shahshahani, 1984). We further provide a theoretical justification for our method by establishing its consistency and convergence rate for the proposed estimator. In application, we apply the proposed method in causal inference, mutual information estimation and covariate shift adaptation, and find consistent improvements in performance.

This paper is organized as follows. Section 2 discusses recent advancements in the field through analysis of related work. Section 3 introduces the conventional methods for DRE from the perspective of model specification. In Section 4, we propose the projection pursuit DRE method that can achieve efficient estimation even in high-dimensional data settings, and we establish the theoretical properties. Section 5 demonstrates the superiority of our proposed method in various applications.

2. Related Work

While this work primarily addresses the curse of dimensionality in DRE, the field has also made significant progress on two related challenges: stabilizing estimators through effective regularization strategies, and addressing geometric disparities between distributions known as density-chasm effects.

Regularized Kernel Learning Methods. The regularization scheme within reproducing Kernel Hilbert space (RKHS) has been developed for estimating the DRE problem. Que & Belkin (2013) reformulated the DRE problem as an inverse problem in terms of an integral operator corresponding to a kernel, then proposed a regularized estimation method with an RKHS norm penalty. Gizewski et al. (2022) applied the regularized kernel methods in the context of unsupervised domain adaptation under covariate shift and developed the convergence rates. Gruber et al. (2024) proposed iterated regularization and developed an improved error bounds faster than the non-iterated error bound under the Bregman distance and certain regular conditions (e.g. source condition and capacity condition). Nguyen et al. (2024) established the pointwise convergence rate of the regularized estimator taking into account both the smoothness of the density ratio and the capacity of the space in which it is estimated.

Density-Chasm Problem Density ratio estimation faces an additional challenge known as the density-chasm problem, which occurs when the distributions differ substantially (Rhodes et al., 2020; Choi et al., 2021; 2022). This phenomenon arises because samples are less likely to be observed in the low-density regions between the two distributions. To overcome this challenge, Choi et al. (2021) proposed an invertible parametric transform mapping the data onto a shared feature space, thereby bringing the transformed densities become closer. Then they estimate the density ratio in the feature space based on the key property that, the ratio remains invariant under such invertible transformation. Notably, this invertible transformation preserves the original data dimensionality. Therefore, it is necessary to clarify that they neither map the data to a low-dimensional latent space nor address the curse of dimensionality as focused by our paper.

3. Density Ratio Estimation

Let 𝑝(𝒙) and 𝑞(𝒙) be two probability density functions of the target and reference datasets respectively, where 𝒙 R𝑑

is a 𝑑-dimensional variable. The density ratio estimation (DRE) problem is to estimate

𝑟 (𝒙) := 𝑝(𝒙)

Projection Pursuit Density Ratio Estimation

based on two independently and identically distributed (𝑖.𝑖.𝑑.) samples from the two referred distributions, i.e. {𝒙𝑝 𝑖}𝑛𝑝 𝑖=1 i.i.d. 𝑝(𝒙) and {𝒙𝑞 𝑖}𝑛𝑞 𝑖=1 i.i.d. 𝑞(𝒙). To make the density ratio function 𝑟 (𝒙) well-defined, we assume 𝑞(𝒙) dominates 𝑝(𝒙), i.e. 𝑝(𝒙) > 0 implies that 𝑞(𝒙) > 0, and X R𝑑

denotes the support of 𝑟 (𝒙).

Below, we briefly summarize the existing methods for the DRE problem and highlight their limitations.

Parametric Methods The density ratio function is assumed to satisfy the following parametric model (Liu et al., 2013; 2017; Nagumo & Fujisawa, 2024):

𝑟 (𝒙) = 𝐶exp{𝜽 ℎ(𝒙)},

where 𝐶 R is a normalizing constant, ℎ(𝒙) : R𝑑 R𝑝

is the feature transformation function, 𝑝is a known fixed positive integer, and 𝜽 R𝑝is the parameter of interest to be estimated. See Nagumo & Fujisawa (2024, Appendix A) for more detailed discussion on this parametric formulation. The parametric formulation is particularly useful for sparse estimation in high-dimensional problems. However, it relies heavily on the correct model specification and may fail to capture non-linear relationships or interactions within the data as effectively as non-parametric methods. This limitation diminishes their adaptability and flexibility in handling complex or varying datasets.

Nonparametric Linear Sieve Methods The density ratio function is approximated by a sequence of increasingly complex linear models (Kanamori et al., 2009; Sugiyama et al., 2008; 2012):

𝑟 (𝒙) 𝑟𝑙𝑚 𝑝(𝒙) := 𝜽 𝝍(𝒙),

or log-linear models (Kanamori et al., 2010; Tsuboi et al., 2009): 𝑟 (𝒙) 𝑟𝑙𝑙𝑚 𝑝(𝒙) := 𝐶exp{𝜽 𝝍(𝒙)},

where 𝝍(𝒙) : R𝑑 R𝑝is a user-specified basis function (also called sieve" in statistical literature), 𝜽 R𝑝is the 𝑝-dimensional coefficient parameter, and 𝑝 N is unknown and will go to infinity at an appropriate rate as the sample size increases. The linear sieve methods adaptively learn the model parameters (𝜽, 𝑝) from the data by minimizing the empirical discrepancy between the true density ratio function and its linear sieve estimators under some distance measure (e.g. the Kullback-Leibler divergence, the squared distance, and the general Bregman distance). The linear sieve methods seek to balance model complexity with interpretability and computational feasibility.

However, the performance of linear sieve methods will significantly deteriorate as the dimension of 𝒙becomes large. Such a curse of dimensionality can be explained as follows. The linear sieve approximation error for the density

ratio function is of order 𝑝 𝑠/𝑑(Lorentz, 1986, Theorem 8), where 𝑠is the Hölder-smoothness of 𝑟 (𝒙), thus it requires 𝑝larger than [1/𝜖]𝑑/𝑠to achieve an 𝜖-error approximation accuracy. On the other hand, a large 𝑝will significantly enlarge the variance of these estimators; indeed, the variance of the linear sieve estimator is of order

𝑝/𝑛(Li & Racine, 2023, Theorem 15.1).

Deep Neural Network Methods Several deep neural network-based methods have been proposed for the DRE problem (Nam & Sugiyama, 2015; Fang et al., 2020; Rhodes et al., 2020; Kato & Teshima, 2021; Choi et al., 2022), with the aim of improving performance in high-dimensional data. Despite their superior performance, these methods tend to lack good interpretability and theoretical guarantees. Moreover, the effectiveness of deep neural network estimation is intrinsically dependent on the availability of substantial training datasets, and its computational burden is much heavier than the parametric and linear sieve methods.

4. Projection Pursuit DRE

Projection pursuit (PP) is a statistical technique used for multidimensional data analysis. The key idea is to approximate a high-dimensional function by progressively projecting it onto efficient low-dimensional spaces that capture the most significant features of the data structure. It was first developed by Friedman & Tukey (1974) for exploratory data analysis and has been applied to various problems such as PP-classification (Lee et al., 2005; da Silva et al., 2021), PPregression (Friedman & Stuetzle, 1981; Zhan et al., 2025), and PP-density estimation (Friedman et al., 1984; Aladjem, 2005). Being inspired by the strengths of PP in mitigating the curse of dimensionality (Huber, 1985), we propose a PP-based method for estimating the density ratio 𝑟 (𝒙) and develop valid asymptotic theories.

4.1. Projection Pursuit Approximation

We propose to approximate the density ratio function 𝑟 (𝒙) by the multiplicative projection pursuit:

𝑟 (𝒙) 𝑟𝐾(𝒙) =

𝑘=1 𝑓𝑘(𝒂 𝑘𝒙) , (1)

where 𝐾 N denotes the number of projections, {𝒂𝑘}𝐾 𝑘=1 are unit 𝑑-dimensional vectors indicating the projection directions, and { 𝑓𝑘}𝐾 𝑘=1 are unknown univariate pursuit functions. That is, the PP approximation (1) converts the estimation of 𝑟 (𝒙), whose direct estimation is challenging in the case of large-dimensional 𝒙, to simpler tasks of estimating projected univariate functions { 𝑓𝑘(𝒂 𝑘𝒙)}𝐾 𝑘=1.

Estimation of { 𝑓𝑘(𝒂 𝑘𝒙)}𝐾 𝑘=1 can be carried out iteratively based on the relation 𝑟𝑘(𝒙) = 𝑟𝑘 1(𝒙) 𝑓𝑘(𝒂 𝑘𝒙) for 𝑘

Projection Pursuit Density Ratio Estimation

{1,...,𝐾}, where 𝑟0(𝒙) 1. Specifically, let E𝑞[ ] and E𝑝[ ] denote the expectation taken with respect to (w.r.t.) the probability densities 𝑞( ) and 𝑝( ), respectively. Suppose that the first 𝑘 1 terms are given, that is 𝑟𝑘 1(𝒙) = Î𝑘 1 𝑚=1 𝑓𝑚(𝒂 𝑚𝒙) is given, the 𝑘th projection direction 𝒂𝑘and the 𝑘th pursuit function 𝑓𝑘can be determined by minimizing the 𝐿2distance, E𝑞{[𝑟 (𝒙) 𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)]2}. We show in Appendix E that minimizing the 𝐿2-distance w.r.t. 𝑓or 𝒂is equivalent to minimizing

𝐻( 𝑓, 𝒂) :=E𝑞[𝑟2 𝑘 1(𝒙) 𝑓2(𝒂 𝒙)]

2E𝑝[𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)] . (2)

For model identification, we assume 𝒂𝑘 S+ 𝑑as in Wang & Yang (2009), where S+ 𝑑is the 𝑑dimensional upper unit hemisphere, i.e. S+ 𝑑= {𝒂= (𝑎1,...,𝑎𝑑) R𝑑, 𝒂 = 1, 𝑎1 > 0}, and denotes the Euclidean norm, i.e. for a vector 𝒗, 𝒗 :=

4.2. Estimation Procedure

Since the function space for searching 𝑓𝑘is infinitely dimensional, direct optimization based on the sample analogue of (2) is impossible. We consider seeking the estimator of the pursuit function 𝑓𝑘(𝑧), for 𝑧 {𝒂 𝒙: 𝒂 S+ 𝑑,𝒙 X}, from the linear sieve class:

𝑗=1 𝛽𝑗𝜙𝑗(𝑧), 𝐽𝑘 N

where {𝜙𝑗}𝐽𝑘 𝑗=1 is a sequence of univariate basis, 𝚽𝑘(𝑧) = (𝜙1(𝑧),...,𝜙𝐽𝑘(𝑧)) , and 𝜷= (𝛽1,..., 𝛽𝐽𝑘) is the approximation coefficients. The rationale is that any continuous function can be approximated arbitrarily well by a linear sieve in F𝐽𝑘as 𝐽𝑘goes to infinity (see e.g., Chen, 2007b).

For 𝑘= 1,...,𝐾, based on (2) and the linear sieve class, we define the estimator in the 𝑘th iteration of ( 𝑓𝑘, 𝒂𝑘) by

ˆ𝑓𝑘(𝑧) := ˆ𝜷 𝑘𝚽𝑘(𝑧), ( ˆ𝒂𝑘, ˆ𝜷𝑘) := argmin 𝒂,𝜷 ˆL𝑘(𝒂, 𝜷;𝜆) (3)

ˆL𝑘(𝒂, 𝜷;𝜆) := 1

𝑖=1 ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖) [𝜷 𝚽𝑘(𝒂 𝒙𝑞 𝑖)]2 (4)

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖) 𝜷 𝚽𝑘(𝒂 𝒙𝑝 𝑖) +𝜆 𝜷 2

is the regularized empirical loss, where ˆ𝑟𝑘 1(𝒙) := Î𝑘 1 𝑚=0 ˆ𝑓𝑚( ˆ𝒂 𝑚𝒙), 𝜷 2 := 𝜷 𝜷is the ℓ2-penalty on the model complexity, and 𝜆> 0 is a tuning parameter. Here, we use the ℓ2-penalty to circumvent the over-fitting problem, while maintaining a closed-form solution in 𝜷for the problem (4) when keeping 𝒂fixed.

Proposition 4.1. For every fixed 𝒂 R𝑑, we have

ˆ𝜷𝑘(𝒂) := argmin 𝜷 ˆL𝑘(𝒂, 𝜷;𝜆)

= 𝒁𝑘(𝒂) 𝒁𝑘(𝒂)

𝒁𝑘(𝒂) := {ˆ𝑟𝑘 1(𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖)}𝑛𝑞 𝑖=1 R𝑛𝑞 𝐽𝑘,

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝑘𝒙𝑝 𝑖) R𝐽𝑘,

and 𝑰𝐽𝑘is an identity matrix of size 𝐽𝑘 𝐽𝑘.

Proof. Using above notation, ˆL𝑘(𝒂, 𝜷;𝜆) can be written as a quadratic function of 𝜷:

ˆL𝑘(𝒂, 𝜷;𝜆) = 1

𝑖=1 [𝜷 ˆ𝑟𝑘 1(𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖)]2

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖) 𝚽𝑘(𝒂 𝒙𝑝 𝑖)

= 𝜷 𝒁 𝑘(𝒂)𝒁𝑘(𝒂)

𝑛𝑝 𝜷 𝑾𝑘(𝒂).

Differentiating it with respect to 𝜷and setting the derivative to zero give the desired result.

By Proposition 4.1 and (3), the estimators of 𝑓𝒂(𝑧) and 𝒂𝑘 can be represented respectively as

ˆ𝑓𝒂,𝑘(𝑧) := ˆ𝜷𝑘(𝒂) 𝚽𝑘(𝑧) , (5)

ˆ𝒂𝑘=argmin 𝒂 R𝑑 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆)

=argmin 𝒂 R𝑑 1 2𝑛𝑞 𝒁𝑘(𝒂) ˆ𝜷𝑘(𝒂) 2

𝑛𝑝 ˆ𝜷𝑘(𝒂) 𝑾𝑘(𝒂) +𝜆ˆ𝜷𝑘(𝒂) ˆ𝜷𝑘(𝒂) .

Then, we have ˆ𝑓𝑘(𝑧) = ˆ𝑓ˆ𝒂𝑘,𝑘(𝑧), the estimators of {𝑟𝑘(𝒙)}𝐾 𝑘=1 are defined by

ˆ𝑟𝑘(𝒙) = ˆ𝑟𝑘 1(𝒙) ˆ𝑓𝑘( ˆ𝒂 𝑘𝒙), 𝑘 {1,...,𝐾},

and the estimator of the density ratio function 𝑟 (𝒙) is given by ˆ𝑟𝐾(𝒙). In practice, to ensure non-negativity of the estimated density ratio function, we truncate the negative values of ˆ𝑓𝑘to the minimum positive estimated values for every iteration. The complete process of our approach given tuned hyper-parameters is summarized in Algorithm 1.

Projection Pursuit Density Ratio Estimation

In the following theorem, we focus on each iteration 𝑘 and show that, given the estimate ˆ𝑟𝑘 1(𝒙) of 𝑟𝑘 1(𝒙), as the sample sizes 𝑛𝑝,𝑛𝑞 , ˆ𝑓𝒂,𝑘and ˆ𝒂𝑘converge respectively to

𝑓𝒂,𝑘:= argmin 𝑓 𝐻( 𝑓, 𝒂) , (6)

and 𝒂𝑘:= argmin 𝒂 𝐻( 𝑓𝒂,𝑘, 𝒂) . (7)

Theorem 4.2. Suppose that Assumptions F.1 to F.4 in Appendix F hold. Then, for each 𝑘= 1,...,𝐾, if 𝑛 1 𝑝 Í𝑛𝑝 𝑖=1{ˆ𝑟𝑘 1(𝒙𝑝 𝑖) 𝑟𝑘 1(𝒙𝑝 𝑖)}2 = 𝑂𝑝(𝜉𝑛,𝑘 1), we have

sup 𝒂 A sup 𝑧 Z

ˆ𝑓𝒂,𝑘(𝑧) 𝑓𝒂,𝑘(𝑧)

𝐽 𝑠 𝑘𝜁0(𝐽𝑘) +

𝜉𝑛,𝑘 1𝜁0(𝐽𝑘)2

+ 𝐽𝑘𝜁0(𝐽𝑘)2

𝐽 (𝑠 1) 𝑘 +

where A S+ 𝑑is a compact set containing 𝒂𝑘, Z := {𝒂 𝒙: 𝒂 A and 𝒙 X}, 𝑠denotes the number of continuous derivatives that 𝑓𝒂,𝑘(𝑧) possesses w.r.t. 𝑧 Z for any 𝒂 A, 𝜁0(𝐽𝑘) is a sequence of constants such that sup𝑧 Z 𝚽𝑘(𝑧) 𝜁0(𝐽𝑘), 𝜁1(𝐽𝑘) is a sequence of constants such that the maximum eigenvalue of E[𝚽(1) 𝑘(𝒂 𝒙)𝚽(1) 𝑘(𝒂 𝒙) ] is bounded by 𝜁1(𝐽𝑘) uniformly in 𝒂 A, and 𝑛𝑞 𝑛𝑝= min(𝑛𝑞,𝑛𝑝).

Finally, we have

sup 𝒙 X |ˆ𝑟𝐾(𝒙) 𝑟𝐾(𝒙)| = 𝑂𝑝

𝐽 (𝑠 1) ℓ +

𝜁1(𝐽𝑖) 𝜁2 0 (𝐽𝑖) o#!

where 𝜁1(𝐽𝑘) 𝜁0(𝐽𝑘) = max{ 𝜁1(𝐽𝑘),𝜁0(𝐽𝑘)}.

The proof is available in Appendix F. Assumption F.1 imposes compactness conditions on X and parameter spaces A. Assumption F.2 requires 𝑓𝒂,𝑘(𝑧) to be bounded and bounded away from 0, 𝑠-times continuously differentiable w.r.t 𝑧 Z and has a bounded derivative w.r.t. 𝒂 A, which are some common regularity conditions in the literature. Assumption F.3 excludes near multicollinearity among the basis functions, 𝚽𝑘(𝑧), and regulates the rate of 𝐽𝑘relative

to 𝑛𝑝and 𝑛𝑞to guarantee the consistency of our estimators. Such conditions are standard in sieve regression. Assumption F.4 requires the Hessian matrix of the sieve approximation of 𝐻( 𝑓, 𝒂) to be positive definite at its minimum w.r.t. 𝒂, which is met when the minimum is in the interior of A.

Algorithm 1 pp DRE

1: Input: samples {𝒙𝑝 𝑖}𝑛𝑝 𝑖=1 and {𝒙𝑞 𝑖}𝑛𝑞 𝑖=1; number of basis functions 𝐽𝑘, learning rate 𝛿, ridge penalty parameter 𝜆, maximum steps 𝐾. 2: Initialize: ˆ𝑟0(𝒙) 1 3: for 𝑘= 1 to 𝐾do 4: Initialize randomly 𝒂(0),𝜸(0), and set 𝑡= 1. 5: while not converge do 6: Calculate

𝒁(𝑡) 𝑘 {ˆ𝑟𝑘 1(𝒙𝑞 𝑖)𝚽𝑘(𝒂(𝑡 1) 𝒙𝑞 𝑖;𝜸(𝑡 1))}𝑛𝑞 𝑖=1,

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂(𝑡 1) 𝑘 𝒙𝑝 𝑖;𝜸(𝑡 1)).

𝑛𝑞 𝒁(𝑡) 𝑘 𝒁(𝑡) 𝑘+𝜆𝑰𝐽𝑘

8: Evaluate the loss function

𝑛𝑞 𝒁(𝑡) 𝑘 𝜷(𝑡) 𝑘 2 2

𝑛𝑝 𝜷(𝑡) 𝑘 𝑾(𝑡) 𝑘+𝜆𝜷(𝑡) 𝑘 𝜷(𝑡) 𝑘.

9: Update (𝒂(𝑡) 𝑘,𝜸(𝑡) 𝑘) with stochastic gradient descent algorithm, such as Adam, with learning rate 𝛿. 10: Update 𝑡 𝑡+1 11: end while 12: Update ˆ𝑟𝑘(𝒙) ˆ𝑟𝑘 1(𝒙) 𝜷(𝑡) 𝑘 𝚽𝑘(𝒂(𝑡) 𝑘 𝒙;𝜸(𝑡) 𝑘) 13: end for 14: return ˆ𝑟𝐾(𝒙).

Remark 4.3 (Selection of Tuning Parameters). In practice, we suggest using cross-validation (CV) to determine the number of projections 𝐾as well as other tuning parameters. Specifically, in experiments and applications, we adopt the Gaussian basis 𝜙𝑗(𝑧) = 𝜙(𝑧;𝛾𝑗) = exp{ (𝑧 𝛾𝑗)2/2}, where 𝛾𝑗is the location parameter of the Gaussian basis. For each 𝑘 {1,...,𝐾}, we determine { ˆ𝛾𝑗}𝐽𝑘 𝑗=1, together with ˆ𝒂𝑘and ˆ𝜷𝑘jointly by minimizing the loss function ˆL𝑘. We monitor the minimal value of the loss function ˆL𝐾in a validation set for a gradually growing 𝐾, and determine the optimal 𝐾 until no further improvement is observed.

Projection Pursuit Density Ratio Estimation

True Density Ratio

Classification

(a) Estimated ˆ𝑟(𝒙)

4 2 0 2 4 4

4 True Density Ratio

4 2 0 2 4 4

4 2 0 2 4 4

4 2 0 2 4 4

4 2 0 2 4 4

4 2 0 2 4 4

(b) Contour plot of ˆ𝑟(𝒙)

Figure 1. 2-D DRE Experiment. The warmer color represents a higher estimated density ratio value. The top-left plot in each panel shows the true density ratio, while the remaining plots illustrate the estimates using various methods.

5. Experiments and Applications

In this section, we compare our proposed projection pursuit density ratio estimation (pp DRE) method with existing alternatives using experimental and real-world data. The compared methods include two classical linear sieve methods, KLIEP (Sugiyama et al., 2007) and u LSIF (Kanamori et al., 2009), a linear sieve method with dimension reduction, D3-LHSS (Sugiyama et al., 2011), a probabilistic classification approach (Qin, 1998; Bickel et al., 2007) based on machine learning classifiers, two latest baselines, f DRE (Choi et al., 2021) and RRND (Nguyen et al., 2024), and a neural network-based density ratio estimation (nn DRE) method, which is a variant of our framework in the sense that it replaces the projection pursuit with a feedforward neural network to model the density ratio. Detailed descriptions of these baselines and implementation specifics are provided in Appendix A and Appendix B.

5.1. 2-D DRE Experiment

We first consider a toy example where 𝑝(𝒙) = N (0𝑑, 𝑰𝑑) and 𝑞(𝒙) = N (0𝑑,2𝑰𝑑), where N (𝜇,Σ) denotes a multivariate Gaussian probability density with mean 𝜇and covariance matrix Σ, and 𝑰𝑑denotes an identity matrix of size 𝑑. We consider a low-dimensional case with 𝑑= 2, which facilitates visualization of the estimated results. The sample sizes are 𝑛𝑝= 𝑛𝑞= 5000. Estimates of this density ratio are shown in Figure 1a, and the corresponding contour plot is presented in Figure 1b. These figures demonstrate that the proposed pp DRE method and the u LSIF method significantly outperform the other baseline approaches, yielding estimates that closely align with the true density ratio.

True Density Ratio

Classification

(a) 𝒄= 0.5𝒆1

True Density Ratio

Classification

Figure 2. Stabilized weights estimation (𝑑𝑋= 10). The y-axis represents the treatment 𝑡and the x-axis represents 𝒄 𝒙. Each point represents a sample data point, and the color indicates the magnitude of the estimated density ratio. The top-left plot in each panel shows the true density ratio, while other plots illustrate estimates using various methods.

5.2. Application in Causal Inference

We apply our proposed pp DRE method to estimate continuous treatment effects in the framework of Ai et al. (2021). Let 𝑇 R denote the observed continuous treatment status variable, with a PDF 𝑓𝑇(𝑡). Let 𝑌 (𝑡) denote the potential response when treatment 𝑡is assigned, and let 𝑌= 𝑌 (𝑇) denote the observed response. Let 𝑿 R𝑑𝑋denote a vector of observable covariates. To identify the causal effect, we assume the unconfoundedness condition that 𝑌 (𝑡) and 𝑇are conditionally independent given 𝑿. We assume a parametric model 𝑔(𝑡;𝜽 ), called the general dose-response function, for the potential outcome 𝑌 (𝑡):

𝜽 :=argmin 𝜽 R𝑝

E[𝐿(𝑌 (𝑡) 𝑔(𝑡;𝜽))] 𝑓𝑇(𝑡)𝑑𝑡

=argmin 𝜽 R𝑝 E[𝜋0(𝑇, 𝑿)𝐿(𝑌 𝑔(𝑇;𝜽))], (10)

where 𝐿( ) is a user-specified loss function, the second equality holds by the unconfoundedness condition, 𝜋0(𝑡,𝒙) is called the stabilized weights, defined by

𝜋0(𝑡,𝒙) := 𝑓𝑇(𝑡) 𝑓𝑿(𝒙)

𝑓𝑇,𝑿(𝑡,𝒙) = 𝑓𝑇(𝑡) 𝑓𝑇|𝑿(𝑡|𝒙) ,

where 𝑓𝑇,𝑿is the joint PDF of 𝑇and 𝑿, 𝑓𝑇|𝑿is the conditional PDF of 𝑇given 𝑿, 𝑓𝑇and 𝑓𝑿are marginal PDFs of 𝑇and 𝑿, respectively.

The causal model (10) encompasses a variety of continuous treatment effect parameters of interest. For example, with L(𝑣) = 𝑣2, model (10) gives 𝑔(𝑡;𝜽 ) = E{𝑌 (𝑡)}, the average dose-response function (ADRF). With L(𝑣) = 𝑣{𝜏 𝐼(𝑣 0)} for some 𝜏 (0,1), the model (10) gives

Projection Pursuit Density Ratio Estimation

2 10 30 50 100 Dimension of X

u LSIF KLIEP Classification D3-LHSS pp DRE nn DRE

Figure 3. Average RMSLE over 50 replicates for stabilized weights estimation with varying dimensions of covariates.

the 𝜏th quantile dose-response function (QDRF) 𝑔(𝑡;𝜽 ) = 𝐹 1 𝑌 (𝑡) (𝜏) = inf{𝑞: P{𝑌 (𝑡) 𝑞} 𝜏}.

Note that, by definition, the stabilized weights 𝜋0(𝑡,𝒙) can be viewed as a density ratio. We compare our method with different estimators of 𝜋0(𝑡,𝒙) based on simulated data sets in Section 5.2.1. In Section 5.2.2, we investigate the application of DRE in both ADRF and QDRF analysis based on a semi-synthetic data set.

5.2.1. STABILIZED WEIGHTS ESTIMATION

In this simulation study, we design the treatment assignment model as 𝑇𝑖= 𝒄 𝑿𝑖+𝜖𝑖, where 𝒄 R𝑑𝑋is specified below, 𝑿𝑖 i.i.d. N (0𝑑𝑋, 𝑰𝑑𝑋) and 𝜖𝑖 i.i.d. 𝑁(0,1). In all simulation scenarios, the sample size is fixed to be 𝑛= 5000.

We first investigate the visual performance of the estimators for a fixed dimension of covariates, 𝑑𝑋= 10. To facilitate visualization, we consider two choices of the coefficient: 𝒄= 0.5𝒆1 and 𝒄= 2𝒆1, respectively, where 𝒆1 is a vector with 1 in the first component while others are zeros. In both scenarios, only the first component of 𝒙affects the treatment assignment. The estimated density ratios are visualized in Figure 2, where the y-axis represents the treatment 𝑡and the x-axis is 𝒄 𝒙. In both scenarios, the density ratio estimates produced by our pp DRE method show the closest alignment with the true density ratios.

We further investigate the estimators performances in terms of the root mean squared logarithmic error (RMSLE) defined in Appendix C.1, for varying dimensions of covariates 𝑑𝑋 {2,10,30,50,100}. We set 𝒄= 0.5 1𝑑𝑋to be a 𝑑𝑋dimensional vector whose components are all 0.5. Figure 3 presents the average RMSLE values of various estimators over 50 replications. The RMSLE of all estimators increases as the dimension 𝑑𝑋grows, but our proposed

pp DRE method consistently outperforms its competitors. Moreover, the advantage of pp DRE is more pronounced for a larger dimension of covariates. Additional experimental results can be found in Appendix C.1.

5.2.2. DOSE RESPONSE FUNCTION ESTIMATION

This section investigates the performance of estimating both ADRF and QDRF using the semi-synthetic variant of the Infant Health and Development Program (IHDP) dataset. The original IHDP dataset (Hill, 2011) consists of 747 observations, each characterized by 𝑑𝑋= 25 covariates. Following Nie et al. (2020) and Gao et al. (2023), we generate the semi-synthetic IHDP-continuous dataset by leveraging the real-world covariates from the original IHDP dataset to simulate continuous dosages and responses. A detailed description of the data generation process can be found in Appendix D.1. Specifically, the randomly assigned treatment 𝑇is generated from 𝑿by (12), and the potential outcome 𝑌 (𝑡) = ℎ(𝑡, 𝑿) + 0.5𝜖, where ℎ(𝑡, 𝑿) is defined in (13) and 𝜖 N (0,1). Then the true ADRF is 𝑔 (𝑡) = E𝑿[ℎ(𝑡, 𝑿)], and the true 𝜏-th QDRF is 𝑔 (𝑡) = inf[𝑞: P{ℎ(𝑡, 𝑿) +0.5𝜖 𝑞} 𝜏].

We estimate ADRF and QDRFs at various quantile levels 𝜏= {0.1, ,0.25, 0.5, 0.75, 0.9} using a parametric model 𝑔(𝑡;𝜽) := 𝜃0+𝜃1𝑡+𝜃2𝑡2+𝜃3𝑡3+𝜃4𝑡4, and obtain ˆ𝜽by solving the empirical version of the optimization problem (10). To measure the accuracy of the estimated dose-response functions across the full dosage range, we compute the average squared error (ASE):

𝑔(𝑇𝑖; ˆ𝜽) 𝑔 (𝑇𝑖) 2 .

The mean and standard error of the ASE computed based on 100 replicates are reported in Table 1. The numerical results indicate that our pp DRE method attains the lowest mean ASE in most cases, demonstrating its superior performance compared to the other alternatives.

5.3. Application in Mutual Information Estimation

The mutual information (MI) is a measure of the mutual dependence between two continuous random vectors 𝑼and 𝑽, which is defined by

MI𝑼,𝑽= 𝑓𝑼,𝑽(𝒖,𝒗) log 𝑓𝑼,𝑽(𝒖,𝒗)

𝑓𝑼(𝒖) 𝑓𝑽(𝒗) 𝑑𝒖𝑑𝒗.

Given a sample from 𝑓𝑼,𝑽(𝒖,𝒗), we can create another sample following the distribution 𝑓𝑼(𝒖) 𝑓𝑽(𝒗) by permuting the 𝒗vectors across the dataset. This enables the application of DRE methods to estimate the MI by

𝑖=1 log ˆ𝑟(𝒖𝑖,𝒗𝑖),

Projection Pursuit Density Ratio Estimation

Table 1. ASE in dose response function estimation with IHDP-continuous dataset.

Method ARDF QDRF

𝜏= 0.1 𝜏= 0.25 𝜏= 0.5 𝜏= 0.75 𝜏= 0.9

u LSIF 0.112(0.022) 0.338(0.058) 0.130(0.027) 0.045(0.016) 0.090(0.022) 0.278(0.064)

KLIEP 0.112(0.022) 0.339(0.059) 0.130(0.027) 0.045(0.016) 0.089(0.022) 0.278(0.064)

Classification 0.107(0.028) 0.341(0.065) 0.130(0.032) 0.045(0.018) 0.088(0.025) 0.276(0.066)

D3-LHSS 0.110(0.024) 0.336(0.058) 0.128(0.030) 0.045(0.016) 0.088(0.022) 0.279(0.068)

f DRE 0.106(0.026) 0.331(0.056) 0.127(0.029) 0.045(0.016) 0.088(0.021) 0.267(0.061)

RRND 0.117(0.022) 0.307(0.053) 0.125(0.026) 0.043(0.015) 0.090(0.021) 0.408(0.082)

nn DRE 0.107(0.022) 0.333(0.059) 0.126(0.027) 0.043(0.015) 0.089(0.021) 0.297(0.066)

pp DRE 0.091(0.025) 0.338(0.066) 0.124(0.029) 0.041(0.016) 0.079(0.022) 0.266(0.060)

Table 2. MAE in mutual information estimation for varying dimension 𝑝and correlation coefficient 𝜌.

Method 𝑝= 2 𝑝= 10 𝑝= 20

𝜌= 0.2 𝜌= 0.8 𝜌= 0.2 𝜌= 0.8 𝜌= 0.2 𝜌= 0.8

u LSIF 0.044(0.001) 0.573(0.015) 0.200(0.008) 5.093(0.031) 0.415(0.023) 10.242(0.088)

KLIEP 0.044(0.000) 1.099(0.011) 0.204(0.007) 4.885(0.019) 0.516(0.032) 9.991(0.091)

Classification 0.270(0.006) 0.169(0.019) 0.381(0.015) 2.579(0.062) 0.272(0.021) 7.185(0.044)

D3-LHSS 0.339(0.077) 0.248(0.162) 0.282(0.084) 4.080(0.201) 0.164(0.132) 9.241(0.198)

f DRE 0.043(0.000) 0.907(0.048) 0.199(0.007) 4.283(0.014) 0.487(0.036) 8.710(0.059)

RRND 0.061(0.001) 0.926(0.006) 0.200(0.008) 5.092(0.031) 0.414(0.022) 10.241(0.089)

nn DRE 0.038(0.037) 1.064(0.275) 1.183(1.725) 3.838(1.912) 1.671(2.202) 8.629(2.577)

pp DRE 0.051(0.044) 0.069(0.015) 0.164(0.038) 0.272(0.183) 0.074(0.098) 2.966(1.089)

with 𝑝(𝒖,𝒗) = 𝑓𝑼,𝑽(𝒖,𝒗) and 𝑞(𝒖,𝒗) = 𝑓𝑼(𝒖) 𝑓𝑽(𝒗) for 𝑟 (𝒖,𝒗) = 𝑝(𝒖,𝒗)/𝑞(𝒖,𝒗).

We adopt the experimental setting in Belghazi et al. (2018); Rhodes et al. (2020); Choi et al. (2022). Specifically, we consider two standard multivariate Gaussian random vectors, 𝑼= (𝑈1,...,𝑈𝑝) R𝑝and 𝑽= (𝑉1,...,𝑉𝑝) R𝑝, with component-wise correlation, corr(𝑈𝑖,𝑉𝑗) = 𝛿𝑖𝑗𝜌, where 𝜌 ( 1,1) and 𝛿𝑖𝑗is Kronecker s delta. Performance is measured by the mean absolute error, MAE := | c MI𝑼,𝑽 MI𝑼,𝑽|. We consider three replicates with sample size 𝑛= 5000. The MAE averaged over three runs for various values of 𝑝and 𝜌are reported in Table 2. Overall, our proposed pp DRE method demonstrates superior or comparable performance in all circumstances, indicating its robustness and effectiveness in mutual information estimation.

5.4. Application in Covariate Shift Adaptation

Covariate shift refers to the change in the distribution of the input variables in the training and the test data sets, i.e. 𝑝 te(𝒙) 𝑝 tr(𝒙). In the case of covariate shift, learning a

parameter, 𝜽in a model 𝑓(𝒙;𝜽) regarding the probability distribution of 𝑦given 𝒙, using standard learning techniques such as empirical risk minimization (ERM) can become biased (Sugiyama et al., 2012). To mitigate this issue, the importance-weighted ERM is widely used. The core idea is to re-weight the training samples in order to learn a model that minimizes the loss on the test dataset:

E(𝒙,𝑦) 𝑝 te(𝒙,𝑦) [𝐿( 𝑓(𝒙;𝜽), 𝑦)]

= E(𝒙,𝑦) 𝑝 tr(𝒙,𝑦)

𝑝 te(𝒙) 𝑝 tr(𝒙) 𝐿( 𝑓(𝒙;𝜽), 𝑦) ,

where 𝐿( ) is a user-specified loss function, and the density ratio 𝑟 (𝒙) = 𝑝 te(𝒙)/𝑝 tr(𝒙) is referred to as the importance, acting as an adjusting weight during the training process.

To simulate the covariate shift setting, we adopt a biased sampling scheme as described by Stojanov et al. (2019), on various classical benchmark regression datasets. In this setup, we introduce a sample selection variable 𝑠 {0,1}, and the probability of a sample being selected for the train-

Projection Pursuit Density Ratio Estimation

Table 3. NMSE under covariate shift adaptation on various benchmark datasets. Method Abalone Billboard Spotify Cancer Mortality Computer Activity Diamond Prices

Unweighted 0.544(0.057) 0.582(0.044) 0.704(0.069) 0.278(0.050) 0.231(0.082)

u LSIF 0.525(0.057) 0.582(0.044) 0.704(0.069) 0.668(0.360) 0.239(0.068)

KLIEP 0.531(0.055) 0.565(0.028) 0.696(0.058) 0.302(0.049) 0.228(0.080)

Classification 0.535(0.049) 0.579(0.042) 0.691(0.067) 0.273(0.051) 0.183(0.056)

D3-LHSS 0.520(0.054) 0.564(0.031) 0.672(0.066) 0.318(0.052) 0.220(0.070)

f DRE 0.586(0.136) 0.615(0.084) 0.664(0.064) 0.294(0.053) 0.233(0.078)

RRND 0.574(0.144) 0.616(0.087) 0.655(0.069) 0.279(0.051) 0.232(0.082)

nn DRE 0.806(0.829) 0.822(0.861) 0.660(0.063) 0.301(0.033) 0.360(0.429)

pp DRE 0.514(0.049) 0.559(0.026) 0.661(0.048) 0.199(0.019) 0.133(0.024)

ing set is given by:

P(𝑠= 1|𝒙) = 𝑒𝑣

(1+ 𝑒𝑣) , with 𝑣= 4 𝜔 (𝒙 𝒙)

where 𝒙is the sample mean of the covariates, 𝜔is a random projection vector uniformly selected from the interval [ 1,1]𝑑and 𝜎𝜔 (𝒙 𝒙) is the standard deviation. Subsequently, we learn a kernel ridge regression model by minimizing the importance-weighted empirical risk:

ˆ𝜽= argmin 𝜽

𝑖=1 ˆ𝑟(𝒙tr 𝑖) n 𝑓(𝒙tr 𝑖;𝜽) 𝑦tr 𝑖 2 +𝜆 𝜽 2o ,

where 𝑓(𝒙;𝜽) = Í𝑛tr 𝑖=1 𝜃𝑖𝐾(𝒙,𝒙tr 𝑖), 𝐾(𝒙,𝒙 ) = exp{ 𝒙 𝒙 2/𝑑} is the kernel basis, and ˆ𝑟is an estimator of 𝑟

based on DRE methods.

To evaluate the performance, we compute the normalized mean squared error (NMSE) on the test dataset:

(𝑦te 𝑖 ˆ𝑦te 𝑖)2

where 𝑛te is the number of test samples, 𝑦te 𝑖is the true value for the 𝑖th sample, ˆ𝑦te 𝑖is the predicted value, and 𝜎2 𝑦is the variance of the response variable on the test dataset. Average NMSE values across 15 random replicates are reported in Table 3. More details on the datasets can be found in Appendix D.2.

As shown in Table 3, the unweighted regression model performs poorly across all datasets, underscoring the necessity for adjustments for the covariate shift problem. In contrast, our proposed pp DRE method consistently outperforms or matches the performance of the baseline methods across various datasets. This highlights the effectiveness of pp DRE in mitigating the impact of covariate shift, positioning it as a promising approach for regression tasks affected by this issue.

6. Conclusion

We propose a novel projection pursuit-based method for estimating the density ratio function, which does not require parametric assumptions, enjoys computational convenience, and can alleviate the curse of dimensionality. The asymptotic consistency and the convergence rates are established to guarantee the validity of the proposed method. Numerical experiments demonstrate that our method outperforms existing alternatives in a variety of applications.

Density ratio estimation based on projection pursuit admits many exciting directions for future work. One particularly promising direction is the extension of pp DRE to independence testing. However, this extension requires rigorous theoretical development, particularly in establishing the estimator s limiting distributions under the null hypothesis of independence. Furthermore, the method s iterative optimization process may still face challenges with computational complexity as the number of projections 𝐾increases. Developing adaptive stopping criteria could alleviate the computational burden by avoiding tuning 𝐾with cross validation.

Acknowledgements

The research of Zheng Zhang is supported by the funds from the National Key R&D Program of China [grant number 2022YFA1008300], the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China [project number 23XNA025]. Mingming Gong is supported by ARC DE210101624, ARC DP240102088, and WIS-MBZUAI 142571.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning in high-dimensional data. There are many potential societal consequences of our work, none

Projection Pursuit Density Ratio Estimation

which we feel must be specifically highlighted here.

Ai, C., Linton, O., Motegi, K., and Zhang, Z. A unified framework for efficient estimation of general treatment models. Quantitative Economics, 12(3):779 816, 2021.

Ai, C., Sun, L.-H., Zhang, Z., and Zhu, L. Testing unconditional and conditional independence via mutual information. Journal of Econometrics, 240(2):105335, 2024.

Aladjem, M. Projection pursuit mixture density estimation. IEEE Transactions on Signal Processing, 53(11):4376 4383, 2005.

Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, pp. 531 540. PMLR, 2018.

Bickel, S., Brückner, M., and Scheffer, T. Discriminative learning for differing training and test distributions. In Proceedings of the 24th International Conference on Machine Learning, pp. 81 88, Corvalis Oregon USA, 2007. ACM.

Chen, X. Large sample sieve estimation of seminonparametric models. Handbook of Econometrics, 6 (B):5549 5632, 2007a.

Chen, X. Large sample sieve estimation of seminonparametric models. Handbook of Econometrics, Elsevier, 2007b.

Choi, K., Liao, M., and Ermon, S. Featurized density ratio estimation. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, pp. 172 182. PMLR, 2021.

Choi, K., Meng, C., Song, Y., and Ermon, S. Density ratio estimation via infinitesimal classification. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pp. 2552 2573. PMLR, 2022.

da Silva, N., Cook, D., and Lee, E.-K. A projection pursuit forest algorithm for supervised classification. Journal of Computational and Graphical Statistics, 30(4):1168 1180, 2021.

Diaconis, P. and Shahshahani, M. On nonlinear functions of linear combinations. SIAM Journal on Scientific and Statistical Computing, 5(1):175 191, 1984.

Fang, T., Lu, N., Niu, G., and Sugiyama, M. Rethinking importance weighting for deep learning under distribution

shift. In Advances in Neural Information Processing Systems, volume 33, pp. 11996 12007. Curran Associates, Inc., 2020.

Friedman, J. and Tukey, J. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, C-23(9):881 890, 1974.

Friedman, J. H. and Stuetzle, W. Projection pursuit regression. Journal of the American Statistical Association, 76 (376):817 823, 1981.

Friedman, J. H., Stuetzle, W., and Schroeder, A. Projection pursuit density estimation. Journal of the American Statistical Association, 79(387):599 608, 1984.

Gao, E., Bondell, H., Huang, W., and Gong, M. A variational framework for estimating continuous treatment effects with measurement error. In The Twelfth International Conference on Learning Representations, 2023.

Gizewski, E. R., Mayer, L., Moser, B. A., Nguyen, D. H., Pereverzyev, S., Pereverzyev, S. V., Shepeleva, N., and Zellinger, W. On a regularization of unsupervised domain adaptation in rkhs. Applied and Computational Harmonic Analysis, 57:201 227, 2022. ISSN 1063-5203.

Gruber, L., Holzleitner, M., Lehner, J., Hochreiter, S., and Zellinger, W. Overcoming saturation in density ratio estimation by iterated regularization. In Proceedings of the 41st International Conference on Machine Learning, pp. 16502 16529. PMLR, 2024.

Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., and Kanamori, T. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 26(2):309 336, 2011.

Hill, J. L. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217 240, 2011.

Huber, P. J. Projection pursuit. The Annals of Statistics, 13 (2):435 475, 1985.

Kanamori, T., Hido, S., and Sugiyama, M. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391 1445, 2009.

Kanamori, T., Suzuki, T., and Sugiyama, M. Theoretical analysis of density ratio estimation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E93-A(4):787 798, 2010.

Kato, M. and Teshima, T. Non-negative bregman divergence minimization for deep direct density ratio estimation. In Proceedings of the 38th International Conference on Machine Learning, pp. 5320 5333. PMLR, 2021.

Projection Pursuit Density Ratio Estimation

Lee, E.-K., Cook, D., Klinke, S., and Lumley, T. Projection pursuit for exploratory supervised classification. Journal of Computational and Graphical Statistics, 14(4):831 846, 2005.

Li, Q. and Racine, J. S. Nonparametric econometrics: theory and practice. Princeton University Press, 2023.

Liu, S., Yamada, M., Collier, N., and Sugiyama, M. Changepoint detection in time-series data by relative density-ratio estimation. Neural Networks, 43:72 83, 2013.

Liu, S., Suzuki, T., Relator, R., Sese, J., Sugiyama, M., and Fukumizu, K. Support consistency of direct sparsechange learning in Markov networks. The Annals of Statistics, 45(3), 2017.

Lorentz, G. Approximation of functions. Chelsea Publishing Company, 1986.

Matsushita, Y., Otsu, T., and Takahata, K. Estimating density ratio of marginals to joint: Applications to causal inference. Journal of Business & Economic Statistics, 41 (2):467 481, 2023.

Meng, X.-L. and Wong, W. H. Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica, 1996.

Nagumo, R. and Fujisawa, H. Density ratio estimation with doubly strong robustness. In Proceedings of the 41st International Conference on Machine Learning, pp. 37260 37276. PMLR, 2024.

Nam, H. and Sugiyama, M. Direct density ratio estimation with convolutional neural networks with application in outlier detection. IEICE Transactions on Information and Systems, E98.D(5):1073 1079, 2015.

Newey, W. K. and Mc Fadden, D. Large sample estimation and hypothesis testing. Handbook of Econometrics, 4: 2111 2245, 1994.

Nguyen, D. H., Zellinger, W., and Pereverzyev, S. On regularized radon-nikodym differentiation. Journal of Machine Learning Research, 25(1), January 2024. ISSN 1532-4435.

Nie, L., Ye, M., Liu, Q., and Nicolae, D. VCNet and functional targeted regularization for learning causal effects of continuous treatments. In International Conference on Learning Representations, 2020.

Qin, J. Inferences for case-control and semiparametric twosample density ratio models. Biometrika, 85(3):619 630, 1998.

Que, Q. and Belkin, M. Inverse density as an inverse problem: the fredholm equation approach. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

Rhodes, B., Xu, K., and Gutmann, M. U. Telescoping density-ratio estimation. In Advances in Neural Information Processing Systems, volume 33, pp. 4905 4916. Curran Associates, Inc., 2020.

Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227 244, 2000.

Sinha, A., O Kelly, M., Tedrake, R., and Duchi, J. C. Neural bridge sampling for evaluating safety-critical autonomous systems. In Advances in Neural Information Processing Systems, volume 33, pp. 6402 6416. Curran Associates, Inc., 2020.

Stojanov, P., Gong, M., Carbonell, J., and Zhang, K. Lowdimensional density ratio estimation for covariate shift correction. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pp. 3449 3458. PMLR, 2019.

Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P., and Kawanabe, M. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.

Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., Von Bünau, P., and Kawanabe, M. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699 746, 2008.

Sugiyama, M., Kawanabe, M., and Chui, P. L. Dimensionality reduction for density ratio estimation in highdimensional spaces. Neural Networks, 23(1):44 59, 2010.

Sugiyama, M., Yamada, M., Von Bünau, P., Suzuki, T., Kanamori, T., and Kawanabe, M. Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search. Neural Networks, 24(2):183 198, 2011.

Sugiyama, M., Suzuki, T., and Kanamori, T. Density ratio estimation in machine learning. Cambridge University Press, 1 edition, 2012.

Suzuki, T., Sugiyama, M., and Tanaka, T. Mutual information approximation via maximum likelihood estimation of density ratio. In 2009 IEEE International Symposium on Information Theory, pp. 463 467, 2009. doi: 10.1109/ISIT.2009.5205712.

Projection Pursuit Density Ratio Estimation

Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., and Sugiyama, M. Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing, 17:138 155, 2009.

Vaart, A. W. and Wellner, J. A. Weak convergence and empirical processes. Springer, 1996.

Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge University Press, 2000.

Wang, L. and Yang, L. Spline estimation of single-index models. Statistica Sinica, 19(2):765 783, 2009. ISSN 10170405, 19968507.

Yamada, M. and Sugiyama, M. Direct density-ratio estimation with dimensionality reduction via heterodistributional subspace analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 25(1):549 554, 2011.

Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., and Sugiyama, M. Relative density-ratio estimation for robust distribution comparison. Neural Computation, 25(5): 1324 1370, 2013.

Zhan, H., Zhang, M., and Xia, Y. Ensemble projection pursuit for general nonparametric regression. The Annals of Statistics, 2025.

Projection Pursuit Density Ratio Estimation

A. Baseline Methods

This section presents the baseline methods used in our numerical studies.

u LSIF The unconstrained Least-Squares Importance Fitting (u LSIF) method, proposed by Kanamori et al. (2009), models the density ratio 𝑟 (𝒙) as follows:

𝑟 (𝒙) 𝜽 𝜓(𝒙) =

ℓ=1 𝜃ℓ𝐾(𝒙,𝒙𝑝 ℓ), (11)

where 𝐾(𝒙,𝒙 ) is the Gaussian kernel basis, and the definitions of 𝜓(𝒙) and 𝜽are straightforward. The model parameter 𝜽 is learned by minimizing the squared loss with a ridge penalty. For implementation, we employ the functions provided by the Python package densratio1.

KLIEP The Kullback-Leibler importance estimation procedure (KLIEP) method, proposed by Sugiyama et al. (2008), uses the same Gaussian kernel model for approximating 𝑟 (𝒙) as that in the u LSIF method (11), but the model parameter 𝜽 is identified by minimizing the unnormalized Kullback-Leibler (UKL) divergence:

UKL (𝑟) = E𝑞[𝑟(𝒙)] E𝑝[log𝑟(𝒙)].

With the fact that E𝑞[𝑟(𝒙)] = 1, the estimation is implemented as follows:

𝑖=1 log(𝜽 𝝍(𝒙𝑝 𝑖)) s.t. 1 𝑛𝑝

𝑖=1 𝜽 𝝍(𝒙𝑞 𝑖) = 1 and 𝜽 0,

where the inequality for vectors is applied in the element-wise manner. The matlab code of the KLIEP method is available online2.

Probabilistic Classification By assigning the label 𝑦= +1 to the sample from 𝑝(𝒙) and 𝑦= 1 to the sample from 𝑞(𝒙) respectively (Qin, 1998; Bickel et al., 2007), the density ratio function 𝑟 (𝒙) = 𝑝(𝒙)/𝑞(𝒙) can be expressed by

𝑟 (𝒙) = P(𝒙|𝑦= +1)

P(𝒙|𝑦= 1) = P(𝑦= 1)P(𝑦= +1|𝒙)

P(𝑦= +1)P(𝑦= 1|𝒙) .

Given an estimator of the posterior probability, ˆ𝑝(𝑦|𝒙), the density ratio estimator ˆ𝑟(𝒙) can be constructed as

𝑛𝑝 ˆ𝑝(𝑦= +1|𝒙)

ˆ𝑝(𝑦= 1|𝒙) .

In our experiments, we use the Light GBM model as the classifier, which is an ensemble learning algorithm based on gradient boosting decision trees.

D3-LHSS The Direct Density-ratio estimation with Dimensionality reduction via Least-squares Heterodistributional Subspace Search (D3-LHSS) method is proposed by Sugiyama et al. (2011). This method assumes that the density ratio can be identified in a low-dimensional space specified by a projection matrix 𝑼 R𝑚 𝑑. Given the projection matrix ˆ𝑼obtained by the LHSS algorithm, the estimator of the density ratio is given by

ℓ=1 ˆ𝜃ℓ𝜓ℓ( ˆ𝑼𝒙),

where { ˆ𝜃ℓ}𝑏 ℓ=1 are the learned using the u LSIF method on the heterodistributional subspace corresponding to ˆ𝑼. For more details, we refer to Sugiyama et al. (2011), whose matlab implementation is available online2.

1https://github.com/hoxo-m/densratio_py 2https://www.ms.k.u-tokyo.ac.jp/sugi/software.html

Projection Pursuit Density Ratio Estimation

f DRE The featurized density ratio estimation (f DRE) framework, proposed by Choi et al. (2021), addresses distributional discrepancy challenges through feature learning. By employing a normalizing flow model 𝑓𝜃: X Z with invertible transformation properties, this method embeds heterogeneous data distributions into a unified feature space where density ratios remain preserved. The crucial invariance property is formally expressed as:

𝑟 (𝒙) = 𝑝(𝒙)

𝑞(𝒙) = 𝑝 ( 𝑓𝜃(𝒙))

𝑞 ( 𝑓𝜃(𝒙)) ,

where 𝑝 ,𝑞 are the densities of 𝑓𝜃(𝒙𝑝) and 𝑓𝜃(𝒙𝑞) respectively, 𝒙𝑝 𝑝(𝒙), and 𝒙𝑞 𝑞(𝒙). This preservation enables direct application of classical density ratio estimators, such as KLIEP, in the transformed space Z, often achieving superior numerical stability compared to native space estimation.

RRND The regularized Radon-Nikodym differentiation estimation (RRND) method, proposed by Nguyen et al. (2024), introduces a kernel-based regularization framework within reproducing kernel Hilbert spaces (RKHS) for accurate pointwise estimation of Radon-Nikodym derivatives, which are equivalent to density ratio functions. This approach can extend the conventional kernel u LSIF through an iterative Lavrentiev regularization scheme, in which the core algorithm operates through successive approximations as follows:

𝛽𝜆,0 𝑿= 0, 𝛽𝜆,𝑙 𝑿= (𝜆𝑰+ 𝑆 𝑋𝑝𝑆𝑋𝑝)(𝑆 𝑋𝑞𝑆𝑋𝑞1+𝜆𝛽𝜆,𝑙 1 𝑿 ), 𝑙 N,

where 𝛽𝜆,𝑙 𝑿is the 𝑙-th iteration of the approximation of 𝑟 (𝒙). The methodology leverages two sample operators:

𝑆𝑋𝑝𝑓= { 𝑓(𝒙𝑝 1 ),..., 𝑓(𝒙𝑝 𝑛𝑝)}, 𝑆𝑋𝑞𝑓= { 𝑓(𝒙𝑞 1),..., 𝑓(𝒙𝑞 𝑛𝑞)},

and two adjoint operators defined through kernel embeddings:

𝑆 𝑋𝑝𝑢( ) = 1

𝑗=1 𝐾( ,𝒙𝑝 𝑗)𝑢𝑗, 𝑢 R𝑛𝑝, 𝑆 𝑋𝑞𝑣( ) = 1

𝑗=1 𝐾( ,𝒙𝑞 𝑗)𝑣𝑗, 𝑣 R𝑛𝑞.

For complete theoretical analysis and convergence properties, readers are directed to the original work by Nguyen et al. (2024).

nn DRE We present a neural network-based density ratio estimation (nn DRE) method. It differs from our proposed pp DRE method in that it utilizes a feedforward neural network to model the density ratio, i.e.

𝑟 (𝒙) 𝑟nn(𝒙;𝜽) = exp{𝐹(𝒙|𝜽)},

where 𝐹( |𝜽) represents the neural network with parameter 𝜽. The estimation is proceeded by minimizing the squared loss based on the neural networks: SQ (𝜽) = E𝑞{[𝑟nn(𝒙;𝜽)]2} 2E𝑝{𝑟nn(𝒙;𝜽)}.

B. Implementation Details and Hyperparameter Selection Process

To ensure rigorous and fair comparisons, we implemented the following protocols.

For our proposed pp DRE method, as mentioned in Remark 4.3, the optimal hyperparameters are identified by the minimal validation loss in CV. For clarity, we detail our approach as follows: across all experiments, we utilized 5-fold CV with random sampling. A grid search was conducted over a predefined set of parameter ranges, which are outlined in Table 4.

For open-source baseline methods (e.g., KLIEP, u LSIF, 𝐷3-LHSS), we utilized their official implementations (e.g., Python densratio package for u LSIF) and default hyperparameter search grids as recommended in their original papers or standard toolkits. For instance, u LSIF s hyperparameters (𝜎, 𝜆) were selected from the grid 1e-3:10:1e9, consistent with its standard implementation. These methods inherently incorporate CV-based hyperparameter selection in their standard workflows. We preserved these built-in CV mechanisms without modification.

For probabilistic classification approach and nn DRE, we implemented 5-fold CV (aligned with our pp DRE method) to identify the optimal hyperparameters within the search spaces outlined in Table 5.

Projection Pursuit Density Ratio Estimation

Table 4. Search Grid for pp DRE Parameter Description Search Space

𝐾 Number of PP iterations {5, 10, 15} 𝐽𝑘 Number of basis functions {20, 50, 70, 100, 150} 𝜆 ℓ2-regularization strength {0.5, 1, 5, 10} 𝛿 Gradient descent learning rate {0.001, 0.01, 0.1}

Table 5. Search Grid for Probabilistic Classification approach and nn DRE

Method Parameter Search Space

Classification

n_estimators {100, 300}

learning_rate {0.0001, 0.001, 0.01}

num_leaves {20, 30}

depth {2, 3}

width {8, 32, 64}

learning_rate {0.0001, 0.001, 0.01}

For f DRE, we have implemented a version that employs KLIEP as the second-stage DRE method. Adhering to the guidelines provided by the official open-source code3 and the original research paper (Choi et al., 2021), we trained the masked autoregressive flow (MAF) models with the configurations presented in Table 6. For the KLIEP method in the second stage, we have retained the original hyperparameter settings as per the open-source implementation.

Table 6. Hyperparameter setting for the normalizing flow model in f DRE

Dataset n_blocks n_hidden hidden_size n_epochs

IHDP 5 1 100 100 Regression Benchmarks 5 1 100 100 MI Gaussians 5 1 100 200

For RRND, in the absence of open-source code, we implemented the algorithm by adhering to the implementation details outlined in the Numerical Illustrations section in Nguyen et al. (2024). The kernel function is assigned as 𝐾(𝑥,𝑥 ) = 1+exp{ (𝑥 𝑥 )2/2}. Utilizing a 5-fold CV, the hyperparameter 𝜆is chosen based on the quasi-optimality criterion, and the optimal iteration step 𝑘is determined by minimizing squared distance loss function in the validation set. Following the configurations in Nguyen et al. (2024), the search grids are 𝑘 {1,2,...,10} and 𝜆 {𝜆ℓ= 𝜆0𝜌ℓ,ℓ= 1,...,9} with 𝜆0 = 0.9. The decay factor 𝜌= (𝜆𝑤/𝜆0)1/𝑙is derived from the lower bound 𝜆𝑤= (𝑛 1/2 +𝑚 1/2), where 𝑛,𝑚are the sample sizes of the numerator and denominator distributions respectively.

C. Additional Numerical Results

C.1. Stabilized Weights Estimation

Metrics We use the mean squared error (RMSE) and the root mean squared logarithmic error (RMSLE) to evaluate the performance, which are defined as follows:

𝑖=1 (ˆ𝑟(𝒙𝑖) 𝑟 (𝒙𝑖))2, RMSLE =

𝑖=1 (log ˆ𝑟(𝒙𝑖) log𝑟 (𝒙𝑖))2.

3https://github.com/ermongroup/f-dre

Projection Pursuit Density Ratio Estimation

Results We present additional numerical results, including the box plot of RMSLE in Figure 4 and the RMSE values in Table 7. The box plot of RMSLE in Figure 4 shows the variation of all methods. In Table 7, due to the extremely large RMSE in some data replications, we report the median of RMSE values in 50 replications, which offers evidence supporting the superior performance of the pp DRE method.

2 10 30 50 100 Dimension of X

u LSIF KLIEP Classification D3-LHSS nn DRE pp DRE

Figure 4. Boxplot of RMSLE in stabilized weights estimation for different covariate dimension settings 𝑑𝑋 {2,10,30,50,100} estimation across 50 data replications

Table 7. Median RMSE in stabilized weights estimation for different covariate dimension settings 𝑑𝑋 {2,10,30,50,100}.

(Median) RMSE Dimension of Covariates

Method 2 10 30 50 100

u LSIF 1.864 4.216 2.452 1.875 1.321

KLIEP 1.827 4.222 2.464 1.898 1.39

Classification 1.93 4.217 2.413 1.809 1.208

D3-LHSS 2.604 8.793 2.637 2.454 2.054

nn DRE 2.181 4.739 2.772 2.227 1.549

pp DRE 1.742 3.835 1.777 1.12 0.9

C.2. Computational Cost Evaluation

Through controlled numerical experimentation, we systematically evaluate the computational costs of all competing methodologies. Maintaining the experimental protocol established in Section 5.2.1, we conduct dimensional scalability analyses over 𝑑𝑋 {2,10,30,50,100} and measure absolute wall-clock execution times. The experiments were carried out on a computing node utilizing dual AMD EPYC 7713 processors, providing a total of 128 CPU cores. Table 8 presents the mean computational duration across three statistically independent trials, with temporal measurements recorded in minutes.

We obtain the following key observations: (i) when the dimension of the data is low or moderately large (𝑑𝑋= 2,10), the computation time of our PPDRE method is comparable to that of the u LSIF, KLIEP, and classification methods, which are suitable for low-dimensional data. Our computation time is much less than that of the f DRE, RRND, nn DRE and 𝐷3-LHSS methods, some of which are also designed for high-dimensional data. (ii) when the dimension of the data is large (𝑑𝑋= 30,50,100), the methods for low-dimensional data fail to work due to the curse of dimensionality. Our PPDRE

Projection Pursuit Density Ratio Estimation

Table 8. Computation Time Comparison Method 𝑑= 2 𝑑= 10 𝑑= 30 𝑑= 50 𝑑= 100

u LSIF 0.38 0.35 0.38 0.43 0.46 KLIEP 0.36 0.39 0.49 0.57 0.72 Classification 0.21 0.16 0.19 0.23 0.40 f DRE 3.78 3.88 3.92 4.28 7.25 RRND 3.78 3.87 3.77 3.82 3.81 nn DRE 5.91 3.84 8.76 6.02 8.63 𝐷3-LHSS 5.95 11.25 10.60 10.79 10.97 pp DRE 0.32 0.91 5.64 10.78 15.22

method consistently outperforms all baseline methods in estimation accuracy, and the computation time is comparable to that of existing methods for high-dimensional data. The observed increase in computation time is expected, as higher dimensions naturally require more iterations to reach convergence.

D. Datasets

D.1. Dataset in Dose Response Function Estimation

IHDP-continuous Dataset The original IHDP dataset contains 25 covariates with binary treatments and continuous outcomes. We follow similar procedure as employed by Nie et al. (2020) and Gao et al. (2023), disregard the treatments and outcomes, and use the covariates to generate continuous dosages and treatments. Let 𝑋1,..., 𝑋𝑑𝑋denote the first to the last element of 𝑿, where 𝑑𝑋= 25. The generating procedure for the semi-synthetic IHDP-continuous dataset is as follows:

Assigned treatment: 𝑇= (1+exp( 𝑇)) 1 , (12)

where 𝑇= 𝑋1 1+ 𝑋2 + max{𝑋3, 𝑋4, 𝑋5} 0.2+min{𝑋3, 𝑋4, 𝑋5} +tanh 5 Í 𝑖 𝐼1 𝑋𝑖

Potential outcome: 𝑌 (𝑡) = ℎ(𝑡, 𝑿) +0.5𝜖,

where ℎ(𝑡, 𝑿) = (1.2 𝑡2) sin(2𝜋𝑡 2) 0.5tanh 5 Í 𝑖 𝐼2 𝑋𝑖

+1.5exp 0.2(𝑋1 𝑋5) 0.1+min{𝑋2, 𝑋3, 𝑋4}

where 𝜖 N (0,1), 𝐼1 = {3,6,7,8,9,10,11,12,13,14} and 𝐼2 = {15,16,17,18,19,20,21,22,23,24}.

D.2. Datasets in Covariate Shift Adaptation

In this section, we introduce the basic information and the regression tasks of the benchmark datasets that are used in the covariate shift adaption experiments. These datasets are readily accessible via the Ready Tensor platform. The specific numbers of observations and features for each dataset are detailed in Table 9.

Table 9. Observation and Feature Numbers of Regression Benchmark Datasets Dataset Abalone Billboard Spotify Cancer Mortality Computer Activity Diamond Prices

Observation 4177 8930 3047 8192 6000

Feature 8 18 31 21 7

Abalone The Abalone dataset, accessible online4, is a popular dataset used in machine learning and statistics to predict the age of abalone from physical measurements. Abalone age is determined by cutting the shell, staining it, and counting the number of growth rings. However, this is a destructive process, so a non-destructive method based on physical measurements is desirable.

4https://www.dcc.fc.up.pt/ ltorgo/Regression/Data Sets.html

Projection Pursuit Density Ratio Estimation

Billboard Spotify The Billboard Spotify dataset represents a comprehensive collection of audio features for songs that made their mark on various Billboard charts that span the years 1961 through 2022. For each song, the corresponding audio features were sourced from the Spotify API. The regression task for the Billboard Spotify dataset is to predict the danceability of a song, which describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.

Cancer Mortality The Cancer Mortality dataset, aggregated from authoritative sources including the American Community Survey, clinicaltrials.gov, and cancer.gov, is designed for regression tasks to predict cancer mortality rates in US counties from 2010 to 2016. It incorporates 2013 census data, including county-level features such as population, income, households, and other demographic attributes.

Computer Activity The Computer Activity databases consist of a collection of metrics related to the activity of computer systems. The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running in a multi-user university department. This dataset, available online4, is utilized to estimate the proportion of time during which central processing units (CPUs) are engaged in user mode operations, based on the recorded system activity measures.

Diamond Prices The Diamond Prices dataset is accessible via the Py Caret library within the Python programming environment. The objective of this dataset is to predict diamond prices based on a set of attributes, including carat weight, cut, color, clarity, polish, symmetry, and the report issued by the grading agency that assessed the diamond s qualities.

E. 𝐿2-distance Minimization

In this section, we show that minimizing the 𝐿2-distance E𝑞{[𝑟 (𝒙) 𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)]2} w.r.t. 𝑓or 𝒂is equivalent to minimizing 𝐻( 𝑓, 𝒂) := E𝑞{𝑟2 𝑘 1(𝒙) 𝑓2(𝒂 𝒙)} 2E𝑝{𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)}.

Note that, by definition, 𝑟 (𝒙) = 𝑝(𝒙)/𝑞(𝒙). Then,

E𝑞{[𝑟 (𝒙) 𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)]2}

=E𝑞{[𝑟 (𝒙)]2} +E𝑞{[𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)]2} 2

X 𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)𝑟 (𝒙)𝑞(𝒙) 𝑑𝒙

=E𝑞{[𝑟 (𝒙)]2} +E𝑞{[𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)]2} 2

X 𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)𝑝(𝒙) 𝑑𝒙

=E𝑞{[𝑟 (𝒙)]2} +E𝑞{[𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)]2} 2E𝑝{𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)} .

Since the first term E𝑞{[𝑟 (𝒙)]2} is independent of either 𝑓or 𝒂, we have that minimizing E𝑞{[𝑟 (𝒙) 𝑟𝑘 1(𝒙) 𝑓(𝒂 𝒙)]2} w.r.t. 𝑓or 𝒂is equivalent to minimizing 𝐻( 𝑓, 𝒂).

F. Proof of Theorem 4.2

F.1. Notations and Assumptions

To prove the theorem, we require the following notations and assumptions.

Notations of derivative. For any univariate function 𝑓, we let 𝑓( 𝑗) denote its 𝑗th derivative. For any multivariate function 𝑔(𝜷, 𝒂), we denote 1𝑔(𝜷, 𝒂) to be its partial derivative with respect to w.r.t. the first argument 𝜷, and 2𝑔(𝜷, 𝒂) to be its partial derivative (w.r.t.) the second argument 𝒂.

Notations regarding matrices. Consider 𝑗= 0,1,2. We let Ω( 𝑗) 𝐽𝑘(𝒂) := E[𝚽( 𝑗) 𝑘(𝒂 𝒙)𝚽( 𝑗) 𝑘(𝒂 𝒙) ], for 𝑗= 0,1,2, where the expectation can be taken w.r.t. the probability density 𝑝( ) or 𝑞( ).

For the corresponding empirical version, let {𝒙1,...,𝒙𝑛} be i.i.d. from either the probability density 𝑝( ) (resp. 𝑛= 𝑛𝑝) or 𝑞( ) (resp. 𝑛= 𝑛𝑞), we define 𝑷( 𝑗) (𝒂) = {𝚽( 𝑗) 𝑘(𝒂 𝒙1),...,𝚽( 𝑗) 𝑘(𝒂 𝒙𝑛)} R𝑛 𝐽𝑘, and ˆΩ( 𝑗) 𝐽𝑘(𝒂) = 𝑛 1𝑷( 𝑗) (𝒂) 𝑷( 𝑗) (𝒂).

Similarly, we define Σ𝐽𝑘(𝒂) = E𝑞 𝑟2 𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)𝚽𝑘(𝒂 𝒙) , ˆΣ𝐽𝑘(𝒂) = 𝑛 1 𝑞𝒁𝑘(𝒂) 𝒁𝑘(𝒂), ˆΣ𝐽𝑘,𝜆(𝒂) = ˆΣ𝐽𝑘(𝒂) +𝜆𝑰𝐽𝑘, where 𝒁𝑘(𝒂) is defined in Proposition 4.1.

Projection Pursuit Density Ratio Estimation

Moreover, we let 𝜂min(Ω) and 𝜂max(Ω) denote the minimum and maximum eigenvalues of any matrix Ω, respectively.

Notations regarding minimization criteria. Recall the definitions of 𝐻( 𝑓, 𝒂) and ˆL𝑘(𝒂, 𝜷;𝜆) from (2) and (4), respectively. By definition, for any 𝒂 A,

ˆ𝑓𝒂,𝑘(𝒂 𝒙) = ˆ𝜷𝑘(𝒂) 𝚽𝑘(𝒂 𝒙) , and ˆ𝜷𝑘(𝒂) = { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 1

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) . (14)

We define 𝐻 (𝜷, 𝒂) := E𝑞[𝑟2 𝑘 1(𝒙) {𝜷 𝚽𝑘(𝒂 𝒙)}2] 2E𝑝[𝑟𝑘 1(𝒙) 𝜷 𝚽𝑘(𝒂 𝒙)] to be the approximation of 𝐻( 𝑓, 𝒂) and the theoretical counterpart of ˆL𝑘. We then define 𝜷 𝑘(𝒂) := argmin 𝜷 𝐻 (𝜷, 𝒂) for any 𝒂 A, 𝒂 𝑘= argmin 𝒂 A 𝐻 (𝜷 𝑘(𝒂), 𝒂), and

𝑓 𝒂,𝑘(𝒂 𝒙) = 𝜷 𝑘(𝒂) 𝚽𝑘(𝒂 𝒙) for any 𝒙 X.

Assumption F.1. (a) The support X of 𝒙is a compact subset of R𝑑. (b) The parameters {𝒂𝑘}𝐾 𝑘=1 defined in (7) are in the interior of a compact set A S+ 𝑑.

Assumption F.2. (a) For every 𝒂 A, there exists a 𝜷𝑘, 𝑗(𝒂) R𝐽𝑘such that

sup 𝒂 A sup 𝑧 Z

𝑓( 𝑗) 𝒂,𝑘(𝑧) {𝜷𝑘, 𝑗(𝒂)} 𝚽( 𝑗) 𝑘(𝑧) < 𝐶0, 𝑗𝐽 (𝑠 𝑗) 𝑘 , (15)

for some constants 𝐶0, 𝑗,𝑠> 0, and 0 𝑗< 𝑠, where Z := {𝒂 𝒙: 𝒂 A and 𝒙 X}; (b) For every 𝑧 Z, 𝑓𝒂,𝑘(𝑧) is continuously differentiable w.r.t. 𝒂, and sup𝑧 Z | 𝒂𝑓𝒂,𝑘(𝑧)| 𝒂=𝒂𝑘 < ; (c) 𝑓𝒂,𝑘(𝑧) are uniformly bounded and bounded away from 0.

Assumption F.3. (a) The eigenvalues of Ω(0) 𝐽𝑘(𝒂) are bounded and bounded away from zero uniformly in 𝐽𝑘and 𝒂 A. (b)

There exist sequences of constants 𝜁𝑗(𝐽𝑘), 𝑗= 1,2, such that the eigenvalues of Ω( 𝑗) 𝐽𝑘(𝒂) are bounded by 𝜁𝑗(𝐽𝑘) uniformly

in 𝐽𝑘and 𝒂 A (c) There exist sequences of constants 𝜁𝑗(𝐽𝑘), 𝑗= 0,1,2, such that sup 𝑗 𝑚sup𝑧 Z 𝚽( 𝑗) 𝑘(𝑧) 𝜁𝑚(𝐽𝑘)

for 𝑚= 0,1,2 and all 𝑘. (d) As 𝑛𝑝,𝑛𝑞 , 𝐽𝑘 , 𝜁0(𝐽𝑘)2

𝐽𝑘/(𝑛𝑝 𝑛𝑞) 0, 𝐽 (𝑠 1) 𝑘 max{𝜁2(𝐽𝑘),𝜁0(𝐽𝑘)2} 0, 𝜁2(𝐽𝑘)

𝐽𝑘/(𝑛𝑝 𝑛𝑞) 0 and 𝜆= 𝑂(

Assumption F.4. The minimum eigenvalue of 2 𝒂𝐻 (𝜷 𝑘(𝒂), 𝒂) 𝒂=𝒂 𝑘is bounded away from 0 uniformly in 𝐽𝑘.

Assumption F.1 imposes compactness conditions on the densities and parameter spaces. This condition is convenient for deriving uniform convergence rates. Assumption F.2 includes regularity conditions. Assumption F.2 (a) can be satisfied if 𝑓𝒂,𝑘(𝑧) is 𝑠-times continuously differentiable w.r.t. 𝑧 Z for any 𝒂 A (Lorentz, 1986). Assumption F.3 rules out near multicollinearity in the approximating basis functions. This condition is familiar in the sieve regression literature (Chen, 2007a). Assumption F.4 requires the Hessian matrix of the approximation of the theoretical minimization criteria to be positive definite at its minimum, 𝒂 𝑘, which is satisfied if 𝒂 𝑘is in the interior of A.

F.2. Outline of the proof

We prove the results in an inductive way. Note that the initial estimate 𝑓0(𝒙; 𝜷0) 1, 𝑟𝑘(𝒙) = 1 Π𝑘 𝑚=1 𝑓𝒂𝑚,𝑚(𝒂 𝑚𝒙) and ˆ𝑟𝑘(𝒙) = 𝑓0(𝒙, 𝜷0)Π𝑘 𝑚=1 ˆ𝑓ˆ𝒂𝑚,𝑚( ˆ𝒂 𝑚𝒙), and ˆ𝑓𝒂,𝑘is estimated based on ˆ𝑟𝑘 1. Let 𝑟0(𝒙) = 1 and ˆ𝑟0(𝒙) = 𝑓0(𝒙; 𝜷0) = 1. The following results hold for 𝑘= 1.

𝑖=1 |ˆ𝑟𝑘 1(𝒙𝑖) 𝑟𝑘 1(𝒙𝑖)|2 = 𝑂𝑝(𝜉𝑛,𝑘 1) . (16)

We shall show that, for any 𝑘 {1,2,...,𝐾}, given (16) holds, we have

sup 𝒙 X | ˆ𝑓ˆ𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙)| = 𝑂𝑃

{𝐽 (𝑠 1) 𝑘 +

𝐽𝑘/(𝑛𝑝 𝑛𝑞)} {

𝜁1(𝐽𝑘) 𝜁2 0 (𝐽𝑘)} , (17)

where 𝜁1(𝐽𝑘) is the rate of the maximum eigenvalue of Ω(1) 𝐽𝑘(𝒂) and 𝜁1(𝐽𝑘) 𝜁0(𝐽𝑘) = max{ 𝜁1(𝐽𝑘),𝜁0(𝐽𝑘)}. Then, using the fact that 𝜉𝑛,0 = 0, we can inductively derive that

sup 𝒙 X |ˆ𝑟𝐾(𝒙) 𝑟𝐾(𝒙)| = 𝑂𝑝

𝐽 (𝑠 1) ℓ +

𝜁1(𝐽𝑖) 𝜁2 0 (𝐽𝑖) o#!

Projection Pursuit Density Ratio Estimation

which establishes the last statement of Theorem 4.2.

Proof of (17). To prove (17), we will need to establish the results in (8) and (9), which are relegated to Sections F.3 and F.4, respectively. Specifically, we decompose ˆ𝑓ˆ𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙) = ˆ𝑓ˆ𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) 𝑓ˆ𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) (18)

+ 𝑓ˆ𝒂𝑘,𝑘(𝒂 𝑘𝒙) 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙) (19)

+ 𝑓ˆ𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) 𝑓ˆ𝒂𝑘,𝑘(𝒂 𝑘𝒙) . (20)

For (18), we have sup𝒙 X | ˆ𝑓ˆ𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) 𝑓ˆ𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙)| sup𝒂 A sup𝒙 X | ˆ𝑓𝒂,𝑘(𝒂 𝒙) 𝑓𝒂,𝑘(𝒂 𝒙)|.

For (19), under Assumption F.2 (b), we can use Taylor s expansion to expand 𝑓ˆ𝒂𝑘,𝑘around 𝑓𝒂𝑘,𝑘and obtain sup𝒙 X | 𝑓ˆ𝒂𝑘,𝑘(𝒂 𝑘𝒙) 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙)| = 𝑂𝑝( ˆ𝒂𝑘 𝒂𝑘 ).

For (20), we can first decompose it as

(20) = 𝑓𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙) + 𝑓ˆ𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) 𝑓𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) + 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙) 𝑓ˆ𝒂𝑘,𝑘(𝒂 𝑘𝒙) .

The supremum norm of the first term can be bounded by 𝑂𝑝( ˆ𝒂𝑘 𝒂𝑘 ), using Taylor s expansion expanding 𝑓𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) around 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙). The second and the third term can also be bounded by 𝑂𝑝( ˆ𝒂𝑘 𝒂𝑘 ) using the same argument for (19). Thus, we have

sup 𝒙 X | ˆ𝑓ˆ𝒂𝑘,𝑘( ˆ𝒂 𝑘𝒙) 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙)| = 𝑂𝑝

sup 𝒂 A sup 𝒙 X | ˆ𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙) 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙)| + ˆ𝒂𝑘 𝒂𝑘 .

Then, using (8) and (9), and noting that 𝜁0(𝐽𝑘) 𝐽𝑘, we then have (17) holds. The outline of the proof is completed.

F.3. Proof of (8)

Recalling the definitions of 𝑓𝒂,𝑘in (6) and 𝑓 𝒂,𝑘in Section F.1, we decompose

sup 𝒂 A sup 𝑧 Z

ˆ𝑓𝒂,𝑘(𝑧) 𝑓𝒂,𝑘(𝑧) sup 𝒂 A sup 𝑧 Z

𝑓 𝒂,𝑘(𝑧) 𝑓𝒂,𝑘(𝑧) (21)

+ sup 𝒂 A sup 𝑧 Z

ˆ𝑓𝒂,𝑘(𝑧) 𝑓 𝒂,𝑘(𝑧) . (22)

We derive the convergence rates for (21) and (22).

Rate of (21).

We obtain the rate (21) = 𝑂(𝐽 𝑠 𝑘𝜁0(𝐽𝑘))

from Lemma F.9.

Rate for (22). By definition, we have

𝜷 𝑘(𝒂) = {Σ𝐽𝑘(𝒂)} 1E𝑝[𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)].

Recalling the representation of ˆ𝑓𝒂,𝑘in (14), we decompose

sup 𝒂 A sup 𝒙 X | ˆ𝑓𝒂,𝑘(𝒂 𝒙) 𝑓 𝒂,𝑘(𝒂 𝒙)|

= sup 𝒂 A sup 𝒙 X

𝚽𝑘(𝒂 𝒙) { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 1

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) 𝚽𝑘(𝒂 𝒙) {Σ𝐽𝑘(𝒂)} 1E𝑝[𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)]

sup 𝒂 A sup 𝒙 X

𝚽𝑘(𝒂 𝒙) { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 1

𝑖=1 {ˆ𝑟𝑘 1(𝒙𝑝 𝑖) 𝑟𝑘 1(𝒙𝑝 𝑖)}𝚽𝑘(𝒂 𝒙𝑝 𝑖)

+ sup 𝒂 A sup 𝒙 X

𝚽𝑘(𝒂 𝒙) { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 1

𝑖=1 𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) 𝚽𝑘(𝒂 𝒙) { ˆΣ𝐽𝑘,𝜆(𝒂)} 1E𝑝[𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)] (24)

+ sup 𝒂 A sup 𝒙 X

𝚽𝑘(𝒂 𝒙) { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 {Σ𝐽𝑘(𝒂)} 1 E𝑝[𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)] . (25)

Projection Pursuit Density Ratio Estimation

We derive the rate for (23) to (25) one by one.

Rate for (23). Note that the term in the absolute value sign in (23) is the 𝐿2-projection of {ˆ𝑟𝑘 1(𝒙𝑝) 𝑟𝑘 1(𝒙𝑝)} to the linear space spanned by the sieve basis Φ𝑘(𝒂 𝒙). Given that A and X are compact sets and the sieve basis 𝚽𝑘is a local

basis, we have (23) sup𝒂 A sup𝒙 X | 𝚽𝑘(𝒂 𝒙) Í𝑛𝑝 𝑖=1{ˆ𝑟𝑘 1(𝒙𝑝 𝑖) 𝑟𝑘 1(𝒙𝑝 𝑖)}2/𝑛𝑝= 𝑂𝑝(𝜁0(𝐽𝑘)

Rate for (24). Define 𝑏𝒂(𝑧) = 𝚽𝑘(𝑧) { ˆΣ𝐽𝑘,𝜆(𝒂)} 1/ 𝚽𝑘(𝑧) { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 and S𝐽𝑘to be the 𝐽𝑘dimensional unit ball S𝐽𝑘:= {𝑏 R𝐽𝑘: 𝑏 = 1}. Since, under Assumption F.3,

sup 𝒂 A sup 𝒙 X 𝚽𝑘(𝑧) { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 = 𝑂𝑃(𝜁0(𝐽𝑘)),

the term (24) can be bounded as follows:

(24) =𝑂𝑃{𝜁0(𝐽𝑘)} sup 𝒂 A sup 𝒙 X

𝑖=1 𝑏𝒂(𝒂 𝒙) 𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) E𝑝[𝑏𝒂(𝒂 𝒙) 𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)]

𝑂𝑃{𝜁0(𝐽𝑘)} sup (𝑏,𝒂) S𝐽𝑘 A

𝑖=1 𝑏 𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) E𝑝[𝑏 𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)]

Consider the measurable function class H𝑘:= {𝒙 𝑏 𝚽𝑘(𝒂 𝒙)} : (𝑏, 𝒂) S𝐽𝑘 A} with its envelope denoted by 𝐻𝑘(𝒙) := sup(𝑏,𝒂) S𝐽𝑘 A 𝑏 𝚽𝑘(𝒂 𝒙) = sup𝒂 A 𝚽𝑘(𝒂 𝒙) . We denote 𝐻 𝑓:= { 𝐻(𝑥)2 𝑓(𝑥)𝑑𝑥}1/2 for any function 𝐻and density 𝑓. Then 𝐻𝑘 𝑝= 𝑂(𝜁0(𝐽𝑘)). Note that by the maximal inequality (e.g., Vaart & Wellner, 1996, Theorem 2.14.2 ),

sup (𝑏,𝒂) S𝐽𝑘 A

𝑖=1 𝑏 𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) E𝑝[𝑏 𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)]

1+log𝑁[](𝜖 𝐻𝑘 𝑝,H𝑘, 𝑝)𝑑𝜖 𝐻𝑘 𝑝,

where 𝑁[](𝜖 𝐻𝑘 𝑝,H𝑘, 𝑝) is the minimum number of 𝜖 𝐻𝑘 𝑝-brackets needed to cover H𝑘(Vaart & Wellner, 1996, Section 2.1). Since 𝐻𝑘 𝑝 𝐻𝑘 , we have

𝑁[](𝜖,H𝑘, 𝑝) 𝑁[](𝜖,H𝑘, ). (27)

Then, by Lemma F.5 we deduce that

1+log𝑁[](𝜖 𝐻𝑘 𝑝,H𝑘, 𝑝)𝑑𝜖

1 𝐽𝑘log 𝜖 𝐻𝑘 𝑝

1 𝐽𝑘 log(𝜖)𝑑𝜖= 𝑂(

sup (𝑏,𝒂) S𝐽𝑘 A

𝑖=1 𝑏 𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) E𝑝[𝑏 𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)]

Consequently, by (26), we have

Rate for (25).

Projection Pursuit Density Ratio Estimation

Note that, under Assumption F.3 and recalling the definition of 𝜂max( ) and the results in Lemma F.6,

(25) 𝜁0(𝐽𝑘) sup 𝒂 A 𝜂max { ˆΣ𝐽,𝜆(𝒂)} 1{Σ𝐽𝑘(𝒂)} 1 Σ𝐽(𝒂) ˆΣ𝐽,𝜆(𝒂) E𝑝 𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)

= 𝑂𝑝(𝜁0(𝐽𝑘)) sup 𝒂 A

𝜂max Σ𝐽(𝒂) ˆΣ𝐽𝑘(𝒂) 2 +𝜆

# E𝑝 𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)

𝑂𝑝(𝜁0(𝐽𝑘)) sup 𝒂 A

𝜆+ Σ𝐽(𝒂) ˆΣ𝐽𝑘(𝒂) E𝑝 𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)

𝜉𝑛,𝑘 1𝜁0(𝐽𝑘)2 + 𝐽𝑘𝜁0(𝐽𝑘)2

E𝑝 𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙) , (29)

where the last equality follows from Lemma F.6 and the rate of 𝜆in Assumption F.3.

Let 𝑉(𝒂) := E𝑝 E𝑝{𝑟𝑘 1(𝒙)|𝒂 𝒙}𝚽𝑘(𝒂 𝒙) . Under Assumption F.3, we have the eigenvalues of Ω𝐽,𝑝:= E𝑝{𝚽𝑘(𝒂 𝒙)𝚽𝑘(𝒂 𝒙) } are bounded and bounded away from 0 uniformly in 𝒂 A. Then

sup 𝒂 A 𝑉(𝒂) 2 = sup 𝒂 A 𝑉(𝒂) 𝑉(𝒂)

=𝑂(1) sup 𝒂 A 𝑉(𝒂) Ω𝐽,𝑝(𝒂) 1Ω𝐽,𝑝(𝒂)Ω𝐽,𝑝(𝒂) 1𝑉(𝒂)

=𝑂(1) sup 𝒂 A E𝑝 {𝑉(𝒂) Ω𝐽,𝑝(𝒂) 1𝚽𝑘(𝒂 𝒙)}2 ,

which is the 𝐿2-projection of E𝑝{𝑟𝑘 1(𝒙)|𝒂 𝒙} to the space linearly spanned by 𝚽𝑘(𝒂 𝒙). Therefore,

sup 𝒂 A 𝑉(𝒂) 2 E𝑝[E𝑝 𝑟𝑘 1(𝒙)|𝒂 𝒙 2] = 𝑂(1) .

Combining the results of (23) to (25), we have

𝜉𝑛,𝑘 1𝜁2 0 (𝐽𝑘) + 𝜁2 0 (𝐽𝑘) 𝐽𝑘 𝑛𝑝 𝑛𝑞

Consequently, we have (8).

F.4. Proof of (9)

Consistency. We first show ˆ𝒂𝑘 𝑝 𝒂𝑘as 𝑛𝑝,𝑛𝑞 . By definition, ˆ𝒂𝑘(resp. 𝒂𝑘) is the unique minimizer of ˆL𝑘(𝒂, ˆ𝜷(𝒂);𝜆) (resp. 𝐻( 𝑓𝒂,𝑘, 𝒂)). From the theory of 𝑀-estimation (Van der Vaart, 2000, Theorem 5.7), if the following condition holds:

ˆL𝑘(𝒂, ˆ𝜷(𝒂);𝜆) 𝐻( 𝑓𝒂,𝑘, 𝒂) 𝑝 0

as 𝑛𝑝,𝑛𝑞 , then we have ˆ𝒂𝑘 𝑝 𝒂𝑘. Since

ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂),𝜆) = 1

𝑖=1 ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖) ˆ𝑓2 𝒂,𝑘(𝒂 𝒙𝑞 𝑖) 2

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖) ˆ𝑓𝒂,𝑘(𝒂 𝒙𝑝 𝑖) +𝜆ˆ𝜷𝑘(𝒂) ˆ𝜷𝑘(𝒂) ,

Projection Pursuit Density Ratio Estimation

we can decompose

ˆL𝑘(𝒂, ˆ𝜷(𝒂);𝜆) 𝐻( 𝑓𝒂,𝑘, 𝒂)

𝑖=1 𝑟2 𝑘 1(𝒙𝑞 𝑖) 𝑓2 𝒂,𝑘(𝒂 𝒙𝑞 𝑖) E𝑞[𝑟2 𝑘 1(𝒙) 𝑓2 𝒂,𝑘(𝒂 𝒙)]

𝑖=1 𝑟𝑘 1(𝒙𝑝 𝑖) 𝑓𝒂,𝑘(𝒂 𝒙𝑝 𝑖) E𝑝[𝑟𝑘 1(𝒙) 𝑓𝒂,𝑘(𝒂 𝒙)]

ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖) ˆ𝑓2 𝒂,𝑘(𝒂 𝒙𝑞 𝑖) 𝑟2 𝑘 1(𝒙𝑞 𝑖) 𝑓2 𝒂,𝑘(𝒂 𝒙𝑞 𝑖)

ˆ𝑟𝑘 1(𝒙𝑝 𝑖) ˆ𝑓𝒂,𝑘(𝒂 𝒙𝑝 𝑖) 𝑟𝑘 1(𝒙𝑝 𝑖) 𝑓𝒂,𝑘(𝒂 𝒙𝑝 𝑖) (31)

+𝜆sup 𝒂 A ˆ𝜷𝑘(𝒂) ˆ𝜷𝑘(𝒂) . (32)

Rate for (30). As imposed in Assumptions F.1 and F.2, A is compact and 𝑓𝒂,𝑘(𝒂 𝒙) is continuous in 𝒂, applying the uniform law of large numbers (Newey & Mc Fadden, 1994, Lemma 2.4), we have

(30) = 𝑂𝑝(1/ 𝑛𝑞+1/ 𝑛𝑝) = 𝑜𝑝(1) .

Rate for (31). Note that 1 𝑛𝑞

ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖) ˆ𝑓2 𝒂,𝑘(𝒂 𝒙𝑞 𝑖) 𝑟2 𝑘 1(𝒙𝑞 𝑖) 𝑓2 𝒂,𝑘(𝒂 𝒙𝑞 𝑖)

{ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖) 𝑟2 𝑘 1(𝒙𝑞 𝑖)} ˆ𝑓2 𝒂,𝑘(𝒂 𝒙𝑞 𝑖)

𝑟2 𝑘 1(𝒙𝑞 𝑖){ ˆ𝑓2 𝒂,𝑘(𝒂 𝒙𝑞 𝑖) 𝑓2 𝒂,𝑘(𝒂 𝒙𝑞 𝑖)} .

We can apply the same decomposition to the second term of (31).

For any 𝑘 {1,2,...,𝐾}, given (16) that 𝑛 1 𝑞 Í𝑛𝑞 𝑖=1 |ˆ𝑟𝑘 1(𝒙𝑞 𝑖) 𝑟𝑘 1(𝒙𝑞 𝑖)|2 = 𝑂𝑝(𝜉𝑛,𝑘 1), using (8) and Cauchy-Schwarz inequality, we can conclude that

(31) = 𝑂𝑝 𝐽 𝑠 𝑘𝜁0(𝐽𝑘) +

𝜉𝑛,𝑘 1𝜁2 0 (𝐽𝑘) + 𝜁2 0 (𝐽𝑘)

𝐽𝑘/(𝑛𝑞 𝑛𝑝) = 𝑜𝑝(1) .

Rate for (32). Using the rate of 𝜆in Assumption F.3 and Lemma F.11, we have (32) = 𝑜𝑝(1).

Consequently, we have ˆ𝒂𝑘 𝒂𝑘 𝑝 0.

Convergence Rate. Let ˆℓ𝑘(𝒂) = ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆). Since ˆ𝒂𝑘is the unique global minimizer of the context function ˆℓ𝑘(𝒂), by the first order condition, we have 𝒂ˆℓ𝑘( ˆ𝒂𝑘) = 0. By applying the mean value theorem, we obtain

0 = 𝒂ˆℓ𝑘( ˆ𝒂𝑘) = 𝒂ˆℓ𝑘(𝒂𝑘) + 2 𝒂ˆℓ𝑘( 𝒂𝑘)( ˆ𝒂𝑘 𝒂𝑘),

where 𝒂lies between ˆ𝒂𝑘and 𝒂𝑘. Hence,

ˆ𝒂𝑘 𝒂𝑘= 2 𝒂ˆℓ𝑘( 𝒂𝑘) 1 𝒂ˆℓ𝑘(𝒂𝑘). (33)

Equation (33) implies that the convergence rate of ˆ𝒂𝑘 𝒂𝑘is determined by the Hessian matrix 2 𝒂ˆℓ𝑘( 𝒂𝑘) and the score function 𝒂ˆℓ𝑘(𝒂𝑘). We derive the rate of these terms one by one. Recalling the definition of ˆℓ𝑘(𝒂) at the beginning of the subsection and applying the chain rule, we obtain

𝒂ˆℓ𝑘(𝒂) = 1 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) + 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) 𝒂ˆ𝜷𝑘(𝒂)

= 1 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) , (34)

Projection Pursuit Density Ratio Estimation

where for any bivariate function 𝑔(𝑥, 𝑦), 1𝑔(𝑥, 𝑦) and 2𝑔(𝑥, 𝑦) denotes the partial derivative of 𝑔w.r.t. to the first and the second arguments, 𝑥and 𝑦, respectively. The second equality follows from the fact that ˆ𝜷𝑘(𝒂) is the unique global minimizer of the convex function ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) for any 𝒂 A, so 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) = 0 for any 𝒂 A. Then, we have

2 𝒂ˆℓ𝑘(𝒂) = 2 1 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) + 2 1 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) 𝒂ˆ𝜷𝑘(𝒂) . (35)

Asymptotics of the Hessian matrix 2 𝒂ˆℓ𝑘( 𝒂𝑘). Taking derivative on both sides of 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) = 0 w.r.t. 𝒂yields

1 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) + 2 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) 𝒂ˆ𝜷𝑘(𝒂) = 0.

Since 2 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) = ˆΣ𝐽𝑘,𝜆(𝒂), substituting this expression into the above equation, we obtain

𝒂ˆ𝜷𝑘(𝒂) = ˆΣ𝐽𝑘,𝜆(𝒂) 1 1 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) . (36)

2 𝒂ˆℓ𝑘(𝒂) = 2 1 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) 2 1 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) ˆΣ𝐽𝑘,𝜆(𝒂) 1 1 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) .

2 1 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) = 2

𝑖=1 ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖){ ˆ𝑓(1) 𝒂,𝑘(𝒂 𝒙𝑞 𝑖)}2𝒙𝑞 𝑖(𝒙𝑞 𝑖)

𝑖=1 ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖) ˆ𝑓𝒂,𝑘(𝒂 𝒙𝑞 𝑖) ˆ𝑓(2) 𝒂,𝑘(𝒂 𝒙𝑞 𝑖)𝒙𝑞 𝑖(𝒙𝑞 𝑖) 2

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖) ˆ𝑓(2) 𝒂,𝑘(𝒂 𝒙𝑝 𝑖)𝒙𝑝 𝑖(𝒙𝑝 𝑖) ,

2 1 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) = 1 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) = 2

𝑖=1 ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖) ˆ𝑓(1) 𝒂,𝑘(𝒂 𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖)(𝒙𝑞 𝑖)

𝑖=1 ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖) ˆ𝑓𝒂,𝑘(𝒂 𝒙𝑞 𝑖)𝚽(1) 𝑘(𝒂 𝒙𝑞 𝑖)(𝒙𝑞 𝑖) 2

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖)𝚽(1) 𝑘(𝒂 𝒙𝑝 𝑖)(𝒙𝑝 𝑖) .

Note that 2 1 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) = 1 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆). Using calculations similar to those in the derivation for (31), the results of Lemmas F.6 and F.10, and the rate of 𝜆in Assumption F.3, we have

2 1 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) 2E𝑞 h 𝑟2 𝑘 1(𝒙){ 𝑓 (1) 𝒂,𝑘(𝒂 𝒙)}2𝒙(𝒙) i

+2E𝑞 h 𝑟2 𝑘 1(𝒙) 𝑓 𝒂,𝑘(𝒂 𝒙) 𝑓 (2) 𝒂,𝑘(𝒂 𝒙)𝒙(𝒙) i 2E𝑝 h 𝑟𝑘 1(𝒙) 𝑓 (2) 𝒂,𝑘(𝒂 𝒙)𝒙(𝒙) i

1 2 ˆL𝑘(𝒂, ˆ𝜷𝑘(𝒂);𝜆) 2E𝑞 h 𝑟2 𝑘 1(𝒙) 𝑓 (1) 𝒂,𝑘(𝒂 𝒙)𝚽𝑘(𝒂 𝒙)(𝒙) i (37)

+2E𝑞 h 𝑟2 𝑘 1(𝒙) 𝑓 𝒂,𝑘(𝒂 𝒙)𝚽(1) 𝑘(𝒂 𝒙)(𝒙) i 2E𝑝 h 𝑟𝑘 1(𝒙)𝚽(1) 𝑘(𝒂 𝒙)(𝒙) i

and ˆΣ𝐽𝑘,𝜆(𝒂) Σ𝐽𝑘(𝒂) = 𝑜𝑝(1) .

Projection Pursuit Density Ratio Estimation

Consequently, we have,

2 𝒂ˆℓ𝑘( 𝒂) = 2 𝒂𝐻 (𝜷 𝑘( 𝒂), 𝒂) + 𝑜𝑝(1) = 2 𝒂𝐻 (𝜷 𝑘(𝒂𝑘), 𝒂𝑘) + 𝑜𝑝(1) ,

where the last equality follows from the fact that 𝒂lies between ˆ𝒂𝑘and 𝒂𝑘, and ˆ𝒂𝑘 𝒂𝑘 𝑝 0 and the continuous mapping theory.

Using arguments similar to the proof of the consistency of ˆ𝒂𝑘to 𝒂𝑘, we have 𝒂 𝑘 𝒂𝑘 0. Thus,

2 𝒂ˆℓ𝑘( 𝒂) = 2 𝒂𝐻 (𝜷 𝑘(𝒂 𝑘), 𝒂 𝑘) + 𝑜𝑝(1) .

Thus, under Assumption F.4, 2 𝒂ˆℓ𝑘( 𝒂) is asymptotically invertible and 2 𝒂ˆℓ𝑘( 𝒂) 1 = 𝑂𝑝(1).

Asymptotics of the score function 𝒂ˆℓ𝑘(𝒂𝑘). Note that

𝒂ˆℓ𝑘(𝒂𝑘) = 1 ˆL𝑘(𝒂𝑘, ˆ𝜷𝑘(𝒂𝑘);𝜆)

𝑖=1 ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖) ˆ𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙𝑞 𝑖) ˆ𝑓(1) 𝒂𝑘,𝑘(𝒂 𝑘𝒙𝑞 𝑖)(𝒙𝑞 𝑖) 2

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖) ˆ𝑓(1) 𝒂𝑘,𝑘(𝒂 𝑘𝒙𝑝 𝑖)(𝒙𝑝 𝑖) .

Using (16), Lemmas F.9 and F.10, and the arguments similar to those in the proof of Lemma F.8, we have

𝒂ˆℓ𝑘(𝒂𝑘) =2E𝑞 h 𝑟2 𝑘 1(𝒙) 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙) 𝑓(1) 𝒂𝑘,𝑘(𝒂 𝑘𝒙)(𝒙) i 2E𝑝 h 𝑟𝑘 1(𝒙) 𝑓(1) 𝒂𝑘,𝑘(𝒂 𝑘𝒙)(𝒙) i (38)

{𝐽 (𝑠 1) 𝑘 +

𝜁1(𝐽𝑘) 𝐽𝑘/(𝑛𝑝 𝑛𝑞) .

By the definition in (6), we have, for any 𝒂 A,

2E𝑞[𝑟2 𝑘 1(𝒙) 𝑓𝒂(𝒂 𝒙)𝜑(𝒂 𝒙)] 2E𝑝[𝑟𝑘 1(𝒙)𝜑(𝒂 𝒙)] = 0 (39)

for all integrable 𝜑: R R. Moreover, by the definition in (7) and chain rule, we have

1𝐻( 𝑓𝒂, 𝒂𝑘) 𝒂𝑓𝒂 𝒂=𝒂𝑘+ 2𝐻( 𝑓𝒂𝑘, 𝒂) 𝒂=𝒂𝑘= 0. (40)

Note that, for any 𝒂 A,

1𝐻( 𝑓𝒂, 𝒂𝑘) 𝒂𝑓𝒂= 2E𝑞[𝑟2 𝑘 1(𝒙) 𝑓𝒂(𝒂 𝑘𝒙) 𝒂𝑓𝒂(𝒂 𝑘𝒙)] 2E𝑝[𝑟𝑘 1(𝒙) 𝒂𝑓𝒂(𝒂 𝑘𝒙)] .

Using (39), we have

1𝐻( 𝑓𝒂, 𝒂𝑘) 𝒂𝑓𝒂 𝒂=𝒂𝑘= 0.

Then (40) implies that

2𝐻( 𝑓𝒂𝑘, 𝒂) 𝒂=𝒂𝑘= 2E𝑞 h 𝑟2 𝑘 1(𝒙) 𝑓𝒂𝑘,𝑘(𝒂 𝑘𝒙) 𝑓(1) 𝒂𝑘,𝑘(𝒂 𝑘𝒙)(𝒙) i 2E𝑝 h 𝑟𝑘 1(𝒙) 𝑓(1) 𝒂𝑘,𝑘(𝒂 𝑘𝒙)(𝒙) i = 0,

which, combined with (38), implies that the score function

𝒂ˆℓ𝑘(𝒂𝑘) = 𝑂𝑝

{𝐽 (𝑠 1) 𝑘 +

𝜁1(𝐽𝑘) 𝐽𝑘/(𝑛𝑝 𝑛𝑞) .

Then we have

ˆ𝒂𝑘 𝒂𝑘 = 𝑂𝑝

{𝐽 (𝑠 1) 𝑘 +

𝜁1(𝐽𝑘) 𝐽𝑘/(𝑛𝑝 𝑛𝑞) .

Projection Pursuit Density Ratio Estimation

F.5. Technical Lemmas

Lemma F.5. For any 𝑘 {1,2,...,𝐾}, log𝑁[](𝜖,H𝑘, ) = 𝑂 𝐽𝑘log 𝜖 1/𝜁0(𝐽𝑘) .

Proof. For a fixed 𝜖> 0, we choose 𝜖/{2𝜁0(𝐽𝑘)}-balls centering at 𝑏1,...,𝑏𝑀1 S𝐽𝑘such that S𝐽𝑘can be covered, and 𝜖/{2sup𝒙 X 𝒙 𝜁1(𝐽𝑘)}-balls centering at 𝒂1,..., 𝒂𝑀2 such that A can be covered. Using the triangle inequality and the mean value theorem, we deduce that, for any (𝑏, 𝒂) S𝐽𝑘 A, there exist 𝑗 {1, , 𝑀1} and ℓ {1, , 𝑀2}, such that 𝑏 𝚽𝑘(𝒂 𝒙) 𝑏 𝑗𝚽𝑘(𝒂 ℓ𝒙)

𝑏 𝑏𝑗 𝚽𝑘(𝒂 𝒙) + 𝜁1(𝐽𝑘) 𝒙 𝒂 𝒂ℓ

𝑏 𝑏𝑗 𝜁0(𝐽𝑘) + 𝜁1(𝐽𝑘) 𝒙 𝒂 𝒂ℓ 𝜖.

Hence, {[𝑏 𝑗𝚽𝑘(𝒂 ℓ𝒙) 𝜖,𝑏 𝑗𝚽𝑘(𝒂 ℓ𝒙) +𝜖]} 𝑗,ℓform a set of 𝜖-brackets that cover the space H𝑘. Because 𝑗ranges from 1 to 𝑀1, and ℓranges from 1 to 𝑀2, we have a total of 𝑀1 𝑀2 brackets. Therefore,

𝑁[] (𝜖,H𝑘, ) 𝑀1 𝑀2 =𝑁(𝜖/{2𝜁0(𝐽𝑘)},S𝐽𝑘, ) 𝑁(𝜖/{2 sup 𝒙 X 𝒙 𝜁1(𝐽𝑘)},A, ).

Since S𝐽𝑘is a compact set in R𝐽𝑘and A is a compact set in R𝑑, 𝑁(𝜖,S𝐽𝑘, ) 𝐶𝜖 𝐶𝐽𝑘and 𝑁(𝜖,A, ) 𝐶𝜖 𝐶𝑑. Hence

log𝑁[](𝜖,H𝑘, ) =𝑂[ 𝑑log{𝜖/𝜁1(𝐽𝑘)} 𝐽𝑘log{𝜖/𝜁0(𝐽𝑘)}]

=𝑂[ 𝐽𝑘log{𝜖/𝜁0(𝐽𝑘)}] .

Lemma F.6. For any 𝑘 {1,2,...,𝐾}, suppose (16) holds. Under Assumption F.3, for 𝑗= 0,1,2, we have

sup 𝒂 A ˆΩ( 𝑗) 𝐽𝑘(𝒂) Ω( 𝑗) 𝐽𝑘,𝑝(𝒂) = 𝑂𝑝(𝜁𝑗(𝐽𝑘)

Similarly, we have

sup 𝒂 A ˆΣ𝐽𝑘(𝒂) Σ𝐽𝑘(𝒂) = 𝑂𝑝

𝜉𝑛,𝑘 1𝜁0(𝐽𝑘) + 𝜁0(𝐽𝑘) 𝐽𝑘 𝑛𝑝 𝑛𝑞

Furthermore, we have

𝜂min{ ˆΩ( 𝑗) 𝐽𝑘(𝒂)} 𝜂min{Ω( 𝑗) 𝐽𝑘(𝒂)} 𝑂𝑝(𝜁𝑗(𝐽𝑘)

𝜂min{ ˆΣ𝐽𝑘(𝒂)} 𝜂min{Σ𝐽𝑘(𝒂)} 𝑂𝑝

𝜉𝑛,𝑘 1𝜁0(𝐽𝑘) + 𝜁0(𝐽𝑘) 𝐽𝑘 𝑛𝑝 𝑛𝑞

𝜂min{ ˆΣ𝐽𝑘,𝜆(𝒂)} 𝜂min{Σ𝐽𝑘(𝒂)} 𝑂𝑝

𝜉𝑛,𝑘 1𝜁0(𝐽𝑘) + 𝜁0(𝐽𝑘) 𝐽𝑘 𝑛𝑝 𝑛𝑞

𝜂max{ ˆΩ( 𝑗) 𝐽𝑘(𝒂)} 𝜂max{Ω( 𝑗) 𝐽𝑘(𝒂)} +𝑂𝑝(𝜁𝑗(𝐽𝑘)

𝜂max{ ˆΣ𝐽𝑘(𝒂)} 𝜂max{Σ𝐽𝑘(𝒂)} +𝑂𝑝

𝜉𝑛,𝑘 1𝜁0(𝐽𝑘) + 𝜁0(𝐽𝑘) 𝐽𝑘 𝑛𝑝 𝑛𝑞

𝜂max{ ˆΣ𝐽𝑘,𝜆(𝒂)} 𝜂max{Σ𝐽𝑘(𝒂)} +𝑂𝑝

𝜉𝑛,𝑘 1𝜁0(𝐽𝑘) + 𝜁0(𝐽𝑘) 𝐽𝑘 𝑛𝑝 𝑛𝑞

uniformly in 𝒂 A.

Projection Pursuit Density Ratio Estimation

Proof. Under Assumption F.3, we have the eigenvalues of Ω( 𝑗) 𝐽𝑘are bounded for 𝑗= 0,1,2 uniformly in 𝒂 A. We can then derive that

sup 𝒂 A E h ˆΩ( 𝑗) 𝐽𝑘(𝒂) Ω( 𝑗) 𝐽𝑘(𝒂) 2i = sup 𝒂 A

𝑖=1 𝜙( 𝑗) 𝑗 (𝒂 𝒙𝑖)𝜙( 𝑗) ℓ (𝒂 𝒙𝑖) E[𝜙( 𝑗) 𝑗 (𝒂 𝒙)𝜙( 𝑗) ℓ (𝒂 𝒙)]

1 𝑛E[{𝜙( 𝑗) 𝑗 (𝒂 𝒙)}2{𝜙( 𝑗) ℓ (𝒂 𝒙)}2]

𝑗=1 {𝜙( 𝑗) 𝑗 (𝒂 𝒙)}2 𝐽𝑘

ℓ=1 {𝜙( 𝑗) ℓ (𝒂 𝒙)}2

1 𝑛E h 𝚽( 𝑗) 𝑘(𝒂 𝒙) 𝚽( 𝑗) 𝑘(𝒂 𝒙)𝚽( 𝑗) 𝑘(𝒂 𝒙) 𝚽( 𝑗) 𝑘(𝒂 𝒙) i

𝑛 sup 𝒂 A tr E h 𝚽( 𝑗) 𝑘(𝒂 𝒙)𝚽( 𝑗) 𝑘(𝒂 𝒙) i = 𝑂

Thus, sup 𝒂 A ˆΩ( 𝑗) 𝐽𝑘(𝒂) Ω( 𝑗) 𝐽𝑘(𝒂) = 𝑂𝑝(𝜁𝑗(𝐽𝑘)

It then follows from the definition of the maximum and minimum eigenvalues that

𝜂min{ ˆΩ( 𝑗) 𝐽𝑘(𝒂)} = min 𝜷 𝜷 =1{𝜷 Ω( 𝑗) 𝐽𝑘(𝒂)𝜷+ 𝜷 [ ˆΩ( 𝑗) 𝐽𝑘(𝒂) Ω( 𝑗) 𝐽𝑘(𝒂)]𝜷} 𝜂min{Ω( 𝑗) 𝐽𝑘(𝒂)} 𝑂𝑝(𝜁𝑗(𝐽𝑘)

and 𝜂max{ ˆΩ( 𝑗) 𝐽𝑘(𝒂)} 𝜂max{Ω( 𝑗) 𝐽𝑘(𝒂)} +𝑂𝑝(𝜁𝑗(𝐽𝑘)

uniformly in 𝒂 A.

For ˆΣ𝐽𝑘(𝒂), we first define

Σ𝐽𝑘(𝒂) := 1

𝑖=1 𝑟𝑘 1(𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖) .

Then, we have

ˆΣ𝐽𝑘(𝒂) Σ𝐽𝑘(𝒂) =ˆΣ𝐽𝑘(𝒂) Σ𝐽𝑘(𝒂) + Σ𝐽𝑘(𝒂) Σ𝐽𝑘(𝒂)

𝑖=1 {ˆ𝑟𝑘 1(𝒙𝑞 𝑖) 𝑟𝑘 1(𝒙𝑞 𝑖)}𝚽𝑘(𝒂 𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖) (41)

𝑖=1 𝑟𝑘 1(𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖) E𝑞[𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)𝚽𝑘(𝒂 𝒙) ] . (42)

For (41), using (16) and Cauchy-Schwarz inequality, we can derive

sup 𝒂 A (41)

𝑖=1 {ˆ𝑟𝑘 1(𝒙𝑞 𝑖) 𝑟𝑘 1(𝒙𝑞 𝑖)}2 sup 𝒂 A

𝜂max( ˆΩ𝐽𝑘(𝒂)) sup 𝒂 A sup 𝒙 X 𝚽𝑘(𝒂 𝒙)

𝜉𝑛,𝑘 1 𝜁0(𝐽𝑘) .

For (42), using the same arguments for ˆΩ𝐽𝑘(𝒂), we have

sup 𝒂 A (42) = 𝑂𝑝(𝜁0(𝐽𝑘)

Projection Pursuit Density Ratio Estimation

The result then follows.

The results for ˆΣ𝐽𝑘,𝜆(𝒂) follows from the fact that ˆΣ𝐽𝑘,𝜆(𝒂) = ˆΣ𝐽𝑘(𝒂) +𝜆𝑰𝐽𝑘and the rate of 𝜆in Assumption F.3.

Lemma F.7. For any 𝑘 {1,2,...,𝐾}, suppose Assumptions F.2 F.4 and (16) hold. Then,

sup 𝒂 A 𝜷 𝑘(𝒂) 𝜷𝑘,0(𝒂) = 𝑂(𝐽 𝑠 𝑘) .

Proof. Note that 𝐻 (𝜷, 𝒂) is convex in 𝜷under Assumption F.3, and has a unique global minimizer 𝜷 𝑘(𝒂). Given a constant 𝑑> 𝐶 𝐽 𝑠 𝑘 for some constant 𝐶> 0 (to be chosen later), for any 𝜷satisfying 𝜷 𝜷𝑘,0(𝒂) = 𝑑, we have 1 𝐶𝐽 𝑠 𝑘 𝑑

𝐻 (𝜷𝑘,0(𝒂), 𝒂) + 𝐶𝐽 𝑠 𝑘 𝑑 𝐻 (𝜷, 𝒂) 𝐻 𝜷𝑘,0(𝒂) 𝐶𝐽 𝑠 𝑘 𝑑 {𝜷𝑘,0(𝒂) 𝜷}, 𝒂 .

Then, we have

𝐻 (𝜷, 𝒂) 𝐻 (𝜷𝑘,0(𝒂), 𝒂) 𝐻 𝜷𝑘,0(𝒂) 𝐶𝐽 𝑠 𝑘 𝑑 {𝜷𝑘,0(𝒂) 𝜷}, 𝒂 𝐻 (𝜷𝑘,0(𝒂), 𝒂)

=𝐻 𝑓𝒂,𝑘 𝐶𝐽 𝑠 𝑘 𝑑 {𝜷𝑘,0(𝒂) 𝜷} 𝚽𝑘, 𝒂 𝐻( 𝑓𝒂,𝑘, 𝒂)

𝜉 𝐶𝐽 𝑠 𝑘 𝑑 {𝜷𝑘,0(𝒂) 𝜷} +𝜉(0) ,

where 𝜉(𝜽) := 𝐻 𝑓𝒂,𝑘 𝜽 𝚽𝑘, 𝒂 𝐻 (𝜷𝑘,0(𝒂) 𝜽, 𝒂) for any 𝜽 R𝐽𝑘. Note that 𝑓𝒂,𝑘is the minimizer of 𝐻( 𝑓, 𝒂). Under Assumption F.3, 𝐻( 𝑓𝒂,𝑘 𝜽 𝚽𝑘, 𝒂) is globally convex in 𝜽and attains the minimum at 𝜽= 0. Applying the Taylor s expansion of 𝐻( 𝑓𝒂,𝑘 𝜽 𝚽𝑘, 𝒂) around 𝜽= 0, we have

inf {𝜷: 𝜷 𝜷𝑘,0(𝒂) =𝑑}

𝐻 𝑓𝒂,𝑘 𝐶𝐽 𝑠 𝑘 𝑑 {𝜷𝑘,0(𝒂) 𝜷} 𝚽𝑘, 𝒂 𝐻( 𝑓𝒂,𝑘, 𝒂)

= inf {𝜷: 𝜷 𝜷𝑘,0(𝒂) =𝑑}

𝐶2𝐽 2𝑠 𝑘 𝑑2 {𝜷𝑘,0(𝒂) 𝜷} E[𝑟2 𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)𝚽𝑘(𝒂 𝒙) ]{𝜷𝑘,0(𝒂) 𝜷}

𝜂1𝐶2𝐽 2𝑠 𝑘 ,

where 𝜂1 > 0 is the minimum eigenvalue of E[𝑟2 𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)𝚽𝑘(𝒂 𝒙) ] under Assumptions F.2 and F.3.

By the definition of 𝜉(𝜽), we can derive that, for 𝜷 𝜷𝑘,0(𝒂) = 𝑑, 𝜉 𝐶𝐽 𝑠 𝑘 𝑑 {𝜷𝑘,0(𝒂) 𝜷} 𝜉(0) = 2𝐶𝐽 𝑠 𝑘 𝑑 {𝜷𝑘,0(𝒂) 𝜷} E𝑞 𝑟2 𝑘 1(𝑥)𝚽𝑘(𝒂 𝒙){ 𝑓𝒂,𝑘(𝒂 𝒙) 𝜷𝑘,0(𝒂) 𝚽𝑘(𝒂 𝒙)}

2𝜂2𝐶𝐶0𝐽 2𝑠 𝑘 ,

uniformly in 𝒂 A and 𝒙 X, where 𝜂2 is the maximum eigenvalue of E[𝑟2 𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)𝚽𝑘(𝒂 𝒙) ] under Assumptions F.2 and F.3. By choosing 𝐶 2𝜂2𝐶0/𝜂1, we have 𝐻 (𝜷, 𝒂) 𝐻 (𝜷𝑘,0(𝒂), 𝒂) 0 for all 𝜷 𝜷𝑘,0(𝒂) > 𝐶 𝐽 𝑠 𝑘 uniformly in 𝒂 A and 𝒙 X, which implies that 𝐻 (𝜷, 𝒂) has a local minimum for 𝜷 𝜷𝑘,0(𝒂) 𝐶 𝐽 𝑠 𝑘. Since 𝜷 𝑘(𝒂) is the unique global minimizer of 𝐻 (𝜷, 𝒂), we have

sup 𝒂 A 𝜷 𝑘(𝒂) 𝜷𝑘,0(𝒂) 𝐶 𝐽 𝑠 𝑘. (43)

Lemma F.8. For any 𝑘 {1,2,...,𝐾}, suppose Assumptions F.2 F.4 and (16) hold. For any 𝒂 A, we have

ˆ𝜷𝑘(𝒂) 𝜷 𝑘(𝒂) = 𝑂𝑝

𝐽 (𝑠 1) 𝑘 +

𝜉𝑛,𝑘 1 + 𝐽𝑘 𝑛𝑞 𝑛𝑝

Projection Pursuit Density Ratio Estimation

Proof. Recall that

𝜷 𝑘(𝒂) = {Σ𝐽𝑘(𝒂)} 1E𝑝[𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)] and ˆ𝜷𝑘(𝒂) = { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 1

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) .

We decompose

ˆ𝜷𝑘(𝒂) 𝜷 (𝒂) (44)

{ ˆΣ𝐽𝑘,𝜆(𝒂)} 1 1

𝑖=1 ˆ𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) 𝜷 𝑘(𝒂)

{ ˆΣ𝐽𝑘,𝜆(𝒂)} 1 1

𝑖=1 {ˆ𝑟𝑘 1(𝒙𝑝 𝑖) 𝑟𝑘 1(𝒙𝑝 𝑖)}𝚽𝑘(𝒂 𝒙𝑝 𝑖)

{ ˆΣ𝐽𝑘,𝜆(𝒂)} 1 1

𝑖=1 {ˆ𝑟2 𝑘 1(𝒙𝑞 𝑖) 𝑟2 𝑘 1(𝒙𝑞 𝑖)}𝚽𝑘(𝒂 𝒙𝑞 𝑖) 𝑓 𝒂(𝒂 𝒙𝑞 𝑖)

{ ˆΣ𝐽𝑘,𝜆(𝒂)} 1 1

𝑖=1 𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 1

𝑖=1 𝑟2 𝑘 1(𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖) 𝜷 𝑘(𝒂)

+𝜆 𝜷 𝑘(𝒂) . (48)

For (45). Defining ˆ𝒓= {ˆ𝑟𝑘 1(𝒙𝑝 1 ),..., ˆ𝑟𝑘 1(𝒙𝑝 𝑛𝑝)} R𝑛𝑝and 𝒓= {𝑟𝑘 1(𝒙𝑝 1 ),...,𝑟𝑘 1(𝒙𝑝 𝑛𝑝)} R𝑛𝑝, by Lemma F.6, we have 𝜂max({ ˆΣ𝐽𝑘,𝜆(𝒂)} 1) = 𝑂(1). Thus,

(45) 2 =𝑛 2 𝑝tr { ˆΣ𝐽𝑘,𝜆(𝒂)} 2𝑷(𝒂) (ˆ𝒓 𝒓)(ˆ𝒓 𝒓) 𝑷(𝒂)

𝑂𝑝(𝑛 2 𝑝)tr 𝑷(𝒂) (ˆ𝒓 𝒓)(ˆ𝒓 𝒓) 𝑷(𝒂)

=𝑂𝑝(𝑛 1 𝑝)tr 𝑷(𝒂) (ˆ𝒓 𝒓)(ˆ𝒓 𝒓) 𝑷(𝒂)(𝑷(𝒂) 𝑷(𝒂)) 1

=𝑂𝑝(𝑛 1 𝑝)tr (ˆ𝒓 𝒓)(ˆ𝒓 𝒓) 𝑷(𝒂)(𝑷(𝒂) 𝑷(𝒂)) 1𝑷(𝒂)

𝑂𝑝(𝑛 1 𝑝) ˆ𝒓 𝒓 2 = 𝑂𝑃 𝐽 2(𝑠 1) 𝑘 +𝜉𝑛,𝑘 1 + 𝐽𝑘/(𝑛𝑝 𝑛𝑞) ,

where the last inequality follows from the fact that 𝑷(𝒂)(𝑷(𝒂) 𝑷(𝒂)) 1𝑷(𝒂) is a projection matrix with maximum eigenvalue 1, and last equality follows from (16). Using exactly the same arguments and the fact that sup𝒂,𝒙| 𝑓 𝒂(𝒂 𝒙)| = 𝑂(1), we have (46) = 𝑂𝑝((45)).

For (47), we have

{ ˆΣ𝐽𝑘,𝜆(𝒂)} 1 ( 1 𝑛𝑝

𝑖=1 𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) 1

𝑖=1 𝑟2 𝑘 1(𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖) 𝜷 𝑘(𝒂)

= { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 ( 1 𝑛𝑝

𝑖=1 𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) E𝑝{𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)}

+ { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 (

E𝑞{𝑟2 𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)𝚽𝑘(𝒂 𝒙) 𝜷 𝑘(𝒂)} 1

𝑖=1 𝑟2 𝑘 1(𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖)𝚽𝑘(𝒂 𝒙𝑞 𝑖) 𝜷 𝑘(𝒂)

+ { ˆΣ𝐽𝑘,𝜆(𝒂)} 1 E𝑝{𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)} E𝑞{𝑟2 𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)𝚽𝑘(𝒂 𝒙) 𝜷 𝑘(𝒂)} . (51)

Projection Pursuit Density Ratio Estimation

𝑖=1 𝑟𝑘 1(𝒙𝑝 𝑖)𝚽𝑘(𝒂 𝒙𝑝 𝑖) E𝑝{𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)

𝑛𝑝 E𝑝 h 𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙) E𝑝{𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙) 2i

𝑛𝑝 E𝑝 h 𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙) 2i

𝑛𝑝 tr E𝑝[𝑟2 𝑘 1(𝒙)𝚽𝑘(𝒂 )𝚽𝑘(𝒂 ) ] = 𝑂 𝐽𝑘

where the last equality follows from Assumptions F.2 and F.3. Thus, by Chebyshev s inequality, we have

(49) = 𝑂𝑝( 𝐽𝑘 𝑛𝑝 ) .

Similarly, we have

(50) = 𝑂𝑝( 𝐽𝑘 𝑛𝑞 ) .

For (51), noting that 𝜷 (𝒂) is the globally unique minimizer of 𝐻 (𝒂, 𝜷) for any 𝒂 A, by the first order condition, we have

0 = 1𝐻 (𝜷 𝑘(𝒂), 𝒂) = 2 E𝑞{𝑟2 𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)𝚽𝑘(𝒂 𝒙) 𝜷 𝑘(𝒂)} E𝑝{𝑟𝑘 1(𝒙)𝚽𝑘(𝒂 𝒙)} .

Thus, we have (51) = 0. Consequently, we have

(47) = 𝑂𝑝( 𝐽𝑘 𝑛𝑝 𝑛𝑞 ) .

Finally, under Assumption F.3, we have (48) = 𝑂𝑝(

𝐽𝑘/𝑛𝑞). Thus, we obtain the result.

Lemma F.9. For any 𝑘 {1,2,...,𝐾}, suppose Assumptions F.1 F.4 and (16) hold. For any 𝒂 A and 𝑗= 0,1,

sup 𝒂 A sup 𝒙 X

𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙) = 𝑂 𝐽 (𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘) ,

E h | 𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙)|2i = 𝑂 𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘) ,

𝑖=1 | 𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙𝑖) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙𝑖)|2 = 𝑂𝑃 𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘) .

Proof. Note that by Lorentz (1986, Theorem 8) and Assumption F.2, for 𝑗< 𝑠1, there exists a 𝜷𝑘, 𝑗(𝒂) R𝐽𝑘such that

sup 𝒂 A sup 𝑧 Z

𝑓( 𝑗) 𝒂,𝑘(𝑧) {𝜷𝑘, 𝑗(𝒂)} 𝚽( 𝑗) 𝑘(𝑧) < 𝐶0, 𝑗𝐽 (𝑠 𝑗) 𝑘 , (52)

for some constant 𝐶0, 𝑗> 0. Then, we have

sup 𝒂 A sup 𝑧 Z

𝑓 ( 𝑗) 𝒂,𝑘(𝑧) 𝑓( 𝑗) 𝒂,𝑘(𝑧)

= sup 𝒂 A sup 𝑧 Z

𝜷 𝑘(𝑎) 𝚽( 𝑗) 𝑘(𝑧) 𝑓( 𝑗) 𝒂 (𝑧)

sup 𝒂 A sup 𝑧 Z

(𝜷 𝑘(𝒂) 𝜷𝑘, 𝑗(𝒂)) 𝚽( 𝑗) 𝑘(𝑧)

+ sup 𝒂 A sup 𝑧 Z

𝜷𝑘, 𝑗(𝒂) 𝚽( 𝑗) 𝑘(𝑧) 𝑓( 𝑗) 𝒂,𝑘(𝑧)

sup 𝒂 A 𝜷 𝑘(𝒂) 𝜷𝑘, 𝑗(𝒂) 𝜁𝑗(𝐽𝑘) +𝑂(𝐽 (𝑠 𝑗) 𝑘 ) . (53)

Projection Pursuit Density Ratio Estimation

For 𝑗= 0, we have (53) = 𝑂(𝐽 𝑠 𝑘𝜁0(𝐽𝑘)) from Lemma F.7.

For 𝑗= 1, since Z is a compact set, using the fundamental theorem of calculus, (52) implies

sup 𝒂 A sup 𝑧 Z

𝑓𝒂,𝑘(𝑧) 𝜷𝑘, 𝑗(𝒂) 𝚽𝑘(𝑧) = 𝑂 𝐽 (𝑠 𝑗) 𝑘 . (54)

Then, using a similar argument in Lemma F.7, we have the following result

𝜷 𝑘(𝒂) 𝜷𝑘, 𝑗(𝒂) = 𝑂 𝐽 (𝑠 𝑗) 𝑘 . (55)

Then we have (53) = 𝑂(𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘) 𝜁𝑗(𝐽𝑘)) .

We next prove E[| 𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙)|2] = 𝑂(𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘)).

E (𝜷𝑘, 𝑗(𝒂) 𝜷 𝑘(𝒂)) 𝚽( 𝑗) 𝑘(𝒂 𝒙) 2

= (𝜷𝑘, 𝑗(𝒂) 𝜷 𝑘(𝒂)) E h 𝚽( 𝑗) 𝑘(𝒂 𝒙)𝚽( 𝑗) 𝑘(𝒂 𝒙) i (𝜷𝑘, 𝑗(𝒂) 𝜷 𝑘(𝒂))

(𝜷𝑘, 𝑗(𝒂) 𝜷 𝑘(𝒂)) 2 𝑂( 𝜁𝑗(𝐽𝑘))

= 𝑂 𝜷𝑘, 𝑗(𝒂) 𝜷 𝑘(𝒂) 2 𝜁𝑗(𝐽𝑘) = 𝑂 𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘) . (56)

Thus, using similar decomposition as (53), we obtain

E[| 𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙)|2]

2E 𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝜷𝑘, 𝑗(𝒂) 𝚽( 𝑗) 𝑘(𝒂 𝒙) 2

+E (𝜷𝑘, 𝑗(𝒂) 𝜷 𝑘(𝒂)) 𝚽( 𝑗) 𝑘(𝒂 𝒙) 2

= 𝑂 𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘) +𝑂 𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘) = 𝑂 𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘) . (57)

We finally prove 𝑛 1 Í𝑛 𝑖=1 | 𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙𝑖) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙𝑖)|2 = 𝑂𝑃 𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘) . Note that, for large enough 𝐽𝑘,

E h | 𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙)|2i 𝐶 𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘), for some constant 𝐶. Choosing 𝐶(𝜖) = 2𝐶/𝜖and using the Markov s inequality, we have

𝑖=1 | 𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙𝑖) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙𝑖)|2 > 𝐶 𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘)

E h | 𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙)|2i

𝐶(𝜖) 𝐽 2(𝑠 𝑗) 𝑘 𝜁𝑗(𝐽𝑘) 𝜖

which implies the third result.

Lemma F.10. For any 𝑘 {1,2,...,𝐾}, suppose Assumptions F.1 F.4 and (16) hold. For any 𝒂 A and 𝑗= 0,1,2,

ˆ𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙) = 𝑂𝑝

{𝐽 (𝑠 1) 𝑘 +

𝜉𝑛,𝑘 1}𝜁𝑗(𝐽𝑘) + 𝜁𝑗(𝐽𝑘) 𝐽𝑘 𝑛𝑞 𝑛𝑝

E h | ˆ𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙)|2i = 𝑂𝑝

{𝐽 2(𝑠 1) 𝑘 +𝜉𝑛,𝑘 1} 𝜁𝑗(𝐽𝑘) + 𝜁𝑗(𝐽𝑘)𝐽𝑘

𝑖=1 | ˆ𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙𝑖) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙𝑖)|2 = 𝑂𝑝

{𝐽 2(𝑠 1) 𝑘 +𝜉𝑛,𝑘 1} 𝜁𝑗(𝐽𝑘) + 𝜁𝑗(𝐽𝑘)𝐽𝑘

Projection Pursuit Density Ratio Estimation

Proof. Under Assumption F.3 and by Lemma F.8, we have

sup 𝒙 X | ˆ𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙)|

{ ˆ𝜷𝑘(𝒂) 𝜷 𝑘(𝒂)} 𝚽( 𝑗) 𝑘(𝒂 𝒙)

ˆ𝜷𝑘(𝒂) 𝜷 𝑘(𝒂) sup 𝑧 Z 𝚽( 𝑗) 𝑘(𝑡,𝑧)

𝜁𝑗(𝐽𝑘){𝐽 (𝑠 1) 𝑘 +

𝜉𝑛,𝑘 1} + 𝜁𝑗(𝐽𝑘)

Similar to (56), by Lemma F.8, we deduce that

X | ˆ𝑓( 𝑗) 𝒂,𝑘(𝒂 𝒙) 𝑓 ( 𝑗) 𝒂,𝑘(𝒂 𝒙)|2𝑑𝐹𝑋(𝒙)

ˆ𝜷𝑘(𝒂) 𝚽( 𝑗) 𝑘(𝒂 𝒙) 𝜷 𝑘(𝒂) 𝚽𝑘(𝒙 𝒂) 2 𝑑𝐹𝑋(𝒙)

= ( ˆ𝜷𝑘(𝒂) 𝜷 𝑘(𝒂)) E h 𝚽( 𝑗) 𝑘(𝒂 𝒙)𝚽( 𝑗) 𝑘(𝒂 𝒙) i ( 𝜷𝑘(𝒂) 𝜷 ( 𝑗) 𝑘 (𝒂))

= 𝑂𝑃 ˆ𝜷𝑘(𝒂) 𝜷 𝑘(𝒂) 2 𝜁𝑗(𝐽𝑘) = 𝑂𝑃

{𝐽 2(𝑠 1) 𝑘 +𝜉𝑛,𝑘 1} 𝜁𝑗(𝐽𝑘) + 𝜁𝑗(𝐽𝑘)𝐽𝑘

Similar to the end of the proof of Lemma F.9, using Markov s inequality, we also have

ˆ𝜷𝑘(𝒂) 𝚽( 𝑗) 𝑘(𝒂 𝒙𝑖) 𝜷 𝑘(𝒂) 𝚽( 𝑗) 𝑘(𝒙 𝑖𝒂) 2

ˆ𝜷𝑘(𝒂) 𝚽( 𝑗) 𝑘(𝒂 𝒙) 𝜷 𝑘(𝒂) 𝚽( 𝑗) 𝑘(𝒙 𝒂) 2 𝑑𝐹𝑋(𝒙)

{𝐽 2(𝑠 1) 𝑘 +𝜉𝑛,𝑘 1} 𝜁𝑗(𝐽𝑘) + 𝜁𝑗(𝐽𝑘)𝐽𝑘

Lemma F.11. For any 𝑘 {1,2,...,𝐾}, suppose Assumptions F.1 F.4 and (16) hold. We have

sup 𝒂 A ˆ𝜷𝑘(𝒂) = 𝑂𝑝(1) and 𝒂ˆ𝜷𝑘(𝒂) = 𝑂𝑝(1) , for any 𝒂 A .

Proof. First, from (8), we have

X ˆ𝑓2 𝒂,𝑘(𝒂 𝒙) 𝑓2 𝒂,𝑘(𝒂 𝒙)𝑝(𝒙) 𝑑𝒙 = 𝑂𝑝 𝐽 (𝑠 1) 𝑘 𝜁0(𝐽𝑘) + 𝜁0(𝐽𝑘)2

𝐽𝑘/ 𝑛𝑞 𝑛𝑝 = 𝑜𝑝(1) .

Then, by Lemma F.6,

sup 𝒂 A ˆ𝜷𝑘(𝒂) 2 = sup 𝒂 A ˆ𝜷𝑘(𝒂) ˆ𝜷𝑘(𝒂) sup 𝒂 A 𝜂min(Ω(0) 𝐽𝑘(𝒂)) 1 sup 𝒂 A ˆ𝜷𝑘(𝒂) Ω(0) 𝐽𝑘(𝒂) ˆ𝜷𝑘(𝒂)

=𝑂(1) sup 𝒂 A

X ˆ𝑓2 𝒂,𝑘(𝒂 𝒙)𝑑𝐹𝑋(𝒙) 𝑑𝒙

𝑂(1) sup 𝒂 A

X 𝑓2 𝒂,𝑘(𝒂 𝒙)𝑑𝐹𝑋(𝒙) 𝑑𝒙+ 𝑜𝑃(1)

Projection Pursuit Density Ratio Estimation

From (36) and (37), using Lemma F.6, we have

𝒂ˆ𝜷𝑘(𝒂) Σ𝐽𝑘(𝒂) 1𝑀 = 𝑜𝑝(1) ,

𝑀=2E𝑞 𝑟2 𝑘 1(𝒙) 𝑓(1) (𝒂 𝒙)𝚽𝑘(𝒂 𝒙)(𝒙)

+2E𝑞 h 𝑟2 𝑘 1(𝒙) 𝑓(𝒂 𝒙)𝚽(1) 𝑘(𝒂 𝒙)(𝒙) i 2E𝑝 h 𝑟𝑘 1(𝒙)𝚽(1) 𝑘(𝒂 𝒙)(𝒙) i

=:𝑀1 + 𝑀2 + 𝑀3 ,

where the definition of 𝑀1, 𝑀2 and 𝑀3 are clear.

By triangle inequality,

Σ𝐽𝑘(𝒂) 1𝑀 Σ𝐽𝑘(𝒂) 1𝑀1 + Σ𝐽𝑘(𝒂) 1𝑀2 + Σ𝐽𝑘(𝒂) 1𝑀3 .

Consider Σ𝐽𝑘(𝒂) 1𝑀1 . Define

𝑉(𝒂 𝒙) = E𝑞 𝑟2 𝑘 1(𝒙) 𝑓(1) (𝒂 𝒙)𝚽𝑘(𝒂 𝒙)E𝑞(𝒙|𝒂 𝒙) Σ𝐽𝑘(𝒂) 1𝚽𝑘(𝒂 𝒙) ,

which is the 𝐿2 projection of 𝑟𝑘 1(𝒙) 𝑓(1) (𝒂 𝒙)E𝑞(𝒙|𝒂 𝒙) onto the space linearly spanned by 𝚽𝑘(𝒂 ). Then we have

Σ𝐽𝑘(𝒂) 1𝑀1 2 =tr 𝑀 1 Σ𝐽𝑘(𝒂) 2𝑀1 𝜂max(Σ𝐽𝑘(𝒂) 1)tr 𝑀 1 Σ𝐽𝑘(𝒂) 1Σ𝐽𝑘(𝒂)Σ𝐽𝑘(𝒂) 1𝑀1

𝑂 E𝑞{𝑉(𝒂 𝒙)𝑉(𝒂 𝒙) } 𝑂 E𝑞{𝑟2 𝑘 1(𝒙){ 𝑓(1) (𝒂 𝒙)}2E𝑞(𝒙|𝒂 𝒙)E𝑞(𝒙|𝒂 𝒙) }

Similarly, under Assumption F.3, we have Σ𝐽𝑘(𝒂)𝑀2 = 𝑂(1) and Σ𝐽𝑘(𝒂)𝑀3 = 𝑂(1). The result follows.