# deep_nonparametric_quantile_regression_under_covariate_shift__1f555990.pdf Journal of Machine Learning Research 25 (2024) 1-49 Submitted 6/24; Revised 12/24; Published 12/24 Deep Nonparametric Quantile Regression under Covariate Shift Xingdong Feng feng.xingdong@mail.shufe.edu.cn School of Statistics and Data Science & Institute of Data Science and Statistics Shanghai University of Finance and Economics Shanghai, China Xin He he.xin17@mail.shufe.edu.cn School of Statistics and Data Science Shanghai University of Finance and Economics Shanghai, China Yuling Jiao yulingjiaomath@whu.edu.cn School of Artificial Intelligence Hubei Key Laboratory of Computational Science Wuhan University Wuhan, China Lican Kang kanglican@whu.edu.cn Institute for Math and AI Wuhan University Wuhan, China Caixing Wang wang.caixing@stu.sufe.edu.cn School of Statistics and Data Science Shanghai University of Finance and Economics Shanghai, China Editor: Xiaotong Shen This work focuses on addressing the challenges posed by covariate shift in nonparametric quantile regression using deep neural networks. We propose a two-stage pre-training reweighted method that leverages importance weighting to mitigate the effects of distribution shift. In the first stage, density ratios are estimated with a neural network by minimizing least squares. In the second stage, a deep neural network estimator is obtained using pre-training weights. Theoretical analysis is provided, offering non-asymptotic error bounds for the unweighted, reweighted, and pre-training reweighted estimators. We consider scenarios with both bounded and unbounded density ratios. Notably, we employ a novel proof technique to bound the generalization error, characterized by the size and weights bound of Re LU neural networks. This enables us to establish fast rates of convergence under the adaptive self-calibration condition, distinguishing our approach from those relying on local Rademacher complexity techniques. Additionally, we derive the ap- *. Xin He and Lican Kang are the corresponding authors. All authors contributed equally to this paper, and their names are listed in alphabetical order. 2024 Xingdong Feng, Xin He, Yuling Jiao, Lican Kang and Caixing Wang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/24-0906.html. Feng and He and Jiao and Kang and Wang proximation error with weight bounds for Re LU neural networks approximating the H older class. Our theoretical findings provide valuable insights for the pre-training process and highlight the efficacy of reweighted techniques. Numerical experiments are conducted to further validate the theoretical findings and demonstrate the effectiveness of our proposed method. Keywords: Deep neural networks, distribution shift, non-asymptotic error bounds, robust estimation 1 Introduction Quantile regression (Koenker and Bassett Jr, 1978) occupies a pivotal and indispensable position within the domain of statistical modeling. It distinguishes itself substantially from traditional mean regression by offering a comprehensive portrayal of the response and covariates. The inherent capacity of quantile regression to estimate conditional quantiles of the response underscores its unique contribution to statistical modeling. This refined approach facilitates the exploration of how various quantiles of the response respond to variations in the covariate space, thereby providing a more profound understanding of the underlying data structure. Precisely, quantile regression not only complements but also elevates conventional mean regression models, particularly when addressing scenarios characterized by heteroscedasticity, outliers, or other manifestations of non-normal data features (Bondell et al., 2010; Wang et al., 2012). Nonparametric quantile regression, a specific subset of quantile regression, proves invaluable when dealing with complex relationships among covariates and response that defy straightforward parametric characterization. Its primary objective revolves around the attainment of nonparametric estimates for the underlying regression function within the context of quantile regression. Notably, it distinguishes itself through its intrinsic flexibility, as it is capable of accommodating diverse data distributions without relying on predetermined functional forms. The research on nonparametric quantile regression has witnessed substantial development in the past decades (White, 1992; Koenker et al., 1994; He and Shi, 1994; He and Ng, 1999; Takeuchi et al., 2006; Sangnier et al., 2016). This extensive body of literature emphasizes the diverse methodological approaches grounded in reproducing kernels, smoothing splines, and shallow neural networks. Recent attention in statistical modeling has been drawn towards the application of deep neural networks, particularly within the framework of nonparametric estimation. Deep nonparametric quantile regression, a specialized domain within this domain, has also garnered significant interest (Padilla et al., 2022a; Madrid Padilla and Chatterjee, 2022; Shen et al., 2021). Yet, the existing literature often lacks comprehensive exploration and effective solutions for the intricate challenge posed by distribution shift within the realm of quantile regression. Distribution shift refers to situations where there exists a substantial divergence between the distributions characterizing the training and testing data sets. This phenomenon is pervasive and manifests ubiquitously across practical modeling scenarios. Notably, instances of distribution shift materialize across a spectrum of real-world domains, including, but not limited to, image analysis (Taori et al., 2020; Guan and Liu, 2021), natural language processing (Jiang and Zhai, 2007; Hassan et al., 2013), and recommender systems (Carroll et al., 2022; Tan et al., 2016). As illustrated in Daume III and Marcu (2006); Torralba and Efros (2011), a model that exhibits robust performance when trained on specific data Deep Nonparametric Quantile Regression under Covariate Shift may falter when exposed to new data characterized by distribution shift. This discrepancy arises from the fundamental principle that models glean insights and knowledge from the distribution of training data, and their effectiveness is contingent on the persistence of these patterns in the testing data. Therefore, distribution shift stands as a formidable challenge in statistics and machine learning. Distribution shift encompasses various types, such as covariate shift, concept shift, marginal shift, conditional shift, label shift, domain shift, and mode shift. Among them, covariate shift (Quinonero-Candela et al., 2008; Sugiyama and Kawanabe, 2012) is a specific and significant manifestation within distribution shift, and it is prevalent in practical scenarios. Covariate shift arises when the distribution of the input covariates within the training data set deviates from that within the testing data set, while simultaneously maintaining an unaltered conditional distribution of the response given the input covariates. In essence, covariate shift entails alterations in the distribution of input covariates between the training and testing data sets, while the underlying relationship between these covariates and the response remains invariant. Addressing the intricacies posed by covariate shift necessitates the deployment of density-ratio reweighting techniques, often referred to as importance weighting. This approach has garnered significant attention and is well-studied in Shimodaira (2000); Huang et al. (2006); Sugiyama et al. (2007b); Bickel et al. (2009); Kanamori et al. (2009); Fang et al. (2020). Furthermore, some works provided related theoretical analysis, see Cortes et al. (2008, 2010); Xu et al. (2022), while the suboptimal convergence rate is given in Cortes et al. (2008, 2010). Recently, the covariate shift problem is studied in Ma et al. (2023) and Feng et al. (2023) within nonparametric regression based on a reproducing kernel Hilbert space framework, which also provided some theoretical insights into addressing the challenges induced by covariate shift. Nevertheless, these works rely on the unrealistic and overly stringent assumption that the density ratio is known, which is impractical since only observed data from the source and target distributions are available in real-world applications. Moreover, we want to emphasize that all the aforementioned methods only focus on the covariate shift problem in the context of mean regression. Despite its practical importance, the covariate shift problem in nonparametric quantile regression remains largely underexplored, particularly when the density ratio is unknown, which is quite challenging from both methodological and theoretical points of view. In literature, estimating the true density ratio is a key challenge in covariate shift adaptation. A straightforward approach is to estimate the source and target densities separately using kernel density estimation (Sugiyama and M uller, 2005; Baktashmotlagh et al., 2014), followed by calculating their ratio. Yet, it is worthy pointing out that such a procedure is inefficient and time-consuming, especially under the high-dimensional case. An alternative approach in literature is to directly estimate the density ratio instead of estimating the densities individually. These works mainly focus on minimizing some discrepancy measures between distributions, including the kernel mean matching (Huang et al., 2006; Gretton et al., 2009), Kullback-Leibler divergence (Sugiyama et al., 2007a,b, 2012) and non-negative Bregman divergence (Kato and Teshima, 2021). To further improve computational efficiency, Kanamori et al. (2009); Sugiyama et al. (2012) formulate the direct importance estimation problem as a least-squares function fitting problem, which can be efficiently solved using a standard quadratic program. It performs comparably to the Kullback-Leibler importance estimation method (Sugiyama et al., 2007b). We want to point out that although there Feng and He and Jiao and Kang and Wang exist some methods on nonparametric density estimation, the investigations on integrating the estimated density ratio into the problem of covariate shift is still lacking in literature, even for the mean regression. To the best of our knowledge, it is still unknown how to explore the effect of estimation accuracy of density ratio on the convergence rate of the final estimated quantile function when covariate shift occurs. In this paper, we investigate the deep nonparametric quantile regression under covariate shift. We design a novel two-stage pre-training procedure, incorporating a reweighting mechanism. At the first stage, we obtain a deep estimator for the density ratio which is constructed based on the least-squares method (Kanamori et al., 2009; Sugiyama et al., 2012). Subsequently, in the second stage, we leverage the density ratio estimates acquired in the first phase to derive a pre-training reweighted estimator for the underlying regression function using deep neural networks. To more comprehensively underscore the merits of our approach, we further introduce two supplementary estimators for comparative analysis, which are referred to as the unweighted and reweighted estimators introduced in Section 3.2. The performance evaluation of the pre-training reweighted estimator with these two alternative estimators elucidates the superiority of our method, thereby illustrating the pivotal role assumed by pre-training and reweighting techniques in our framework. In summary, the contributions of this work are outlined as follows. (i) We introduce a novel two-stage pre-training reweighted method to address the issue of covariate shift in deep nonparametric quantile regression. We thus propose a pretraining reweighted estimator using deep neural networks for the underlying regression function. To our best knowledge, this is the first work to consider the estimated ratios for covariate shift problems. (ii) We provide a non-asymptotic error analysis for unweighted, reweighted, and pretraining reweighted quantile regression estimators. This analysis is primarily achieved by decomposing the error into statistical and approximation errors, followed by establishing error bounds via trade-off considerations. Under the adaptive self-calibration condition, the resulting error bounds of order O n 2ζ d+2ζ , where n is the sample size, d is the data dimension, and ζ is a smoothness parameter, attain minimax optimal rates in nonparametric regression. Furthermore, our theoretical development simultaneously explores scenarios involving both bounded and unbounded density ratios under some weaker conditions than those commonly considered in the existing literature. Our theoretical findings provide prior guidance for pre-training and underscore the significance of reweighted techniques. (iii) Our technical novelty lies in an alternative proof method for bounding the statistical error, enabling us to obtain a sharp rate. This approach simplifies and differs from the conventional local Rademacher complexity techniques (Bartlett et al., 2005). Additionally, we also derive the approximation error with weight bounds for Re LU neural networks approximating the H older class. 1.1 Outlines The rest of the paper is organized as follows. In Section 2, we introduce the standard nonparametric quantile regression and the definition of covariate shift. In Section 3, we for- Deep Nonparametric Quantile Regression under Covariate Shift mally formulate the deep nonparametric quantile estimation problem under covariate shift and propose a two-stage pre-training reweighted method involving the reweighted estimator with the pre-training density ratio. We also consider the naive unweighted estimator and the reweighted estimator with the true density ratio. In Section 4, we provide the details of error analysis for both unweighted, reweighted, and pre-training reweighted estimators, and establish the theoretical guarantees on these estimators under some conditions on the density ratio. Numerical experiments on synthetic examples are provided in Section 5. Concluding remarks are then given in Section 6. All the technical proofs of lemmas and theorems and additional numerical experiments are deferred to the Appendix. 1.2 Notations For any a, b R, we use a to denote the largest integer less than a, and write a b = max{a, b}. We define a b and a = O(b) if there exists some positive constant C such that a Cb, a b and a = Ω(b) if there exists some positive constant C such that a Cb. We use N0, N and R to denote the set of non-negative, strictly positive integers and real numbers, respectively. For a multi-index s = (s1, . . . , sd) Nd 0, the symbol s denotes the partial differential operator s = ( x1 )s1 . . . ( xd )sd and we use the convention that s is the identity operator when s = 0. Let ν be a probability distribution over Rd and f be a measurable function from Rd to R. We use Lp(ν) = {f : R |f(x)|pν(dx) < } for p 1 to denote the space of Lp-integrable functions with respect to ν, equipped with the norm f Lp(ν) = { R |f(x)|pν(dx)}1/p. 2 Preliminaries In this section, we introduce the standard nonparametric quantile regression model and the definition of covariate shift. 2.1 The Standard Nonparametric Quantile Regression Given a univariate response Y Y R and a d-dimensional covariate vector X = (X1, . . . , Xd) X Rd, d N, the τ-th quantile Qτ(Y |X) of Y given X is Qτ(Y |X) = f0(X, τ). (1) According to (1), we have P(Y f0(X, τ) 0) = τ for any given τ (0, 1). If we define the error term as ε = Y f0(X, τ), then model (1) becomes Y = f0(X, τ) + ε, (2) where P(ε 0|X = x) = τ for any x X. For notation simplicity, we suppress τ and denote f0( ) := f0( , τ) since we only focus on one specific quantile level τ. Suppose that a random training sample D := {(Xi, Yi)}n i=1 is composed of independent copies of the random pair (X, Y ) drawn from some unknown joint distribution PX,Y . The standard quantile regression (Koenker and Bassett Jr, 1978) estimates f0 by bf arg min f F 1 n i=1 ρτ(Yi f(Xi)), (3) Feng and He and Jiao and Kang and Wang where ρτ(u) = u(τ I{u < 0}) denotes the τ-th quantile loss and F denotes a specific function class containing a set of measurable functions. It is known that the quantile loss ρτ( ) is Lipschitz continuous with the Lipschitz constant λ := max{τ, 1 τ}. This implies that ρτ(u1) ρτ(u2) λ|u1 u2| for any u1, u2 R. Note that the optimization task (3) is the empirical version of the following learning problem, which aims to find a ef : Rd R overall measurable functions satisfying ef = arg min f E(X,Y ) PX,Y ρτ(Y f(X)) ρτ(Y f0(X)) , (4) where E(X,Y ) PX,Y denotes the expectation with repect to PX,Y . It is worthy pointing out that the sample version of the term E(X,Y ) PX,Y [ρτ(Y f0(X))] can be regarded as a constant and thus is omitted in (3) and somewhere else in this paper. Moreover, under the assumption that the τ-th conditional quantile of ε given X is zero, the global minimizer ef in (4) coincides with the underlying regression function f0 in (2). 2.2 Phenomenon of Covariate Shift In literature, it is commonly assumed that the source (training) and target (testing) data originate from the same distribution PX,Y defined over the joint space X Y. Moreover, the joint distribution can be decomposed as PX,Y = PY |XPX, where PY |X denotes the conditional distribution determined by the model given in (2) and PX denotes the marginal distribution of X. Yet, in many real applications, the source and target data may come from some different joint distributions. Specifically, we assume that the target data are drawn from some other joint distribution QX,Y , which is also defined over the joint space X Y and can be decomposed as QX,Y = QY |XQX. To be more precise, in this paper, we consider the special case that for the source and target data, the conditional distributions are the same that PY |X = QY |X representing the invariance of the underlying regression model (2), while the marginal distributions of the source data PX (source distribution) and the target data QX (target distribution) are significantly different. Note that assuming PY |X = QY |X implicitly requires that the conditional distribution of ε given X remains invariant, regardless of whether X is generated from PX or QX. This scenario is often referred to as covariate shift. Under the standard quantile regression without covariate shift, the prediction performance of the estimator bf in (3) can be evaluated under the L2(PX) norm (Padilla et al., 2022b) that bf f0 2 L2(PX) = EX PX ( bf(X) f0(X))2 , where EX PX is the expectation over PX, and it is well-known that bf is indeed a consistent estimator under the L2(PX) norm. However, once the phenomenon of covariate shift occurs, this metric becomes problematic in the sense that we primarily aim to construct an estimator whose prediction error for the target source is small under the L2(QX) norm that bf f0 2 L2(QX) = EX QX ( bf(X) f0(X))2 , where EX QX is the expectation over QX. It is thus clear that the evaluation of the discrepancy between the source and target distributions plays a crucial role in tackling the problem of covariate shift. Deep Nonparametric Quantile Regression under Covariate Shift In this section, we provide a deep nonparametric estimation procedure using feedforward neural networks (FNNs) with Rectified Linear Unit (Re LU) activation. We first give a brief introduction of Re LU FNNs and H older class in Section 3.1, and then propose the unweighted, reweighted, and the novel pre-training reweighted estimators in Sections 3.2 and 3.3, respectively. 3.1 Re LU FNNs and H older class Recently, deep neural networks have attracted tremendous attention and received a variety of successful achievements in many applications (Bauer and Kohler, 2019; Schmidt-Hieber, 2020; Lu et al., 2021; Jiao et al., 2023). Neural network functions have demonstrated their effectiveness in approximating high-dimensional functions. In this paper, we aim to estimate the regression function within the function class of Re LU FNNs. Precisely, we use Fd1,d2(W, D, S, B) to denote the set of functions {ϕ} s that can be parameterized by Re LU FNNs with width W, depth D, size S, weights bound B, and d1, d2 denote the input and output dimensions of the Re LU FNNs. Specifically, the Re LU FNN ϕ : Rd1 Rd2 can be expressed as ϕ(x) = ADϕD(x) + b D, where ϕ0(x) = x, ϕℓ+1(x) = σ (Aℓϕℓ(x) + bℓ) , ℓ= 0, . . . , D 1, and Aℓ RNℓ+1 Nℓdenote the weight matrix, Nℓ N denotes the width of the ℓ-th hidden layer with N0 = d1 and ND+1 = d2, bℓ RNℓ+1 denotes the bias vector, and σ(x) = max(0, x) is the Re LU activation function defined for each element of x. Therefore, the parameters of the Re LU FNN ϕ( ) can be denoted as θ = (A0, b0), . . . , (AD 1, b D 1), (AD, b D) . Furthermore, we denote the number of non-zero elements and the largest absolute value of the parameters in θ as ℓ=0 vec (Aℓ) 0 + bℓ 0 , and θ = max n max ℓ {0,...,D} vec(Aℓ) , max ℓ {0,...,D} bℓ o , respectively. Note that vec(A) transforms the matrix A into the corresponding vector by concatenating the column vectors. Then, the width W, the size S, and the bound B are defined as W = max N1, . . . , ND , S = θ 0, and θ B, respectively. It is thus clear that the Re LU FNN may not be fully connected, and thus S can be much smaller than that of the fully connected case. In the rest of this paper, we focus on the Re LU FNNs Fd1,d2(W, D, S, B) that d1 = d and d2 = 1. Furthermore, we introduce the definition of H older class, which is a generalization of Lipschitz continuity, and is commonly used to characterize the smoothness of functions (Bauer and Kohler, 2019; Schmidt-Hieber, 2020; Lu et al., 2021; Jiao et al., 2023). Feng and He and Jiao and Kang and Wang Definition 1 (H older class). We denote the H older class Hζ [0, 1]d, B as Hζ([0, 1]d, B) := n f : Rd R, max s 1 t sf B, max s 1=t sup x =y | sf(x) sf(y)| x y s B o , for some B, ζ > 0 with ζ = t + s, t N0 and s (0, 1], and d N. Here we assume that the underlying function f0, defined in (2), is H older continuous with parameters B and ζ, and thus belongs to the H older class Hζ [0, 1]d, B in Definition 1. 3.2 Unweighted and Reweighted Estimators Based on the training sample D := {(Xi, Yi)}n i=1, traditional nonparametric quantile regression with a standard empirical risk minimization (ERM) framework leads to the naive unweighted estimator given by bf D arg min f G 1 n i=1 ρτ(Yi f(Xi)), (5) where G denotes a function class of Re LU FNNs as defined in Section 3.1. Note that the right side of (5) is an unbiased estimator of E(X,Y ) PX,Y [ρτ(Y f0(X))] in the absence of covariate shift. However, when covariate shift happens, this term becomes biased, potentially resulting in an inaccurate predictive estimator. To tackle the issue of covariate shift, it is crucial to measure the discrepancy between the source and target distributions, and thus we introduce the concept of density ratio. Specifically, we denote p X and q X as the probability density functions of PX and QX, respectively. Then, the density ratio can be defined as r(x) = q X(x) which quantifies the dissimilarity between PX and QX. Once the density ratio r(x) is known, we can simply solve the covariate shift problem by considering the following reweighted ERM task that bfr,D arg min f J 1 n i=1 r(Xi)ρτ(f(Xi) Yi), (6) where J denotes a function class of Re LU FNNs. Here, we use different notations of function class for theoretical simplicity. The reweighted procedure ensures an unbiased estimator of E(X,Y ) QX,Y [ρτ(Y f0(X))] since it is easy to verify that E(X,Y ) PX,Y [r(X)ρτ(Y f0(X))] = E(X,Y ) QX,Y [ρτ(Y f0(X))]. Similar treatments can be also referred to Ma et al. (2023); Feng et al. (2023) under the reproducing kernel Hilbert space (RKHS) framework. Yet, it is unrealistic to get the exact density ratio as the prior information in practice, and most existing studies (Ma et al., 2023; Feng et al., 2023) only investigate the statistical property of the reweighted estimator under the squared loss function and Lipschitz continuous loss functions with some known density ratios. Thus, it is still an open and fundamental problem to propose a reweighted estimator with the plug-in estimated ratio and investigate its statistical properties. Deep Nonparametric Quantile Regression under Covariate Shift 3.3 Pre-training Reweighted Estimator In the scenario where access to the exact density ratio is not available, it is natural to consider an alternative problem akin to (6) with a plug-in density ratio estimator. Inspired by this, we propose a two-step pre-training reweighted estimator. In the first step, a density ratio estimator is obtained through least-squares density ratio fitting using a deep neural network (Kanamori et al., 2009; Sugiyama et al., 2012). Subsequently, we leverage this plugin density ratio estimator to compute the final pre-training reweighted estimator. To the best of our knowledge, the established theoretical results in Section 4 are the first attempt in the literature that a pre-training density estimator is considered in the analysis instead of simply using the true density ratio. In the first step, motivated by the fact that the exact density ratio r is the minimizer of minu 1 2EX PX (u(X) r(X))2 overall measurable function u, we remove the negligible terms that are independent of r(X), then the optimization becomes r = arg min u L(u) = arg min u 1 2EX PX[u(X)2] EX QX[u(X)]. (7) Suppose that we obtain the extra unlabeled samples drawn from PX and QX, denoted by SP := {XP i }m i=1 and SQ := {XQ i }m i=1, respectively. Consequently, the empirical version of (7) can be formulated as br S arg min u U b LS(u) = arg min u U 1 2m i=1 u(XP i )2 1 i=1 u(XQ i ), (8) where b LS( ) denotes the empirical pre-training risk and U also denotes a function class of Re LU FNNs as defined in Section 3.1. In the second step, by substituting the true density ratio r in (6) with its estimator br S obtained from (8), we can obtain the final pre-training reweighted estimator over a hypothesis function class M defined as follows bfbr S,D arg min f M 1 n i=1 br S(Xi)ρτ(f(Xi) Yi). (9) Here we want to emphasize that the density ratio br S is estimated by using the unlabeled samples SP and SQ, which are independent of the data D used in solving (9). This independence assumption is a common presupposition in the context of covariate shift. It allows us to obtain an accurate estimation for the density ratio r using a large volume of independent unlabeled data. Additionally, this assumption plays a crucial role in our theoretical analysis. We summarize our proposed two-step pre-training deep nonparametric quantile regression algorithm in Algorithm 1. Note that the first step of Algorithm 1 adopts unsupervised learning to estimate the density ratio r based on the unlabeled data SP and SQ. The second step of Algorithm 1 integrates the estimated pre-trained density ratio br S to estimate the underlying regression function f0 by using the labeled data D, and thus amalgamates information derived from the estimated density ratio and the labeled data. It is highlighted that the proposed algorithm Feng and He and Jiao and Kang and Wang Algorithm 1 The two-step pre-training deep nonparametric quantile regression algorithm 1. Input: Unlabeled data SP = {XP i }m i=1 from PX, SQ = {XQ i }m i=1 from QX and labeled data D = {(Xj, Yj)}n j=1 sampled from PX,Y . 2. Pre-train the density ratio: Obtain br S by solving (8) using SP and SQ. 3. Reweighted nonparametric quantile regression: Obtain bfbr S,D by solving (9) using br S and D. 4. Return: the pre-training reweighted estimate bfbr S,D. encapsulates the essence of the pre-training reweighted algorithm, which strategically combines unsupervised and supervised learning paradigms to address the challenges of covariate shift under the regression setting. It is worthy pointing out that the difficulty of estimating the density ratio r depends on the shapes of the distributions. Specifically, if p X and q X have different tail behaviors, for example, one is heavy-tailed and another is light-tailed, then the problem of estimating r may become challenging. This is largely due to the fact that in the tail regions, r may be very large or even unbounded. Then, the straightforward use of the original estimate of the density ratio may result in a significant increase in variance due to the potential presence of large quantities of br S at some covariates. To address this issue, we consider the truncated density ratio estimator brξ,S, that is brξ,S arg min u TξU b LS(u), (10) where TξU = {Tξu; u U} denotes a truncated set of U with ( u, if u ξ, ξ, otherwise, and ξ is some previously chosen threshold. Note that the truncated function class TξU can be realized by adding an extra activation function σ ξ(x) = min(x, ξ) to the original Relu FNN function class U. Thus, the ratio estimator br S in Algorithm 1 is replaced with the truncated counterpart brξ,S to obtain the following estimator bfbrξ,S,D arg min f M 1 n i=1 brξ,S(Xi)ρτ(f(Xi) Yi). (11) Note that the estimation accuracies of br S and brξ,S affect the performance of bfbr S,D and bfbrξ,S,D, respectively. In practice, obtaining unlabeled data is typically much cheaper and more readily available compared to labeled data, and thus an large amount of unlabeled data can usually be collected to ensure the estimation accuracy. For example, collecting response variables is laborious and costly in electronic health records (EHR) data sets, while the covariates are quite easy to obtain from the databases (Gronsbell and Cai, 2018). Numerically, we use m = 1000 unlabeled data points to guarantee accurate estimation of the density ratio in our all simulated examples, which yields satisfactory numerical performance. Deep Nonparametric Quantile Regression under Covariate Shift 4 Theoretical Analysis In this section, we provide the theoretical analysis on the non-asymptotic error bounds for the unweighted (Section 4.1), reweighted (Section 4.2), and pre-training reweighted estimators (Section 4.3), respectively. 4.1 Theoretical Analysis for Unweighted Estimator We establish the non-asymptotic error bounds for the unweighted estimator bf D defined in (5) (Theorems 8 and 9) under two broad scenarios of covariate shift where the density ratio is either bounded or unbounded. To start with, we define the ball centered at f(x) with radius c 0 that B(f(x), c) = y : |y f(x)| c , and denote the conditional density function and the cumulative distribution function of Y given X as p Y |X and FY |X( ), respectively. The following technical assumptions are introduced for our theoretical development. Assumption 1. The density ratio r is well-defined, that is, the source density p X(x) > 0 is required whenever the target density function q X(x) > 0. Assumption 2. Whether X is distributed as specified by PX or QX, the τ-th quantile of ε given X is zero and ε given X shares the same conditional distribution. Assumption 3. There exist some positive constants ξ1, ξ2 and κ such that for any |δ| ξ1 and y B(f0(x), ξ2), there holds FY |X=x(y + δ) FY |X=x(y) κ|δ|, almost surely. Moreover, for some absolute positive constant c, there holds supt R p Y |X=x(t) c. Assumption 1 requires the density ratio r( ) = q X( )/p X( ) always exists, implying that the two densities must have overlapping support or the support of the source covariates must cover that of the target covariates. Similar assumption is commonly required in the literature of importance weighting (Cortes et al., 2010) and covariate shift (Ma et al., 2023; Feng et al., 2023). Assumption 2 ensures the exact equivalence of ef in (4) to the underlying regression function f0 in (2) and the requirement that PY |X = QY |X. This assumption is naturally satisfied if ε is independent of X. Assumption 3 can be regarded as an adaptive self-calibration condition governing the conditional distribution of Y given X, and significantly differs from the commonly assumed self-calibration condition in literature (Shen et al., 2021; Madrid Padilla and Chatterjee, 2022), where the specific case that ξ2 = 0 is considered. Note that Assumption 3 requires the existence of a neighborhood around y B(f0(x), ξ2) within which the perturbation of FY |X( ) exceeds the variation in y. It is crucial to emphasize that Assumption 3 plays a vital role in deriving the refined error decomposition in Lemma 4, and consequently leads to the minimax optimal result in Theorem 8. By Assumption 3, the approximation error within our decomposition in Lemma 4 is quantified by the term inff G f f0 2 . This is in sharp contrast to the commonly assumed self-calibration condition in literature, where the approximation error is quantified by the term inff G f f0 within the error decomposition using the Lipschitz continuity of the quantile loss, and thus leads to a suboptimal theoretical order of the error bound. Feng and He and Jiao and Kang and Wang Remark 2. Note that Assumption 1 does not hold if the two densities have no overlap supports since there will be regions where p X(x) = 0 and q X(x) > 0, leading to a division by 0 and the failure of the definition. Under these cases, performing direct density ratio estimation becomes problematic, and some different treatments should be considered. If the non-overlapping supports are known in advance, one possible treatment is to restrict the estimation to the region where both densities have overlapping support. Therefore, we only estimate the density ratio in the overlapping region and assign the ratio to be zero where the region is outside of the support of p X. If the non-overlapping supports are unknown, the applied reweighted framework in this paper is no longer applicable, and some new framework needs to be developed (Mallinar et al., 2024), which may use some other divergence measures such as the total variation distance (R enyi, 1961; Liese and Vajda, 2006) to evaluate the difference between distributions. Yet, we want to point out that these methods and their theoretical investigation are out of this paper s scope, and we leave this interesting problem as further work. Remark 3. It is worthy pointing out under the special case that ξ2 = 0, Assumption 3 is similar to the assumptions required in Shen et al. (2021); Madrid Padilla and Chatterjee (2022), and is much weaker than assumptions required in He and Shi (1994); Belloni and Chernozhukov (2011); Padilla et al. (2022a). Precisely, Condition 2 in He and Shi (1994) assumes that the density function of Y is lower bounded by some positive constant; Condition D.1 in Belloni and Chernozhukov (2011) requires that the conditional density of Y given X = x must be both continuously differentiable and bounded away from zero uniformly across all quantile levels and for any x X; Assumption 2 in Padilla et al. (2022a) also assumes that the conditional density of Y given X = x is upper-bounded by a positive constant. 4.1.1 Error Decomposition and Estimates for Error Terms In this section, we provide a novel error decomposition of the unweighted estimator including the approximation error and statistical error terms, which is important for the non-asymptotic error analysis. We define f G as the best approximation of f0 within some function space G that is uniformly bounded by B with B 1, where the approximation quality is measured by the distance of the norm, namely, f arg min f G f f0 , Together with Assumptions 2 and 3 density, the error L2(PX) of the unweighted estimator bf D can be decomposed into two distinct components as stated in the following lemma. Lemma 4. Suppose that Assumptions 2 and 3 are satisfied, and the function space G is also uniformly bounded by B with B 1. Then, the unweighted estimator bf D defined in (5) satisfies bf D f0 2 L2(PX) BE(X,Y ) PX,Y ρτ(Y bf D(X)) ρτ(Y f (X)) + B2 inf f G f f0 2 . Lemma 4 decomposes the upper bound of bf D f0 2 L2(PX) into two terms, and it suffices to separately bound the statistical error term E(X,Y ) PX,Y [ρτ(Y bf D(X)) ρτ(Y f (X))] Deep Nonparametric Quantile Regression under Covariate Shift and the approximation error term inff G f f0 . For the approximation error term, we use the technical tools and results in Yarotsky (2017); Petersen and Voigtlaender (2018) to show that for any H older continuous function, it can be approximated arbitrarily well by a Re LU FNN with properly chosen depth, size, and weights bound in Theorem 5. Note that for mathematical simplicity, we assume that the support of X is X = [0, 1]d, which can be easily relaxed. Theorem 5. Suppose that f Hζ([0, 1]d, B). For any ϵ (0, 1), there exists a Re LU FNN function ψ with the depth D O (log(1/ϵ)), size S O ϵ d/ζ log(1/ϵ) , weights bound B O Bϵ d/ζ such that f ψ Bϵ. Theorem 5 establishes the upper bound of the approximation error which is related to the depth D, size S and weights bound B of the considered Re LU FNN. Specifically, it demonstrates that the upper bound of the approximation error is of order O(S ζ/d) with an omitted logarithmic term, which is consistent with the results in Yarotsky (2017); Petersen and Voigtlaender (2018). It is worthy pointing out that the proof of Theorem 5 partially follows the technical treatment as that in Yarotsky (2017) and also adopts the technical tools used in Petersen and Voigtlaender (2018). This deliberate choice facilitates the derivation of the weights bound, aligning with the rate presented in Petersen and Voigtlaender (2018). The completed proof of Theorem 5 is provided in Section A.1.1. For the statistical error, we use the empirical process techniques (Vaart and Wellner, 2023; Van der Vaart, 2000; Van de Geer and van de Geer, 2000; Vershynin, 2018) to derive the error bound in Theorem 7, which is characterized by the complexity of the considered function class G, including measures such as the covering number (Anthony et al., 1999). Definition 6. (Covering number) For δ > 0, the covering number N δ, F, d related to a semi-metric d on the set F is defined as N(δ, F, d) = min κ n there are g1, . . . , gκ such that min 1 j κ d (f, gj) δ, for any f in F o . With the help of empirical process techniques, the explicit upper bound of the statistical error is provided in the following theorem. Theorem 7. Suppose that Assumption 3 holds and the considered function class G is a Re LU FNN as defined in Section 3.1 and f B, we have EDE(X,Y ) PX,Y ρτ(Y bf D(X)) ρτ(Y f (X)) O BSD log(n BWD) where ED is the expectation taken over the training sample D. Theorem 7 provides the upper bound of the statistical error, which is related to the depth D, size S and weights bound B of the considered Re LU FNN. Note that if we keep the other terms fixed and omit the logarithmic term, the rate of the upper bound becomes O SD n , and the fastest rate O 1 n can also be achieved if we further assume S and D are fixed. We want to emphasize that an innovative proof technique is adopted for analyzing the statistical error, where we first derive the tail probability of ρτ(Y bf D(X)) ρτ(Y f (X)) Feng and He and Jiao and Kang and Wang by using the Bernstein s inequality, and then characterize the statistical error by utilizing the covering number under the norm . This innovative technical treatment is motivated by the properties of the quantile loss, which exhibits Lipschitz continuity, boundedness, and local strong convexity over the target function. Consequently, we can establish a tight bound for the covering number of the Re LU FNN under the norm and leverage the local convexity of ρτ( ), which yields the sharp bound of O SD n for the statistical error in Theorem 7. Clearly, the technical tools adopted in our theoretical analysis significantly differ from and simplify the commonly used local Rademacher complexity techniques (Bartlett et al., 2005). More details are provided in Section A.1.2 Now we are ready to investigate the theoretical performance of the unweighted estimator under covarite shift. Specifically, we establish the upper bound of bf D f0 L2(QX) under two different scenarios where the density ratio is either uniformly bounded or unbounded but has a finite second moment. We first introduce the uniformly bounded assumption for the density ratio below. Assumption 4 (Uniformly bounded). The density ratio r(x) = q X(x)/p X(x) is upperbounded, that is, Γ := supx X r(x) < . Assumption 4 is a commonly used condition in the literature of covariate shift (Cortes et al., 2010; Ma et al., 2023; Feng et al., 2023). Note that Γ is typically assumed to be larger than 1 since the equality indicates that PX and QX are the same. Under Assumption 4, it is easy to verify that bf D f0 2 L2(QX) Γ bf D f0 2 L2(PX). Then, along with the approximation and statistical errors established in Theorems 5 and 7, we can provide the non-asymptotic error bound for bf D with the proper choices of parameters, including the size S, depth D, and weights boundedness B of the considered Re LU FNN class. Theorem 8. Suppose that Assumptions 2-4 are satisfied. If the function class G is a Re LU FNN function class bounded by B 1 with the size S = O n d d+2ζ log n and the depth D = O (log n), and weights boundedness B = O Bn d d+2ζ , then we have ED bf D f0 2 L2(QX) O ΓB2n 2ζ d+2ζ (log n)3 . Theorem 8 establishes the non-asymptotic error bound for the unweighted estimator bf D under the uniformly bounded ratio case, and the obtained bound concurs with the minimax optimal rate in the literature of nonparametric regression (Stone, 1982; Gyorfi et al., 2002; Tsybakov, 2009). This implies that even when the phenomenon of covariate shift occurs, the unweighted estimator is still optimal when the density ratio is uniformly bounded. The proof sketch of Theorem 8 is provided below, which may give some intuitions on why the minimax optimal rate can still be achieved under the bounded case. Note that in the proof of Theorem 8, we first show that the convergence rate of bf D with respect to the source distribution aligns with the minimax lower bound in the literature of nonparametric regression (Stone, 1982; Tsybakov, 2009), i.e., ED bf D f0 2 L2(PX) O B2n 2ζ d+2ζ (log n)3 . Then, when the covariate shift occurs and the density ratio is assumed to be bounded by supx X r(x) Γ, we have ED bf D f0 2 L2(QX) ΓED bf D f0 2 L2(PX) O ΓB2n 2ζ d+2ζ (log n)3 . Deep Nonparametric Quantile Regression under Covariate Shift This implies that the convergence rate of bf D with respect to the target distribution is also minimax optimal (ignoring the constants and the log terms). Moreover, since the density ratio is assumed to be uniformly bounded, it implies that the shift between the source and target distribution may not be too large, and thus may be well controlled. Therefore, the unweighted estimator may still achieve good prediction performance. However, such a uniform boundedness condition is somewhat restrictive in practice, and the following assumption introduces a relaxation, assuming that the density ratio has a finite second moment. Assumption 5 (Finite second moment). The density ratio r has a finite second moment with respect to PX, that is, V := EX PX[r2(X)] < . Note that Assumption 5 is much weaker than Assumption 4, since the finite second moment is implied by the uniformly bounded condition with V 2 = Γ (Ma et al., 2023; Feng et al., 2023). To see this, we have EX PX[r2(X)] = EX QX[r(X)] Γ. Interestingly, these two conditions on the density ratio are related to R enyi divergence Dl(q X||p X) (R enyi, 1961) between the source and target densities p X and q X. Specifically, the conditions on Assumptions 4 and 5 are equivalent to requiring D (q X||p X) and D2(q X||p X) to be bounded (Cortes et al., 2010), respectively. The following theorem establishes the nonasymptotic error bound for the unweighted estimator bf D under the second moment bounded case. Theorem 9. Suppose that Assumptions 2, 3, and 5 are satisfied. If the function class G is a Re LU FNN function class bounded by B 1 with the size S = O n d d+2ζ log n and the depth D = O(log n), and the weights bound B = O Bn d d+2ζ , then we have ED bf D f0 2 L2(QX) O V B2n ζ d+2ζ (log n) 3 2 . Although the unweighted estimator bf D is still consistent with the true quantile function f0, it is clear that its convergence rate of order O(n ζ d+2ζ ) is suboptimal compared to the minimax rate in Stone (1982); Gyorfi et al. (2002); Tsybakov (2009). Consequently, some additional reweighted adjustments are required to tackle this issue. 4.2 Theoretical Analysis for Reweighted Estimator In this part, we give the non-asymptotic error bounds for the reweighted estimator using the exact density ratio. The following theorem provides the non-asymptotic error bound for the reweighted estimator bfr,D defined in (6) under the case that the density ratio is uniformly bounded. Theorem 10. Suppose that all the assumptions in Lemma 4 as well as Assumptions 1 and 4 are satisfied. If the function class J is a Re LU FNN function class bounded by B 1, with the size S = O n d d+2ζ log n , the depth D = O (log n), and the weights bound B = O Bn d d+2ζ , then we have ED bfr,D f0 2 L2(QX) O Γ2B2n 2ζ d+2ζ (log n)3 . Feng and He and Jiao and Kang and Wang In Theorem 10, we obtain a convergence rate of order O(Γ2B2n 2ζ/(d+2ζ)(log n)3), which includes an additional term Γ compared to the rate in Theorem 8. This is due to the influence of the density ratio in the reweighted estimation procedure in (6) and in fact, the derived rate in Theorem 10 is exactly the same as that in Theorem 8 if Γ is a constant. When the density ratio is unbounded but has a finite second moment, as noted in Section 4.3, we also apply a truncated density ratio. Specifically, let ξ > 0 be the pre-specified truncation level, then the truncated density ratio is defined as ( r(x), if r(x) ξ, ξ, otherwise. Thus, the reweighted estimator with a truncated density ratio can be obtained by solving the following optimization task bf Tξr,D arg min f K 1 n i=1 Tξr(Xi)ρτ f(Xi) Yi , (12) where K refers to a certain hypothesis class of measurable functions. The non-asymptotic error bound for the reweighted estimator bf Tξr,D is established in the following theorem with an appropriate choice of the truncated level ξ. Theorem 11. Suppose that all the assumptions in Lemma 4 as well as Assumptions 1 and 5 are satisfied. If the function class K is a Re LU FNN function class bounded by B 1, with the size S = O n d d+6ζ log n , the depth D = O(log n), and the weights bound B = O Bn d d+6ζ , then we have ED h bf Tξr,D f0 2 L2(QX) i O V 4 3 B2n 2ζ d+6ζ log n . Note that the obtained convergence rate of order O(n 2ζ d+6ζ ) is suboptimal. In the proof of Theorem 11, we divide the statistical error with the truncated ratio into two terms, the error with the true density ratio and their difference. With the similar argument in the proof of Theorem 7 and Markov inequality, we can show that the upper bounds of the two terms are of orders O(BSDξ2 log(n BWD) n ) and O(B2V 2 ξ ), respectively. The trade-off between the two upper bounds leads to such a suboptimal rate. However, it can be further improved with extra conditions on the density ratio, which is shown in the next assumption. Assumption 6. There exists some constant δ > 0 such that the density ratio r has a finite (1 + δ)-th moment with respect to PX, that is, U := EX PX[r1+δ(X)] < . Note that Assumption 6 is more general than Assumption 5, where the latter assumption is a special case of Assumption 6 when δ = 1. We can then obtain the convergence rate of bf Tξr,D in the following Corollary. Corollary 12. Suppose that all the assumptions in Lemma 4 as well as Assumptions 1 and 6 are satisfied. If the function class K is a Re LU FNN function class bounded by Deep Nonparametric Quantile Regression under Covariate Shift B 1, with the size S = O n d d+(2+4/δ)ζ log n , the depth D = O (log n), and weights bound d d+(2+4/δ)ζ , it follows that ED h bf Tξr,D f0 2 L2(QX) i O U2δ/(2+δ)B2n 2ζ d+(2+4/δ)ζ log n . When δ = 1, corresponding to the second moment bounded case, the convergence rate obtained in Corollary 12 reduces to O(n 2ζ d+6ζ ), which coincides with that of Theorem 11. When δ , the convergence achieves the minimax optimal rate of O(n 2ζ d+2ζ ) in nonparametric regression. 4.3 Theoretical Analysis for Pre-training Reweighted Estimator In this part, we present the most pivotal result of this paper, focusing on the convergence rate of the pre-training reweighted estimator. The established results are closely aligned with, yet extend beyond, the findings of both the unweighted and reweighted estimators. Furthermore, we concurrently consider the cases where the density ratio is either bounded or unbounded but with certain bounded moments. For the uniformly bounded case, as elucidated in Algorithm 1, obtaining the pre-training reweighted estimator necessitates the preliminary acquisition of a density ratio estimator, a process grounded in least-squares density ratio fitting. This density ratio estimator is subsequently incorporated into the empirical reweighted risk to derive the pre-training reweighted estimator. To reach this ultimate result, we need to introduce some additional conditions on the density ratio, as outlined in Assumptions 7 and 8. To proceed, we first establish the non-asymptotic error bound of the density ratio estimator, as expounded in Theorem 13. Subsequently, we derive the non-asymptotic error bound of the pre-training reweighted estimator, as detailed in Theorem 14. Assumption 7. The density ratio r is H older continuous that r Hα([0, 1]d, Γ) for some positive α. Assumption 7 is a standard condition restricting the underlying function space of the density ratio r, which is important to establish the estimation consistency of the pre-training weight br S. Theorem 13. Suppose that Assumptions 1, 4 and 7 are satisfied. If the function class U is a Re LU FNN function space bounded by Γ, with the size S = O m d d+2α log m , the depth D = O(log m), and the weights bound B = O Γm d d+2α , then we have ES br S r 2 L2(PX) O Γ3m 2α d+2α (log m)3 . Theorem 13 establishes a non-asymptotic error bound for the density-ratio estimator br S under the uniformly bounded case. Ignoring the fixed and log terms, the obtained convergence rate becomes O m 2α d+2α , which concurs with the minimax optimal rate in the literature of nonparametric estimation. Feng and He and Jiao and Kang and Wang With the obtained convergence rate of br S and the following lower bounded assumption, we are now ready to give the non-asymptotic error bound of the final pre-training reweighted estimator bfbr S,D defined in (9). Assumption 8. The density ratio r is bounded away from zero, that is, Υ := infx X r(x) > 0. Assumption 8 further requires that the density ratio is uniformly lower bounded, which excludes some extreme cases. Theorem 14. Suppose that Assumptions 1-4, 7 and 8 are satisfied. If the function class M is a Re LU FNN function class bounded by B 1, with the size S = O n d d+2ζ log n , the depth D = O (log n), and the weights bound B = O Bn d d+2ζ , we have ES,D bfbr S,D f0 2 L2(QX) O B2Γ2n 2ζ 2ζ+d (log n)3 + O B2Γ3 Υ m 2α d+2α (log m)3 . Moreover, if m Ω Γ ζ(d+2α) α(d+2ζ) , we have ES,D bfbr S,D f0 2 L2(QX) O B2Γ2n 2ζ d+2ζ (log n)3 . Theorem 14 establishes the convergence rates of the pre-training reweighted estimator bfbr S,D under the uniformly bounded case, which consists of two terms representing the estimation accuracies of the underlying function and the density ratio, respectively. Clearly, Theorem 14 provides valuable theoretical insights for determining an appropriate pre-training sample size m, i.e., m O Γ ζ(d+2α) α(d+2ζ) , where the obtained rate is of the same order as that of the reweighted estimator in Theorem 5. This further illustrates the effectiveness of the combination of pre-training and reweighted techniques in our proposed method. As the density ratio is unbounded, the following assumption is needed to establish the sharp rate for the truncated density ratio estimator. Assumption 9. There exists some constant δ > 0 such that the density ratio r has a finite (2 + δ)-th moment with respect to PX, that is, Ξ := EX PX[r2+δ(X)] < . Although Assumption 9 is slightly stronger than Assumption 5, it is necessary to provide a sharp bound for the difference between the true density ratio r and its truncated version Tξr in terms of L2(PX), denoted as Tξr r 2 L2(PX). More details regarding this issue are deferred to Appendix A.4. Now we are ready to establish the convergence rate of the truncated density ratio estimator under Assumption 9. Theorem 15. Suppose that Assumptions 1 and 9 are satisfied, and Tξr Hα([0, 1]d, ξ) holds for some α > 0. If the function space U is a Re LU FNN function class bounded by ξ, with the size S = O m d d+(2+6/δ)α , the depth D = O (log m), weights bound B = O m δd+2α δd+(6+2δ)α for the Re LU DNNs, and the truncation level ξ = O m 2α δd+(6+2δ)α , then we have ES h brξ,S r 2 L2(PX) i O m 2α d+(2+6/δ)α (log m)2 . Deep Nonparametric Quantile Regression under Covariate Shift Theorem 15 ensures that under some technical assumptions, the truncated density ratio estimator can achieve a sharp rate of convergence, even in the case of unbounded densities. Particularly, as δ , this convergence rate approaches the minimax optimal nonparametric rate of O(m 2α d+2α ). Remark 16. If we further strengthen Assumption 9 such that the square of the density ratio is sub-exponential with respect to PX, i.e., EX PX exp(σr2(X)) < , where σ is some positive constant, the convergence rate of brξ,S in Theorem 15 can be improved to be minimax optimal, which corresponds to the case where δ . It is worth noting that under the nonparametric regression setting, the finite (2 + δ)-th moment condition and the sub-exponential tail condition usually lead to different non-asymptotic error bounds (Han and Wellner, 2019; Schmidt-Hieber, 2020; Farrell et al., 2021). Finally, we obtain the convergence rate for this pre-training reweighted estimator bfbrξ,S,D in the following theorem. Theorem 17. Suppose that Assumptions 1-3, 8 and 9 are satisfied. If the function space M is Re LU FNN function class bounded by B 1, the size S = O(n d d+(2+4/δ)ζ ), the depth D = O(log n), weights bound B = O(Bn d d+(2+4/δ)ζ ), and the truncation level ξ = O(n 2ζ δd+(4+2δ)ζ ), then we have ES,D bfbrξ,S,D f0 2 L2(QX) B2n 2ζ d+(2+4/δ)ζ (log n)2 + B2m 2α d+(2+6/δ)α (log m)2 Moreover, if m Ω n [δd+(6+2δ)α]ζ [δd+(4+2δ)ζ]α , we obtain that ES,D bfbrξ,S,D f0 2 L2(QX) O B2n 2ζ d+(2+4/δ)ζ (log n)2 . Theorem 17 provides a non-asymptotic error bound for bfbrξ,S,D, which approaches the minimax optimal rate in the field of nonparametric regression when δ . As shown in Corollary 12, some sharp upper bounds can be similarly obtained if the density ratio r is known previously under a weaker condition than Assumption 5 commonly used in the literature. As discussed in Remark 16, if we further assume that r2 is sub-exponential, these bounds can achieve the minimax optimal rate. Notably, this result is particularly appealing as it is established by considering the pre-training density ratio estimator. To the best of our knowledge, Theorem 17 is the first result for investigation on the problem of covariate shift in the context of nonparametric quantile regression. 5 Numerical Experiments In this section, we evaluate the numerical performance of the pre-training reweighted estimators (PWDQR) in (9) and (11) for both bounded and unbounded ratios. We compare it to some state-of-the-art methods, including the reweighted estimator (WDQR) as defined in (6) and 12 for the bounded and unbounded ratios, respectively, and the unweighted estimator (DQR) as defined in (5). The implementation details of all the considered methods are provided below. Feng and He and Jiao and Kang and Wang DQR: we implement it in Pytorch using the stochastic gradient descent (SGD) (Bottou, 2012) with Nesterov momentum of 0.9 and initial learning rate of 0.1 with rate decay 0.5. We consider the fixed width neural network consisting of Re LU activated multilayer perceptrons with three hidden layers. WDQR: the implementation details of WDQR are almost identical to those of DQR, except for using the weighted quantile loss function instead of the unweighted one, where the true density ratio r is plugged in. PWDQR: the implementation details of PWDQR is exactly the same as that of WDQR, except that we replace the true density ratio with the (truncated) pre-trained density ratio. Specifically, it is similar to WDQR with a pre-trained density ratio br S instead of the given truth. For the estimation of br S, we solve (8) by a neural network using Pytorch, which consists of Re LU activated multilayer perceptrons with two hidden layers. The optimization algorithm is Adam (Kingma and Ba, 2017) with a learning rate 10 4. For the truncated pre-trained density ratio brξ,S, we additionally use an activate function σ ξ(x) = min(x, ξ) for the last layer of the neural network. We consider two generating scenarios including the univariate and multivariate cases, and two covariate shift settings with bounded and unbounded density ratios, respectively. In this simulation study, we only use the truncated ratio for the unbounded case in WDQR and PWDQR, and the truncated levels ξ for WDQR and PWDQR are suggested to be c1(n/ log n)1/3 and c2 log m for some constants c1, c2 > 0 as indicated in the theorems, respectively. For each simulated scenario, we generate the training data {Xtr i , Y tr i }ntr i=1 with sample size ntr from the source distribution to train those three nonparametric quantile regression models at five quantile levels τ {0.05, 0.25, 0.5, 0.75, 0.95}. To evaluate each model, we generate the target data {Xta i , Y ta i }nta i=1 with sample size nta from the target distribution. For notation simplicity, we denote bfτ ntr and fτ 0 as the estimated and true quantile functions at the specific quantile level τ (0, 1), respectively. We evaluate the performance of these methods based on two norms between bfτ ntr and fτ 0 as given by L1 : bfτ ntr fτ 0 L1(ν) = 1 nta bfτ ntr(Xta i ) fτ 0 (Xta i ) , L2 : bfτ ntr fτ 0 L2(ν) = bfτ ntr(Xta i ) fτ 0 (Xta i ) 2 )1/2 . To estimate the pre-training density ratio, we also independently generate extra training data {f X tr i , e Y tr i }m i=1 and target data {f X ta i , e Y ta i }m i=1 with the same sample size m. In our study, we fix nta = 10000 and m = 1000, and we report the averaged L1 and the square of L2 distances together with their corresponding standard errors over 100 independent repetitions under different scenarios. Deep Nonparametric Quantile Regression under Covariate Shift (a) Bounded case (b) Unbounded case Figure 1: The quantile curves of model (13) for the bounded case (a) and unbounded case (b), where the training and target data are represented as blue and orange dots, respectively, and the target quantile functions at levels τ = 0.05 (blue line), τ = 0.25 (orange line), τ = 0.5 (green line), τ = 0.75 (red line), τ = 0.95 (purple line) are plotted as curves. 5.1 Univariate Model We generate the data from the following univariate model X6 + σε, (13) where ε N(0, 1) and σ = 0.05. Here, the true quantile function at quantile level τ is fτ 0 (X) = e 1 X6 + σΦ 1(τ), where Φ( ) is the cumulative distribution function of the standard normal random error ε. The source and target covariates are drawn from the normal distributions with mean µ1 and variance σ2 1, and mean µ2 and variance σ2 2, respectively. According to Cortes et al. (2010) and Feng et al. (2023), we can show that the density ratio r( ) is uniformly bounded if and only if σ2 1 σ2 2, and second moment bounded if and only if σ2 2 σ2 1 σ2 2/2. Consequently, we choose µ1 = 0, σ2 1 = 0.4, µ2 = 0.5, σ2 2 = 0.3 for the uniform bounded case and µ1 = 0, σ2 1 = 0.3, µ2 = 1, σ2 2 = 0.5 for the second moment bounded case, respectively. Figure 1 shows the univariate data generation model in the bounded case and unbounded case and their corresponding conditional quantile curves at τ = 0.05, 0.25, 0.5, 0.75, 0.95. Performance of different estimators based on L1 and L2 2 of prediction errors are summarized in Tables 1-2 for those cases with bounded and unbounded ratios, respectively. From the results in Table 1, we observe that the prediction errors of the estimator DQR are very close to those of the estimator WDQR in all scenarios, aligning with our theoretical findings that the weighted and unweighted estimators can both achieve the minimax optimal rate for the uniformly bounded case. As the sample size n increases, the performance of PWDQR tends to coincide with that of WDQR, which is expected since the pre-trained weight function converges to its true counterpart when n is large. For the case with unbounded ratios, as shown in Table 2, both WDQR and PWDQR significantly outperform DQR in all scenarios. This phenomenon is consistent with our theoretical Feng and He and Jiao and Kang and Wang Table 1: Averaged L1 and L2 2 errors ( 10 1) based on testing data with the corresponding standard deviations in brackets for DQR, WDQR and PWDQR for Model (13) with the bounded density ratio. Sample size n = 512 n = 2048 τ Method L1 L2 2 L1 L2 2 0.05 DQR 0.545(0.219) 0.086(0.126) 0.264(0.065) 0.021(0.009) WDQR 0.541(0.332) 0.084(0.118) 0.259(0.107) 0.020(0.019) PWDQR 0.555(0.243) 0.087(0.110) 0.270(0.088) 0.025(0.044) 0.25 DQR 0.249(0.108) 0.023(0.027) 0.140(0.035) 0.006(0.003) WDQR 0.249(0.088) 0.022(0.016) 0.138(0.049) 0.006(0.004) PWDQR 0.255(0.173) 0.031(0.091) 0.146(0.012) 0.007(0.004) 0.5 DQR 0.213(0.081) 0.019(0.012) 0.129(0.028) 0.004(0.002) WDQR 0.211(0.104) 0.017(0.014) 0.128(0.039) 0.004(0.003) PWDQR 0.224(0.179) 0.025(0.037) 0.131(0.039) 0.005(0.003) 0.75 DQR 0.245(0.084) 0.023(0.012) 0.145(0.036) 0.004(0.002) WDQR 0.241(0.179) 0.021(0.034) 0.145(0.049) 0.005(0.003) PWDQR 0.269(0.288) 0.029(0.067) 0.147(0.052) 0.006(0.004) 0.95 DQR 0.634(0.354) 0.106(0.321) 0.275(0.054) 0.014(0.005) WDQR 0.627(0.626) 0.103(0.423) 0.274(0.088) 0.015(0.007) PWDQR 0.684(0.697) 0.126(0.589) 0.276(0.094) 0.016(0.009) Table 2: Averaged L1 and L2 2 errors ( 10 1) based on testing data with the corresponding standard deviations in brackets for DQR, WDQR and PWDQR for Model (13) with the unbounded density ratio. Sample size n = 512 n = 2048 τ Method L1 L2 2 L1 L2 2 0.05 DQR 1.420(0.738) 0.610(0.689) 0.678(0.634) 0.185(0.062) WDQR 1.192(0.561) 0.369(0.450) 0.618(0.237) 0.100(0.107) PWDQR 1.276(0.624) 0.377(0.482) 0.620(0.198) 0.103(0.064) 0.25 DQR 1.018(0.446) 0.443(0.374) 0.475(0.185) 0.096(0.085) WDQR 0.731(0.489) 0.237(0.428) 0.408(0.138) 0.066(0.061) PWDQR 0.809(0.321) 0.299(0.455) 0.428(0.157) 0.068(0.061) 0.5 DQR 0.912(0.396) 0.427(0.361) 0.402(0.185) 0.095(0.112) WDQR 0.779(0.505) 0.298(0.464) 0.374(0.143) 0.066(0.064) PWDQR 0.783(0.336) 0.306(0.445) 0.396(0.162) 0.067(0.065) 0.75 DQR 0.928(0.398) 0.455(0.384) 0.520(0.207) 0.101(0.123) WDQR 0.762(0.534) 0.330(0.499) 0.410(0.161) 0.072(0.049) PWDQR 0.798(0.405) 0.345(0.460) 0.443(0.180) 0.073(0.070) 0.95 DQR 1.527(0.591) 0.672(0.542) 0.714(0.208) 0.154(0.077) WDQR 1.389(0.814) 0.501(0.599) 0.594(0.283) 0.116(0.096) PWDQR 1.427(0.545) 0.479(0.542) 0.634(0.243) 0.132(0.162) findings. Specifically, the unweighted estimator is sub-optimal when the importance ratio is second moment bounded. In contrast, both weighted and pre-trained weighted estimators can achieve the optimal rates. These results further validate the necessity and effectiveness of the reweighted procedure and our pre-training algorithm under covariate shift. Deep Nonparametric Quantile Regression under Covariate Shift Table 3: Averaged L1 and L2 2 errors based on testing data with the corresponding standard deviations in brackets for DQR, WDQR and PWDQR for model (14) under the case with bounded density ratio. Sample size n = 512 n = 2048 τ Method L1 L2 2 L1 L2 2 0.05 DQR 0.348(0.064) 0.156(0.057) 0.215(0.031) 0.059(0.017) WDQR 0.348(0.074) 0.154(0.062) 0.213(0.023) 0.058(0.011) PWDQR 0.350(0.069) 0.158(0.063) 0.219(0.024) 0.063(0.012) 0.25 DQR 0.149(0.031) 0.038(0.017) 0.103(0.017) 0.020(0.005) WDQR 0.150(0.054) 0.037(0.015) 0.102(0.012) 0.019(0.004) PWDQR 0.151(0.030) 0.041(0.015) 0.104(0.013) 0.021(0.004) 0.5 DQR 0.114(0.013) 0.028(0.006) 0.078(0.007) 0.014(0.003) WDQR 0.112(0.017) 0.027(0.007) 0.076(0.005) 0.013(0.002) PWDQR 0.120(0.015) 0.027(0.006) 0.077(0.005) 0.014(0.003) 0.75 DQR 0.145(0.028) 0.038(0.012) 0.102(0.016) 0.019(0.005) WDQR 0.144(0.032) 0.036(0.010) 0.101(0.013) 0.019(0.004) PWDQR 0.149(0.025) 0.039(0.010) 0.103(0.012) 0.019(0.004) 0.95 DQR 0.320(0.052) 0.141(0.025) 0.215(0.027) 0.059(0.007) WDQR 0.318(0.060) 0.139(0.050) 0.213(0.023) 0.059(0.010) PWDQR 0.321(0.054) 0.145(0.054) 0.214(0.024) 0.060(0.010) 5.2 Multivariate Model In this section, we consider the following additive multivariate model Y = sin(2πX1) + 0.5e X2 + 1.5|(X3 0.4)(X3 0.6)| + σX2ε, (14) where ε t(3) and σ = 0.1. Here, the true quantile function at quantile level τ is fτ 0 (X) = sin(2πX1) + 0.5e X2 + 1.5|(X3 0.4)(X3 0.6)| + σX2F 1 t (τ, 3), where F 1 t ( , 3) is the cumulative distribution function of the Student s t random error ε with degrees 3. We assume that three covariates X1, X2 and X3 are independent, and X2, X3 are generated from the uniform distribution on [0, 1] for both source and target distributions. The source and target data of X1 are drawn from Beta distributions with parameters (α1, β1) and (α2, β2), respectively. It is easy to verify that the importance ratio r(X) is uniformly bounded if and only if α2 α1 and β2 β1, and second moment bounded if and only if α2 < α1, 2α2 α1, 2β2 β1 or β2 < β1, 2α2 α1, 2β2 β1. In our study, we choose α1 = 2.5, β1 = 1.5, α2 = 3, β2 = 4 for the uniformly bounded case and α1 = 4, β1 = 1, α2 = 3, β2 = 6 for the second moment bounded case, respectively. Results on the performance of different estimators are summarized in Tables 3-4 for those cases with the bounded and unbounded ratios, respectively. As indicated in Tables 3-4, the conclusions for the multivariate model are very similar to those of the univariate model. The errors of the three estimators are very close for the uniformly bounded case, and both WDQR and PWDQR have a better performance than DQR for the second moment bounded case. 6 Conclusion and Discussion In this work, we leverage deep nonparametric quantile regression as the foundational framework to systematically explore and illuminate the phenomenon of covariate shift within Feng and He and Jiao and Kang and Wang Table 4: Averaged L1 and L2 2 errors based on testing data with the corresponding standard deviations in brackets for DQR, WDQR and PWDQR for model (14) under the case with unbounded density ratio. Sample size n = 512 n = 2048 τ Method L1 L2 2 L1 L2 2 0.05 DQR 0.390(0.093) 0.217(0.098) 0.290(0.037) 0.124(0.029) WDQR 0.345(0.165) 0.187(0.242) 0.265(0.038) 0.094(0.024) PWDQR 0.363(0.362) 0.195(0.262) 0.270(0.049) 0.099(0.045) 0.25 DQR 0.267(0.043) 0.143(0.055) 0.216(0.037) 0.108(0.044) WDQR 0.246(0.064) 0.109(0.060) 0.161(0.020) 0.051(0.014) PWDQR 0.254(0.134) 0.118(0.152) 0.165(0.027) 0.053(0.018) 0.5 DQR 0.245(0.050) 0.147(0.072) 0.209(0.047) 0.122(0.057) WDQR 0.209(0.047) 0.095(0.040) 0.146(0.021) 0.054(0.019) PWDQR 0.219(0.071) 0.100(0.064) 0.150(0.029) 0.055(0.023) 0.75 DQR 0.272(0.070) 0.174(0.089) 0.239(0.063) 0.150(0.070) WDQR 0.254(0.053) 0.129(0.055) 0.173(0.034) 0.069(0.025) PWDQR 0.263(0.072) 0.132(0.075) 0.182(0.039) 0.072(0.031) 0.95 DQR 0.426(0.107) 0.302(0.138) 0.369(0.083) 0.243(0.101) WDQR 0.386(0.127) 0.274(0.167) 0.295(0.053) 0.135(0.043) PWDQR 0.397(0.136) 0.285(0.178) 0.308(0.057) 0.142(0.054) quantile regression. We propose a two-stage method, leading to the development of a pretraining reweighted estimator for the target quantile function. In our theoretical analysis, we present rigorous non-asymptotic error bounds for unweighted, reweighted, and pre-training reweighted estimators. This analysis simultaneously considers scenarios in which the density ratio is either bounded or unbounded under some weaker moment conditions than those commonly considered in the existing literature. Importantly, we introduce a novel approach to constrain generalization error, simplify the tools of local Rademacher complexities, and derive an approximation error with weights bound for Re LU neural networks approximating the H older continuous function class. Furthermore, our theoretical insights provide valuable prior guidance in the selection of an appropriate sample size for the pre-training strategy. These results underscore the importance of the pre-training and reweighting techniques in mitigating the challenges posed by the covariate shift phenomenon. Several directions for future work are worth exploring. One possible future work is to further investigate the impact of lower and upper bounds of the density ratio on convergence rates. Most recently, generative adversarial networks and diffusion models have both been proven effective in generating high-quality samples from complex distributions, which may provide promising alternatives. Thus, another possible further work to integrate generative models within the proposed framework is to adopt these methods to model the conditional distribution directly, or use the generative model as a pre-processing step. Acknowledgments and Disclosure of Funding We would like to express our sincere gratitude to the editor, the action editor, and two anonymous reviewers for their constructive comments, which have greatly contributed to the significant improvement of the manuscript. This research was partly supported by Deep Nonparametric Quantile Regression under Covariate Shift the National Natural Science Foundation of China grant (12371441, 12371270, 12001356, U24A2002), Shanghai Science and Technology Development Funds (23JC1402100), Natural Science Foundation of Shanghai (24ZR1421400), Shanghai Research Center for Data Science and Decision Technology, and the Fundamental Research Funds for the Central Universities. Feng and He and Jiao and Kang and Wang Appendix A. Proof of the Main Results In Section A.1, we first prove the error decomposition including approximation error and statistical error and provide the tight estimates of each error bound in Sections A.1.1 and A.1.2, respectively. Based on the error decomposition, in Section A.2, we provide the proofs of the non-asymptotic error bounds for the unweighted estimator under the uniformly bounded case and second moment bounded case in Sections A.2.1 and A.2.2, respectively. In Section A.3, with a slightly modified error decomposition in Lemma 21 and estimates for the statistical error term in Lemma 22, we succeed in deriving the non-asymptotic error bounds for the reweighted estimator under the uniformly bounded case and second moment bounded case in Sections A.3.1 and A.3.2, respectively. In Section A.4, we provide the proofs of the non-asymptotic error bounds for the pre-training density ratio and the pretraining reweighted estimator under the uniformly bounded case and exponential moment bounded case in Sections A.4.1 and A.4.2, respectively. A.1 Proof of the Error Terms In this section, we give the proof of Lemma 4, which indicates the error decomposition of the L2(PX) norm of the difference between the true quantile function f0 and its estimates bf D. Proof. Recall the Knight s identity (Belloni and Chernozhukov, 2011) that for any u, v R, there holds ρτ(u v) ρτ(u) = v(τ I{u 0}) + Z v 0 (I{u z} I{u 0})dz. Then, for any f G, by taking u = Y f (X) and v = f(X) f (X), we have ρτ Y f(X) ρτ Y f (X) = f(X) f (X) τ I{Y f (X)} + Z f(X) f (X) h I Y f (X) + z I Y f (X) i dz = f(X) f (X) τ I{Y f0(X)} f(X) f (X) I{Y f0(X)} I{Y f (X)}) + Z f(X) f (X) I{Y f (X) + z} I{Y f (X)} dz. Deep Nonparametric Quantile Regression under Covariate Shift By taking expectations, we note that EY |X PY |X[(τ I{Y f0(X)}) | X] = 0 due to the fact that P(Y f0(X) | X) = τ, and using Fubini s theorem, there holds E(X,Y ) PX,Y ρτ(Y f(X)) ρτ(Y f (X)) =EX PX h f(X) f (X) EY |X PY |X (τ I{Y f0(X)}) | X i EX PX h f(X) f (X) EY |X PY |X (I{Y f0(X)} I{Y f (X)}) | X i + EX PX h Z f(X) f (X) EY |X PY |X[I{Y f (X) + z} | X] EY |X PY |X[I{Y f (X)} | X] dz i C1EX PX |f(X) f (X)||f0(X) f (X)| + C2EX PX D2(f(X) f (X)) EX PX |f(X) f (X)|2 q EX PX |f0(X) f (X)|2 + C2EX PX D2(f(X) f (X)) , where C1, C2 are two absolute positive constants and D2(t) := min{|t|, |t|2}, t R, the first inequality follows from Assumptions 2 and 3 and Lemma 13 in Madrid Padilla and Chatterjee (2022), and the last inequality follows from the Cauchy-Schwarz inequality. Consequently, for any β > 0, there holds C2EX PX D2(f(X) f (X)) E(X,Y ) PX,Y ρτ(Y f(X)) ρτ(Y f (X)) EX PX |f(X) f (X)|2 q EX PX |f0(X) f (X)|2 4β EX PX |f(X) f (X)|2 + C1βEX PX |f0(X) f (X)|2 + E(X,Y ) PX,Y ρτ(Y f(X)) ρτ(Y f (X)) . Note that f G, we have f B with B 1, then for any f B, we have D2((f(X) f (X)) = min{|f(X) f (X)|, |f(X) f (X)|2} 2B min{|f(X) f (X)| 2B , |f(X) f (X)|2 = |f(X) f (X)|2 where the last equality follows from |f(X) f (X)| f + f 2B. By setting β = C1B C2 , there holds f f 2 L2(PX) 4B E(X,Y ) PX,Y ρτ(Y f(X)) ρτ(Y f (X)) C2 2 f0 f 2 L2(PX). Feng and He and Jiao and Kang and Wang Note that bf D G, then bf D B, it follows directly by the triangle inequality that bf D f0 2 L2(PX) 2 bf D f 2 L2(PX) + 2 f f0 2 L2(PX) BE(X,Y ) PX,Y ρτ(Y bf D(X)) ρτ(Y f (X)) + B2 f f0 2 L2(PX) BE(X,Y ) PX,Y ρτ(Y bf D(X)) ρτ(Y f (X)) + B2 f f0 2 . This completes the proof of Lemma 4. A.1.1 Proof of Estimates for the Approximation Error We give the proof of the upper bound for the approximation error as stated in Theorem 5. This proof aligns with the methodology employed in previous works (Yarotsky, 2017; Petersen and Voigtlaender, 2018). Diverging from the approach in Yarotsky (2017), our contribution lies in deriving bounds for the weights through the techniques introduced in Petersen and Voigtlaender (2018). To start with, we first introduce two preliminary lemmas. Lemma 18. Given M > 0 and δ (0, 1), there exists a Re LU neural network h satisfies: (i). for all x, y [ M, M], we have |xy h(x, y)| δ; (ii). if x = 0 or y = 0, then h(x, y) = 0; (iii). the depth D and the size S in h are less than c1 ln(1 δ) + c2, and weights bound B is not larger than c3 δ , where c1 is an absolute constant, and c2, c3 are two constants depending on M. The proof of Lemma 18 can be completed by combining Proposition 3 in Yarotsky (2017) and Lemma A.3 in Petersen and Voigtlaender (2018), and thus is omitted here. Lemma 19. Let f1 Fd1,k1(W1, D1, S1, B1) and f2 Fd2,k2(W2, D2, S2, B2), and then the following statements hold. (i). (Composition) If k1 = d2, then f2 f1 Fd1,k2(max{W1, W2}, D1 +D2, S1 +S2, B1 B2 max{W1, W2}). Moreover, if A Rd2 d1, b Rd2 and define the function f(x) = f2(Ax + b) for x Rd1, then there holds f Fd1,k2(W2, D2, S2, d2B2 (A, b) ). (ii). (Parallelization) If d1 = d2, denote f(x) = (f1(x), f2(x)), then there holds f Fd1,k1+k2(W1 + W2, max{D1, D2}, S1 + S2, max{B1, B2}). (iii). (Linear Combination) Let c1, c2 R. If d1 = d2 and k1 = k2, then we have c1f1 + c2f2 Fd1,k1(W1 + W2, max{D1, D2}, S1 + S2, max {B1, B2, |c1|B1 + |c2|B2}). Proof. To start with, we first denote fi as Re LU neural networks with parameters θi = (A(i) 0 , b(i) 0 ), . . . , (A(i) Di, b(i) Di) , fori = 1, 2. For (i), without loss of generality, we assume that W1 = W2 and then, f2 f1 can be parameterized by (A(1) 0 , b(1) 0 ), . . . , (A(1) D1 1, b(1) D1 1), (A(2) 0 A(1) D1, A(2) 0 b(1) D1 + b(2) 0 ), (A(2) 1 , b(2) 1 ), . . . , (A(2) D2, b(2) D2) . Deep Nonparametric Quantile Regression under Covariate Shift Note that it holds true that (A(2) 0 A(1) D1, A(2) 0 b(1) D1 + b(2) 0 ) max (A(2) 0 A(1) D1) , (A(2) 0 b(1) D1 + b(2) 0 ) Then, we have f2 f1 Fd1,k2 max{W1, W2}, D1 + D2, S1 + S2, B1 B2 max{W1, W2} . Following a similar treatment, we can conclude that f Fd1,k2(W2, D2, S2, d2B2 (A, b) ) for the function f(x) = f2(Ax + b) since it is a composition of f2 with f1(x) = Ax + b, which can be regarded as a neural network with depth zero. For (ii), without loss of generality, we assume that D1 = D2 and then, f can be parameterized by the parameters ((A0, b0), . . . , (AD1, b D1)) with Then, the derived result directly follows from A(1) ℓ 0 b(1) ℓ 0 A(2) ℓ b(2) ℓ = max (A(1) ℓ, b(1) ℓ) , (A(2) ℓ, b(2) ℓ) . For (iii), directly replacing the matrix (AD1, b D1) in (ii) with (c1A(1) D1, c2A(2) D2, c1b(1) D1 + c2b(2) D2), the derived result directly follows from (c1A(1) D1, c2A(2) D1, c1b(1) D1 + c2b(2) D2) |c1| (A(1) D1, b(1) D1) + |c2| (A(2) D1, b(2) D1) |c1|B1 + |c2|B2. Thus, we have c1f1 + c2f2 Fd1,k1(W1 + W2, max{D1, D2}, S1 + S2, max {B1, B2, |c1|B1 + |c2|B2}). This completes the proof of Lemma 19. Combining Lemmas 18 and 19, we are now ready to complete the proof of Theorem 5. Proof. We denote the function ψ( ) by ψ(t) := σ(1 |t|) = σ(1 σ(t) σ( t)) [0, 1], t R. Notice that ψ is a two-layer neural network contained in F(2, 2, 6, 1). Let N N, for any n = (n1, . . . , nd) {0, 1, . . . , N}d, we define i=1 ψ (Nxi ni) , x = (x1, . . . , xd) Rd. Then, ψn is supported on x Rd : x n N , and (N + 1)d functions {ψn}n form a partition of unity of the domain [0, 1]d, i.e., n {0,1,...,N}d ψn(x) = ni=0 ψ (Nxi ni) 1, x [0, 1]d. Feng and He and Jiao and Kang and Wang Let cn,s := sf n N /s! be the Taylor coefficients of f at n N , where s := (s1, . . . , sd) Nd with s 1 t = ζ . We denote by pn,s(x) := ψn(x) x n N s. Then, it follows that pn,s is supported on x Rd : x n N . Denote by n {0,1,...,N}d s 1 t cn,spn,s(x). Using Taylor expansion in Lemma A.8 in Petersen and Voigtlaender (2018), we have |f(x) p(x)| = n ψn(x)f(x) X s 1 t cn,s x n s 1 t cn,s x n s 1 t cn,s x n N = ϵ 2d+1dt for some ϵ > 0, then |f(x) p(x)| Bϵ 2 . Hence, it remains to construct a neural network approximating p(x) with the approximation error Bϵ 2 . Equivalently, it aims to construct neural networks approximating the product pn,s(x) = ψn(x) x n N s. Let δ > 0, then we can recursively define fn,s(x) = h (ψ(Nx1 n1), h(ψ(Nx2 n2), . . . , h(x1 n1/N, . . .), . . .) , where h is defined in Lemma 18. Using similar arguments in the proof of Theorem 1 in Yarotsky (2017), it holds that fn,s can be implemented by a Re LU network with the depth and size not larger than c1(d + t) ln(1/δ) for some constant c1 = c1(d, t), and |fn,s(x) pn,s(x)| (d + t)δ. (16) Consequently, we establish the desired neural network n {0,1,...,N}d s 1 t cn,sfn,s(x). Deep Nonparametric Quantile Regression under Covariate Shift Therefore, for any x [0, 1]d, we have |p(x) f(x)| = s 1 t cn,spn,s(x) X s 1 t cn,sfn,s(x) s 1 t |cn,s pn,s(x) fn,s(x)| s 1 t |pn,s(x) ϕn,s(x)| B(t + 1)dt X |pn,s(x) ϕn,s(x)| B(t + 1)dt2d max n: x n N |pn,s(x) ϕn,s(x)| B(t + 1)dt+12d+1δ, where the third inequality follows from P s 1 t 1 = Pt j=0 P s 1=j 1 Pt j=0 dj (t + 1)dt, and the last inequality holds by (16). Setting δ = ϵ (t + 1)dt+12d+2 , (17) we have |f(x) f(x)| |f(x) p(x)| + |p(x) f(x)| Bϵ. Moreover, f is a liner combination of dt(N +1)d neural networks fn,s, we can conclude that the neural network f has not more than c1 ln(1/δ) + 1 layers and dt(N + 1)d(c1 ln(1/δ) + 1) weights by Theorem 1 of Yarotsky (2017). With δ given by (17) and N given by (15), it holds that f has the depth at most c2(ln(1/ϵ)+1) and at most c2ϵ d/ζ(ln(1/ϵ)+1) weights, where c2 = c2(d, ζ) is a constant depending d, ζ. Using Lemma 19 and the techniques in proof of Theorem A.9 in Petersen and Voigtlaender (2018), the weights of f can be upper-bounded by B = c3Bϵ d/ζ, where c3 = c3(d, ζ) is a constant depending on d, ζ. This completes the proof. A.1.2 Proof of Estimates for the Statistical Error In this section, we give the proof of Theorem 7. We first introduce the following lemma bounding the covering number of Re LU FNNs in the uniform norm. Lemma 20. Let F be the Re LU FNNs with width W, depth D, and size S. Assume that the parameters of F are bounded by a constant B > 0, then for each δ > 0, log N(F, δ, ) O(SD log(BWD/δ)). Proof. For each ϕθ F, we have ϕθ(x) = (ADσ( ) + b D) . . . (A2σ( ) + b2) (A1x + b1) , Feng and He and Jiao and Kang and Wang where x [0, 1]d, and θ B. For different parameters θ and θ, we denote ϕθ(x) = (ADσ( ) + b D) . . . (A2σ( ) + b2) (A1x + b1) = ϕD ϕD 1 . . . ϕ1 ϕ0(x), ϕ θ(x) = ADσ( ) + b D . . . A2σ( ) + b2 A1x + b1 = ϕD ϕD 1 . . . ϕ1 ϕ0(x). By adding and subtracting one item, it follows that |ϕθ(x) ϕ θ(x)| = |ϕD ϕD 1 . . . ϕ1 ϕ0(x) ϕD ϕD 1 . . . ϕ1 ϕ0(x)| |ϕD ϕD 1 . . . ϕ1 ϕ0(x) ϕD ϕD 1 . . . ϕ1 ϕ0(x)| + | ϕD ϕD 1 . . . ϕ1 ϕ0(x) ϕD ϕD 1 . . . ϕ1 ϕ0(x)| + + | ϕD ϕD 1 . . . ϕ1 ϕ0(x) ϕD ϕD 1 . . . ϕ1 ϕ0(x)| + | ϕD ϕD 1 . . . ϕ1 ϕ0(x) ϕD ϕD 1 . . . ϕ1 ϕ0(x)|. Therefore, it suffices to bound D + 1 terms on the right hand of the above inequality. We provide detailed proofs for obtaining the upper bound of the first term, and the remaining D terms can be controlled in a similar manner. Through elementary algebraic calculations, we have |ϕD ϕD 1 . . . ϕ1 ϕ0(x) ϕD ϕD 1 . . . ϕ1 ϕ0(x)| =|(AD AD)σ(g D 1(x)) + b D b D| ( g D 1(x) 1 + 1) θ θ (W g D 1(x) + 1) θ θ , (18) where g D 1(x) := AD 1σ(g D 2(x)) + b D 1. By mathematical recursion, we can bound g D 1(x) as follows: g D 1(x) = AD 1σ(g D 2(x)) + b D 1 AD 1σ(g D 2(x)) + b D 1 WB g D 2(x) + WB (WB)D 1 + . . . + WB (D 1)(WB)D 1. (19) Combining (18) and (19) gives that |ϕD ϕD 1 . . . ϕ1 ϕ0(x) ϕD ϕD 1 . . . ϕ1 ϕ0(x)| =|(AD AD)σ(g D 1(x)) + b D b D| ( g D 1(x) 1 + 1) θ θ (W g D 1(x) + 1) θ θ D(WB)D θ θ . Hence, we can conclude that |ϕθ(x) ϕ θ(x)| L θ θ , Deep Nonparametric Quantile Regression under Covariate Shift where L = O D2(BW)D . Utilizing Lemma 5.13 and Problem 5.5 in van Handel (2016), for each δ > 0, we have log N(F, δ, ) O(SD log(BWD/δ)). With Lemma 20, we can prove Theorem 7 as follows. Proof. For each f G, we denote L(f) := E(X,Y ) PX,Y [ρτ(Y f(X)) ρτ(Y f (X))] , b LD(f) := 1 i=1 (ρτ(Yi f(Xi)) ρτ(Yi f (Xi))) . Then, for each f G, we have L( bf D) = L( bf D) L(f ) = L(f ) 2b LD( bf D) + L( bf D) + 2b LD( bf D) 2L(f ) L(f ) 2b LD( bf D) + L( bf D) + 2b LD(f) 2L(f ). Taking expectations followed by taking infimum about f over G in the above equation, we have ED h L( bf D) i ED h L(f ) 2b LD( bf D) + L( bf D) i . Therefore, it remains to derive the upper bound of the statistical error Esta := ED h L(f ) 2b LD( bf D) + L( bf D) i . Denote by ℓ(f; ω) := ρτ(y f(x)) ρτ(y f (x)) with ω := (x, y). It is easy to check that ℓ(f; ω) is Lipschitz continuous over f, i.e., for each ω, we have | ℓ(f1; ω) ℓ(f2; ω)| λ|f1(x) f2(x)|, with λ = 1. Let D := {U 1, . . . , U n} be an i.i.d ghost sample independent of D = {U1, . . . , Un} with Ui := (Xi, Yi), i = 1, . . . , n. Let G(f; U) := ED h ℓ(f; U ) 2 ℓ(f; U) i . Then, we have 2 ℓ bf D; Ui + ED h ℓ bf D; U i i # i=1 G bf D; Ui # Let N (G, δ, ) be the covering number of G with cover C = {f1, . . . , f N }. f G, there exists a f C such that | ℓ(f; ω) ℓ( f; ω)| λ f f λδ, G(f; ω) G( f; ω) + 3λδ. By Assumption 3, it follows that L is strongly convex at f , i.e., for each f, L(f) L(f ) c f f 2 L2(PX) , Feng and He and Jiao and Kang and Wang where c > 0 is an absolute constant. For each f C, we have i=1 G f; Ui > t i=1 ED h ℓ( f; U i) i 2 i=1 ℓ( f; Ui) > t i=1 ℓ( f; Ui) i=1 ℓ( f; Ui) > t/2 + ED i=1 ℓ( f; Ui) It is easy to show that | ℓ( f; Ui)| 2λB, and | ℓ( f; Ui) E ℓ( f; Ui)| 4λB := b. Denote by σ2 := V ar( ℓ( f, Ui)), then we have σ2 E[ ℓ( f; Ui)2] λ2 f f 2 L2(PX) λ2 c E[ ℓ( f; Ui)]. Thus, we have E[ ℓ( f; Ui)] cσ2 λ2 . Let v := t bλ , then we have σ2 bλv 2Bc, and v t/2. By Bernstein s inequality and (20), we have i=1 G f; Ui > t i=1 ℓ( f; Ui) i=1 ℓ( f; Ui) > t/2 + cσ2 i=1 ℓ( f; Ui) i=1 ℓ( f; Ui) > v exp( nv2/2(σ2 + bv)) exp( nv/b(2 + λ/Bc)) exp( nt/(8λB + 4λ2/c)). Hence, t > 3λδ, we have i=1 G bf D; Ui > t max f G 1 n i=1 G (f; Ui) > t i=1 G f; Ui > t 3λδ N exp n(t 3λδ) 8λB + 4λ2/c By Lemma 20 and setting a = 3λδ + c0 with δ = 1/n and c0 = (8λB + 4λ2/c) log N/n yield that a N exp n(t 3λδ) 8λB + 4λ2/c a + N exp c0n 8λB + 4λ2/c 4λB + 4λ2/c 8λB + 4λ2/c (log N + 1) + 3λ n O BSD log(n BWD) Deep Nonparametric Quantile Regression under Covariate Shift This completes the proof. A.2 Proof of the Main Results for the Unweighted Estimators A.2.1 Proof of Error Bounds for the Unweighted Estimators Under the Uniformly Bounded Case In this section, we give the proof of Theorem 8, which provides the non-asymptotic error bound for the unweighted estimators under the uniformly bounded case. The proof highly relies on the nice property that bf D f0 L2(QX) Γ bf D f0 L2(PX) given the assumption that supx X r(x) Γ. Proof. By Lemma 4 and taking the expectation on the training data D, we have ED h bf D f0 2 L2(PX) i BEDE(X,Y ) PX,Y ρτ(Y bf D(X)) ρτ(Y f (X)) + B2 inf f G f f0 2 For the statistical error term, from Theorem 7, we have EDE(X,Y ) PX,Y ρτ(Y bf D(X)) ρτ(Y f (X)) O BSD log(n BWD) O Bn 2ζ d+2ζ (log n)3 , where the second inequality is from the fact that the function class G is a Re LU FNN F bounded by B 1, with the size S = O(n d d+2ζ log n) and the depth D = O(log n), and weights bound B = O(Bn d d+2ζ ). For the approximation error term, this can be bounded by using Theorem 5. By combining these two error terms, we have ED bf D f0 2 L2(PX) O B2n 2ζ d+2ζ (log n)3 . (21) Then by Assumption 4 that supx X r(x) Γ, there holds ED bf D f0 2 L2(QX) ΓED bf D f0 2 L2(PX) O ΓB2n 2ζ d+2ζ (log n)3 . This completes the proof of Theorem 8. A.2.2 Proof of Error Bounds for the Unweighted Estimators Under the Bounded Second Moment Case In this section, we give the proof of Theorem 9 , which provides the non-asymptotic error bound for the unweighted estimators under the bounded second moment case. In this case, the property that bf D f0 L2(QX) V 2 bf D f0 L2(PX) does not hold for the bounded second moment case. By using the Cauchy-Schwarz inequality, we can only obtain a suboptimal rate. Feng and He and Jiao and Kang and Wang Proof. By the definition of the density ratio r, for each f F, we have f f0 2 L2(QX) = EX PX r(X)(f(X) f0(X))2 EX PX r2(X) 1 2 EX PX (f(X) f0(X))4 1 V 1 2 EX PX (f(X) f0(X))4 1 (4B2V ) 1 2 n f f0 2 L2(PX) o 1 where the first inequality follows from the Cauchy-Schwarz inequality, the second inequality is from Assumption 5 that EX PX[r2(X)] V 2, and the last inequality is due to that fact that f B and f0 B. Combining this result with (21), we have ED bf D f0 2 L2(QX) (4B2V ) 1 2 ED n f f0 2 L2(PX) o 1 2 O B2V 1 2 n ζ d+2ζ (log n) 3 2 . This completes the proof. A.3 Proof of the Main results for the Reweighted estimators A.3.1 Proof of Error Bounds for the Reweighted Estimators Under the Uniformly Bounded Case In this section, we give the proof of Theorem 10, which provides the non-asymptotic error bound for the reweighted estimators under the uniformly bounded case. This proof follows a similar approach to that in Theorem 8. Specifically, it incorporates a modified error decomposition as detailed in Lemma 21 and includes estimates for the statistical error term from Lemma 22. Lemma 21. Suppose that Assumptions 2 and 3 are satisfied, and the function space J is also uniformly bounded by B with B 1. Then, the reweighted estimator bfr,D defined in (6) satisfies bfr,D f0 2 L2(QX) BE(X,Y ) PX,Y h r(X) ρτ(Y bfr,D(X)) ρτ (Y f (X)) i + B2 f0 f 2 . Deep Nonparametric Quantile Regression under Covariate Shift Proof. By the Knight identity and Fubini s theorem, we can employ the same proof procedure as detailed in Lemma 4 in Section 4.1.1, which shows that E(X,Y ) PX,Y [r(X)(ρτ(Y f(X)) ρτ(Y f (X)))] =EX PX h r(X) f(X) f (X) EY |X PY |X (τ I{Y f0(X)}) | X i EX PX h r(X) f(X) f (X) EY |X PY |X (I{Y f0(X)} I{Y f (X)}) | X i + EX PX h Z f(X) f (X) 0 r(X) EY |X PY |X[I{Y f (X) + z} | X] EY |X PY |X[I{Y f (X)} | X] dz i C1EX PX r(X)|f(X) f (X)||f0(X) f (X)| + C2EX PX r(X)D2(f(X) f (X)) EX PX r(X)|f(X) f (X)|2 q EX PX r(X)|f0(X) f (X)|2 + C2EX PX r(X)D2(f(X) f (X)) , where C1, C2 are two absolute positive constants. Then, for any β > 0, C2EX PX r(X)D2(f(X) f (X)) EX PX r(X)|f(X) f (X)|2 q EX PX r(X)|f0(X) f (X)|2 + E(X,Y ) PX,Y r(X) ρτ(Y f(X)) ρτ(Y f (X)) 4β EX PX r(X)|f(X) f (X)|2 + C1βEX PX r(X)|f0(X) f (X)|2 + E(X,Y ) PX,Y r(X) ρτ(Y f(X)) ρτ(Y f (X)) . By setting β = C1B C2 and applying the inequality D2(f(X) f (X)) |f(X) f (X)|2 2B , which holds almost surely for B 1, there holds f f 2 L2(QX) 4B E(X,Y ) PX,Y r(X) ρτ(Y f(X)) ρτ(Y f (X)) C2 2 f0 f 2 L2(QX). Using the triangle inequality yields that bfr,D f0 2 L2(QX) 2 bfr,D f 2 L2(QX) + 2 f f0 2 L2(QX) BE(X,Y ) PX,Y h r(X) ρτ(Y bfr,D(X)) ρτ (Y f (X)) i + B2 f0 f 2 L2(QX) BE(X,Y ) PX,Y h r(X) ρτ(Y bfr,D(X)) ρτ (Y f (X)) i + B2 f0 f 2 . This completes the proof. Feng and He and Jiao and Kang and Wang Lemma 22. Given the reweighted estimator bfr,D in (6) and the considered function class J is Re LU FNN F, we have EDE(X,Y ) PX,Y h r(X) ρτ(Y bfr,D(X)) ρτ(Y f (X)) i O BSDΓ2 log(n BWD) Proof. This proof is similar to that of Theorem 7, requiring solely the redefinition of L, b LD, and ℓ. Specifically, we reformulate them as L(f) := E(X,Y ) PX,Y [r(X)(ρτ(Y f(X)) ρτ(Y f (X)))] , b LD(f) := 1 i=1 r(Xi) (ρτ(Yi f(Xi)) ρτ(Yi f (Xi))) , where f J . Then, we define ℓ(f; ω) := r(x) (ρτ(y f(x)) ρτ(y f (x))) with ω := (x, y). It is easy to check that ℓ(f; ω) is Lipschitz continuous over f, i.e., for each ω, we have | ℓ(f1; ω) ℓ(f2; ω)| r(x)|f1(x) f2(x)| λ|f1(x) f2(x)|, with λ = Γ. Using Assumption 3, we can conclude that L is strongly convex at f , i.e., for each f, L(f) L(f ) c f f 2 L2(QX) , where c > 0 is an absolute constant. Therefore, employing analogous arguments to the proof of Theorem 7 gives the desired upper bound. This completes the proof. With Theorem 5, and Lemmas 21-22, we proceed the proof of Theorem 10 as given below. Proof. By Lemma 4 and taking the expectation on the training data D, we have ED h bf D f0 2 L2(PX) i BEDE(X,Y ) PX,Y r(X) ρτ(Y bf D(X)) ρτ(Y f (X)) + B2 inf f G f f0 2 . We set the function class J to be a Re LU FNN bounded by B 1, with the size S = O(n d d+2ζ log n) and the depth D = O(log n), and weights bound B = O(Bn d d+2ζ ). Using Lemma 22 and Theorem 5, then we obtain the desired bound. A.3.2 Proof of Error Bounds for the Reweighted Estimators Under the Bounded Second Moment Case In this section, we give the proof of Theorem 11, which provides the non-asymptotic error bound for the truncated reweighted estimators in the case of a bounded second moment. We consider a more general assumption that EX PX[r1+δ(X)] = U < , for some δ 0, and second moment bounded is a special case when δ = 1. Proof. We denote by R r(f) := E(X,Y ) PX,Y r(X)(ρτ(f(X) Y ) ρτ(f (X) Y )) , R Tξr(f) := E(X,Y ) PX,Y Tξr(X)(ρτ(f(X) Y ) ρτ(f (X) Y )) , Deep Nonparametric Quantile Regression under Covariate Shift where f K. Then, we have R r(f) R Tξr(f) = E(X,Y ) PX,Y [(r(X) ξ)I(r(X) ξ)(ρτ(f(X) Y ) ρτ(f (X) Y ))] = E(X,Y ) PX,Y [r(X)I(r(X) ξ)(ρτ(f(X) Y ) ρτ(f (X) Y ))] E(X,Y ) PX,Y [ξI(r(X) ξ)(ρτ(f(X) Y ) ρτ(f (X) Y ))] where the inequality holds by the 1+δ moment bounded assumption and Markov inequality. According to the proof of Lemma 21, for each f K, f f0 2 L2(QX) BE(X,Y ) PX,Y [r(X) (ρτ(Y f(X)) ρτ (Y f (X)))] + B2 f0 f 2 L2(QX) BR Tξr(f) + B2U ξδ + B2 f0 f 2 . Similar to the proof of Lemma 22, it holds that ED[R Tξr( bf Tξr,D)] O BSDξ2 log(n BWD) Setting ξ = n U SD log(n BWD) 1 2+δ yields that ED bf Tξr,D f0 2 L2(QX) U2δ/(2+δ)B2(SD log(n BWD)/n)1/(1+δ) + B2 f0 f 2 . Using Theorem 5 and setting the function class K as a Re LU FNN bounded by B 1, with the size S = O n d d+(2+4/δ)ζ log n , the depth D = O (log n), and weights bound B = d d+(2+4/δ)ζ , it follows that ED h bf Tξr,D f0 2 L2(QX) i O U2δ/(2+δ)B2n 2ζ d+(2+4/δ)ζ log n . Let δ = 1 and U = V 2, we complete the proof. A.4 Proof of the Main results for the Pre-training Reweighted estimators A.4.1 Proof of Error Bounds for the Pre-training Reweighted Estimators Under the Uniformly Bounded Case In this section, we give the proof of Theorem 13 and Theorem 14, which provide the nonasymptotic error bound for the pre-training density ratio and the pre-training reweighted estimators under the uniformly bounded case. Feng and He and Jiao and Kang and Wang Proof. For each u U, we have L(br S) L(r) = L(r) 2b LS(br S) + L(br S) + 2b LS(br S) 2L(r) L(r) 2b LS(br S) + L(br S) + 2b LS(u) 2L(r). Next, we bound the statistical error Esta = ES h L(r) 2b LS(br S) + L(br S) i and approximation error 2b LS(u) 2L(r). This approximation error can be controlled by Theorem 5. Therefore, it remains to derive the upper bound of the statistical error. This can be done by following the proof of Theorem 7. We denote by ℓ(u; x) := 1 2u2(xp) u(xq), with x := (xp, xq). Then, we define ℓ(u; x) := ℓ(u; x) ℓ(r; x). It is easy to check that ℓ(u; x) is Lipschitz continuous over u, i.e., for each x, we have | ℓ(u1; x) ℓ(u2; x)| λ (|u1(xp) u2(xp)| + |u1(xq) u2(xq)|) , with λ = Γ. Moreover, L is strongly convex at r, i.e., for each u, L(u) L(r) c u r 2 L2(PX) , c := 1. Then, using the similar arguments in the proof of Theorem 7 yields that 32λΓ + 8(2 + Γ)λ2/c (log N + 1) m Γ3SD log(m BWD) Then, setting the size S = O m d d+2α log m , the depth D = O (log m), and weights bound B = O Γm d d+2α gives the desired result. By using Theorem 13, we are ready to prove Theorem 14. Proof. Using Lemma 21, we can similarly get ES,D h bfbr S,D f0 2 L2(QX) i BES,DE(X,Y ) PX,Y [r(X)(ρτ(Y bfbr S,D(X)) ρτ(Y f (X)))] + B2 f f0 2 L2(QX) BES,DE(X,Y ) PX,Y [(r(X) br S(X))(ρτ(Y bfbr S,D(X)) ρτ(Y f (X)))] + BES,DE(X,Y ) PX,Y [br S(X)(ρτ(Y bfbr S,D(X)) ρτ(Y f (X)))] + B2 f f0 2 L2(QX). (22) Deep Nonparametric Quantile Regression under Covariate Shift For any β > 0, we have ES,DE(X,Y ) PX,Y [(r(X) br S(X))(ρτ(Y bfbr S,D(X)) ρτ(Y f (X)))] ES h EX PX |(r(X) br S(X))|2 /2β i E(X,Y ) PX,Y ρτ(Y bfbr S,D(X)) ρτ(Y f (X)) 2 β/2 ES h r br S 2 L2(PX) i /2β + ES,D bfbr S,D f 2 ES h r br S 2 L2(PX) i /2β + β bfbr S,D f0 2 Υ f0 f 2 L2(QX) . (23) Combining (22)-(23) and setting β = Υ 2B yield that ES,D h bfbr S,D f0 2 L2(QX) i BES,DE(X,Y ) PX,Y [br S(X)(ρτ(Y bfbr S,D(X)) ρτ(Y f (X)))] + B2 f f0 2 + B2 Υ ES h r br S 2 L2(PX) i . (24) In (24), we can deduce that ES,DE(X,Y ) PX,Y [br S(X)(ρτ(Y bfbr S,D(X)) ρτ(Y f (X)))] O BSDΓ2 log(n BWD) n by employing similar arguments in Lemma 22, f f0 2 can be bounded by Theorem 5, and ES h r br S 2 L2(PX) i can be bounded using Theorem 13. Therefore, setting the size S = O n d d+2ζ log n , the depth D = O (log n), and weights bound B = O Bn d d+2ζ , we have ES,D h bfbr S,D f0 2 L2(QX) i O B2Γ2n 2ζ 2ζ+d (log n)3 + O B2Γ3m 2α d+2α (log m)3 Moreover, if m Ω Γ ζ(d+2α) α(d+2ζ) , we have ES,D bfbr S,D f0 2 L2(QX) O B2Γ2n 2ζ d+2ζ (log n)3 . This completes the proof. A.4.2 Proof of Error Bounds for the Pre-training Reweighted Estimators under the 2 + δ Moment Bounded Case In this section, we give the proof of Theorem 15 and Theorem 17 density, which provide the non-asymptotic error bound for the pre-training density ratio and the pre-training reweighted estimators under the 2 + δ moment bounded case. The proof of Theorem 15 is similar to that of Theorem 13, except for an extra error term for the truncation Tξr Feng and He and Jiao and Kang and Wang r 2 L2(PX) which can be well bounded with Assumption 9. And with Theorem 15, the proof of Theorem 17 is exactly the same as that of Theorem 14. Proof. For each u TξU, we have L(brξ,S) L(r) = L(r) 2b LS(brξ,S) + L(brξ,S) + 2b LS(brξ,S) 2L(r) L(r) 2b LS(brξ,S) + L(brξ,S) + 2b LS(u) 2L(r). Next, we bound the statistical error Esta = ES h L(r) 2b LS(brξ,S) + L(brξ,S) i and approxima- tion error 2b LS(u) 2L(r). Using the triangle inequality, the approximation error satisfies inf u TξU ES[b LS(u) L(r)] 2 inf u TξU u Tξr 2 L2(PX) + 2 Tξr r 2 L2(PX). (25) On the right hand of (25), the first term can be bounded by Theorem 5. In terms of the second term, it can be bounded by using Assumption 9. Specifically, there holds that Tξr r 2 L2(PX) r I(r > ξ) 2 L2(PX) = EX PX[r2+δ(X)] where the last inequality follows from Assumption 9. Moreover, similar to the argument of the proof of Theorem 13, the statistical error Esta can be bounded by O ξ3SD log(m BWD) Then, by setting the size S = O m δd δd+(6+2δ)α , the depth D = O (log m), weights bound δd δd+(6+2δ)α for the Re LU DNNs, and the truncation level ξ = O m 2α δd+(6+2δ)α , we obtain the desired result. If we further assume that the square of the density ratio r2 is sub-exponential with respect to PX, i.e., EX PX exp(σr2(X)) < , for some positive constant σ, then Tξr r 2 L2(PX) r I(r > ξ) 2 L2(PX) 2 (r2(X) ξ2) 2 ξ2 EX PX exp σr2(X) Then, setting the size S = O m d d+2α log m , the depth D = O (log m), weights bound B = O ξm d d+2α for the Re LU DNNs, and the truncation level ξ = O log m gives the convergence rate of O(m 2α d+2α (log m) 9 2 ). Deep Nonparametric Quantile Regression under Covariate Shift Proof. Following a similar procedure in the proof of Theorem 14, we have ES,D h bfbrξ,S,D f0 2 L2(QX) i BES,DE(X,Y ) PX,Y [r(X)(ρτ(Y bfbrξ,S,D(X)) ρτ(Y f (X)))] + B2 f f0 2 L2(QX) BES,DE(X,Y ) PX,Y [(r(X) brξ,S(X))(ρτ(Y bfbrξ,S,D(X)) ρτ(Y f (X)))] + BES,DE(X,Y ) PX,Y [brξ,S(X)(ρτ(Y bfbrξ,S,D(X)) ρτ(Y f (X)))] + B2 f f0 2 L2(QX). For the first part of the above equation, for any β > 0, we have ES,DE(X,Y ) PX,Y [(r(X) brξ,S(X))(ρτ(Y bfbrξ,S,D(X)) ρτ(Y f (X)))] ES h EX PX |(r(X) brξ,S(X))|2 /2β i E(X,Y ) PX,Y ρτ(Y bfbrξ,S,D(X)) ρτ(Y f (X)) 2 β/2 ES h r brξ,S 2 L2(PX) i /2β + ES,D bfbrξ,S,D f 2 ES h r brξ,S 2 L2(PX) i /2β + β bfbrξ,S,D f0 2 Υ f0 f 2 L2(QX) , where the second inequality follows from the fact that ρτ( ) is Lipschitz continuous and Assumption 4.9 that Υ = infx X r(x) > 0. Plugging in this result and setting β = Υ 2B yield that ES,D h bfbrξ,S,D f0 2 L2(QX) i B2 f f0 2 + B2 Υ ES h r brξ,S 2 L2(PX) i + BES,DE(X,Y ) PX,Y [brξ,S(X)(ρτ(Y bfbrξ,S,D(X)) ρτ(Y f (X)))]. Note that brξ,S ξ, then we employ the similar arguments in Lemma 22 and obtain that ES,DE(X,Y ) PX,Y [brξ,S(X)(ρτ(Y bfbrξ,S,D(X)) ρτ(Y f (X)))] O BSDξ2 log(n BWD) Set the size S = O(n δd δd+(4+2δ)ζ ), the depth D = O(log n), weights bound B = O(Bn δd δd+(4+2δ)ζ) ), and the truncation level ξ = O(n 2ζ δd+(4+2δ)ζ ). By Theorem 5, we can get inf f M f f0 2 O(Bn 2δζ δd+(4+2δ)ζ ). At last, the term ES h r brξ,S 2 L2(PX) i can be bounded using Theorem 15. Therefore, by combining the three error terms, we have ES,D h bfbrξ,S,D f0 2 L2(QX) i O B2n 2δζ δd+(4+2δ)ζ (log n)2 + O B2m 2δα δd+(6+2δ)α (log m)2 Feng and He and Jiao and Kang and Wang Moreover, if m Ω n [δd+(6+2δ)α]ζ [δd+(4+2δ)ζ]α , we have ES,D bfbrξ,S,D f0 2 L2(QX) O B2n 2δζ δd+(4+2δ)ζ (log n)2 . If we further assume that the square of the density ratio r2 is sub-exponential with respect to PX, i.e., EX PX exp(σr2(X)) < , for some positive constant σ, then ES,DE(X,Y ) PX,Y [brξ,S(X)(ρτ(Y bfbrξ,S,D(X)) ρτ(Y f (X)))] O BSDξ2 log(n BWD) O Bn 2ζ d+2ζ (log n)3 log m , with the size S = O(n d d+2ζ log n), the depth D = O(log n), weights bound B = O(Bn d d+2ζ ), and the truncation level ξ = O( log m). By Theorem 5, we can get inf f M f f0 2 O(Bn 2ζ d+2ζ ). At last, the term ES h r brξ,S 2 L2(PX) i can be bounded using Theorem 17. Therefore, by combining the three error terms, we have ES,D h bfbrξ,S,D f0 2 L2(QX) i O B2n 2ζ d+2ζ (log n)3 log m + O B2m 2α d+2α (log m)9/2 Moreover, if m Ω n (d+2α)ζ (d+2ζ)α , we have ES,D bfbrξ,S,D f0 2 L2(QX) O B2n 2ζ d+2ζ (log n)4 . This completes the proof. Appendix B. Additional Numerical Experiments In this part, we provide some additional numerical results. Specifically, the generating scheme is the same as that the generating scheme is the same as that in Section 5.1, except that we consider two different scenarios where the covariates are drawn from different distribution families. The first case is that the target covariate is drawn from N(0, 1), the source covariate is drawn from the standard Cauchy distribution, and clearly, the density ratio is uniformly bounded under this case. The second case is that the target covariate is drawn from a Pareto distribution with scale parameter 0.2 and shape parameter 2, the source covariate is drawn from Student s t-distribution with 3 degrees of freedom, and clearly, the density ratio is unbounded density ratios with the bounded second moment. The averaged performances of all the estimators are summarized in terms of L1 and L2 2 prediction errors under different scenarios in Tables 5 and 6. It is thus clear from Tables 5 and 6 that the obtained numerical results are consistent with those presented in Section 5. Under the uniformly bounded case, the averaged errors Deep Nonparametric Quantile Regression under Covariate Shift Table 5: Averaged L1 and L2 2 errors ( 10 1) based on testing data with the corresponding standard deviations in brackets for DQR, WDQR and PWDQR for Model (13) with the bounded density ratio. Sample size n = 512 n = 2048 τ Method L1 L2 2 L1 L2 2 0.05 DQR 1.374(0.169) 0.246(0.062) 1.102(0.104) 0.153(0.027) WDQR 1.373(0.192) 0.246(0.068) 1.092(0.093) 0.151(0.022) PWDQR 1.428(0.232) 0.254(0.091) 1.195(0.194) 0.162(0.045) 0.25 DQR 0.584(0.080) 0.049(0.013) 0.540(0.049) 0.044(0.007) WDQR 0.562(0.071) 0.050(0.012) 0.539(0.047) 0.044(0.007) PWDQR 0.598(0.094) 0.053(0.032) 0.570(0.072) 0.049(0.012) 0.5 DQR 0.449(0.023) 0.031(0.004) 0.421(0.010) 0.028(0.002) WDQR 0.446(0.025) 0.032(0.004) 0.419(0.013) 0.027(0.002) PWDQR 0.468(0.043) 0.034(0.009) 0.429(0.031) 0.029(0.005) 0.75 DQR 0.607(0.079) 0.055(0.013) 0.555(0.044) 0.043(0.007) WDQR 0.605(0.079) 0.055(0.012) 0.553(0.050) 0.044(0.008) PWDQR 0.613(0.083) 0.057(0.031) 0.571(0.063) 0.047(0.015) 0.95 DQR 1.333(0.154) 0.215(0.043) 1.044(0.094) 0.137(0.021) WDQR 1.334(0.180) 0.217(0.054) 1.047(0.091) 0.141(0.022) PWDQR 1.459(0.213) 0.233(0.078) 1.121(0.176) 0.151(0.045) Table 6: Averaged L1 and L2 2 errors ( 10 1) based on testing data with the corresponding standard deviations in brackets for DQR, WDQR and PWDQR for Model (13) with the unbounded density ratio. Sample size n = 512 n = 2048 τ Method L1 L2 2 L1 L2 2 0.05 DQR 1.332(0.184) 0.305(0.262) 1.089(0.112) 0.157(0.034) WDQR 1.205(0.344) 0.184(0.112) 0.987(0.109) 0.124(0.023) PWDQR 1.223(0.210) 0.189(0.052) 1.009(0.143) 0.128(0.029) 0.25 DQR 0.586(0.090) 0.130(0.260) 0.551(0.049) 0.061(0.055) WDQR 0.515(0.055) 0.046(0.007) 0.508(0.043) 0.043(0.009) PWDQR 0.549(0.102) 0.048(0.015) 0.525(0.060) 0.043(0.008) 0.5 DQR 0.460(0.039) 0.112(0.260) 0.423(0.159) 0.044(0.057) WDQR 0.424(0.021) 0.031(0.007) 0.409(0.011) 0.028(0.006) PWDQR 0.433(0.029) 0.033(0.006) 0.418(0.034) 0.031(0.009) 0.75 DQR 0.585(0.075) 0.131(0.259) 0.542(0.055) 0.061(0.061) WDQR 0.523(0.087) 0.046(0.015) 0.514(0.055) 0.042(0.011) PWDQR 0.539(0.103) 0.048(0.016) 0.521(0.059) 0.043(0.012) 0.95 DQR 1.257(0.152) 0.269(0.259) 1.046(0.097) 0.153(0.067) WDQR 1.234(0.394) 0.211(0.138) 1.037(0.115) 0.135(0.027) PWDQR 1.238(0.219) 0.215(0.056) 1.110(0.353) 0.156(0.053) Feng and He and Jiao and Kang and Wang of all the three estimators are very similar, even when the source and target distributions belong to different families. Under the unbounded case, where the target distribution has heavier tails than the source distribution, both WDQR and PWDQR significantly outperform DQR. These numerical results further validate our theoretical findings as discussed in Section 4. Martin Anthony, Peter L Bartlett, and Peter L Bartlett. Neural Network Learning: Theoretical Foundations, volume 9. Cambridge University Press Cambridge, 1999. Mahsa Baktashmotlagh, Mehrtash T Harandi, Brian C Lovell, and Mathieu Salzmann. Domain adaptation on the statistical manifold. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2481 2488, 2014. Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005. Benedikt Bauer and Michael Kohler. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4):2261 2285, 2019. Alexandre Belloni and Victor Chernozhukov. ℓ1-penalized quantile regression in highdimensional sparse models. The Annals of Statistics, 39(1):82 130, 2011. Steffen Bickel, Michael Br uckner, and Tobias Scheffer. Discriminative learning under covariate shift. The Journal of Machine Learning Research, 10:2137 2155, 2009. Howard D Bondell, Brian J Reich, and Huixia Wang. Noncrossing quantile regression curve estimation. Biometrika, 97(4):825 838, 2010. L eon Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade: Second Edition, pages 421 436. Springer, 2012. Micah D Carroll, Anca Dragan, Stuart Russell, and Dylan Hadfield-Menell. Estimating and penalizing induced preference shifts in recommender systems. In International Conference on Machine Learning, pages 2686 2708. PMLR, 2022. Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In International Conference on Algorithmic Learning Theory, pages 38 53. Springer, 2008. Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. Advances in Neural Information Processing Systems, 23, 2010. Hal Daume III and Daniel Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26:101 126, 2006. Tongtong Fang, Nan Lu, Gang Niu, and Masashi Sugiyama. Rethinking importance weighting for deep learning under distribution shift. Advances in Neural Information Processing Systems, 33:11996 12007, 2020. Deep Nonparametric Quantile Regression under Covariate Shift Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference. Econometrica, 89(1):181 213, 2021. Xingdong Feng, Xin He, Caixing Wang, Chao Wang, and Jingnan Zhang. Towards a unified analysis of kernel-based methods under covariate shift. Advances in Neural Information Processing Systems, 36:73839 73851, 2023. Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Sch olkopf. Covariate shift by kernel mean matching. Dataset Shift in Machine Learning, 3:131 160, 2009. Jessica L Gronsbell and Tianxi Cai. Semi-supervised approaches to efficient evaluation of model prediction performance. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3):579 594, 2018. Hao Guan and Mingxia Liu. Domain adaptation for medical image analysis: a survey. IEEE Transactions on Biomedical Engineering, 69(3):1173 1185, 2021. Laszlo Gyorfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A Distribution-free Theory of Nonparametric Regression, volume 1. Springer, 2002. Qiyang Han and Jon A Wellner. Convergence rates of least squares regression estimators with heavy-tailed errors. The Annals of Statistics, 47(4):2286 2319, 2019. Ali Hassan, Robert Damper, and Mahesan Niranjan. On acoustic emotion recognition: compensating for covariate shift. IEEE Transactions on Audio, Speech, and Language Processing, 21(7):1458 1468, 2013. Xuming He and Pin Ng. Quantile splines with several covariates. Journal of Statistical Planning and Inference, 75(2):343 352, 1999. Xuming He and Peide Shi. Convergence rate of b-spline estimators of nonparametric conditional quantile functions. Journaltitle of Nonparametric Statistics, 3(3-4):299 308, 1994. Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sch olkopf, and Alex Smola. Correcting sample selection bias by unlabeled data. Advances in Neural Information Processing Systems, 19:601 608, 2006. Jing Jiang and Cheng Xiang Zhai. Instance weighting for domain adaptation in nlp. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 264 271, 2007. Yuling Jiao, Guohao Shen, Yuanyuan Lin, and Jian Huang. Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics, 51(2):691 716, 2023. Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391 1445, 2009. Feng and He and Jiao and Kang and Wang Masahiro Kato and Takeshi Teshima. Non-negative Bregman divergence minimization for deep direct density ratio estimation. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 5320 5333. PMLR, 2021. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2017. Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: Journal of the Econometric Society, 46:33 50, 1978. Roger Koenker, Pin Ng, and Stephen Portnoy. Quantile smoothing splines. Biometrika, 81 (4):673 680, 1994. Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394 4412, 2006. Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465 5506, 2021. Cong Ma, Reese Pathak, and Martin J Wainwright. Optimally tackling covariate shift in RKHS-based nonparametric regression. The Annals of Statistics, 51(2):738 761, 2023. Oscar Hernan Madrid Padilla and Sabyasachi Chatterjee. Risk bounds for quantile trend filtering. Biometrika, 109(3):751 768, 2022. Neil Rohit Mallinar, Austin Zane, Spencer Frei, and Bin Yu. Minimum-norm interpolation under covariate shift. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 34543 34585. PMLR, 2024. Oscar Hernan Madrid Padilla, Wesley Tansey, and Yanzhen Chen. Quantile regression with Re LU networks: Estimators and minimax rates. The Journal of Machine Learning Research, 23(1):11251 11292, 2022a. Oscar Hernan Madrid Padilla, Wesley Tansey, and Yanzhen Chen. Quantile regression with Re LU networks: Estimators and minimax rates. The Journal of Machine Learning Research, 23(1):11251 11292, 2022b. Philipp Petersen and Felix Voigtlaender. Optimal approximation of piecewise smooth functions using deep Re LU neural networks. Neural Networks, 108:296 330, 2018. Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset Shift in Machine Learning. Mit Press, 2008. Alfr ed R enyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 4, pages 547 562, 1961. Maxime Sangnier, Olivier Fercoq, and Florence d Alch e Buc. Joint quantile regression in vector-valued RKHSs. Advances in Neural Information Processing Systems, 29:3700 3708, 2016. Deep Nonparametric Quantile Regression under Covariate Shift Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with Re LU activation function. The Annals of Statistics, 48(4):1875 1897, 2020. Guohao Shen, Yuling Jiao, Yuanyuan Lin, Joel L Horowitz, and Jian Huang. Deep quantile regression: Mitigating the curse of dimensionality through composition. ar Xiv preprint ar Xiv:2107.04907, 2021. Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227 244, 2000. Charles J Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10:1040 1053, 1982. Masashi Sugiyama and Motoaki Kawanabe. Machine Learning in Non-stationary Environments: Introduction to Covariate Shift Adaptation. MIT press, 2012. Masashi Sugiyama and Klaus-Robert M uller. Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions, 23(4):249 279, 2005. Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert M uller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8 (35):985 1005, 2007a. Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. Advances in Neural Information Processing Systems, 20:1433 1440, 2007b. Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density-ratio matching under the bregman divergence: a unified framework of density-ratio estimation. Annals of the Institute of Statistical Mathematics, 64:1009 1044, 2012. Ichiro Takeuchi, Quoc Le, Timothy Sears, Alexander Smola, et al. Nonparametric quantile estimation. The Journal of Machine Learning Research, 7:1231 1264, 2006. Yong Kiam Tan, Xinxing Xu, and Yong Liu. Improved recurrent neural networks for sessionbased recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 17 22, 2016. Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583 18599, 2020. A Torralba and AA Efros. Unbiased look at dataset bias. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, pages 1521 1528, 2011. Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009. AW van der Vaart and Jon A Wellner. Empirical Processes. Springer, 2023. Feng and He and Jiao and Kang and Wang Sara A Van de Geer and Sara van de Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000. Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge University Press, 2000. Ramon van Handel. Probability in High Dimension. APC 550 Lecture Notes, Princeton University. 12 2016. Roman Vershynin. High-dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018. Huixia Judy Wang, Deyuan Li, and Xuming He. Estimation of high conditional quantiles for heavy-tailed distributions. Journal of the American Statistical Association, 107(500): 1453 1464, 2012. Halbert White. Nonparametric estimation of conditional quantiles using neural networks. In Computing Science and Statistics, pages 190 199. Springer, 1992. Renzhe Xu, Xingxuan Zhang, Zheyan Shen, Tong Zhang, and Peng Cui. A theoretical analysis on independence-driven importance weighting for covariate-shift generalization. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 24803 24829. PMLR, 2022. Dmitry Yarotsky. Error bounds for approximations with deep Re LU networks. Neural Networks, 94:103 114, 2017.