# distributionfree_modelagnostic_regression_calibration_via_nonparametric_methods__fa967fff.pdf

Distribution-Free Model-Agnostic Regression Calibration via Nonparametric Methods

Shang Liu Imperial College Business School Imperial College London s.liu21@imperial.ac.uk

Zhongze Cai Imperial College Business School Imperial College London z.cai22@imperial.ac.uk

Xiaocheng Li Imperial College Business School Imperial College London xiaocheng.li@imperial.ac.uk

In this paper, we consider the uncertainty quantification problem for regression models. Specifically, we consider an individual calibration objective for characterizing the quantiles of the prediction model. While such an objective is well-motivated from downstream tasks such as newsvendor cost, the existing methods have been largely heuristic and lack of statistical guarantee in terms of individual calibration. We show via simple examples that the existing methods focusing on populationlevel calibration guarantees such as average calibration or sharpness can lead to harmful and unexpected results. We propose simple nonparametric calibration methods that are agnostic of the underlying prediction model and enjoy both computational efficiency and statistical consistency. Our approach enables a better understanding of the possibility of individual calibration, and we establish matching upper and lower bounds for the calibration error of our proposed methods. Technically, our analysis combines the nonparametric analysis with a covering number argument for parametric analysis, which advances the existing theoretical analyses in the literature of nonparametric density estimation and quantile bandit problems. Importantly, the nonparametric perspective sheds new theoretical insights into regression calibration in terms of the curse of dimensionality and reconciles the existing results on the impossibility of individual calibration. To our knowledge, we make the first effort to reach both individual calibration and finite-sample guarantee with minimal assumptions in terms of conformal prediction. Numerical experiments show the advantage of such a simple approach under various metrics, and also under covariates shift. We hope our work provides a simple benchmark and a starting point of theoretical ground for future research on regression calibration.

1 Introduction

Modern machine learning methods have witnessed great success on a wide range of tasks in the past decade for their high accuracy in dealing with various kinds of complicated data. Uncertainty quantification has played an important role in the interpretation and risk assessment of machine learning models for downstream tasks such as optimization and decision-making. People observe

Equal contribution.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

that the output of deep learning models tends to be over over-confident [Guo et al., 2017, Amodei et al., 2016], which inspires many recent works on uncertainty calibration for classification problems [Kumar et al., 2019, Wenger et al., 2020, Luo et al., 2022]. Comparatively, there has been less systematic theoretical understanding of the uncertainty calibration for regression problems. For a regression problem, the output of a prediction model can be the prediction of the conditional mean (by minimizing mean squared error) or of the conditional median (by minimizing mean absolute error). But such a single-point prediction cannot characterize the uncertainty, which motivates the development of uncertainty calibration methods for regression problems. Some efforts have been made in order to estimate the entire distribution, which includes Bayesian ways [Mac Kay, 1992, Damianou and Lawrence, 2013], the frequentists ways [Kendall and Gal, 2017, Lakshminarayanan et al., 2017, Cui et al., 2020, Zhao et al., 2020], and the nonparametric ways [Lei and Wasserman, 2014, Lei et al., 2018, Bilodeau et al., 2021, Song et al., 2019]. Another easier task is to predict the quantiles rather than the whole distribution. Existing ways mainly focus on completing the task of quantile prediction in one go [Pearce et al., 2018, Thiagarajan et al., 2020, Chung et al., 2021, Takeuchi et al., 2006, Stainwart and Christmann, 2011]. However, all those methods suffer from either statistical inconsistency under model misspecification or computational intractability during the training phase, or sometimes even both (see the detailed review in Appendix A).

In this paper, we suggest that previous ways of training a new quantile prediction model from scratch while discarding the pre-trained (mean) regression model may not be the best way for a quantile calibration objective, because the pre-trained model (though designed for a mean estimation objective) can be helpful for the quantile calibration. We propose a simple, natural, but theoretically non-trivial method that divides the whole quantile calibration task into two steps: (i) train a good regression model and (ii) estimate the conditional quantiles of its prediction error. Although a similar idea is applied in split conformal prediction[Papadopoulos et al., 2002, Vovk et al., 2005, Lei et al., 2018], the theoretical justification is still in a lack since the conformal prediction only requires an average calibration guarantee (see Definition 1) that can also be achieved without this first training step at all (where the detailed review is given in Appendix A). By a careful analysis of the individual calibration objective (see Definition 3), we capture the intuition behind the two-step procedure and formalize it in a mathematical guarantee. After a comprehensive numerical experiment on real-world datasets against existing benchmark algorithms, we suggest that one neither needs to estimate the whole distribution nor to train a new quantile model from scratch, while a pre-trained regression model and a split-and-subtract suffice, both theoretically and empirically. Our contribution can be summarized below:

First, we propose a simple algorithm that can estimate all percent conditional quantiles simultaneously. We provide the individual consistency of our algorithm and prove the minimax optimal convergence rate with respect to the mean squared error (Theorem 1 and 2). Our analysis is new and it largely relaxes the assumptions in the existing literature on kernel density estimation and order statistics. By showing the necessity of the Lipschitz assumption (Theorem 6), our result uses minimal assumptions to reach both finite sample guarantee and individual calibration, and our paper is the first to keep the latter two goals simultaneously up to our knowledge.

Second, we propose a two-step procedure of estimating mean + quantile of error rather than directly estimating the conditional quantiles, which enables a faster convergence rate both theoretically and empirically. Specifically, our convergence rate is of order O(L 2d d+2 n 2 d+2 ), where L is the Lipschitz constant of the conditional quantile with respect to the features, n is the number of samples, and d is the dimension of feature. Since the conditional mean function and the conditional quantile function are highly correlated, one can greatly reduce the Lipschitz constant by subtracting the mean from the quantile.

Moreover, we construct several simple examples to show the unexpected behavior of the existing calibration methods, suggesting that a population-level criterion such as sharpness or MMD can be misleading. As our analysis works as a positive result on individual calibration, we also provide a detailed discussion about the existing results on the impossibility of individual calibration, illustrating that their definitions of individual calibration are either impractical or too conservative.

2 Problem Setup

Consider the regression calibration problem for a given pre-trained model ˆf. We are given a dataset {(Xi, Yi)}n i=1 X Y independent of the original data that trains ˆf(x). Here X [0, 1]d Rd denotes the covariate/feature space and Y R denotes the response/label space. The i.i.d. samples {(Xi, Yi)}n i=1 follow an unknown distribution P on X Y. We aim to characterize the uncertainty of the regression error Ui := Yi ˆf(Xi) in a model agnostic manner, i.e., without any further restriction on the choice of the underlying prediction model ˆf.

Now we introduce several common notions of regression calibration considered in the literature. We first state all the definitions with respect to the error Ui and then establish equivalence with respect to the original Yi. Define the τ -th quantile of a random variable Z as Qτ(Z) := inf{t : F(t) τ} for any τ [0, 1] where F is the c.d.f. of Z. Accordingly, the quantile of the conditional distribution U|X = x is denoted by Qτ(U|X = x). A quantile prediction/calibration model is denoted by ˆQτ(x) : X R which gives the τ-quantile prediction of U given X = x.

Definition 1 (Marginal/Average Calibration). The model ˆQτ(x) is marginally calibrated if

P(U ˆQτ(X)) = τ

for any τ [0, 1]. Here the probability distribution is with respect to the joint distribution of (U, X).

As noted by existing works, the marginal calibration requirement is too weak to achieve the goal of uncertainty quantification. A predictor that always predicts the marginal quantile of U for any X = x, i.e., ˆQτ(x) Qτ(U), is always marginally calibrated, but such a model does not capture the heteroscedasticity of the output variable Y or the error U with respect to X.

Definition 2 (Group Calibration). For some pre-specified partition X = X1 XK, the model ˆQτ(x) is group calibrated if P(U ˆQτ(X)|X Xk) = τ

for any τ [0, 1] and k = 1, ..., K.

Group calibration is a stronger notion of calibration than marginal calibrations. It is often considered in a related but different problem called conformal prediction [Vovk, 2012, Lei and Wasserman, 2014, Alaa et al., 2023] where the goal is to give a sharp covering set ˆC(X) (or covering band if unimodality is assumed) such that P(Y ˆC(X)|X Xk) 1 α, k. Foygel Barber et al. [2021] consider the case where the guarantee is made for all P(X Xk) δ for some δ > 0.

Definition 3 (Individual Calibration). The model ˆQτ(x) is individually calibrated if

P(U ˆQτ(X)|X = x) = τ

for any τ [0, 1] and x X. Here the probability distribution is with respect to the conditional distribution U|X.

In this paper, we consider the criteria of individual calibration, an even stronger notion of calibration than the previous two. It requires the calibration condition to hold with respect to any x X in a pointwise manner. The difference between the conformal prediction problem with group calibration and the regression calibration problem with individual calibration is that conformal prediction does not make any assumptions on the data distribution, while our lower bound result of Theorem 6 suggests that without any assumption, the convergence rate of the conditional quantile estimator can be arbitrarily slow. The conformal prediction literature considers a weaker notion of calibration than the individual calibration setting considered in this paper where some mild assumptions are made.

Finally, the following proposition says that any quantile prediction model on the error U is equivalent to a corresponding quantile prediction on Y . The result is stated for the individual calibration, while a similar statement can be made for marginal calibration and group calibration as well.

Proposition 1. For any predictor ˆQτ(x) on the τ -th quantile of U|X = x, we have

P Y ˆf(X) + ˆQτ(X) X = x = P U ˆQτ(X) X = x .

0 2 4 6 8 10 12 14

Observations

Predictions

95\% Interval

0 2 4 6 8 10 12 14

Deep En%e ble

Ob%ervation%

Prediction%

95\% Interval

0 2 4 6 8 10 12 14

Our Algorith NRC

Ob%ervation%

Prediction%

95\% Interval

Figure 1: Synthetic data where the underlying distribution is obtained by a combination of sine functions. The solid lines denote predicted means, the shaded area denotes predicted intervals between 97.5% and 2.5% quantiles, and the yellow dots denote a subset of real observations. The leftmost plot gives the real mean as well as oracle quantile values, while the rest two plots are predictions from different calibration models. The middle plot is produced by a Deep Ensemble [Lakshminarayanan et al., 2017] of 5 HNNs trained with 40,000 samples, which is both a common benchmark and a building block for several regression calibration methods. The rightmost plot is produced by our proposed nonparametric calibration method Algorithm 2 NRC of which the base regression model is an ordinary feed-forward regression network. The detailed setup is given in Appendix E.2.

2.1 Motivating examples

In addition to the above calibration objective, existing literature also considers sharpness (the average length of confidence intervals [Zhao et al., 2020, Chung et al., 2021]) and maximum mean discrepancy (MMD) principle [Cui et al., 2020]. To the best of our knowledge, all the existing works consider population-level objectives such as marginal calibration, sharpness, and MMD when training the calibration model, while we are the first work that directly aims for individual calibration. Here we use a simple example to illustrate that such population-level objectives may lead to undesirable calibration results, and in Appendix B, we elaborate with more examples on the inappropriateness of existing methods for regression calibration.

Example 1. Consider X Unif[0, 1] and the conditional distribution Y |X = x Unif[0, x]. Then if one aims for τ = 90%, the outcome from sharpness maximization subject to marginal calibration is to ˆQτ(x) = x for x [0, 0.9] and ˆQτ(x) = 0 for x (0.9, 1]. Consequently, the quantile prediction has a 100% coverage over x [0, 0.9] but 0% coverage over x (0.9, 1].

Generally, the population-level criteria such as sharpness or MMD may serve as performance metrics to monitor the quality of a calibration outcome. However, the inclusion of such criteria in the objective function will encourage the calibration result to be over-conservative for the low variance region (x [0, 0.9]) while giving up the high variance (x (0.9, 1]), which is highly undesirable for risk-sensitive applications and/or fairness considerations.

Besides, individual calibration is sometimes also necessitated by the downstream application. For example, in some decision-making problems, regression calibration serves as a downstream task where loss is measured by the pinball loss; say, the newsvendor problem [Kuleshov et al., 2018, Ban and Rudin, 2019] in economics and operations research uses pinball loss to trade-off the backorder cost and the inventory cost.

Proposition 2. The pinball loss with respect to τ is defined by

lτ(u1, u2) := (1 τ)(u1 u2) 1{u1 > u2} + τ(u2 u1) 1{u1 < u2}.

We have Qτ(Y |X = x) = ˆf(x) + Qτ(U|X = x) arg min y R E[lτ(y, Y )|X = x].

Proposition 2 serves as the original justification of pinball loss for quantile regression. More importantly, it says that a calibration method that is inconsistent with respect to individual calibration can be substantially suboptimal for downstream tasks such as the newsvendor problem.

3 Main Results

3.1 Generic nonparametric quantile estimator

In this subsection, we present and analyze a simple algorithm of a nonparametric quantile estimator, which can be of independent interest itself and will be used as a building block for the regression calibration. The algorithm takes the dataset {(Xi, Ui)}n i=1 as input and outputs a τ-quantile estimation of the conditional distribution U|X = x for any τ and x.

Algorithm 1 Simple Nonparametric Quantile Estimator

Input: τ (0, 1), dataset {(Xi, Ui)}n i=1, kernel choice κh( , ) 1: Estimate the distribution of U|X = x by (where δu is a point mass distribution at u)

ˆPU|X=x = Pn i=1 κh(x, Xi)δUi Pn i=1 κh(x, Xi) .

2: Output the conditional τ -th quantile by minimizing the pinball loss on ˆPU|X=x

ˆQSNQ τ (x) = arg min u EU ˆ PU|X=x[lτ(u, U)]. (1)

The minimum of the optimization problem in Algorithm 1 is indeed achieved by the τ-quantile of the empirical distribution ˆPU|X=x. In the following, we analyze the theoretical properties of the algorithm under the naive kernel

κh(x1, x2) = 1{ x1 x2 h}.

where h is the hyper-parameter for window size, and x1 x2 denotes Euclidean distance.

Assumption 1. We assume the following on the joint distribution (U, X) and τ (0, 1) of interest:

(a) (Lipschitzness). The conditional quantile function is L-Lipschitz with respect to x,

|Qτ(U|X = x1) Qτ(U|X = x2)| L x1 x2 , x1, x2 X.

(b) (Boundedness). The quantile Qτ(U|X = x) is bounded within [ M, M] for all x X.

(c) (Density). There exists a density function px(u) for the conditional probability distribution U|X = x, and the density function is uniformly bounded away from zero in a neighborhood of the interested quantile. That is, there exist constants p and r,

px(u) p, x X and |u Qτ(U|X = x)| r.

In Assumption 1, part (a) and part (b) impose Lipschitzness and boundedness for the conditional quantile, respectively. Part (c) requires the existence of a density function and a lower bound for the density function around the quantile of interest. Its aim is to ensure a locally strong convexity for the expected risk EU PU|X=x[lτ(u, U)]. Under Assumption 1, we establish consistency and convergence rate for Algorithm 1.

Theorem 1. Under Assumption 1, Algorithm 1 is statistically consistent, i.e., for any ϵ > 0,

lim n P ˆQSNQ τ (X) Qτ(U|X) ϵ = 0.

Furthermore, by choosing h = Θ(L 2 d+2 n 1 d+2 ), we have for sufficiently large n C, when L > 0,

E ˆQSNQ τ (X) Qτ(U|X) 2 C L 2d d+2 n 2 d+2 ,

where C and C depends polynomially on 1

r, M, and log(n). In addition, whens L = 0, the

right-hand-side becomes O( 1

While Algorithm 1 seems to be a natural algorithm for nonparametric quantile regression, Theorem 1 provides the first convergence rate for such a nonparametric quantile estimator, to the best of our knowledge, in the literature of nonparametric regression; also, it is the first theoretical guarantee towards individual calibration for regression calibration. The standard analysis of nonparametric mean estimation [Györfi et al., 2002] cannot be applied here directly because the algorithm involves a local optimization (1). Even more challenging, the optimization problem (1) aims to optimize the quantile of the conditional distribution U|X = x but the samples used in (1) are from distributions U|X = xi, which causes a non-i.i.d. issue. To resolve these issues, we combine the idea of bias and variance decomposition in nonparametric analysis with a covering concentration argument for the local optimization problem. The detailed proof is deferred to Appendix H.

Theorem 1 also complements the existing results on finite-sample analysis for quantile estimators. One line of literature [Szorenyi et al., 2015, Altschuler et al., 2019] establishes the quantile convergence from the convergence of empirical processes, and this requires additional assumptions on the density function and does not permit the non-i.i.d. structure here. The quantile bandits problem also entails the convergence analysis of quantile estimators; for example, Zhang and Ong [2021] utilize the analysis of order statistics [Boucheron and Thomas, 2012], and the analysis inevitably requires a non-decreasing hazard rate for the underlying distribution. Other works [Bilodeau et al., 2021] that follow the kernel density estimation approach require even stronger conditions such as realizability. We refer to Appendix D for more detailed discussions. In comparison to these existing analyses, our analysis is new, and it only requires a minimal set of assumptions. The necessity of the assumptions can be justified from the following lower bound.

Theorem 2 rephrases the lower bound result in the nonparametric (mean) regression literature [Györfi et al., 2002] for the quantile estimation problem. It states that the convergence rate in Theorem 1 is minimax optimal (up to poly-logarithmic terms). It is easy to verify that the class PL satisfies Assumption 1. In this light, Theorem 1 and Theorem 2 establish the criticalness of Assumption 1. Furthermore, in Theorem 6 in Appendix C, we establish that the convergence rate of any estimator can be arbitrarily slow without any Lipschitzness assumption on the conditional statistics. Theorem 2. Let PL be the class of distributions of (X, U) such that X Uniform[0, 1]d, U = µ(X) + N, and µ(x) is L-Lipschitz where N is an independent standard Gaussian random variable. For any algorithm, there exists a distribution in the class PL such that the convergence rate of the algorithm is Ω(L 2d d+2 n 2 d+2 ).

Our result here provides a guarantee for the estimation of Qτ(U|X) and thus a positive result for individual calibration with respect to a specific τ. We note that the result does not contradict the negative results on individual calibration [Zhao et al., 2020, Lei and Wasserman, 2014]. Zhao et al. [2020] measure the quality of individual calibration via the closeness of the distribution ˆFX(Y ) to the uniform distribution. In fact, such a measurement is only meaningful for a continuous distribution of Y |X, but they prove the impossibility result based on a discrete distribution of Y |X. So, their negative result on individual calibration does not exclude the possibility of individual calibration for a continuous Y |X. Lei and Wasserman [2014] require an almost surely guarantee based on finite observations and prove an impossibility result. This is a very strong criterion; our results in Theorem 1 are established for two weaker but more practical settings: either the strong consistency in the asymptotic case as n or a mean squared error guarantee under finite-sample. We defer more discussions to Appendix C.

3.2 Regression calibration

Now we return to regression calibration and build a calibrated model from scratch. Specifically, Algorithm 2 splits the data into two parts; it uses the first part to train a prediction model and the second part to calibrate the model with Algorithm 1. The first part can be skipped if there is already a well-trained model ˆf where it becomes a recalibration problem.

The theoretical property of Algorithm 2 can be established by combining the results from Section 3.1 with Proposition 1. In Algorithm 2, the quantile calibration allows full flexibility in choosing the regression model ˆf and does not require an associated uncertainty model to proceed. The motivation for calibrating the error Ui instead of the original Yi is two-fold: First, if one treats Yi itself as Ui and applies Algorithm 1, then it essentially restricts the prediction model to be a nonparametric one. Second, the reduction from Yi to Ui gives a better smoothness of the conditional distribution (a

Algorithm 2 Nonparametric Regression Calibration

Input: Dataset {(Xi, Yi)}n i=1, kernel choice κh( , ), τ 1: Split the dataset into half: D1 = {(Xi, Yi)}n1 i=1, D2 = {(Xi, Yi)}n i=n1+1.

2: Use D1 to train a regression model ˆf. 3: Calculate the estimation error of ˆf on D2: Ui = Yi ˆf(Xi), i = n1 + 1, , n. 4: Run Algorithm 1 on the data {(Xi, Ui)}n i=n1+1 and obtain ˆQSNQ τ ( ).

5: Return ˆf( ) + ˆQSNQ τ ( ).

smaller Lipschitz constant in Assumption 1 (a)). And thus it will give a faster convergence rate. We remark that Algorithm 2 induces an independence between the training of the prediction model and that of the calibration model. The design is intuitive and important because otherwise, it may result in predicting smaller confidence intervals both theoretically and empirically.

3.3 Implications from the nonparametric perspective

The nonparametric approach gives a positive result in terms of the possibility of individual calibration, but it pays a price with respect to the dimensionality d. The following theorem states that such a dimensionality problem can be addressed under conditional independence. And more generally, it provides a guideline on which variables one should use to perform calibration tasks.

Theorem 3. Suppose U = Y ˆf(X). Suppose Z = m(X) Z, where m is some measurable function, and Z X is a d0-dimensional subspace of X. If X and U are mutually independent conditioned on Z (i.e. X Z U is a Markov chain), then it is lossless to perform calibration using only (Ui, Zi) s. In particular, applying Algorithm 1 on (Ui, Zi) s will yield the same consistency but with a faster convergence rate of O(n 2 d0+2 ).

Theoretically, one can identify such Z by independence test for each component of X and Y ˆf(X) on the validation set. Some other practical methods such as random projection and correlation screening selection are shown in Section 4. When the conditional independence in Theorem 3 does not hold, it is still not the end of the world if one calibrates (Ui, Zi). The following theorem tells that conditional calibration with respect to Z, as an intermediate between marginal calibration and individual calibration, can be achieved.

Theorem 4. Suppose U = Y ˆf(X) and Z is a sub-feature of X. Without loss of generality, we assume that Z = Πd0(X), where Πd0 projects from X onto a d0-dimensional subspace Z. Then

P(U Qτ(U|Z) Z) = τ,

which implies that P(Y ˆf(X) + Qτ(U|Z) Z) = τ.

A notable implication from the above theorem is that if we want to calibrate the model against certain sub-features such as age, gender, geo-location, etc., we can summarize these features in Z and use Algorithm 1 with respect to (Ui, Zi).

However, if we return to the original pinball loss, the following theorem inherits from Proposition 2 and states that the inclusion of more variables will always be helpful in terms of improving the pinball loss. It highlights a trade-off between the representation ability of the quantile calibration model (inclusion of more variables)and its generalization ability (slowing down the convergence rate with higher dimensionality).

Theorem 5. Suppose U = Y ˆf(X). Suppose Z(d1) and Z(d2) are the first d1 and d2 components of X. Suppose Qτ(FZ(di)) is the true τ -th quantile of U conditioned on Z(di), i = 1, 2. Suppose Qτ(Fmargin) is the true τ -th marginal quantile of U. If 0 < d1 < d2 < d, then

E[lτ( ˆf(X) + Qτ(U), Y )] E[lτ( ˆf(X) + Qτ(U|Z(d1)), Y )]

E[lτ( ˆf(X) + Qτ(U|Z(d2)), Y )] E[lτ( ˆf(X) + Qτ(U|X), Y )].

4 Numerical Experiments

In this section, we evaluate our methods against a series of benchmarks. We first introduce the evaluation metrics that are widely considered in the literature of uncertainty calibration and conformal prediction, including Mean Absolute Calibration Error (MACE) which calculates the average absolute error for quantile predictions from 0.01 to 0.99; Adversarial Group Calibration Error (AGCE) which finds the sub-group of the test data with the largest MACE; Check Score which shows the empirical pinball loss; Length which measures the average length for the constructed 0.05-0.95 intervals; and Coverage which reflects the empirical coverage rate for those 0.05-0.95 intervals. Formal definitions of these calibration measurements are provided in Appendix E.1.

The benchmark methods are listed as follows (the details can be found in Appendix E.4). For the Gaussian-based methods, we implement the vanilla Heteroscedastic Neural Network (HNN) [Kendall and Gal, 2017], Deep Ensembles (Deep Ensemble) [Lakshminarayanan et al., 2017], and Maximum Mean Discrepancy method (MMD) [Cui et al., 2020]. We also implement MC-Dropout (MCDrop) [Gal and Ghahramani, 2016] and Deep Gaussian Process model (DGP) [Damianou and Lawrence, 2013]. For models that do not rely on Gaussian assumption, we implement the combined calibration loss (CCL) introduced in Sec. 3.2 of [Chung et al., 2021]. We also implement post-hoc calibration methods, such as isotonic regression (ISR) suggested by [Kuleshov et al., 2018] with a pre-trained HNN model. Another post-hoc method named Model Agnostic Quantile Regression (MAQR) method in Sec. 3.1 of Chung et al. [2021] is also implemented. For conformal prediction algorithms, we implement Conformalized Quantile Regression (CQR) [Romano et al., 2019] and Orthogonal Quantile Regression (OQR) [Feldman et al., 2021].

As our proposed nonparametric regression calibration (NRC) algorithm is agnostic with respect to the underlying prediction model, we implement a feed-forward neural network and a random forest model as the prediction model for NRC. Also, we apply the dimension reduction idea (called NRC-DR) including random projection and covariate selection where all technical details are given in Appendix E.3 and E.5. All codes are available at https://github.com/Zhongze Cai/NRC.

In the following subsections, we summarize our results on UCI datasets, time series, and highdimensional datasets. Additional experiments on covariate shift are deferred to E.7.

4.1 UCI dataset experiments

We evaluate our NRC algorithms on the standard 8 UCI datasets Dua and Graff [2017]. Part of the experiment results on representative benchmark models is given in Table 1, and the full results are presented in Appendix E.5 due to space limit. Additional details on dataset description, hyperparameter selection, and early stopping criteria are given in Appendix E.5.

The result supports the competitiveness of our algorithm. In terms of MACE, our methods achieve the minimum loss on 6 out of 8 datasets, with the non-optimal results close to the best performance seen on other benchmarks. The rest metrics also witness our algorithms outperforming the benchmark models, with a superior result on 6 datasets in terms of AGCE, and 4 in terms of Check Score. Note that even for the rest 4 datasets where our NRC algorithm does not achieve the best Check Score (the empirical pinball loss), our performance is still competitive compared to the best score, while for some specific datasets (such as Energy and Yacht) we achieve a significant improvement over the Check Score, which verifies the potential of our methods for the related downstream tasks. As for the Length metric (which evaluates the average interval length of a symmetric 90% confidence interval), it is not easy to trade-off between the sharpness requirement and the coverage rate due to the fact that one can trivially sacrifice one for the other, so we omit the direct comparison. However, we can still observe that our NRC algorithm behaves reasonably well, with a stable coverage rate performance and a relatively sharp interval length amongst the rest benchmarks.

4.2 Bike-sharing time series dataset

This experiment explores the potential of our algorithm to quantify the uncertainty of the time series objective, where independent variables are aggregated in a sliding window fashion from the past few days. The Bike-Sharing dataset provided in Fanaee-T and Gama [2014] is used to visually evaluate our proposed algorithm. Due to the limitations in the representation ability of the feed-forward structure, we instead deploy LSTM networks for the time series data. We design LSTM-HNN,

Table 1: Experiments on the standard UCI datasets. Each experiment (for a combination of dataset and method) is repeated 5 times in order to estimate the range of value fluctuation of each metric. On each dataset, the minimum loss for each metric is marked in bold and red, while we omit the comparisons on the Length and the Coverage due to the difficulty of trading off between sharper intervals and better coverage intervals. Full experiment details are deferred to Appendix E.

Dataset Metric MCDrop Deep Ensemble ISR MAQR CQR NRC NRC-DR

MACE 0.14 0.03 0.06 0.04 0.057 0.03 0.051 0.02 0.049 0.02 0.048 0.02 0.055 0.02 AGCE 0.22 0.04 0.27 0.07 0.27 0.08 0.25 0.06 0.24 0.06 0.19 0.05 0.19 0.05 Check Score 1.9 0.2 1 0.3 0.95 0.2 1.1 0.2 0.72 0.1 1.1 0.2 1.1 0.2 Length(Coverage) 8.6 1(95%) 8.5 1(90%) 7.9 1(88%) 9.1 1(88%) 7.1 0.7(82%) 10 1(94%) 9.7 1(91%)

MACE 0.078 0.02 0.053 0.03 0.044 0.02 0.049 0.03 0.059 0.03 0.047 0.02 0.038 0.03 AGCE 0.2 0.04 0.18 0.03 0.19 0.05 0.18 0.06 0.23 0.05 0.18 0.05 0.19 0.04 Check Score 3.6 0.2 1.9 0.3 2.3 0.6 2 0.2 1.72 0.6 1.4 0.2 2 0.4 Length(Coverage) 25 4(95%) 19 1(93%) 20 0.8(89%) 17 2(79%) 20 0.7(89%) 21 0.8(90%) 19 0.9(92%)

MACE 0.16 0.02 0.065 0.04 0.055 0.03 0.063 0.02 0.075 0.03 0.037 0.008 0.04 0.01 AGCE 0.21 0.01 0.19 0.04 0.21 0.07 0.21 0.04 0.18 0.008 0.22 0.06 0.24 0.08 Check Score 1.4 0.3 0.64 0.09 0.6 0.1 0.6 0.2 0.14 0.03 0.18 0.03 0.69 0.2 Length(Coverage) 9 4(93%) 7.7 0.6(94%) 7.1 2(95%) 6.2 3(84%) 1.8 0.3(95%) 4.5 1(88%) 1.5 0.1(89%)

MACE 0.034 0.01 0.077 0.05 0.013 0.005 0.032 0.02 0.041 0.02 0.013 0.004 0.017 0.007 AGCE 0.075 0.01 0.12 0.06 0.072 0.02 0.072 0.02 0.075 0.02 0.067 0.02 0.056 0.02 Check Score 0.045 0.002 0.027 0.002 0.031 0.001 0.029 0.001 0.025 0.0009 0.032 0.002 0.032 0.002 Length(Coverage) 0.5 0.02(95%) 0.29 0.006(93%) 0.28 0.007(90%) 0.27 0.01(87%) 0.26 0.005(89%) 0.25 0.003(90%) 0.52 0.007(90%)

MACE 0.1 0.03 0.12 0.05 0.011 0.005 0.043 0.02 0.077 0.07 0.012 0.005 0.017 0.007 AGCE 0.15 0.05 0.14 0.06 0.054 0.009 0.081 0.01 0.098 0.05 0.052 0.008 0.052 0.008 Check Score 0.0018 0.0006 0.00064 0.0002 0.00079 0.0003 0.0021 0.0007 0.0011 0.0009 0.00088 0.0001 0.002 0.0007 Length(Coverage) 0.02 0.002(94%) 0.015 0.0003(100%) 0.0078 0.0008(90%) 0.017 0.003(84%) 0.015 0.002(98%) 0.0097 0.002(91%) 0.0088 0.0006(91%)

MACE 0.29 0.01 0.045 0.03 0.011 0.006 0.029 0.02 0.049 0.005 0.0086 0.003 0.0086 0.003 AGCE 0.3 0.01 0.075 0.03 0.065 0.02 0.06 0.01 0.071 0.02 0.057 0.007 0.053 0.008 Check Score 24 3 1.3 0.1 1.3 0.1 1.2 0.03 1.1 0.03 1 0.03 1.2 0.05 Length(Coverage) 20 3(97%) 16 0.5(97%) 13 0.8(90%) 12 0.8(88%) 13 0.7(93%) 13 0.3(91%) 11 0.1(90%)

MACE 0.036 0.02 0.049 0.03 0.016 0.009 0.029 0.02 0.056 0.007 0.016 0.01 0.017 0.006 AGCE 0.1 0.03 0.12 0.05 0.091 0.02 0.094 0.03 0.1 0.06 0.075 0.03 0.075 0.02 Check Score 0.2 0.01 0.2 0.009 0.2 0.01 0.2 0.02 0.21 0.02 0.19 0.01 0.21 0.01 Length(Coverage) 2.4 0.09(90%) 2.2 0.02(88%) 2.2 0.03(87%) 2.3 0.1(84%) 2.3 0.04(89%) 2.3 0.04(90%) 2.2 0.02(90%)

MACE 0.11 0.03 0.049 0.02 0.06 0.03 0.088 0.04 0.073 0.05 0.096 0.05 0.087 0.04 AGCE 0.29 0.08 0.35 0.06 0.34 0.09 0.31 0.07 0.34 0.04 0.34 0.07 0.31 0.04 Check Score 1.9 0.6 1.1 0.4 1.5 0.6 0.32 0.09 0.19 0.07 0.18 0.03 0.4 0.2 Length(Coverage) 18 5(94%) 10 1(91%) 10 0.9(88%) 3.3 1(83%) 1.7 0.4(72%) 5.3 1(93%) 3.2 1(91%)

which is an ordinary LSTM network followed by a fully connected layer with 2 output units; and LSTM-NRC, whose regression model shares the same LSTM setting as LSTM-HNN except for the ending layer having only one output unit (without variance/uncertainty unit). For more details, see Appendix E.6.

0 25 50 75 100 125 150 175 200

Observations

Predictions

95\% Interval

0 25 50 75 100 125 150 175 200

Observations

Predictions

95\% Interval

Figure 2: Experiments on the Bike-Sharing dataset. The demonstrated time stamps are consecutively taken from the test set, thus temporally separated from the training set. The top figure gives the predicted 95% confidence intervals of an LSTM-HNN model, and the bottom figure is produced from a random projection variant of our NRC method (LSTM-NRC). The target variable shows strong periodicity, which both two models capture. Nevertheless, our NRC method delivers sharper confidence intervals and is better adapted to the underlying distribution shift. In particular, our method also has a better coverage rate on the extreme values at time stamps between 100 and 200.

Figure 2 visually compares the two models on the 95% confidence interval prediction task on the test set (see Appendix E.6 for how the test set is split out). A first observation is that our LSTM-NRC is producing sharper and better calibrated 95% intervals. Also, in spite of the satisfying coverage rate of both the methods on time range 0 to 100 (time stamp -1 is exactly the terminating point of the training set, thus time range 0 to 100 roughly denotes the near future of the historical events), on time range 100 to 200 the behaviors of the two models diverge. After an evolution of 100 time stamps the target variable (rental count per hour) is under a shift in distribution: the peak values get higher than historical records would foresee. Confidence intervals produced by LSTM-HNN fail to cover these "extremer" peaks, while our LSTM-NRC method could easily adapt to it, taking advantage of a high-quality regression model. Note that both LSTM-HNN and LSTM-NRC share the same regression architecture, we argue that the additional variance unit of LSTM-HNN may hurt the mean prediction performance of the model.

4.3 Higher-dimensional datasets

One major drawback of the non-parametric type methods is its unsatisfying dependence on the dimension of the data. Our paper attacks such an obstacle from two aspects simultaneously: on the one hand, we subtract a regression model from the original data to gain a smaller Lipschitz constant L; on the other hand, we apply several dimensional reduction methods to reduce the data dimension manually. In this paper, we mainly implement three of them: random projection, covariate selection, and second-to-last layer feature extraction, of which the details can be found in Appendix E.3. We call the variants of our NRC algorithm NRC-RP, NRC-Cov, and NRC-Embed, respectively. We examine them against several high-dimensional datasets: the medical expenditure panel survey (MEPS) datasets [panel 19, 2017, panel 20, 2017, panel 21, 2017], as suggested in [Romano et al., 2019]. The datasets details are also deferred to Appendix E.3.

Dataset Metric Deep Ensemble ISR CQR OQR NRC-RP NRC-Embed

MACE 0.26 0.04 0.12 0.05 0.05 0.01 0.055 0.02 0.024 0.01 0.0087 0.006 AGCE 0.26 0.04 0.14 0.04 0.079 0.01 0.091 0.03 0.062 0.02 0.04 0.02 Check Score 25 7 10 6 2.4 0.2 2.5 0.2 3.2 0.4 2.4 0.7 Length(Coverage) 230 140(99%) 50 6(93%) 28 2(89%) 25 2(85%) 35 8(90%) 24 0.5(90%)

MACE 0.25 0.02 0.16 0.08 0.046 0.01 0.057 0.024 0.0052 0.004 0.0058 0.005 AGCE 0.25 0.02 0.18 0.08 0.061 0.01 0.08 0.03 0.033 0.01 0.041 0.007 Check Score 23 8 7.4 1 2.5 0.1 2.5 0.1 3 0.4 2.4 0.08 Length(Coverage) 190 50(99%) 69 38(94%) 28 2(89%) 27 4(85%) 32 5(90%) 27 3(91%)

MACE 0.25 0.02 0.13 0.1 0.05 0.02 0.0087 0.004 0.012 0.002 0.0088 0.005 AGCE 0.26 0.02 0.15 0.1 0.063 0.008 0.075 0.02 0.045 0.02 0.041 0.01 Check Score 36 20 23 10 2.5 0.3 2.6 0.3 3.5 0.7 2.7 1 Length(Coverage) 160 40(99%) 43 20(93%) 27 3(86%) 29 3(89%) 39 3(90%) 24 2(91%)

Table 2: Experiments on MEPS Datasets

In Table 2 we summarize the dimension reduction variants of our NRC algorithm against several benchmarks. The result sees a substantial advantage of our method over the selected benchmark methods, while most of the unshown benchmarks either are too computationally expensive to run or do not deliver competitive results on the high-dimensional dataset.

5 Conclusion

In this paper, we propose simple nonparametric methods for regression calibration, and through its lens, we aim to gain a better theoretical understanding of the problem. While numerical experiments show the advantage, we do not argue for a universal superiority of our methods. Rather, we believe the contribution of our work is first to give a positive result for individual calibration and an awareness of its importance, and second to show the promise of returning to simple statistical estimation approaches than designing further complicated loss functions without theoretical understanding. Future research includes (i) How to incorporate the uncertainty output from a prediction model with our nonparametric methods; (ii) While we do not observe the curse of dimensionality for our methods numerically and we explain it by the reduction of Lipschitzness constant, it deserves more investigation on how to adapt the method for high-dimensional data.

Ahmed M Alaa, Zeshan Hussain, and David Sontag. Conformalized unconditional quantile regression. In International Conference on Artificial Intelligence and Statistics, pages 10690 10702. PMLR, 2023.

Jason M Altschuler, Victor-Emmanuel Brunel, and Alan Malek. Best arm identification for contaminated bandits. J. Mach. Learn. Res., 20(91):1 39, 2019.

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. ar Xiv preprint ar Xiv:1606.06565, 2016.

Gah-Yi Ban and Cynthia Rudin. The big data newsvendor: Practical insights from machine learning. Operations Research, 67(1):90 108, 2019.

Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005.

Blair Bilodeau, Dylan J Foster, and Daniel M Roy. Minimax rates for conditional density estimation via empirical entropy. ar Xiv preprint ar Xiv:2109.10461, 2021.

Stéphane Boucheron and Maud Thomas. Concentration inequalities for order statistics. Electronic Communications in Probability, 17(none):1 12, 2012.

Asaf Cassel, Shie Mannor, and Assaf Zeevi. A general approach to multi-armed bandits under risk criteria. In Conference on learning theory, pages 1295 1306. PMLR, 2018.

Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart. Robust covariate shift regression. In Artificial Intelligence and Statistics, pages 1270 1279. PMLR, 2016.

Youngseog Chung, Willie Neiswanger, Ian Char, and Jeff Schneider. Beyond pinball loss: Quantile methods for calibrated uncertainty quantification. Advances in Neural Information Processing Systems, 34:10971 10984, 2021.

Peng Cui, Wenbo Hu, and Jun Zhu. Calibrated reliable regression using maximum mean discrepancy. Advances in Neural Information Processing Systems, 33:17164 17175, 2020.

Andreas Damianou and Neil D Lawrence. Deep gaussian processes. In Artificial intelligence and statistics, pages 207 215. PMLR, 2013.

Sanjoy Dasgupta. Experiments with random projection. ar Xiv preprint ar Xiv:1301.3849, 2013.

Dheeru Dua and Casey Graff. Uci machine learning repository, 2017. URL http://archive.ics. uci.edu/ml.

Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642 669, 1956.

Hadi Fanaee-T and Joao Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2:113 127, 2014.

Shai Feldman, Stephen Bates, and Yaniv Romano. Improving conditional coverage via orthogonal quantile regression. Advances in neural information processing systems, 34:2060 2071, 2021.

Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10(2):455 482, 2021.

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050 1059. PMLR, 2016.

Isaac Gibbs, John J Cherian, and Emmanuel J Candès. Conformal prediction with conditional guarantees. ar Xiv preprint ar Xiv:2305.12616, 2023.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321 1330. PMLR, 2017.

László Györfi, Michael Köhler, Adam Krzy zak, and Harro Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989.

Steven R Howard and Aaditya Ramdas. Sequential estimation of quantiles with applications to a/b testing and best-arm identification. Bernoulli, 28(3):1704 1728, 2022.

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017.

Ravi Kumar Kolla, LA Prashanth, Sanjay P Bhat, and Krishna Jagannathan. Concentration bounds for empirical conditional value-at-risk: The unbounded case. Operations Research Letters, 47(1): 16 20, 2019.

Vladimir Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, pages 2593 2656, 2006.

Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. In International conference on machine learning, pages 2796 2804. PMLR, 2018.

Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32, 2019.

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.

Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages 71 96, 2014.

Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523): 1094 1111, 2018.

Rachel Luo, Aadyot Bhatnagar, Yu Bai, Shengjia Zhao, Huan Wang, Caiming Xiong, Silvio Savarese, Stefano Ermon, Edward Schmerling, and Marco Pavone. Local calibration: metrics and recalibration. In Uncertainty in Artificial Intelligence, pages 1286 1295. PMLR, 2022.

David JC Mac Kay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448 472, 1992.

Pascal Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability, pages 1269 1283, 1990.

Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009.

Medical expenditure panel survey panel 19, 2017. URL https://meps.ahrq.gov/mepsweb/ data_stats/download_data_files_detail.jsp?cbo Puf Number=HC-181.

Medical expenditure panel survey panel 20, 2017. URL https://meps.ahrq.gov/mepsweb/ data_stats/download_data_files_detail.jsp?cbo Puf Number=HC-181.

Medical expenditure panel survey panel 21, 2017. URL https://meps.ahrq.gov/mepsweb/ data_stats/download_data_files_detail.jsp?cbo Puf Number=HC-192.

Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19 23, 2002 Proceedings 13, pages 345 356. Springer, 2002.

Tim Pearce, Alexandra Brintrup, Mohamed Zaki, and Andy Neely. High-quality prediction intervals for deep learning: A distribution-free, ensembled approach. In International conference on machine learning, pages 4075 4084. PMLR, 2018.

Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. Mit Press, 2008.

Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. Advances in neural information processing systems, 32, 2019.

Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep gaussian processes. Advances in neural information processing systems, 30, 2017.

Hao Song, Tom Diethe, Meelis Kull, and Peter Flach. Distribution calibration for regression. In International Conference on Machine Learning, pages 5897 5906. PMLR, 2019.

Ingo Stainwart and Andreas Christmann. Estimating conditional quantiles with the help of the pinball loss. Bernoulli, 17(1):211 225, 2011.

Balazs Szorenyi, Róbert Busa-Fekete, Paul Weng, and Eyke Hüllermeier. Qualitative multi-armed bandits: A quantile-based approach. In International Conference on Machine Learning, pages 1660 1668. PMLR, 2015.

Ichiro Takeuchi, Quoc V Le, Timothy D Sears, and Alexander J Smola. Nonparametric quantile estimation. Journal of Machine Learning Research, 7(45):1231 1264, 2006.

Jayaraman J Thiagarajan, Bindya Venkatesh, Prasanna Sattigeri, and Peer-Timo Bremer. Building calibrated deep models via uncertainty matching with auxiliary interval predictors. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6005 6012, Apr. 2020.

Léonard Torossian, Aurélien Garivier, and Victor Picheny. X-armed bandits: Optimizing quantiles, cvar and other risks. In Asian Conference on Machine Learning, pages 252 267. PMLR, 2019.

Vladimir Vovk. Conditional validity of inductive conformal predictors. In Asian conference on machine learning, pages 475 490. PMLR, 2012.

Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a random world, volume 29. Springer, 2005.

Junfeng Wen, Chun-Nam Yu, and Russell Greiner. Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In International Conference on Machine Learning, pages 631 639. PMLR, 2014.

Jonathan Wenger, Hedvig Kjellström, and Rudolph Triebel. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pages 178 190. PMLR, 2020.

Jia Yuan Yu and Evdokia Nikolova. Sample complexity of risk-averse bandit-arm selection. In IJCAI, pages 2576 2582, 2013.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.

Mengyan Zhang and Cheng Soon Ong. Quantile bandits for best arms identification. In International Conference on Machine Learning, pages 12513 12523. PMLR, 2021.

Shengjia Zhao, Tengyu Ma, and Stefano Ermon. Individual calibration with randomized forecasting. In International Conference on Machine Learning, pages 11387 11397. PMLR, 2020.

A Related Literature

A.1 Estimating the whole distribution

The most straightforward thought is to estimate the whole distribution which accomplishes the task of mean regression and uncertainty quantification simultaneously. Generally, there are two ways to fit a distribution: either fitting a parametric model or turning to nonparametric methods. Representative parametric methods are Bayesian Neural Networks [Mac Kay, 1992] and Deep Gaussian Process (DGP) [Damianou and Lawrence, 2013], of which the latter assumes the data is generated according to a Gaussian process of which the parameters are further generated by another Gaussian process. The major drawbacks of these Bayesian-type methods are their high computation cost during the training phase and statistical inconsistency under model misspecification. To resolve the computational issue, [Gal and Ghahramani, 2016] proposes a simple MC-Dropout method that captures the model uncertainty without changing the network structure. Despite the view of Bayesian, the task of distribution estimation also falls into the range of frequentists: by assuming the underlying distribution family (say, Gaussian), one can fulfill the task by estimating the moments or maximizing the likelihood. For example, the Heteroscedastic Neural Network (HNN) [Kendall and Gal, 2017] gives an estimation of both mean and variance at the final layer by assuming the Gaussian distribution. An ensemble method called Deep Ensemble [Lakshminarayanan et al., 2017] is constructed on the basis of HNN. [Cui et al., 2020] designs an algorithm that finds the best Gaussian distribution to minimize the empirical Maximum Mean Discrepancy (MMD) from the observed data so as to ensure the marginal calibration property of quantiles. [Zhao et al., 2020] raises a randomized forecasting principle that minimizes the randomized calibration error with a negative log-likelihood regularization term to ensure the so-called sharpness. Although they claim the construction of (randomized) individual calibration (named PAIC) from the existence of m PAIC predictors, they do not provide any non-trivial m PAIC predictor except for the one that permanently predicts the uniform distribution.

The other approach to estimating the entire distribution is the nonparametric way without assuming the underlying distribution. For example, Kernel Density Estimation (KDE) utilizes kernel methods to estimate the conditional probability density function. Although the method enjoys certain theoretical consistency guarantees [Lei and Wasserman, 2014, Lei et al., 2018, Bilodeau et al., 2021], it is often computationally expensive and not sample-efficient in high dimensional cases, which makes it impractical for uncertainty quantification. A similar idea is applied in [Song et al., 2019] via a post-hoc way. The method is to improve on some given distribution estimation via the Gaussian process and Beta link function. However, unlike the KDE literature, the method in [Song et al., 2019] lacks a theoretical guarantee of its consistency.

A.2 Estimating quantiles

Due to the excessive and sometimes unnecessary difficulties in estimating the distribution, some researchers suggest estimating the quantiles only, since a satisfying quantile estimation suffices in many downstream tasks. The quantile prediction is now regarded as a new task, where one natural idea is to combine the average calibration error on quantiles with a sharpness regularization term. [Pearce et al., 2018] appends a calibration error (with respect to coverage and sharpness) to the loss functions to obtain the calibrated prediction intervals at a pre-specified confidence level. [Thiagarajan et al., 2020] induces an extra interval predictor with a similar combined loss. But their methods require the pre-determined confidence or quantile level and the whole model needs to be retrained if one wants another level of prediction. Such methods in [Pearce et al., 2018, Thiagarajan et al., 2020] are only evaluated empirically without theoretical guarantee for even the marginal distribution. [Chung et al., 2021] proposes an average calibration loss combined with a similar sharpness regularization term, which achieves an average (but not individually, as we will discuss later) calibration. Along with these models that aim to give a quantile prediction out of scratch, [Kuleshov et al., 2018] develops a post-hoc way of calibrating (on average) a pre-trained quantile prediction model via isotonic regression methods. The aim is to choose a specific level of quantile prediction that reaches the desired marginal calibration of which the consistency relies solely on the pre-trained model in contrast to our model that applies a model-agnostic post-hoc calibration.

Another line of research is to use Kernelized Support Vector Machines (KSVM) to predict the conditional quantiles in one step, which fixes the inconsistency issue [Takeuchi et al., 2006, Stainwart and Christmann, 2011]. Essentially, methods from both papers search for a quantile prediction

function over the Reproducing Kernel Hilbert Space (RKHS), say H. Theoretically, Takeuchi et al. [2006] derive a performance guarantee for the conditional quantile estimator based on the assumption of finite H-norm, which is in parallel with our Lipschitz class (Assumption 1 (a)). The Lipschitz function class and the bounded H-norm function class overlap, but neither one is contained in the other. Stainwart and Christmann [2011] analyze the same algorithm but adopt a different approach. Instead of imposing a bounded H-norm, they impose a decay rate for the eigenvalues of the kernel integral operator, alongside some other assumptions on the underlying distribution. Those works focus on the RKHS space while we focus on the Lipschitz space. For the Lipschitz function class, to the best of our knowledge, our work is the first result to provide an individual guarantee. From the practical point of view, both papers rely on solving a kernelized learning problem that cannot scale well with the sample size or dimension, while our algorithm is simple to implement and of low computational cost.

A.3 Conformal prediction

A very closely related field is conformal prediction, where the goal is to give a coverage set that contains the true label with a certain probability. Theoretically, the goal of conformal prediction is no harder than our quantile prediction task since correct quantiles induce desired coverage sets, while empirically different measurements are considered: the conformal prediction cares about giving the sharpest covering set rather than specifying which quantiles are estimated. For the regression problem, there is an impossible triangle summarized in a concurrent work [Gibbs et al., 2023]: (i) a conditional/individual coverage (as our Definition 3) (ii) no assumptions on the underlying distribution (iii) a finite-sample guarantee and asymptotic consistency. Such a triangle is shown by Vovk [2012], Foygel Barber et al. [2021] to be impossible to achieve simultaneously. Some works in the conformal prediction literature aim to reconcile the impossibility: Romano et al. [2019] consider the marginal/average case (as defined in Definition 1), which greatly relaxes (i). Gibbs et al. [2023] violates (i) in a milder but different way, focusing on a midway between marginal coverage and conditional coverage defined by the linear function class.

The most related paper to ours may be Feldman et al. [2021]: both theirs and ours attempt to address the minimum cost of keeping the requirement (i). Before diving into the detailed comparison, we summarize the main difference first: Feldman et al. [2021] relaxes the remaining two, while our paper keeps (iii) and relaxes (ii) in a much milder way. Feldman et al. [2021] achieves a theoretical guarantee that the true conditional quantile is the minimizer of the regularized population loss under a realizability assumption (that is, the true conditional quantiles lie in the function class considered) and an assumption that the conditional distribution is continuous with respect to features. However, its theoretical results have several limitations. First, it achieves only asymptotic consistency compared to our finite-sample guarantee in Theorem 1, which means that Feldman et al. [2021] violates requirement (iii). Besides, it only achieves a necessary condition that the true conditional quantile is the minimizer of the population loss but not vice versa. Second, Feldman et al. [2021] violates (ii) to a much deeper content than ours. Note that the key assumptions made in Feldman et al. [2021] are the zero-approximation-error assumption and the continuity assumption. The former means that there should be no model misspecification at all, which is not a weak one in the statistical learning theory and limits the guarantee from a desirable distribution-free property. On the contrary, our algorithm can have a finite-sample guarantee with only three very general assumptions summarized in Assumption 1. Furthermore, our Lipschitz assumption is made only to achieve a finite-sample guarantee (of which the necessity is supported by Theorem 6), while it can still be relaxed to a weaker one than Feldman et al. [2021] to reach the same asymptotic result. To see that, our result only requires the conditional quantile to be continuous, while Feldman et al. [2021] requires continuity for the whole conditional distribution. Third, the regularization term in Feldman et al. [2021] is related to a zero-one indicator, which is discontinuous for gradient descent. The authors twist their algorithm by replacing it with a smooth approximation to overcome the issue without any formal guarantee on the twisted one. In contrast, our theory accompanies the exact same algorithm that we propose.

Our paper can also serve as a starting point for understanding the empirical success of the split procedure that is common among the conformal prediction algorithms. As summarized before, most conformal prediction methods theoretically abandon requirement (i) with only a marginal coverage guarantee. But a tricky point is that if one only needs the marginal coverage, it suffices to merely use the empirical quantile without the splitting step at all. Why does the learner bother to split ? Empirically, the splitting procedure helps to obtain a more homogeneous coverage across the whole

feature space, while our paper is the first theoretical justification up to our knowledge: to reduce the Lipschitz constant L and smooth the target. Such an intuition is justified by Theorem 1 and Theorem 2 from both the upper and the lower bounds.

B Failure of Existing Algorithms in Individual Calibration

In this subsection, we will give a brief discussion of the existing methods in regression calibration literature. We give a close inspection of the existing calibration algorithms, proving that they cannot achieve a desirable result (or even avoid naive and meaningless ones).

Some methods assume that the underlying distribution is Gaussian [Cui et al., 2020, Zhao et al., 2020]. The most straightforward drawback one can think of is that these Gaussian-based methods are not consistent once the Gaussian assumption is violated. However, we will prove with some counter-examples that they cannot prevent the naive or meaningless prediction even if the underlying model is Gaussian.

Cui et al. [2020] use a two-step procedure to estimate the conditional distribution, where at each step the underlying model is a neural network predicting the mean µ(x) and variance σ2(x). The first step is simply fitting the observed {(Xi, Yi)}n i=1 into the model, getting an initial guess of µ and σ. The second step utilizes the maximum mean discrepancy (MMD) principle. To be more specific, the second step iterates the following steps until convergence: first generate N new samples { ˆYi}N i=1 from the estimated Gaussian model N(ˆµ(Xi), ˆσ2(Xi)), then update the model estimation (ˆµ, ˆσ2 to minimize the empirical MMD 1

N PN i=1 ϕ(Yi) 1

N PN i=1 ϕ( ˆYi) . But the problem is: the corresponding relation between each {(Xi, Yi)} pair is wiped out from the model by calculating the empirical MMD. In other words, the procedure produces the same result even if one permute n estimated conditional distributions {(ˆµ(Xi), ˆσ(Xi))} arbitrarily. This can lead to a failure in predicting the true conditional distribution even under the Gaussian condition, especially when they use multi-layer neural networks which are well-known universal approximators that can fit any continuous function [Hornik et al., 1989].

Zhao et al. [2020] try to give a randomized prediction of the cdf (in their paper denoted by h[x, r]( ), where r is some Unif[0, 1] random variable). The idea behind this is that if one gets a good cdf estimation, then ˆFx(Y ) should be close to the uniform distribution. So the target is to minimize the empirical absolute difference between h[x, r](y) and r (denoted by L1). If L1 < ϵ with probability at least 1 δ, then they call it (ϵ, δ)-probably approximately individual calibrated (PAIC). But minimizing L1 alone cannot avoid being trapped by a naïve and meaningless calibration: h[x, r](y) r. So they combine the loss with another maximum likelihood type loss L2 (negative logarithm of likelihood, by assuming the underlying distribution of Y is Gaussian). But combining that loss cannot guarantee a meaningful calibration. To illustrate this, one just needs to consider the continuous distribution case, where each Xi = xi is only observed once almost surely. A global minimizer for the empirical loss L1 = Pn i=1 1 n|h[xi, ri](yi) ri| is obtained if one predicts h[xi, ri]( ) to be N(yi σΦ 1(ri), σ2), where N is the Gaussian and Φ( ) is the cdf of standard Gaussian. It can be easily checked that this Gaussian distribution s r-th i quantile is exactly yi). The maximum likelihood type loss L2 is now 1

2 log(σ2) + 1

n Pn i=1 Φ 1(ri)2 = 1

2 log(σ2) + constant. Minimizing it means that the σ2 term will asymptotically go to zero, leading to an overfit and meaningless result of h[xi, ri]( ) δyi. Again, this overfitting problem can be exacerbated by the fact that they train the model via a neural network, which is commonly believed to be highly overparametrized that can fit into any data nowadays. We must also point out that although Zhao et al. [2020] claims that they prove the existence of an (ϵ, δ)-PAIC predictor, the existence is only guaranteed by the existence of another monotonically PAIC (where monotonicity requires h[x, r](y) is monotonically increasing with respect to r) predictor (see their Theorem 1). But they do not provide the existence of any non-trivial m PAIC predictor other than the trivial solution h[x, r](y) r. Therefore, we think their argument is at least incomplete, if not suspicious.

For those existing methods that do not assume Gaussian distributions, unfortunately, they are still inconsistent from an individual calibration perspective [Kuleshov et al., 2018, Chung et al., 2021].

Kuleshov et al. [2018] develop an isotonic regression way to calibrate the quantiles in the regression setting, which is inspired by the widely used isotonic regression in the classification setting, hence is a recalibration method based on some initial estimation of all quantiles (say { Q}τ = ˆF 1(τ)).

Then the authors propose an isotonic regression method that learns a monotonically increasing function g(τexpected) = τobserved, where τobserved = 1 n Pn i=1 1{Yi Qτexpected}. The final output is ˆQτ = Qg 1(τ). The major drawback of their method is that it heavily relies on the initial calibration Qτ. Hence the final marginal (or individual) recalibration result is only valid when the function class { Qτ : τ [0, 1]} contains a marginal (or individual) calibrated prediction. Despite the marginal case, such a realizability condition is too strong to obtain in the individual calibration set, leading to the failure of their recalibration methods in the individual case.

Chung et al. [2021] propose a combined loss in their Section 3.2 with respect to both average calibration and sharpness. The first term minimizes the calibration error: L1 = 1{ 1

n Pn i=1 1{Yi ˆQτ} < τ} 1

n Pn i=1[(Yi ˆQτ(Xi))1{Yi > ˆQτ(Xi)}] + 1{ 1

n Pn i=1 1{Yi ˆQτ} > τ} 1

n Pn i=1[( Yi + ˆQτ(Xi))1{Yi < ˆQτ(Xi)}]. Note that L1 is zero if the prediction ˆQτ is marginally calibrated. The second term is L2 = 1 n Pn i=1 ˆQτ(Xi) ˆQ1 τ(Xi), called sharpness regularization term. But that combined loss L = (1 λ)L1 + λL2 will fail even for calibrating such a simple instance: X Unif[0, 1], u Unif[ 1, 1], and Y = u X. Any τ -th(τ > 0.5) quantile should be Qτ(Fx) = (τ 0.5)x, while minimizing their proposed loss for λ [0, λ0] ends up with ˆQτ = x for those x < τ 0.5 and ˆQτ = 0 for those x τ 0.5. For larger λ [λ0, 1], minimizing their proposed loss will lead to an all-zero meaningless prediction ˆQτ = 0. The detailed calculation can be found in Appendix.

Finally, we point out another issue that using the training residuals as errors to estimate the error distribution will lead to a biased (usually over-confident) estimation. For example, the Section 3.1 of Chung et al. [2021] proposes a combined method of first training a regression model ˆf on {(Xi, Yi)}n i=1 and then train a regression model on resi = Yi ˆf(Xi). But they use the same set of data to do both steps of training, leading to a statistically inconsistent estimator. The inconsistency becomes especially severe for modern highly overparametrized machine learning models that can fit into even random data [Zhang et al., 2021]. Remark 1. Many researchers [Chung et al., 2021, Zhao et al., 2020] argue that combining a sharpness regularization term with the average calibration loss will help to force the calibration model to stay away from marginal calibration. However, in this section, we show that those combining methods fail to provide a meaningful individually calibrated quantile prediction. In conclusion, a sharpness regularization term may help to achieve a marginal calibration, while it does no good for the goal of individual calibration.

C Discussions on Negative/Impossibility Results for Individual Calibration

Before elaborating on those impossibility results, we adopt the lower bound result from the nonparametric (mean) estimation literature Györfi et al. [2002], showing that any estimator may suffer from an arbitrarily slow convergence rate without additional assumptions on the underlying estimation, even in a noiseless setting. The original result is established for the conditional mean regression, and the same result holds for the conditional quantile estimator in our case. Theorem 6 (Theorem 3.1 in Györfi et al. [2002], rephrased). For any sequence of conditional quantile estimators { ˆQτ,n} and any arbitrarily slow convergence rate {an} with limn an = 0 and a1 a2 > 0, P, s.t. X Unif[0, 1], U|X δg(X) (hence Qτ(U|X = x) = g(x), τ > 0), where g(X) { 1, +1}, and

E h ˆQτ,n(X) g(X) 2i

Such a negative result further justifies our Assumption 1 in a way that one cannot achieve any meaningful convergence rate guarantee without making any assumptions on the underlying distribution.

Discussions on Zhao et al. [2020]: The negative result obtained by Zhao et al. [2020] (their Proposition 1) is for the target of estimating the conditional distribution s cdf. The authors argue that the quality of an estimation ˆFx( ) should be measured via how close the distribution ˆFx(Y ) is to the uniform distribution Unif[0, 1]. The measurement of closeness is defined by the Wasserstein-1 distance.

Their counter-example is constructed in a way that PY |X=x is a singleton (say, a delta distribution on the point g(x), where g(x) is chosen adversarially by the nature). They prove that any learner cannot distinguish between the true distribution and the adversarially chosen distribution from the observed data, thus any algorithm that claims a small closeness between ˆFx(Y ) and Unif[0, 1] cannot guarantee that the observed data are not sampled from the adversarial distribution. Since the Wasserstein-1 distance between a delta distribution and Unif[0, 1] is at least 1

4, the closeness claim is false for at least the adversarial distribution. We need to note that the measurement they take is a little suspicious: if the conditional distribution itself is a delta distribution, then the closeness of Fx(Y ) to the uniform distribution cannot be regarded as a good measurement of distribution calibration, since even the oracle will also suffer an at least 1

4 Wasserstein distance. Although a certain negative result when Y |X δg(X) is also made in our Theorem 6, we need to note that the closeness of ˆFx(Y ) to Unif[0, 1] can only be a good calibration measurement if the conditional distribution is continuous. On the contrary, for the general continuous conditional distributions (with only those mild assumptions), our methods can achieve a minimax optimal rate individual calibration guarantee with respect to any quantile (Theorem 1).

Discussions on Lei and Wasserman [2014], Vovk [2012]: Vovk [2012] s impossibility result (their Proposition 4) is a version of Lemma 1 in Lei and Wasserman [2014]. Thus we will only discuss Lemma 1 in Lei and Wasserman [2014] here. The impossibility result is established for those distributions which have continuous marginal distributions PX. They focus on a conditional coverage condition, meaning that one should obtain a (1 p) coverage set ˆC(Xtest) such that P(Ytest ˆC(Xtest)|Xtest = x) 1 p for any x X almost surely. Lei and Wasserman [2014] prove that for any distribution P and its any continuous point x0, for any finite n, such an almost sure conditional coverage must result in an infinite expected length of covering set with respect to Lebesgue measure. This negative result seems to be disappointing for individual calibration at first glance. However, an almost sure guarantee in a finite sample case is usually too strong in practice. Our results are established for two weaker but more practical settings: either the strong consistency in the asymptotic case as n or a mean squared error guarantee at the finite sample case. A similar spirit of relaxing the requirement in order to get a positive result can also be found in Lei et al. [2018], where the authors get an asymptotic result that the covering band will converge to the oracle under certain assumptions. Note that their requirement is to get the sharpest covering band, meaning that they need to estimate the probability density function, which leads to their more restricted assumptions and different analysis, compared to ours.

D Comparison with Other Results

In this subsection, we make a comparison of the technical contributions with other related literature such as concentration of order statistics and Kernel Density Estimation (KDE).

Current quantile concentration analyses are all based on the i.i.d. assumption, which is not applicable to the continuous case conditional quantile estimation problem, where the learner almost surely does not observe identical features. But to complete our discussions, we give a brief introduction to the existing quantile concentration methods. There are two lines of research, which we call vertical and horizontal correspondingly. Vertical concentration guarantee is based on curving the convergence rate of the Smirnov process (the process of the empirical cdf converges to the true cdf) and applying the Dvoretzky-Kiefer-Wolfowitz inequality [Dvoretzky et al., 1956, Massart, 1990] to get a sharp coverage along the vertical line (in regards to cdf function value), that is, the get a sharp δ > 0 such that ˆQτ δ Qτ ˆQτ+δ with high probability [Szorenyi et al., 2015, Altschuler et al., 2019]. But turning such a vertical guarantee that guarantees the coverage probability into a horizontal one that guarantees the estimation error is highly non-trivial. By assuming the cdf to be strictly increasing and continuous, Szorenyi et al. [2015], Torossian et al. [2019], Kolla et al. [2019], Howard and Ramdas [2022] get a horizontal concentration of order O(1/ n). Cassel et al. [2018] assume a Lipschitz cdf (hence an upper bound for pdf) and get a rate of O(1/ n). Altschuler et al. [2019] directly assumes a function (which they call R) of the cdf to transform the vertical concentration into a horizontal one. Another line of quantile concentration is to directly get a horizontal guarantee. Yu and Nikolova [2013] use Chebyshev s inequality to prove the horizontal concentration while remaining coarse-grained compared to Hoeffding type guarantee. Zhang and Ong [2021] utilize an alternative assumption and analysis of Boucheron and Thomas [2012] that relates to a certain

type of distribution with non-decreasing hazard rate (or monotone hazard rate, MHR). Although the MHR assumption is common in survival analysis, such an assumption is no longer appropriate for many other distributions, such as Gaussian. In our proof, we make a combination of the finite sample theory of the M-estimator and the local strong convexity of the conditional expected pinball loss to derive a new proof for the concentration of empirical conditional quantile around the true conditional quantile. We do not assume the upper bound or lower bound on the pdf, the strict increasing cdf, or the non-decreasing hazard rate. We also generalize the i.i.d. setting of quantile concentration to the independent but not identical setting, which is the first one to our knowledge in the quantile concentration literature.

Another important line of research is the kernel density estimation (KDE) literature focuses on giving a full characterization of the conditional distribution. Such a goal is aligned with the need for the full characterization of pdf in some downstream tasks. For example, if one wants to find the sharpest covering set (or covering band, if one further assumes the conditional distribution is univariate), then one must identify the probability density function so that thresholding can thus be constructed [Lei and Wasserman, 2014, Lei et al., 2018]. But for a simpler and probably wider class of downstream tasks, such as decision-making based on quantiles (newsvendor problem) and evaluating and minimizing the quantiles (robust optimization), an accurate estimation of the conditional quantiles should suffice.

From the point of developing practical algorithms, KDE methods are often computationally intractable for conditional quantile estimations. To get a quantile estimation, one has to integrate the full probability density function. The whole procedure is highly costly even if one discretized the integration. Another point is that theoretical guarantees in the KDE literature are often established with respect to the mean integrated squared error (MISE), while it does not simultaneously guarantee convergence with respect to some quantile. Consider a simple instance where the estimated probability density function ˆp(y) differs from true probability density function p(y) by a factor of O 1 n 1 |y| .

The MISE converges with a rate of O 1

n . However, the cumulative pdf estimation error (denoted by R y |ˆp(t) p(t)|dt may diverge (since R 0 1 |t|dt is a divergent integration). The divergence is caused by the non-exchangeability between the square and the integration.

Since the final goal is to get an estimation of the conditional quantile, we can relax the assumptions in the KDE literature largely. For those nonparametric methods such as Lei and Wasserman [2014], they require the smoothness (to be more specific, Hölderness) of the underlying conditional pdf with respect to y and the Lipschitzness with respect to x in the infinite norm, which is a far stronger assumption than our case. Instead, we only assume that the conditional quantile (rather than the full pdf with respect to the sup norm) is Lipschitz. For those methods that establish the KDE convergence results in a generalization bound via complexity arguments, our assumption is still much weaker. For example, Bilodeau et al. [2021] make a realizability assumption (that the true pdf must lie in the hypothesis class) to achieve a generalization bound, while our assumption just requires the true conditional quantile lies in a bounded set [ M, M] R.

In a word, our analysis combines the idea of decomposing the error into bias and variance terms as in nonparametric regression methods and the spirit of the covering number arguments in the analysis of parametric models. Such a combination is novel and non-trivial, which is critical for the relaxation of the assumptions.

E Numerical Experiment Details

In this part, we provide detailed experiment settings for Section 4.

E.1 Evaluation metrics

Mean Absolute Calibration Error (MACE): for (expected) quantile level τ and a quantile prediction model Q(X, τ), define "observed quantile level" by

i=1 1{Yi Q(Xi, τ)}.

For pre-specified quantile levels 0 τ1 < < τM 1, MACE is calculated by:

MACE({(Xi, Yi)}n i=1, Q) := 1

τj τ obs j .

In practice, we set M = 100 and pick τ s from 0.01 to 0.99 with equal step size. Adversarial Group Calibration Error (AGCE): introduced in Zhao et al. [2020], AGCE is calculated in practice by first sub-sampling (possibly with replacement) B1, , Br {(Xi, Yi)}n i=1 and then retrieve the maximum MACE from the pool:

AGCE({Bk}r k=1, Q) := max k {MACE(Bk, Q)}.

Check Score: check score is an empirical version of the expected pinball loss. For prespecified quantile levels 0 τ1 < < τm 1 (in our experiments τ s are set in the same way as in MACE), the check score is defined as:

Check Score({(Xi, Yi)}n i=1, Q) := 1

j=1 ρτj (Q(X, τj), Y ) ,

with ρτ(Q(X, τ), Y ) := 1

n Pn i=1 lτ(Q(Xi, τ), Yi). Length: our length metric gives the average interval length of a predicted 90% confidence interval. The interval is given as the area between the lower quantile (at level τlow = 0.05) and the upper quantile (at level τup = 0.95):

Length({(Xi, Yi)}n i=1, Q) := 1

i=1 (Q(X, τup) Q(X, τlow)) .

Coverage: to encourage a fairer comparison, we always attach the empirical coverage rate to the length metric. Coverage rate evaluates the percentage of real data that is covered by the given 90% confidence intervals:

Coverage({(Xi, Yi)}n i=1, Q) := 1

i=1 1{Q(Xi, τlow) Yi Q(Xi, τup)}.

We make a brief explanation for the selection of the performance metrics as follows. The former three metrics are typical measurements in the literature of quantile regression and uncertainty calibration that measure the quality of the calibrations, while the latter two are classic metrics in the field of conformal prediction. Although the quantile calibration task is very closely related to the conformal prediction task, the goals are still different. Conformal prediction aims to provide as sharp as possible prediction sets that cover the true outcomes with desired rates. While a precise quantile prediction guarantees the coverage rate, such a covering band may not be the sharpest. As is discussed in Appendix B, the additional sharpness regularization term could harm the goal of precise quantile prediction. For example, people may select 0.05 and 0.95 quantiles to construct a 0.9 coverage band, but the sharpest covering band probably would deviate from such quantiles. However, a highquality quantile prediction alone is of independent interest for many downstream tasks, such as the newsvendor problem. This points to a difference between the objectives of calibration and conformal prediction. As a result, the popular performance metrics applied in conformal prediction such as the average interval length and the coverage rate are not direct measurements for the quantile calibration problem. Nevertheless, we still present the Length and the Coverage metrics for completeness. Some other measurements, for example, the independence between the coverage indicator and the band length in Feldman et al. [2021], are only a necessary condition of the quantile calibration and are therefore not considered here.

E.2 Synthetic Dataset

For an example of heteroscedastic Gaussian noises, consider an underlying generation scheme of (X, Y ) R2 from X Unif[0, 15] and Y |X N(µ0(X), σ2 0(X)), with µ0(X) = 4 sin( 2π

15 X) and σ0(X) = max{0.2X |sin(X)|, 0.1}. We generate ntrain = 40000 samples for training and

ntest = 1000 for testing. For our NRC method, we randomly select n1 = 36000 for training a regression model and ntrain n1 = 4000 samples for the quantile calibration step. All neural networks used in this example have hidden layers of size [100, 50]. As shown in Figure 1, our NRC method successfully captures the actual fluctuation of conditional variance, while the traditional benchmark model Deep Ensemble which relies on 5 heteroscedastic neural networks cannot capture the individual dependence on the feature of the noise variance.

E.3 Dimension reduction implementation

As given in Theorem 1, for high dimensional data (i.e., large d) the upper bound guarantee may get too loose as a theoretical guarantee. Motivated by this dilemma, we propose applying dimensionreduction techniques when doing nonparametric approximation. We summarize them as the NRC-DR algorithms. There are loads of choices for dimension reduction, in this work we only implement three of them: random projection, covariate selection, and second-to-last layer feature extraction.

Random Projection Dasgupta [2013]: we use the Gaussian random projection method, in which we project the d-dimensional inputs on the column space of a randomly generated d0 d matrix (assuming that d0 < d, otherwise original input will be retained), whose elements are independently generated from N(0, 1

d). Covariate Selection: we select the most relevant features from the input vector. In this experiment, we select d0 out of d features that have the largest absolute value of Pearson correlation coefficient with the target variable. Second-to-Last Layer Embedding: The second-to-last layer of a trained neural network has been widely used for feature extraction when the input dimension is high. We add an additional layer of size d0 to the regression network ˆf( ) in Algorithm 2 and directly use this layer as the feature embedding.

NRC algorithms implemented with each of these dimension-reduction techniques are called NRC-RP, NRC-Cov, and NRC-Embed respectively. In the UCI-8 experiments we implement NRC-RP and NRC-Cov, with the full result presented in Table 4, while for high-dimensional datasets we choose the medical expenditure panel survey (MEPS) datasets [panel 19, 2017, panel 20, 2017, panel 21, 2017] as suggested in [Romano et al., 2019] to test out our NRC-RP and NRC-Embed variants. The sizes and feature dimensions of these datasets are listed in Table 3. Due to the relative abundance of training data, we set the hidden layers of neural networks to [64, 64], the rest hyperparameters are pre-determined by grid search. For the results presented in Table 2, we reduce the dimension to d0 = 30.

E.4 Benchmark methods

We explain the benchmark models and uncertainty quantification methods we are competing against in Section 4. The first type of method is based on the Gaussian assumption. They assume that the conditional distribution of Y given X can be fully determined by the conditional mean and variance. These models are of the same output structure of two neurons: one for prediction of µ(X), another for σ(X). At the implementation of the experiments, we assume that µ(X) and σ(X) share the same overall network structure, with two hidden layers of size [20, 20] for small datasets and [64, 64] for large or high dimensional datasets. Amongst them, MC-Dropout (MCDrop) Gal and Ghahramani [2016], Deep Ensembles (Deep Ensemble) Lakshminarayanan et al. [2017] and Heteroscedastic Neural Network (HNN) Kendall and Gal [2017] are trained by minimizing the negative log-likelihood (NLL) error. The Maximum Mean Discrepancy method (MMD) Cui et al. [2020] trains by first minimizing the NLL loss of an HNN model, and then re-train the model to minimize the MMD between the model prediction and the same training set at the second phase. The method introduced in section 3.2 of Chung et al. [2021] introduces a combined calibration loss (CCL) with an average calibration loss and a sharpness regularization term, and the model is trained by minimizing it. Apart from these feed-forward structures, there is also a Deep Gaussian Process model (DGP) Damianou and Lawrence [2013] which assumes that the data are generated from a Gaussian process, and is trained by the doubly stochastic variational inference algorithm Salimbeni and Deisenroth [2017]. The DGP model is of high training cost, we thus only test it on small datasets. We also run experiments on post-hoc calibration methods, such as isotonic regression (ISR) suggested by Kuleshov et al. [2018]. In our experiments, we evaluate the post-hoc ISR method by attaching it

to a pre-trained HNN model. We also implement another post-hoc method named Model Agnostic Quantile Regression (MAQR) method given in section 3.1 of Chung et al. [2021]. Their algorithm is similar to ours in terms of training a regression model first, but different in two aspects: first, MAQR uses the same dataset to do both regression training and quantile calibration; second, MAQR obtains a quantile result at the second phase via training a new quantile regression model, which in our implementation is a neural network. For the algorithms coming from the conformal prediction society, we test the Conformalized Quantile Regression (CQR) method introduced by Romano et al. [2019] and the Orthogonal Quantile Regression (OQR) approach from Feldman et al. [2021]. The CQR algorithm also takes a training step followed by a calibration step, with the training step learning the target quantiles by minimizing the pinball loss and the calibration step further calibrating the quantile predictions by conformal prediction. Based on the CQR paradigm, the OQR algorithm further improves conditional coverage by introducing an additional regularization term to the vanilla pinball loss used for quantile regression. The regularization term encourages the independence between interval length and the violation (falling out of the predicted interval) event of the response variable.

E.5 Experiment details on the 8 UCI datasets

We have run extensive experiments to compare our methods against multiple state-of-the-art benchmark models. The full results are listed in Table 4. A brief review of all the benchmarks has been provided in Section E.4, and realizations of algorithms NRC-RP and NRC-Cov are explained in Appendix E.3.

A short summary of the 8 UCI datasets is listed in 3. For all the network structures used in Section 4 we fix the size of the hidden layers to [20, 20] (and [10] for the DGP model). For our dimension reduction algorithms, the target dimension that we reduce to is set to 4. The rest hyperparameters (including the learning rate, minibatch size, kernel width, etc) are pre-determined by running a grid search. The learning rate is searched within [10 2, 5 10 3, 10 3], the minibatch size is searched within [10, 64, 128], and the rest hyperparameters exclusive to each model are searched in the same fashion. For each possible combination of model and dataset, we save its optimal hyperparameters into separate configuration files.

During the main experiment, for each combination of model and dataset, we load its optimal hyperparameter configurations, on which we repeat the whole training process 5 times. Each time we randomly split out 10% of the whole sample as the testing set. On the remaining set, for our NRC algorithms we additionally separate out 30% for recalibration, and for the rest algorithms, no additional partitioning is required. The rest data is then fed into the whole train-validation loop, and we set up an early stopping scheme with a patience count of 20. That is, if for 20 epochs no decrease in loss is observed on the validation set, then the training is early stopped.

Table 3: Descriptions of Datasets. Each row gives the number of examples contained in the corresponding dataset and the number of features in the input variable.

Dataset # Examples # Features

Boston 506 13

Concrete 1030 8

Energy 768 8

Kin8nm 8192 8

Naval 11934 17

Power 9568 4

Wine 4898 11

Yacht 308 6

Dataset # Examples # Features

Meps_19 15785 139

Meps_20 17541 139

Meps_21 15656 139

Bike-Sharing 17379 11

Wine-Red 1599 11

Auto-MPG 392 6

Table 4: Full table of experiment results on 8 UCI datasets

Dataset Metric HNN MCDrop Deep Ensemble CCL ISR DGP MMD

MACE 0.065 0.04 0.14 0.03 0.06 0.04 0.11 0.04 0.057 0.03 0.11 0.01 0.18 0.03 AGCE 0.25 0.09 0.22 0.04 0.27 0.07 0.34 0.1 0.27 0.08 0.36 0.1 0.41 0.09 Check Score 0.96 0.2 1.9 0.2 1 0.3 1.1 0.4 0.95 0.2 1.1 0.3 1.3 0.4 Length(Coverage) 8.9 1(88%) 8.6 1(95%) 8.5 1(90%) 5.3 0.7(90%) 7.9 1(88%) 4.6 0.1(75%) 10 1(97%)

MACE 0.038 0.02 0.078 0.02 0.053 0.03 0.11 0.03 0.044 0.02 0.17 0.02 0.098 0.05 AGCE 0.18 0.03 0.2 0.04 0.18 0.03 0.22 0.03 0.19 0.05 0.25 0.04 0.24 0.07 Check Score 2.2 0.6 3.6 0.2 1.9 0.3 2.3 0.6 2.3 0.6 1.7 0.2 3.9 1 Length(Coverage) 19 1(89%) 25 4(95%) 19 1(93%) 17 1(77%) 20 0.8(89%) 5.1 0.1(50%) 23 3(94%)

MACE 0.07 0.04 0.16 0.02 0.065 0.04 0.13 0.07 0.055 0.03 0.085 0.01 0.13 0.05 AGCE 0.2 0.07 0.21 0.01 0.19 0.04 0.29 0.1 0.21 0.07 0.2 0.03 0.28 0.1 Check Score 0.61 0.1 1.4 0.3 0.64 0.09 0.79 0.1 0.6 0.1 0.25 0.02 0.71 0.2 Length(Coverage) 7.8 2(88%) 9 4(93%) 7.7 0.6(94%) 2.1 0.2(60%) 7.1 2(95%) 4.2 0.4(97%) 8 0.8(92%)

MACE 0.049 0.04 0.034 0.01 0.077 0.05 0.1 0.02 0.013 0.005 0.035 0.01 0.13 0.01 AGCE 0.11 0.04 0.075 0.01 0.12 0.06 0.13 0.02 0.072 0.02 0.08 0.03 0.18 0.02 Check Score 0.031 0.001 0.045 0.002 0.027 0.002 0.03 0.002 0.031 0.001 0.023 0.001 0.035 0.003 Length(Coverage) 0.26 0.005(88%) 0.5 0.02(95%) 0.29 0.006(93%) 0.25 0.003(80%) 0.28 0.007(90%) 0.28 0.002(94%) 0.37 0.002(86%)

MACE 0.12 0.05 0.1 0.03 0.12 0.05 0.18 0.1 0.011 0.005 0.12 0.06 0.2 0.007 AGCE 0.17 0.06 0.15 0.05 0.14 0.06 0.22 0.1 0.054 0.009 0.15 0.05 0.2 0.005 Check Score 0.00086 0.0002 0.0018 0.0006 0.00064 0.0002 0.0021 0.0006 0.00079 0.0003 0.0026 0.0006 0.006 0.0007 Length(Coverage) 0.013 0.0008(99%) 0.02 0.002(94%) 0.015 0.0003(100%) 0.016 0.0007(84%) 0.0078 0.0008(90%) 0.05 0.004(97%) 0.0054 0.0003(77%)

MACE 0.045 0.02 0.29 0.01 0.045 0.03 0.16 0.07 0.011 0.006 0.13 0.01 0.21 0.03 AGCE 0.1 0.02 0.3 0.01 0.075 0.03 0.21 0.08 0.065 0.02 0.16 0.007 0.23 0.03 Check Score 1.3 0.09 24 3 1.3 0.1 1.5 0.2 1.3 0.1 1.3 0.009 1.7 0.2 Length(Coverage) 13 1(87%) 20 3(97%) 16 0.5(97%) 14 1(80%) 13 0.8(90%) 17 0.1(86%) 13 1(85%)

MACE 0.05 0.04 0.036 0.02 0.049 0.03 0.057 0.009 0.016 0.009 0.047 0.03 0.13 0.02 AGCE 0.11 0.05 0.1 0.03 0.12 0.05 0.12 0.03 0.091 0.02 0.11 0.05 0.16 0.02 Check Score 0.2 0.01 0.2 0.01 0.2 0.009 0.21 0.01 0.2 0.01 0.2 0.01 0.24 0.02 Length(Coverage) 2.4 0.07(89%) 2.4 0.09(90%) 2.2 0.02(88%) 2.3 0.06(86%) 2.2 0.03(87%) 2.3 0.008(89%) 2.4 0.08(92%)

MACE 0.075 0.05 0.11 0.03 0.049 0.02 0.11 0.04 0.06 0.03 0.11 0.03 0.14 0.05 AGCE 0.34 0.1 0.29 0.08 0.35 0.06 0.45 0.05 0.34 0.09 0.29 0.05 0.42 0.08 Check Score 1.4 0.6 1.9 0.6 1.1 0.4 0.49 0.2 1.5 0.6 0.54 0.2 1.1 0.3 Length(Coverage) 8.1 2(70%) 18 5(94%) 10 1(91%) 7.8 1(86%) 10 0.9(88%) 5.3 0.2(85%) 9.2 2(90%)

Dataset Metric MAQR CQR OQR NRC NRC-RF NRC-RP NRC-Cov

MACE 0.051 0.02 0.049 0.02 0.057 0.03 0.055 0.02 0.048 0.02 0.055 0.02 0.055 0.02 AGCE 0.25 0.06 0.24 0.06 0.19 0.05 0.19 0.05 0.25 0.08 0.26 0.07 0.19 0.05 Check Score 1.1 0.2 0.72 0.1 0.73 0.2 1.1 0.2 1.2 0.4 1.1 0.2 1.1 0.2 Length(Coverage) 9.1 1(88%) 7.1 0.7(82%) 8 1(90%) 10 1(94%) 9.7 1(91%) 11 1(94%) 9.6 1(90%)

MACE 0.049 0.03 0.059 0.03 0.048 0.02 0.058 0.02 0.047 0.02 0.038 0.03 0.045 0.02 AGCE 0.18 0.06 0.23 0.05 0.18 0.05 0.19 0.05 0.18 0.05 0.19 0.04 0.2 0.06 Check Score 2 0.2 1.72 0.6 1.57 0.2 1.9 0.4 1.4 0.2 2.6 0.7 2 0.4 Length(Coverage) 17 2(79%) 20 0.7(89%) 17 0.8(89%) 21 0.8(90%) 19 0.9(92%) 21 1(89%) 21 1(89%)

MACE 0.063 0.02 0.075 0.03 0.058 0.01 0.04 0.01 0.037 0.008 0.04 0.01 0.04 0.01 AGCE 0.21 0.04 0.18 0.008 0.17 0.006 0.26 0.1 0.22 0.06 0.24 0.08 0.26 0.1 Check Score 0.6 0.2 0.14 0.03 0.31 0.2 0.69 0.2 0.18 0.03 0.69 0.2 0.69 0.2 Length(Coverage) 6.2 3(84%) 1.8 0.3(95%) 3.7 0.8(94%) 4.5 1(88%) 1.5 0.1(89%) 8.5 3(91%) 7.3 2(91%)

MACE 0.032 0.02 0.041 0.02 0.053 0.01 0.017 0.007 0.013 0.004 0.017 0.007 0.017 0.007 AGCE 0.072 0.02 0.075 0.02 0.099 0.01 0.07 0.02 0.067 0.02 0.056 0.02 0.07 0.02 Check Score 0.029 0.001 0.025 0.0009 0.036 0.001 0.032 0.002 0.044 0.001 0.032 0.002 0.032 0.002 Length(Coverage) 0.27 0.01(87%) 0.26 0.005(89%) 0.45 0.005(93%) 0.25 0.003(90%) 0.52 0.007(90%) 0.28 0.006(89%) 0.27 0.009(90%)

MACE 0.043 0.02 0.077 0.07 0.24 0.2 0.017 0.007 0.012 0.005 0.017 0.007 0.017 0.007 AGCE 0.081 0.01 0.098 0.05 0.27 0.1 0.052 0.008 0.054 0.01 0.063 0.02 0.052 0.008 Check Score 0.0021 0.0007 0.0011 0.0009 0.0032 0.001 0.002 0.0007 0.00088 0.0001 0.002 0.0007 0.002 0.0007 Length(Coverage) 0.017 0.003(84%) 0.015 0.002(98%) 0.023 0.004(63%) 0.0097 0.002(91%) 0.0088 0.0006(91%) 0.012 0.002(90%) 0.01 0.002(91%)

MACE 0.029 0.02 0.049 0.005 0.057 0.007 0.0086 0.003 0.014 0.008 0.01 0.004 0.0086 0.003 AGCE 0.06 0.01 0.071 0.02 0.084 0.008 0.057 0.007 0.058 0.01 0.053 0.008 0.057 0.007 Check Score 1.2 0.03 1.1 0.03 1.2 0.08 1.2 0.05 1 0.03 1.3 0.07 1.2 0.05 Length(Coverage) 12 0.8(88%) 13 0.7(93%) 14 1(95%) 13 0.3(91%) 11 0.1(90%) 14 1(91%) 13 1(92%)

MACE 0.029 0.02 0.056 0.007 0.05 0.004 0.017 0.006 0.016 0.01 0.017 0.006 0.017 0.006 AGCE 0.094 0.03 0.1 0.06 0.13 0.08 0.075 0.03 0.077 0.02 0.078 0.03 0.075 0.02 Check Score 0.2 0.02 0.21 0.02 0.21 0.02 0.21 0.01 0.19 0.01 0.21 0.01 0.21 0.01 Length(Coverage) 2.3 0.1(84%) 2.3 0.04(89%) 2.3 0.03(89%) 2.3 0.04(90%) 2.2 0.02(90%) 2.3 0.05(90%) 2.4 0.06(90%)

MACE 0.088 0.04 0.073 0.05 0.14 0.08 0.12 0.06 0.096 0.05 0.087 0.04 0.093 0.05 AGCE 0.31 0.07 0.34 0.04 0.38 0.05 0.34 0.07 0.42 0.08 0.31 0.08 0.31 0.04 Check Score 0.32 0.09 0.19 0.07 0.41 0.2 0.44 0.2 0.18 0.03 0.54 0.2 0.4 0.2 Length(Coverage) 3.3 1(83%) 1.7 0.4(72%) 3.6 0.8(88%) 5.3 1(93%) 3.2 1(91%) 4.5 1(93%) 5.5 1(91%)

E.6 Experiment details on the Bike-Sharing dataset

The Bike-Sharing dataset is a time series dataset that records the number of bike rentals as well as the value of 11 related features at each hour. As briefly described in Section 4, for this experiment we design the LSTM-HNN and LSTM-NRC models, each having 2 layers, 40 hidden neurons for each layer, and a sliding window of size 5. For the LSTM-NRC model, in the recalibration step we are required to calculate (as part of the whole kernel matrix calculation) the distance between to input variables, and we retrieve the input vector by concatenating all the features within the history window (into a 55-dimension vector). As the dimension is high, we adopt dimension-reduction techniques and use random projection to map the concatenated feature vector to a 4-dimensional space.

Most experiment details, in terms of hyperparameter selection, early stopping criteria, etc, as similar to the settings described in Appendix E.5. One difference lies in the splitting of the dataset. Due to the temporal property, we are only allowed to predict "the future" from "the past", thus we split the starting 80% of time stamps for training (and further split for recalibration, if required), and the ending 20% of time stamps for testing.

E.7 Robustness to Covariate Shift

Covariate shift [Quinonero-Candela et al., 2008] describes a frequently observed phenomenon that the distribution of the covariates (also called the independent variable, feature variable, etc) that differ between the training and testing datasets. In the presence of covariate shift, a calibration model attained from the training set that only achieves average calibration (i.e., naive predictions of quantiles) may no longer suit well for the testing dataset, while an individually calibrated model maintains its optimality in terms of all the metrics mentioned above.

Following the experiment design in Chen et al. [2016] and Wen et al. [2014], we experiment on both real datasets and half-synthetic ones (all of which are available on Dua and Graff [2017]). The experiment design is almost the same as in Section 4, except for a difference in the processing of testing sets. The experiment results on a subset of benchmarks are presented in Table 5, which further validates the robustness of our algorithm to covariate shift, especially from the dominating performance on the half-synthetic datasets. A few remarks on the datasets: the Wine dataset and Auto-MPG dataset both induce a natural interpretation of covariate shift, with the training set and the testing set having different colors in the experiment on the Wine dataset, and different origin cities on the Auto-MPG dataset; and on the rest two datasets we synthetically simulate a covariate shift for the testing set, as recommended in Chen et al. [2016]. To be specific, we introduce a covariate shift into the testing set by following steps:

1. Randomly split out 10% of the whole dataset as the "testing pool".

2. On the training set XT , Calculate empirical mean and variance ˆµ = XT , ˆΣ = Cov(XT ).

3. Sample Xseed N(ˆµ, ˆΣ). 4. Resample (with replacement) 1000 examples from the testing pool, following density N(Xseed, 0.3ˆΣ).

Table 5: Experiments on covariate shift datasets. The first three rows are real datasets, while the last two are half-synthetic ones. For the Auto-MPG dataset, models are always trained on data samples from city 1, and tested on dataset of either city 2 or city 3, denoted by "MPG2" and "MPG3" respectively. Our algorithm has an outstanding performance, especially so on the two half-synthetic datasets.

Dataset Metric HNN MCDrop Deep Ensemble ISR DGP NRC

Wine MACE 0.32 0.1 0.22 0.07 0.32 0.06 0.31 0.09 0.26 0.06 0.2 0.03 AGCE 0.33 0.1 0.23 0.07 0.34 0.06 0.33 0.08 0.28 0.05 0.23 0.03 Check Score 0.51 0.2 0.29 0.06 0.43 0.1 0.48 0.2 0.31 0.06 0.31 0.09

MPG12 MACE 0.065 0.03 0.12 0.01 0.07 0.03 0.059 0.02 0.17 0.01 0.069 0.008 AGCE 0.21 0.05 0.22 0.04 0.19 0.03 0.21 0.06 0.30 0.05 0.29 0.06 Check Score 1.2 0.2 1.5 0.2 1.1 0.04 1.1 0.06 1.7 0.2 1.2 0.05

MPG13 MACE 0.1 0.02 0.16 0.02 0.11 0.02 0.1 0.03 0.23 0.03 0.08 0.03 AGCE 0.2 0.009 0.2 0.03 0.2 0.02 0.24 0.06 0.37 0.07 0.26 0.1 Check Score 1.1 0.04 1.6 0.3 1.2 0.08 1 0.05 1.4 0.07 0.9 0.06

Concrete MACE 0.1 0.04 0.17 0.09 0.12 0.03 0.059 0.01 0.19 0.02 0.05 0.008 AGCE 0.13 0.04 0.18 0.08 0.15 0.04 0.093 0.02 0.21 0.01 0.07 0.01 Check Score 4.11 0.2 4.8 0.8 4.3 0.2 4 0.3 2.3 0.27 2.5 0.1

Boston MACE 0.24 0.08 0.33 0.1 0.2 0.01 0.11 0.02 0.08 0.03 0.05 0.02 AGCE 0.26 0.07 0.34 0.1 0.21 0.01 0.14 0.02 0.1 0.03 0.075 0.02 Check Score 2.1 0.7 3.2 1.2 1.5 0.05 1.1 0.1 0.71 0.3 0.61 0.05

F Some Basic Probability Results

In this section, we present several well-known probability theory results. Lemma 1 (Bernstein s Inequality). Let ξ1, . . . , ξn be N independent random variables. Suppose |ξi| M0 almost surely. Then t > 0,

i=1 (ξi E[ξi])

1 2Nt2 PN i=1 1 N Var(ξi) + 1

Lemma 2 (Hoeffding s Inequality). Let ξ1, . . . , ξn be N independent random variables. Suppose ai ξi bi almost surely. Then t > 0,

i=1 (ξi E[ξi]) t

2t2 PN i=1(bi ai)2

i=1 (E[ξi] ξi) t

2t2 PN i=1(bi ai)2

Lemma 3 (Poisson Approximation). Let Sn = Pn k=1 ξn,k, where for each n the random variables ξn,k, k [n] are mutually independent, each taking value in the set of non-negative integers. Suppose that pn,k = P(ξn,k = 1) and ϵn,k = P(ξn,k 2) are such that as n ,

(a) Pn k=1 pn,k λ < ;

(b) maxk [n]{pn,k} 0;

(c) Pn k=1 ϵn,k 0.

Then, Sn D Poisson(λ) of a Poisson distribution with parameter λ as n .

G Proofs in Section 2

Proof of Proposition 1. From the fact that Y = U + ˆf(X), one can replace the Y term in each probability with U + ˆf(X), proving that the LHS and RHS probabilities are identical.

Proof of Proposition 2. By Lemma 6, we know that

Qτ(x) arg min u R E lτ(u, U)|X = x .

Since lτ(u1, u2) = lτ(u1 + c, u2 + c), c R, we have

Qτ(x) arg min u R E lτ(u + ˆf(X), U + ˆf(X))|X = x ,

which completes the proof.

H Proofs in Section 3.1

We start the analysis on a fixed point x X. We first analyze the case where n(x) > 0, i.e. i [n], Xi x h. We denote those Xi x h by {Xik}n(x) k=1 . We denote FU|Xik by

Fk. We denote the average distribution of Fk by F = Pn(x) k=1 1 n(x)Fk, and denote its τ -th quantile by

Qτ( F). We abbreviate the estimation ˆQSNQ τ obtained by Algorithm 1 to be ˆQτ in this section. To bound the difference ˆQτ(x) Qτ(U|X = x) , we factorize it into two terms, as an analogy to the bias-variance factorization in nonparametric analysis: ˆQτ(x) Qτ(U|X = x) Qτ( F) Qτ(U|X = x) + ˆQτ(x) Qτ( F) ,

where the former term is called the bias term and the latter the variance term. We now deal with them one by one.

Lemma 4. Under Assumption 1 (a) with Lipschitz constant L, we have Qτ( F) Qτ(U|X = x) Lh.

Proof of Lemma 4. By Assumption 1 (a), k [n(x)],

|Qτ(U|X = Xik) Qτ(U|X = x)| Lh,

which implies that

Fk Qτ(U|X = x) + Lh + = Fk Qτ(U|X = x) + Lh τ,

Fk Qτ(U|X = x) + Lh τ.

F Qτ(U|X = x)+Lh + = F Qτ(U|X = x)+Lh =

1 n(x)Fk Qτ(U|X = x)+Lh τ,

F Qτ(U|X = x) + Lh =

1 n(x)Fk Qτ(U|X = x) + Lh τ,

which justifies that Qτ( F) Qτ(U|X = x) Lh.

Lemma 4 controls the bias term. For the variance term, we need a more careful analysis. We start with inspecting the properties of the expected pinball loss function. We will show that under Assumption 1 (b) and (c), the empirical risk minimization solution obtained at the second step of Algorithm 1 is actually close to the true quantile of F by two steps:

Step 1. The empirical risk minimizer ˆQτ(x) concentrates around its corresponding population risk s unique minimizer for sufficiently large n(x) C(p, r, M), where the minimizer is denoted by u . Step 2. The unique population risk minimizer u in Step 1 is identical to the population risk minimizer of EU F [lτ(u, U)].

We prove the latter Step 2 first due to its simplicity. The following Lemma 5 proves Step 2.

Lemma 5. We define the population risk stated in Step 1 by lτ(u) := EUk Fk[Pn(x) k=1 1 n(x)lτ(u, Uk)]. Suppose Assumption 1 hold for some L, p, r, M. Then

(a) lτ is twicely differentiable and convex;

(b) lτ is identical to EU F [lτ(u, U)] up to a constant;

(c) For sufficiently small h r 2L, r0 r 2, s.t. the minimizer of lτ is unique (denoted by u ), u [ M, M], and lτ is p-strongly convex within a ball of radius r0 around its minimizer u .

To prove Lemma 5, we introduce the following Lemma 6, whose proof is postponed to Appendix J. Lemma 6. For any distribution with cdf F(u) = P(U u), the expected pinball loss with respect to F is semi-derivative with respect to u, and

EU F [lτ(u, U)] = F(u ) τ, +EU F [lτ(u, U)] = F(u) τ.

Furthermore, for those F(u ) = F(u), the derivative exists.

Proof of Lemma 5. By Lemma 6, we have

u lτ(u) = 1 n(x)

k=1 u EU Fk[lτ(u, U)] = 1 n(x)

k=1 Fk(u) τ.

Since we assume that (X, U) follows a continuous distribution, and the conditional pdf exists, then

2 lτ(u) = 1 n(x)

where pk is the pdf of Fk, k [n(x)], which proves part (a).

For part (b), one only needs to notice that F has the cdf of the form Pn(x) k=1 1 n(x)Fk, which implies that EU F [lτ(u, U)] has the same derivative as lτ.

For part (c), by Assumption 1 (c), we have n(x) balls B(Qτ(U|X = Xik), r) k [n(x)], such that

pk(u) p, u B Qτ(U|X = Xik), r

. At the same time, from the proof of Lemma 4, we know that Qτ( F) Qτ(U|X = Xik) Lh 1

which means that

max k {Qτ(U|X = Xik) r} + 1

2 r min k Qτ(U|X = Xik)

Qτ( F) max k Qτ(U|X = Xik)

min k {Qτ(U|X = Xik) + r} 1

Such an inequality implies that for at least a ball of radius r

2EU F [lτ(u, U)] = 1 n(x)

k=1 pk(u) p, u B(Qτ( F), r

which proves the uniqueness of u and the local strong convexity.

Since mink Qτ(U|X = Xik) Qτ( F) maxk Qτ(U|X = Xik) and Qτ(U|X = Xik) [ M, M], we have u = Qτ( F) [ M, M].

Now we go back to prove Step 1. To get the optimal convergence rate, our arguments share the same spirits of Bartlett et al. [2005], Koltchinskii [2006] of getting fast convergence rates for the M-estimators of low variance. But for simplicity, we derive our proof of uniform convergence via Bernstein s inequality and covering number arguments, which is more similar to the way of Maurer and Pontil [2009].

We are now interested in ξk(u) := lτ(u, Uik) lτ(Qτ( F), Uik), u [ M, M], Uik U|X = Xik, k [n(x)]. A direct computation shows that Lemma 7. u [ M, M], we have

|ξk(u)| max{τ, 1 τ} u Qτ( F) u Qτ( F) ,

Var(ξk(u)) 1

3 max{τ 2, (1 τ)2} u Qτ( F) 2 1

u Qτ( F) 2 .

Proof of Lemma 7. Note that the pinball loss can be expressed in an alternative way: lτ(u, U) = max{(1 τ)(u U), τ(U u)},

lτ(Qτ( F), U) = max (1 τ) Qτ( F) U , τ U Qτ( F) . From the fact that (1 τ)(u U) (1 τ)(Qτ( F) U) (1 τ) u Qτ( F)

and that τ(U u) τ U Qτ( F) τ|u Qτ( F)|, we prove the first statement via Lemma 15.

For the second claim, one just need to notice that any almost surely bounded by [a, b] random variables have at most a variance of 1 12(b a)2.

Combining Lemma 1 and Lemma 7, we directly prove the following Lemma 8 (Pointwise Convergence). δ > 0, n(x) > 0, u [ M, M], with probability at least 1 δ the following holds

k=1 E[ξk(u)] 1 n(x)

u Qτ( F) log 1

u Qτ( F) 2 log 1

Proof. Direct corollary from Lemma 1 and Lemma 7.

To reach a uniform convergence guarantee from pointwise convergence, we utilize the covering number arguments. Definition 4 (ϵ-Covering Numbers). ϵ > 0, a function class F is said to have an ϵ-covering number of N (ϵ, F, ) := inf{|F0|: F0 F, f F, f0 F0, s.t. f f0 ϵ}.

We compute the function class formed up by the functions lτ(u, ) lτ(Qτ( F), ). Lemma 9 (Covering Number). Assume Assumption 1 (b) holds with parameter M. Define L[ M,M] := {lτ(u, ) lτ(Qτ( F), ): u [ M, M]}. Then ϵ > 0,

N ϵ, L[ M,M], diam([ M, M])

Proof of Lemma 9. By a similar argument to that in Lemma 7, we have lτ(u1, ) lτ(Qτ( F), ) lτ(u2, ) lτ(Qτ( F), ) |u1 u2|,

where the left-hand-side is the function infinity form by itself.

Combining the above fact with another that |u1 u2| sup u 1,u 2 [ M,M] |u 1 u 2| = diam([ M, M]),

we complete the proof.

Since there will be many coefficients dependence, we abbreviate any constant C that depends polynomially on some other factors (c1, , cq) by C(c1, , cq). Now we provide the uniform convergence result by combining the pointwise convergence and the covering number. Lemma 10 (Uniform Convergence). ϵ, δ > 0, assume n(x) is sufficiently large such that n(x) 2 3 log( 2M

ϵδ ), we have with probability at least 1 δ,

k=1 E[ξk(u)] 1 n(x)

4ϵ + 2 3n(x)

u Qτ( F) log 2M

u Qτ( F) 2 log 2M

holds for u [ M, M].

Specifically, if we select ϵ = 1 n(x), δ = 1 n(x)M , we have n(x) C1(log(M)) := 6 log(max{2M, 10}),

k=1 E[ξk(u)] 1 n(x)

3 log(2n(x)M)

u Qτ( F) 2 log(2n(x)M)

holds for u [ M, M].

Proof of Lemma 10. ϵ > 0, we can select an ϵ-cover of the function class L[ M,M], denoted by

S = {L[uj] = lτ(uj, ) lτ(Qτ( F), )} N(ϵ,L[ M,M], ) j=1 .

ξk(u) = lτ(u, Uik) lτ(Qτ( F), Uik) = L[u](Uik), u [ M, M],

which means that uj S, s.t.

|ξk(u) ξk(uj)| = |L[u](Uik) L[uj](Uik)| L[u] L[uj] ϵ.

Now δ > 0, we apply Lemma 8 on {ξk(uj)}m k=1, uj S with probability guarantee of at least 1 δ N(ϵ,L[ M,M], ), we have uj S, 1 m

k=1 E[ξk(uj)] 1

uj Qτ( F) log N(ϵ, L[ M,M], )

uj Qτ( F) 2 log N(ϵ, L[ M,M], )

Combining above inequality with the fact that |ξk(u) ξk(uj)| ϵ and the way we select the cover in Lemma 9 such that |u uj| ϵ, we complete the proof.

To verify the specific conclusion, one just need to notice that our choice of n(x) ensures that n(x) 4

3 log(2n(x)M) for ϵ = 1 n(x), δ = 1 n(x)M .

Lemma 11 (Generalization Bound for ˆQτ). If we take ˆQτ(x) = arg minu Pn(x) k=1 1 n(x)lτ(u, Uik) as we do in Algorithm 1, then for sufficiently large m C2(p 1, r 1, M), we have with probability at least 1 1 n(x)M ,

0 lτ( ˆQτ(x)) lτ(Qτ( F)) C3(p 1, M, log(n(x)M))

n(x) = O 1 n(x)

where C3 := 8 + 8M

3 log(n(x)M) + 16 log(n(x)M)

More specifically, with probability at least 1 1 n(x)M ,

ˆQτ(x) Qτ( F) C4(p 1, M, log(n(x)M)) p

where C4 := q

Proof of Lemma 11. We first claim that for sufficiently large n(x) C2, the empirical minimizer ˆQτ(x) must fall into the neighborhood B(Qτ( ˆF), r

2) stated in Lemma 5 (c). In fact, one just needs to verify that

3 log(2n(x)M) + 1 p

3 M 2 log(2n(x)M) p r2

which will definitely hold for sufficiently large n(x) C2(p 1, r 1, M).

From the fact that minimizing Pn(x) k=1 1 n(x)lτ(u, Uik) is equivalent to minimizing it minus some constant, we know that n(x) X

1 n(x)ξk( ˆQτ(x)) 0.

By Lemma 5 and Lemma 10, we have

0 lτ( ˆQτ(x)) lτ(Qτ( F))

k=1 E[ξk( ˆQτ(x)]

3 log(2n(x)M) +

16 3n(x)| ˆQτ(x) Qτ( F)|2 log(2n(x)M)

3 log(2n(x)M) +

lτ( ˆQτ(x)) lτ(Qτ( F)) log(2n(x)M),

where the first inequality follows from Lemma 5 (a), the second inequality from Lemma 10, and the last from Lemma 5 (c).

Solve the quadratic inequality

w2 α n(x) +

where α = 4 + 8M

3 log(2n(x)M), β = 32 log(2n(x)M)

3p , we have

β n(x) + 4α n(x) 2 1 p

where w is exactly q

lτ( ˆQτ(x)) lτ(Qτ( F)).

lτ( ˆQτ(x)) lτ(Qτ( F)) 1 n(x) (2β + 2α) = 1 n(x)

3 log(2n(x)M) + 64 log(2n(x)M)

The latter conclusion follows from the local strong convexity of lτ again, as is shown in Lemma 5 (c).

Lemma 11 proves our Step 1, finally. We can combine Step 1 and Step 2 now to get an upper bound that is critical for the following analysis:

Lemma 12 (Bounding ˆQτ(x) Qτ(U|X = x) ). Suppose Assumption 1 holds. For sufficiently

small h r 2L, we have

ˆQτ(x) Qτ(U|X = x) Lh + 1{n(x) < C2}M

+ 1{n(x) C2}

M1{Bounds in Lemma 11 fail} + C4 p

where C2 = C2(p 1, r 1, M), C4 = C4(p 1, M, log(n(x)M)) in Lemma 11, and the failing probability in Lemma 11 is no larger than 1 n(x)M provided n(x) C2.

Proof of Lemma 12. A direct corollary from Lemma 4 and Lemma 11.

Proof of the pointwise consistency part of Theorem 1. Since the O 1

term and the failing

probability term in Lemma 12 tend to zero as n(x) , we only need to verify that we can select h such that as n , we have

(b) C > 0, P(n(x) C) 1.

We will show that both conditions are true for any x0 where the pdf p(x) is positive and continuous at x = x0, if we select h = C5n 1 d+2 .

The first argument automatically holds as n . We turn to prove the second one.

Since p(x) is continuous at p(x0), we can find sufficiently large n N0, such that

2 , x B(x0, h).

Then the probability of 1{Xi B(x0, h)} can be lower bounded by

P(Xi B(x0, h)) p(x0)

2 V hd = C6n d d+2 ,

where V is the volume of the unit ball in X Rd, C6 := p(x0)

C > 0, j N+, we can define a sequence of parameters {λj} j=1, such that for any ηj Poisson(λj),

P(ηj C) 1 1

j , j = 1, 2, . . . ,

where we assume without loss of generality that λj is non-decreasing.

For each j, we can define an auxiliary sequence of independent random variables {νl,q,j} for l = [q], q = 1, 2, . . . , j = 1, 2, . . . , such that each random variable νl,q,j Bernoulli( λj

By Lemma 3, we can find a sequence of {qj} j=1, such that

l=1 νl,q,j C

j , q qj, j = 1, 2, . . . ,

where we select qj without loss of generality that λj

qj is non-increasing.

As {1{Xi B(x0, h)}}n i=1 is also an independent Bernoulli random variable sequence with parameter C6n d d+2 , we can select a non-decreasing sequence of {Nj} j=1, such that n Nj,

C6n d d+2 λj

Then n Nj, we have

i=1 1{Xi B(x0, h)} C

i=1 νi,n,j C

where the first inequality follows from the result of stochastically dominant of 1{Xi B(x0, h)} over νl,n,j.

ϵ > 0, we can select sufficiently large j 2

ϵ , then n max{N0, Nj},

P(n(x0) C) 1 ϵ,

which completes the proof.

To prove the mean squared error part of Theorem 1, we state a lemma on the Binomial random variable that will be proved later.

Lemma 13. For any ζ Binomial(n, p), we have

(a) E h 1 ζ 1{ζ > 0} i 2 (n+1)p;

(b) r N, r < np, we have P(ζ r) (n r)p (np r)2 .

Now we return to the proof of the mean squared error part of Theorem 1.

Proof of the mean squared error part of Theorem 1. By direct computation from Lemma 12, we have

E h | ˆQτ(Xtest) Qτ(U|X = Xtest)|2i

= EXtest h EX1:n EU1:n[| ˆQτ(Xtest) Qτ(U|X = Xtest)|2] i

+ h4C4(p 1, M, log(n(Xtest)M))2

n(Xtest) + 4M 2( 1 n(Xtest)M )2i 1{n(Xtest) C2}

+ 4M 21{n(Xtest) < C2} #

4L2h2 + EXtest

C 4(p 1, M, log(n M))

n(Xtest) 1{n(Xtest) C2} + 4M 21{n(Xtest) < C2} #

where the first inequality comes from the fact that (a + b + c + d)2 4(a2 + b2 + c2 + d2), and the second from that n(Xtest) n. Note that we replace the original C4 that depends on the logarithmic term of n(Xtest) with a new C 4 that does not depend logarithmically on n(Xtest) but logarithmically on the entire n. From the definition of C4 in Lemma 11, we recall that the dependence on the logarithmic term is polynomial.

By Lemma 13 (a), we have

1 n(Xtest)1{n(Xtest C2)} EX1:n

1 n(Xtest)1{n(Xtest > 0)}

2 (n + 1)P(Xi B(Xtest, h)).

We try to bound EX[ 1 P(Xi B(X,h))].

Since X [0, 1]d, we can generate an h

2 -cover of X, denoted by S = {xj}Cd j=1, where

where C is some intrinsic constant with respect to the space Rd equipped with metric .

1 P(Xi B(X, h))

1 P(B(x, h))d P

1{x B(xj, h

2 )} P(B(x, h)) d P

1{x B(xj, h

where P is the probability measure with respect to X, and the first inequality follows from that S is a h 2 -cover of X, the second from the fact that B(xj, h

2 ) B(x, h) for those x xj h

2 , and the inequality from the fact that R 1{x B(xj, h

2 )}d P = P(B(xj, h

We now try to bound EXtest h EX1:n 1{n(Xtest) < C2} i .

If n P(B(x, h)) 2 C2 , then by Lemma 13 (b), we have

P(n(x) < C2) n C2 (n P(B(x, h)) C2 )2 4 n P(B(x, h)).

If n P(B(x, h)) < 2 C2 , then for sufficiently large n 4 C2 1, by direct calculation we have,

P(B(x, h)) 1 P(B(x, h)) 2 C2 2 n 1 4

3 C2 3 n 2 C2 4 n 3 C2 C2 n + 1 C2 ,

which implies that

Cl n 1 P(B(x, h)) n l P(B(x, h)) l 2 4

3 C2 C0 n 1 P(B(x, h)) n , l [ C2 ],

where Cl n is the number of combinations of (n, l). Hence

P(n(x) < C2) P(n(x) C2 )

l=0 Cl n 1 P(B(x, h)) n l P(B(x, h)) l

1 P(B(x, h)) n

3 exp( n P(B(x, h))).

Since exp( a) 1

a maxb{b exp( b)} = 1 ea, we have

P(n(x) < C2) C7 n P(B(x, h)),

where C7 = 8(C2)2

Therefore, we can conclude that (w.l.o.g. we assume C7 4)

P(n(x) < C2) max{C7, 4}

n P(B(x, h)) = C7 n P(B(x, h)).

Hence, we can finally give the bound by

E h | ˆQτ(Xtest) Qτ(U|X = Xtest)|2i

4L2h2 + EXtest

C 4(p 1, M, log(n M))

n(Xtest) 1{n(Xtest) C2} + 4M 21{n(Xtest) < C2} #

4L2h2 + C 4Ch d 1

n + C7Ch d 1

= 4L2h2 + C 2(p 1, r 1, M, log(n M))h d 1

For the special case of L = 0, we can select h = Θ(M), hence reaching a fast rate of convergence of

order O C 2 n .

If L > 0, by selecting h = n 1 d+2 L 2 d+2 (d C 2) 1 d+2 , we get the convergence result that

E h | ˆQτ(FX) Qτ(FX)|2i L 2d d+2 n 2 d+2 d C 2 L, p 1, r 1, M, log(n) 2 d+2 ,

where the dependence of C 2 on log(n) is polynomial. We conclude that

E h | ˆQτ(FX) Qτ(FX)|2i = O L 2d d+2 n 2 d+2 .

Remark 2. A for large d cases, the left behind term (d C 2) 2 d+2 is almost 1, compared to other terms.

The major contribution to the error upper bound is made by the L 2d d+2 n 2 d+2 term. If L is replaced by k L, then n must be at least kdn to keep the upper bound to remain the same, which implies that the contribution of L is much more significant than n.

As for the lower bound, Theorem 2 follows immediately from Theorem 3.2 in Györfi et al. [2002] with Hölderness parameter (1, L). Note that the class PL we construct has a conditional distribution identical to the Gaussian of the unit variance, which automatically fulfills our Assumption 1 (b) and (c). The only additional requirement in PL compared to their D(1,L) is that we require the existence of µ(x) = 0 for some x [0, 1]d. In their proof, they construct a sub-class based on the division of [0, 1]d into smaller cubes, where the mean function µ(x) is (1, L)-Hölder (i.e. L-Lipschitz) on each cube while being zero on the boundary of those cubes. Hence our additional requirement of x [0, 1]d such that µ(x) = 0 does not affect the result.

Lemma 14 (Theorem 3.2 in Györfi et al. [2002]). Consider a conditional mean estimation problem (i.e. regression problem) with the estimator sequence {ˆµn} and true conditional mean µ. For the class PL, the sequence an = L 2d d+2 n 2 d+2

is a lower minimax rate of convergence. In particular,

lim inf n inf ˆµn sup (X,Y ) P,P PL E ˆµn µ 2

L 2d d+2 n 2 d+2 C8 > 0,

for some constant C8 independent of L.

Proof of Theorem 2. Since for any Gaussian distribution with known variance, estimating its mean is equivalent to estimating its any τ -th quantile, Theorem 2 is a direct corollary from Lemma 14, once one verifies that PL satisfies Assumption 1.

Verifying Assumption 1 (a): Since the τ -th conditional quantile is of the type µ(x) + Φ(τ) and µ(X) is L-Lipschitz, the conditional quantile is also of course L-Lipschitz.

Verifying Assumption 1 (b): Because µ(x) is L-Lipschitz and X is bounded, its image µ(X) is bounded. Furthermore, as we know 0 µ(X), we can bound µ(X) by [ L

d], leading to the boundedness of the conditional quantile set µ(X) + Φ(r).

Verifying Assumption 1 (c): The pdf around µ(x) + Φ(τ) is of the same shape as that of standard Gaussian at Φ(τ), which is definitely lower bounded from zero in a neighborhood.

I Proofs in Section 3.2

Proof of Theorem 3. By the definition of mutual independence, we have A F(A), B F(U),

P(U B|Z z(A))P(X A|Z z(A)) = P(U B, X A|Z z(A)).

P(U B|Z z(A)) = P(U B, X A|Z z(A))

P(X A|Z z(A))

= P(U B, X A, Z z(A)) / P(Z z(A))

P(X A, Z z(A)) / P(Z z(A))

= P(U B, X A, Z z(A))

P(X A, Z z(A))

= P(U B, X A)

P(X A) = P(U B|X A),

leading to the conclusion that FU|X=x = FU|Z=z(x).

Thus, if we have any consistency guarantee or mean squared error guarantee on

| ˆQτ(U|Z = z(x)) Qτ(U|Z = z(x))|,

then we also get the same result for

| ˆQτ(U|Z = z(x)) Qτ(U|X = x)|.

Notice that the dimension of Z is d0 d, we can conclude the proof.

Proof of Theorem 4. By definition of quantiles, we know

P U Qτ(U|Z) Z = FU|Z(Qτ(U|Z)) τ.

Since we assume that FU|Z is absolutely continuous with respect to the Lebesgue measure, the Radon-Nikodym derivative exists, which is the conditional probability density function. Thus FU|Z must be locally Lipschitz. Then for any fixed bounded neighborhood I containing Qτ(U|Z), FU|Z is Lipschitz on I with some Lipschitz constant L( I). Therefore ϵ > 0 and small enough such that B(Qτ(U|Z), ϵ) I,

FU|Z(Qτ(U|Z) ϵ) FU|Z(Qτ(U|Z)) L(I)ϵ.

By the definition of quantiles, we have

FU|Z(Qτ(U|Z) ϵ) < τ.

Hence FU|Z(Qτ(U|Z)) < L(I)ϵ + τ. Taking ϵ 0, we have FU|Z(Qτ(U|Z)) τ. Combining it with the fact that FU|Z(Qτ(U|Z)) τ, we prove that

P U Qτ(U|Z) Z = FU|Z(Qτ(U|Z)) = τ.

Proof of Theorem 5. We only prove the middle inequality, since the remaining two hold from similar arguments.

By a similar argument as Proposition 2, if one knows Z(d2), then the best single point decision they can make with respect to conditional expectation of pinball loss is

Qτ(U|Z(d2)) arg min u R E[lτ(u, U)|Z(d2)].

Since Z(d1) F(Z(d2)), if another learner makes a decision based on the first d1 components of Z(d2), say, Qτ(U|Z(d1)), then they must suffer a pinball loss no smaller than that Z(d2) learner, which is E[lτ(Qτ(U|Z(d2)), U)|Z(d2)] E[lτ(Qτ(U|Z(d1)), U)|Z(d2)]. By taking the expectation with respect to Z(d2), the tower property of the conditional expectation tells us that

E h lτ(Qτ(U|Z(d2)), U) i = E h E[lτ(Qτ(U|Z(d2)), U)|Z(d2)] i

E h E[lτ(Qτ(U|Z(d1)), U)|Z(d2)] i

= E h lτ(Qτ(U|Z(d1)), U) i .

Noticing that Y = U + ˆf(X) and lτ(a + c, b + c) = lτ(a, b), we have

lτ(u + ˆf(X), Y ) = lτ(u, U),

which verifies the proof finally.

Remark 3. One may wonder if one can conduct the proof of Theorem 5 in a simpler way via Jensen s inequality since lτ( , ) is convex with respect to its first component. But unfortunately, such an argument does not work for the quantile case, since in general,

E[Qτ(U|X)] = Qτ(U).

J Proofs of Auxiliary Lemmas in Appendix H

Proof of Lemma 6. Denote the probability measurement by P. From the definition of pinball loss and expectation, we have

EU[lτ(u, U)] = Z u

(τ 1)(U u)d P + Z

u τ(U u)d P

(τ 1)(U u)d P + Z

u τ(U u)d P

For a small but positive > 0, we have

EU[lτ(u , U)] EU[lτ(u, U)]

(τ 1)(U u + )d P + Z

u τ(U u + )d P

(τ 1)(U u)d P Z

u τ(U u)d P

(τ 1)(U u + )d P + Z

u τ(U u + )d P

(τ 1)(U u)d P Z

u τ(U u)d P

u (τ 1)(U u + )d P + Z u

u τ(U u + )d P

(τ 1)d P + Z

u (U u + )d P

= (τ F(u )) + Z u

u (U u + )d P.

Note that any cdf is càdlàg (right continuous with left limits). Thus, ϵ > 0, δ > 0, s.t. < δ,

|F(u ) F(u )| ϵ.

Then we have

u (U u + )d P

(τ F(u )) EU[lτ(u , U)] EU[lτ(u, U)] (τ F(u ) + ϵ).

τ F(u ) lim inf 0+ EU[lτ(u , U)] EU[lτ(u, U)]

lim sup 0+ EU[lτ(u , U)] EU[lτ(u, U)]

τ F(u ) + ϵ.

Since ϵ can be arbitrarily small, the limit exists, and

EU F [lτ(u, U)] = lim 0+ EU[lτ(u , U)] EU[lτ(u, U)]

The other side of semi-derivative can be derived in a similar way, which we omit for simplicity.

Lemma 15. For any a1, a2, b1, b2 R, we have

| max{a1, b1} max{a2, b2}| max{|a1 a2|, |b1 b2|}.

Proof of Lemma 15. If ai bi (or ai bi) simultaneously for i = 1, 2, the claim is straightforward. Without loss of generality, we assume a1 < b1 < b2 < a2. Then

LHS = a2 b1 a2 a1 = RHS.

Proof of Lemma 13. As for part (a), we notice that

1 j + 1Cj npj(1 p)n j

= 1 (n + 1)p

j=0 Cj+1 n+1pj+1(1 p)n j

j=0 Cj n+1pj(1 p)n j+1

= 1 (n + 1)p(p + 1 p)n+1 = 1 (n + 1)p,

where Cj n is the number of combinations of (n, l).

k 2 k+1 for any k 1, we have

ζ 1{ζ > 0} E 2 1 + ζ

2 (n + 1)p.

For part (b), we prove its symmetric argument that r > np,

P(ζ r) rq (r np)2 ,

where q = 1 p.

The probability ratio of two adjacent value is

P(ζ = r + 1)

P(ζ = r) = (n r)p

Hence for any k r, we have

P(ζ = k) P(ζ = r) 1 r np

Summing all these k s from r to , we have

k=r P(ζ = k)

= P(ζ = r) 1

= P(ζ = r) rq r np.

From a similar argument of inspecting the probability ratio between np k r, we know that

P(ζ = k) P(ζ = r).

There are at least r np + 1 r np such k s (including r itself), so we have

1 (r np)P(ζ = r),

which leads to the conclusion that

P(ζ r) rq (r np)2 , r > np.

If we have r < np, then we repeat the above arguments by replacing r with n r and reversing p and q, we can prove

P(ζ r) (n r)p

(np r)2 , r < np.