# calibrated_reliable_regression_using_maximum_mean_discrepancy__43db13e9.pdf

Calibrated Reliable Regression using Maximum Mean Discrepancy

Peng Cui1 2, Wenbo Hu1 2, Jun Zhu1

1 Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University, Beijing, 100084 China 2 Real AI xpeng.cui@gmail.com, wenbo.hu@realai.ai, dcszj@tsinghua.edu.cn

Accurate quantiﬁcation of uncertainty is crucial for real-world applications of machine learning. However, modern deep neural networks still produce unreliable predictive uncertainty, often yielding over-conﬁdent predictions. In this paper, we are concerned with getting well-calibrated predictions in regression tasks. We propose the calibrated regression method using the maximum mean discrepancy by minimizing the kernel embedding measure. Theoretically, the calibration error of our method asymptotically converges to zero when the sample size is large enough. Experiments on non-trivial real datasets show that our method can produce well-calibrated and sharp prediction intervals, which outperforms the related state-of-the-art methods.

1 Introduction

Deep learning has achieved signiﬁcant progress on a wide range of complex tasks [22] mainly in terms of some metrics on prediction accuracy. However, high accuracy alone is often not sufﬁcient to characterize the performance in real applications, where uncertainty is pervasive because of various facts such as incomplete knowledge, ambiguities, and contradictions. Accurate quantiﬁcation of uncertainty is crucial to derive a robust prediction rule. For example, an accurate uncertainty estimate can reduce the occurrence of accidents in medical diagnosis [23], warn users in time in self-driving systems [27], reject low-conﬁdence predictions [6], and better meet consumers order needs of internet services especially on special events [39]. In general, there are two main types of uncertainty, aleatoric uncertainty and epistemic uncertainty [10]. Aleatoric uncertainty captures inherent data noise (e.g., sensor noise), while epistemic uncertainty is considered to be caused by model parameters and structure, which can be reduced by providing enough data.

Though important, it is highly nontrivial to properly characterize uncertainty. Deep neural networks (DNNs) typically produce point estimates of parameters and predictions, and are insufﬁcient to characterize uncertainty because of their deterministic functions [11]. It has been widely observed that the modern neural networks are not properly calibrated and often tend to produce over-conﬁdent predictions [3, 15]. An effective uncertainty estimation is to directly model the predictive distribution with the observed data in a Bayesian style [26]. But performing Bayesian inference on deep networks is still a very challenging task, where the networks deﬁne highly nonlinear functions and are often over-parameterized [38, 34]. The uncertainty estimates of Bayesian neural networks (BNNs) may lead to an inaccurate uncertainty quantiﬁcation because of either model misspeciﬁcation or the use of approximate inference [19]. Besides, BNNs are computationally more expensive and slower to train in practice, compared to non-Bayesian NNs. For example, a simple method of MC-Dropout

J.Z is the corresponding author.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

directly captures uncertainty without changing the network structure [12]. But the uncertainty quantiﬁcation of MC-Dropout can be inaccurate, as will be seen in the empirical results in this paper.

Apart from BNNs, some methods have been developed to incorporate the variance term into NNs to estimate the predictive uncertainty. For instance, [17] proposed a heteroscedastic neural network (HNN) to combine both model uncertainty and data uncertainty simultaneously, getting the mean and variance by designing two outputs in the last layer of the network. Based on HNN, [21] described a simple and scalable method for estimating predictive uncertainty from ensembled HNNs, named as deep ensemble. But the ensembled model is usually computationally expensive especially when the model structure is complex.

An alternative way to obtain accurate predictive uncertainty is to calibrate the inaccurate uncertainties. Early attempts were made to use the scaling and isotonic regression techniques to calibrate the supervised learning predictions of traditional models, such as SVMs, neural networks and decision trees [32, 29]. For regression tasks, the prediction intervals are calibrated based on the proportion of covering ground truths. Recently, [15, 19] adopted a post-processing step to adjust the output probabilities of the modern neural networks based on the temperature scaling and non-parametric isotonic regression techniques. Such post-processing methods can be directly applied to both BNNs and DNNs without model modiﬁcations. But they need to train an auxiliary model and rely on an additional validation dataset. Moreover, the isotonic regression tends to overﬁt especially for small datasets [35]. [31, 36] directly incorporated a calibration error to loss functions to obtain the calibrated prediction intervals at the speciﬁc conﬁdence level. Predetermining the speciﬁc conﬁdence level can be regarded as the point calibration and its calibration model needs to be retrained when the conﬁdence level is changed. [35] proposed an extension to the post-processing procedure of the isotonic regression, using Gaussian Processes (GPs) and Beta link functions. This method improves calibration at the distribution level compared to existing post-processing methods, but is computationally expensive because of the GPs.

In this paper, we propose a new way to obtain the calibrated predictive uncertainty of regression tasks at the global quantile level it derives a distribution matching strategy and gets the well-calibrated distribution which can output predictive uncertainties at all conﬁdence levels. Speciﬁcally, we minimize the maximum mean discrepancy (MMD) [13] to reduce the distance between the predicted probability uncertainty and true one. We show that the calibration error of our model asymptotically converges to zero when the sample size is sufﬁciently large. Extensive empirical results on the regression and time-series forecasting tasks show the effectiveness and ﬂexibility of our method.

2 Preliminaries

In this section, we introduce some preliminary knowledge of calibrated regressor and maximum mean discrepancy, as well as the notations used in the sequel.

2.1 Calibrated Regressor

Let us denote a predictive regression model as f: x y, where x Rd and y R are random variables. We use Θ to denote the parameters of f. We learn a proposed regression model given a labeled dataset {(xi, yi)}N i=1 with N samples.

To obtain more detailed uncertainty of the output distribution, a calibrated regressor outputs the cumulative distribution function (CDF) Fi by the predictive distribution for each input xi. When evaluating the calibration of regressors, the inverse function of CDF F 1 i : [0, 1] ˆyi is used to denote the quantile function:

F 1 i (p) = inf {y : p Fi(y)} . (1)

Intuitively, the calibrated regressor should produce calibrated prediction intervals (PIs). For example, given the probability 95%, the calibrated regressor should output the prediction interval that approximately covers 95% of ground truths in the long run.

Formally, we deﬁne a well-calibrated regressor [8, 19] if the following condition holds, for all p [0, 1], PN i=1 I yi F 1 i (p)

N p, when N , (2)

where I( ) is the indicator function that equals to 1 if the predicate holds otherwise 0. More generally, for a prediction interval [F 1 i (p1), F 1 i (p2)], there is a similar deﬁnition of two-sided calibration as follows: PN i=1 I F 1 i (p1) yi F 1 i (p2)

N p2 p1 for all p1, p2 [0, 1] (3)

For this task, the previous methods applied the post-processing techniques [19, 15] or added a regularized loss [31, 36]. But when we want to get PIs with different conﬁdence levels, we need to retrain the model because the conﬁdence level is predetermined in the loss function. In contrast, we argue that the key challenge for the calibrated regression is getting the well-calibrated distribution. Based on this principle, our method utilizes the distribution matching strategy and aims to directly get a calibrated predictive distribution, which can naturally output well-calibrated CDF and PIs for each input xi.

2.2 Maximum Mean Discrepancy

Our method adopts maximum mean discrepancy (MMD) to perform distribution matching. Specifically, MMD is deﬁned via the Hilbert space embedding of distributions, known as kernel mean embedding [13]. Formally, given a probability distribution, the kernel mean embedding represents it as an element in a reproducing kernel Hilbert space (RKHS). An RKHS F on X with the kernel function k is a Hilbert space of functions g : X R. We use ϕ(x) = k(x, ) to represent the feature map of x. The expectation of embedding on its feature map is deﬁned as:

µX := EX[ϕ(X)] = Z

Ω ϕ(x)P(dx). (4)

This kernel mean embedding can be used for density estimation and two-sample test [13].

Based on the Hilbert space embedding, the maximum mean discrepancy (MMD) estimator was developed to distinguish two distributions P and Q [13]. Formally, the MMD measure is deﬁned as follows:

Lm(P, Q) = EX(ϕ(P)) EX(ϕ(Q)) F . (5)

The MMD estimator is guaranteed to be unbiased and has nearly minimal variance among unbiased estimators [25]. Moreover, it was shown that Lm(P, Q) = 0 if and only if P = Q [13].

We conduct a hypothesis test with null hypotheses H0 : P = Q, and the alternative hypotheses H1 : P = Q if Lm(P, Q) > cα for some chosen threshold cα > 0. With a characteristic kernel function (e.g., the popular RBF kernels), the MMD measure can be used to distinguish the two different distributions and have been applied to generative modeling [24, 25].

In practice, the MMD objective can be estimated using the empirical kernel mean embeddings:

ˆL2 m(P, Q) =

i=1 ϕ(x1i) 1

where x1i and x2j are independent random samples drawn from the distributions P and Q respectively.

3 Calibrated Regression with Maximum Mean Discrepancy

We now present the uncertainty calibration with the maximum mean discrepancy and then plug it into the proposed calibrated regression model. We also give the theoretical guarantee to show the effectiveness of our uncertainty calibration strategy.

3.1 Uncertainty Calibration with Distribution Matching

Speciﬁcally in this part, we use P and Q to represent the unknown true distribution and predictive distribution of our regression model respectively. The distribution matching strategy of our uncertainty calibration model is to directly minimize the kernel embedding measure deﬁned by MMD in

Eqn. (6). The speciﬁc goal is to let the predictive distribution Q converge asymptotically to the unknown target distribution P so that we can get the calibrated CDFs {Fi}. The strategy is to minimize the MMD distance measure between the regression ground-truth targets {y1, , yn} and random samples {ˆy1, , ˆyn} from the predictive distribution Q. The speciﬁc form of the MMD distance loss Lm is:

L2 m(P, Q) :=

i=1 ϕ(yi) 1

We use a mixture of k kernels spanning multiple ranges for our experiments:

k (x, x ) =

i=1 kσi (x, x ) , (8)

where kσi is an RBF kernel and the bandwith parameter σi can be chosen simple values such as 2,4,8, etc. The kernel was proved to be characteristic, and it can maximize the two-sample test power and low test error [14]. In general, a mixture of ﬁve kernels or more is sufﬁcient to obtain good results.

With the incorporation of this MMD loss, we learn the calibrated predictive probability distribution and the obtained uncertainties can be generalized to arbitrary conﬁdence levels without retraining.

In theory, under H0 : P = Q, the predictive distribution Q will converge asymptotically to the true distribution P as sample size N , which is why minimizing MMD loss is effective for uncertainty calibration. Leveraging our distribution matching strategy, the uncertainty calibration can be achieved by narrowing the gap between P and Q. Formally, we have the following theoretical result: Theorem 1. Suppose that the predictive distribution Q has the sufﬁcient ability to approximate the true unknown distribution P, given data is i.i.d. Eqn. (9) holds by minimizing the MMD loss Lm = µx1 µx2 F in our proposed methodology as the sample size N

PN i=1 I yi F 1 i (p)

N p for all p [0, 1] (9)

Proof. Lm(P, Q) = 0 if and only if P = Q when F is a unit ball in a universal RKHS [13]. Under H0 : P = Q, the predictive distribution Q(x) will converge asymptotically to the unknown true distribution P(x) as the sample size N by minimizing the MMD loss Lm. Further, Eqn. (9) holds according to the obtained predictive distribution. Because the conﬁdence level p is exactly equal to the proportion of samples {y1, , yn} covered by the prediction interval.

This theoretical result can be generalized to the two side calibration condition deﬁned in Eqn. (3) and we defer the details to Appendix A.

3.2 Calibrated Regression with MMD

To represent the model uncertainty, we use a heteroscedastic neural network (HNN) to get the predictive distribution and outputs the predicted mean µ(x) and the variance σ2(x) in the ﬁnal layer, which can combine epistemic uncertainty and aleatoric uncertainty in one model [30, 17].

Based on this representation model, we use a two-stage learning framework which optimizes the two objectives one by one, namely the negative log likelihood loss Lh and the uncertainty calibration loss Lm. In the ﬁrst stage, the optimal model parameters can be learned by minimizing negative log-likelihood loss (NLL):

log σ2 Θ(xi) 2 + (yi µΘ(xi))2

2σ2 Θ(xi) + constant . (10)

In practice, to improve numerical stability, we optimize the following equivalent form:

1 2 exp ( si) (yi µΘ(xi))2 + 1

2si, si := log σ2 Θ(xi). (11)

Although the Gaussian assumption is a bit restrictive above, we found that the method performs satisfactorily well in our experiments. In the second stage, we minimize the uncertainty calibration loss with MMD, i.e., Eqn. (7).

The overall procedure is two-stage:

step 1: minΘ Lh(Θ; y, f(x)), step 2: minΘ Lm (Θ; y, f(x)) , (12)

where Lm is the loss function of distribution matching objective, and Lh is the loss function of distribution estimation. We detail the whole process of the framework in Algorithm 1. The main merits of the two-stage learning are to 1) utilize the representation capability of the HNN model in the ﬁrst stage and 2) learn the calibrated predictive distribution via the distribution matching strategy in the second stage. Compared with the bi-level learning algorithm used in [19] which iterates the two stages for several times, our method runs with one time iteration of the two stages, which reduces the computation cost of the kernel-based MMD component.

Comparison with Post-processing Calibration Methods The previous post-processing methods [19, 36, 31] calibrate the uncertainty outputs of the input dataset without any model modiﬁcations and needs to be retrained when the conﬁdence level is changed. In contrast, the proposed distribution matching with MMD, albeit also regarded as a post-processing procedure, learns the calibrated predictive model, which means practitioners are not required to retrain the model but can enjoy the calibration performance.

Algorithm 1 Deep calibrated reliable regression model.

Labeled training data and kernel bandwidth parameters Output:

Trained mean µ(xi) and variance σ(xi) for the predictive distribution 1: while not converged do 2: Compute µ(xi) and log σ(xi) 3: Compute NLL loss Lh by Eqn. (11) 4: Update model parameters Θ = arg minΘ Lh(Θ; y, f(x)) by SGD 5: end while 6: while not converged do 7: Compute µ(xi) and log σ(xi), randomly sampling data { ˆyi}N i=1 from predictive distrbution 8: Compute MMD loss Lm by Eqn. (7) 9: Update model parameters Θ = arg minΘ Lm(Θ; y, f(x)) by SGD 10: end while 11: return a trained model f(x);

4 Experiments

In this section, we compare the proposed method with several strong baselines on the regression and time-series forecasting tasks in terms of predictive uncertainty. The time-series forecasting task models multiple regression sub-problems in sequence and the tendency along the sliding windows can be used to examine the obtained predictive uncertainty. Then we show the sensitivity analysis and the time efﬁciency of our proposed method.

4.1 Datasets and Experimental Settings

Baselines We compare with several competitive baselines, including MC-Dropout (MCD) [12], Heteroscedastic Neural Network (HNN) [17], Deep Ensembles (Deep-ens) [21], Ensembled Likehood (ELL), MC-Dropout Likelihood (MC NLL), Deep Gaussian Processes (DGP) [33] and the post-hoc calibration method using isotonic regression (ISR) [19]. ELL and MC NLL are our proposed variants inspired by Deep Ensemble. The variance of ELL is computed by predictions from multiple networks during the training phase, and the variance of MC NLL is computed by multiple random predictions based on MC-Dropout during the training phase. Details of these compared methods can be found in Appendix B.1.

Hyperparameters For all experimental results, we report the averaged results and std. errors obtained from 5 random trials. The details of hyperparameters setting can be found in Appendix B.2.

Datasets We use several public datasets from UCI repository [2] and Kaggle [1]: 1) for the timeseries task: Pickups, Bike-sharing, PM2.5, Metro-trafﬁc and Quality; 2) for the regression task: Power Plant, Protein Structure, Naval Propulsion and wine. The details of the datasets can be found in Appendix B.3.

Evaluation Metrics We evaluate the performance using two metrics: 1) RMSE for the prediction precision; and 2) the calibration error. The calibration error is the absolute difference between true conﬁdence and empirical coverage probability. We use two variants: the expectation of coverage probability error (ECPE) and the maximum value of coverage probability error (MCPE). We put the detailed deﬁnitions of the metrics in Appendix C.

4.2 Results of Time-series Forecasting Tasks

For time-series forecasting tasks, we construct an LSTM model with two hidden layers (128 hidden units and 64 units respectively) and a linear layer for making the ﬁnal predictions. The size of the sliding window is 5 and the forecasting horizon is 1. Take the Bike-sharing dataset as an example, the bike sharing data of the past ﬁve hours will be used to predict the data of one hour in the future. All datasets are split into 70% training data and 30% test data.

Dataset Metric MCD HNN Deep-ens MC NLL

Metro-trafﬁc ECPE 0.304 0.005 0.102 0.002 0.100 0.001 0.142 0.010 MCPE 0.505 0.011 0.162 0.003 0.160 0.002 0.235 0.014 RMSE 523.6 6.725 556.3 3.332 508.9 1.288 631.6 14.23

Bike-sharing ECPE 0.258 0.011 0.054 0.002 0.038 0.001 0.119 0.013 MCPE 0.432 0.020 0.089 0.004 0.066 0.008 0.206 0.022 RMSE 38.86 0.141 40.71 0.542 37.60 0.355 61.57 1.624

Pickups ECPE 0.246 0.017 0.078 0.001 0.064 0.006 0.088 0.016 MCPE 0.408 0.025 0.117 0.005 0.098 0.011 0.136 0.010 RMSE 350.3 6.562 359.8 3.421 336.4 1.653 526.8 9.214

PM2.5 ECPE 0.331 0.013 0.022 0.001 0.026 0.003 0.081 0.010 MCPE 0.550 0.025 0.050 0.004 0.060 0.004 0.151 0.027 RMSE 70.95 2.629 58.81 0.372 60.24 0.114 66.77 3.613

Air-quality ECPE 0.329 0.005 0.058 0.003 0.045 0.001 0.111 0.004 MCPE 0.561 0.008 0.091 0.006 0.072 0.002 0.178 0.004 RMSE 81.16 0.111 79.60 0.254 80.03 0.236 87.12 0.971

Dataset Metric ELL DGP ISR proposed

Metro-trafﬁc ECPE 0.048 0.017 0.115 0.007 0.032 0.002 0.017 0.001 MCPE 0.075 0.027 0.192 0.013 0.051 0.003 0.036 0.002 RMSE 613.5 18.63 646.4 0.302 556.3 3.332 545.5 4.225

Bike-sharing ECPE 0.027 0.003 0.121 0.003 0.042 0.002 0.006 0.001 MCPE 0.048 0.055 0.213 0.005 0.066 0.005 0.019 0.002 RMSE 52.50 2.901 55.39 0.397 40.71 0.542 37.93 0.334

Pickups ECPE 0.018 0.008 0.098 0.003 0.049 0.002 0.008 0.001 MCPE 0.038 0.016 0.160 0.005 0.075 0.004 0.023 0.001 RMSE 325.9 11.23 440.3 3.469 359.8 3.421 346.9 4.652

PM2.5 ECPE 0.080 0.007 0.061 0.006 0.023 0.002 0.010 0.000 MCPE 0.119 0.011 0.149 0.014 0.057 0.006 0.035 0.003 RMSE 61.09 0.434 61.44 2.113 58.81 0.372 57.43 0.332

Air-quality ECPE 0.018 0.005 0.102 0.002 0.030 0.001 0.010 0.001 MCPE 0.04 0.008 0.181 0.003 0.044 0.005 0.026 0.001 RMSE 90.01 0.566 86.05 0.210 79.60 0.254 80.69 0.292 Table 1: The forecast and calibration error scores of each method on different datasets. Each row corresponds to the results of a speciﬁc method in a particular metric.

0.0 0.2 0.4 0.6 0.8 1.0 Expected Confidence Level

Observed Confidence Level

MC Dropout HNN Deep Ensemble MC NLL ELL DGP ISR proposed

(a) Dataset: Air Quality.

0.0 0.2 0.4 0.6 0.8 1.0 Expected Confidence Level

Observed Confidence Level

MC Dropout HNN Deep Ensemble MC NLL ELL DGP ISR proposed

(b) Dataset: Bike Sharing.

Figure 1: For the time-series forecasting task, we plot the expected conﬁdence vs observed conﬁdence for all methods. The closer to the diagonal line, the uncertainty calibration is better. The results of other datasets can be found in Appendix.

0 10 20 30 40 50 Time Stamp

1500 Ground Truth 95% PIs

(a) Dataset: Air Quality.

0 10 20 30 40 50 Time Stamp

Ground Truth 95% PIs

(b) Dataset: Bike Sharing.

Figure 2: Calibrated forecasting: Displayed prediction intervals (PIs) obtained at the 95% conﬁdence level by our proposed method in a time-series. As shown in the ﬁgure, the prediction intervals are also sharp while accurately covering the ground truth. The results of other datasets can be found in Appendix.

Table 1 present the results of all the methods, including the forecast and calibration errors. We can see that our method with the MMD distribution matching strategy achieves the accurate forecasting results on par with the strong baselines in terms of RMSE2, but signiﬁcantly outperforms the baselines in the uncertainty calibration, in terms of ECPE and MCPE on all data-sets. Besides, we prefer prediction intervals as tight as possible while accurately covering the ground truth in regression tasks. We measure the sharpness using the width of prediction intervals, which is detailed in Appendix C. And our method also gets a relatively tighter prediction interval through the reported calibration sharpness from Table 4 in the Appendix. In addition, the ensemble method is second only to ours, due to the powerful ability of the ensemble of multiple networks. But when the network complexity is greater than the data complexity, the computation of the ensemble method is quite expensive, while our method can also be applied to more complex NNs. Figure 1 shows the proportion that PIs covering ground truths at different conﬁdence levels. The result of our model is closest to the diagonal line, which indicates the best uncertainty calibration among all methods. Figure 2 shows the predictions and corresponding 95% prediction intervals. The intervals are visually sharp and accurately cover the ground truths.

4.3 Results of Regression Tasks For regression tasks, we used a fully connected neural network with two hidden layers (256 hidden units) and each layer has a Re LU activation function. The size of our networks is close to the previous works [12, 19, 21] on regression problems. We evaluate on four UCI datasets varying in size from 4,898 to 45,730 samples. We randomly split 80% of each data-set for training and the rest for testing. Table 2 presents the results of all methods, where we can draw similar conclusions as in the time-series forecasting tasks. The forecast results of our method is competitive in terms of RMSE and the calibration error of our method is signiﬁcantly smaller than existing methods. Figure

2In Table 5 in the Appendix, we also show the results in other metrics, such as R2, SMAPE, etc., which have a similar conclusion.

3 reﬂects that the uncertainty calibration performance of each method at different conﬁdence levels in general regression tasks and we ﬁnd that our method signiﬁcantly improves calibration.

Dataset Metric MCD HNN Deep-ens MC NLL

Power Plant

ECPE 0.235 0.021 0.094 0.002 0.084 0.004 0.095 0.004 MCPE 0.386 0.038 0.151 0.007 0.142 0.001 0.153 0.004 RMSE 3.792 0.171 3.843 0.165 3.945 0.150 3.936 0.158

Protein Structure

ECPE 0.365 0.011 0.042 0.006 0.049 0.001 0.086 0.005 MCPE 0.635 0.021 0.071 0.005 0.084 0.002 0.138 0.005 RMSE 4.088 0.014 4.337 0.021 4.255 0.010 4.574 0.018

Naval Propulsion

ECPE 0.175 0.077 0.038 0.006 0.270 0.016 0.216 0.042 MCPE 0.283 0.116 0.065 0.007 0.431 0.025 0.344 0.047 RMSE 0.001 0.000 0.001 0.000 0001 0.000 0.001 0.001

ECPE 0.235 0.021 0.041 0.003 0.012 0.001 0.046 0.006 MCPE 0.386 0.038 0.082 0.013 0.034 0.004 0.095 0.011 RMSE 0.732 0.041 0.705 0.038 0.672 0.040 0.683 0.064

Dataset Metric ELL DGP ISR proposed

Power Plant

ECPE 0.019 0.025 0.094 0.005 0.062 0.003 0.007 0.001 MCPE 0.035 0.037 0.158 0.008 0.105 0.003 0.024 0.003 RMSE 4.186 0.184 4.181 0.009 3.843 0.165 3.819 0.112

Protein Structure

ECPE 0.038 0.009 0.020 0.002 0.014 0.006 0.006 0.000 MCPE 0.075 0.016 0.036 0.004 0.027 0.010 0.024 0.002 RMSE 4.519 0.019 4.950 0.011 4.337 0.021 4.556 0.012

Naval Propulsion

ECPE 0.059 0.034 0.115 0.007 0.021 0.003 0.012 0.001 MCPE 0.117 0.051 0.192 0.012 0.036 0.010 0.030 0.004 RMSE 0.002 0.001 0.001 0.000 0.001 0.000 0.001 0.000

ECPE 0.073 0.009 0.178 0.003 0.083 0.006 0.008 0.002 MCPE 0.103 0.011 0.300 0.006 0.127 0.008 0.024 0.004 RMSE 0.684 0.061 0.754 0.031 0.705 0.038 0.705 0.035 Table 2: The calibration error scores of uncertainty evaluation and RMSE for each method on different datasets, each row has the results of a speciﬁc method in a particular metric. Our method improves calibration and outperforms all baselines

0.0 0.2 0.4 0.6 0.8 1.0 Expected Confidence Level

Observed Confidence Level

MC Dropout HNN Deep Ensemble MC NLL ELL DGP ISR proposed

(a) Dataset: Naval Propulsion.

0.0 0.2 0.4 0.6 0.8 1.0 Expected Confidence Level

Observed Confidence Level

MC Dropout HNN Deep Ensemble MC NLL ELL DGP ISR proposed

(b) Dataset: Protein Structure.

Figure 3: For the regression task,we plot the expected conﬁdence vs observed conﬁdence for all methods. The closer to the diagnoal line, the uncertainty calibration is better. The results of other datasets can be found in Appendix.

4.4 Computation Efﬁciency We analyze the time complexity on the type of methods that generate the uncertainty distribution, these methods are relatively computationally expensive : DGP, Deep Ensembles, ELL and our proposed method. For the regression task, these four methods use the same network structure with a fully connected neural network (256 hidden units) at each hidden layer. The training and inference of DGP is performed using a doubly stochastic variational inference algorithm [33]. As can be seen

in Figure 4, DGP is the most time-consuming, the training time increases almost linearly as the number of network layers increases. The computation time of our method is the least among all methods when model complexity becomes higher, and can also keep low calibration error. This result coheres our argument that our method is not computationally expensive compared to the baseline methods.

1 2 3 4 5 6 The number of hidden layers

Computation time (log seconds)

DGP ELL Deep Ensemble proposed

(a) Computation time.

1 2 3 4 5 6 The number of hidden layers

Calibration error (ECPE)

Deep Ensemble

(b) Calibration errror.

Figure 4: The computation time (log seconds) of four methods during model training phase (left) and calibration error of different models (right) on the wine dataset on GTX1080Ti. We can see that our method is also effective in computing efﬁciency and calibration for more complex models.

5 Conclusion and Discussions

We present a ﬂexible and effective uncertainty calibration method with the MMD distribution matching strategy for regression and time-series forecasting tasks. Our method is guaranteed to produce well-calibrated predictions given sufﬁcient data under mild assumptions. Extensive experimental results show that our method can produce reliable predictive distributions, and obtain the wellcalibrated and sharp prediction intervals.

There are several directions for future investigation. Firstly, the Gaussian likelihood may be toorestrictive sometimes and one could use a mixture distribution or a complex network, e.g., mixture density network [4] as a base model. Secondly, our calibration strategy can be extended to classiﬁcation tasks. But the challenge we need to overcome is the impact of batch-size and binning on the performance of MMD. Thirdly, the kernels used in the MMD deﬁnition can be deﬁned on other data structures, such as graphs and time-series [16]. Finally, it is interesting to investigate on the sample size for a given task. Speciﬁcally, we provide an asymptotic analysis on well-calibration, while in practice we only have ﬁnite data. [13] shows that MMD has performance guarantees at ﬁnite sample sizes, based on uniform convergence bounds. For our method, regardless of whether or not p = q, the empirical MMD converges in probability at rate O((m + n) 1

2 ) to its population value, where m and n respectively represent the number of samples sampled from P and Q. So a further investigation on the bound of our method is worth considering in the future work.

Statement of Potential Broader Impact

Uncertainty exists in many aspects of our daily life, which plays a critical role in the application of modern machine learning methods. Unreliable uncertainty quantiﬁcation may bring safety and reliability issues in these applications like medical diagnosis, autonomous driving, and demand forecasting. Despite deep learning has achieved impressive accuracies on many tasks, NNs are poor to provide accurate predictive uncertainty. Machine learning models should provide accurate conﬁdence bounds (i.e., uncertainty estimation) on these safety-critical tasks.

This paper aims to solve the problem of inaccurate predictive quantiﬁcation for regression models. Our method produces the well-calibrated predictive distribution while achieving the high-precision forecasting for regression tasks, and naturally generate reliable prediction intervals at any conﬁdence level we need.

Our proposal has a positive impact on a variety of tasks using the regression models. For example, our proposed model produces more accurate demand forecasting based on the historical sales data for a retail company, which can calculate the safety stock to make sure you don t lose customers. We

believe that it is necessary to consider the uncertainty calibration for many machine learning models, which will improve the safety and reliability of machine learning and deep learning methods.

Acknowledgement

We would like to thank the anonymous reviewers for their useful comments, especially for review 1 and review 3. Part of this work was done when the ﬁrst two authors were working at Real AI. This work was supported by the National Key Research and Development Program of China (No.2017YFA0700904), NSFC Projects (Nos. 61620106010, U19B2034, U1811461), Beijing Academy of Artiﬁcial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, and the NVIDIA NVAIL Program with GPU/DGX Acceleration.

[1] Nyc uber pickups with weather and holidays | kaggle. https://www.kaggle.com/yannisp/ uber-pickups-enriched. Accessed: 2020-06-04.

[2] UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed: 2020-0604.

[3] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. ar Xiv preprint ar Xiv:1606.06565, 2016.

[4] Christopher M Bishop. Mixture density networks. 1994.

[5] Glenn W Brier. Veriﬁcation of forecasts expressed in terms of probability. Monthly weather review, 78(1):1 3, 1950.

[6] Justin Cosentino and Jun Zhu. Generative well-intentioned networks. In Advances in Neural Information Processing Systems, pages 13098 13109, 2019.

[7] Andreas Damianou and Neil Lawrence. Deep Gaussian processes. In Artiﬁcial Intelligence and Statistics, pages 207 215, 2013.

[8] A Philip Dawid. The well-calibrated Bayesian. Journal of the American Statistical Association, 77(379):605 610, 1982.

[9] Morris H De Groot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12 22, 1983.

[10] Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter? Structural safety, 31(2):105 112, 2009.

[11] Yarin Gal. Uncertainty in Deep Learning. Ph D thesis, University of Cambridge, 2016.

[12] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050 1059, 2016.

[13] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012.

[14] Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, and Bharath K Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In Advances in neural information processing systems, pages 1205 1213, 2012.

[15] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321 1330. JMLR. org, 2017.

[16] Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning. The annals of statistics, pages 1171 1220, 2008.

[17] Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for computer vision? Advances in neural information processing systems, pages 5574 5584, 2017.

[18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[19] Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. International Conference on Machine Learning, pages 2796 2804, 2018.

[20] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and shortterm temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 95 104, 2018.

[21] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, pages 6402 6413, 2017.

[22] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436 444, 2015.

[23] Christian Leibig, Vaneeda Allken, Murat Seçkin Ayhan, Philipp Berens, and Siegfried Wahl. Leveraging uncertainty information from deep neural networks for disease detection. Scientiﬁc reports, 7(1):1 14, 2017.

[24] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2203 2213, 2017.

[25] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718 1727, 2015.

[26] David JC Mac Kay. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448 472, 1992.

[27] Rhiannon Michelmore, Matthew Wicker, Luca Laurenti, Luca Cardelli, Yarin Gal, and Marta Kwiatkowska. Uncertainty quantiﬁcation with statistical guarantees in end-to-end autonomous driving control. In International Conference on Robotics and Automation, 2020.

[28] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015.

[29] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625 632, 2005.

[30] David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 ieee international conference on neural networks (ICNN 94), volume 1, pages 55 60. IEEE, 1994.

[31] Tim Pearce, Alexandra Brintrup, Mohamed Zaki, and Andy Neely. High-quality prediction intervals for deep learning: A distribution-free, ensembled approach. In International Conference on Machine Learning, pages 4075 4084, 2018.

[32] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classiﬁers, 10(3):61 74, 1999.

[33] Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep gaussian processes. In Advances in Neural Information Processing Systems, pages 4588 4599, 2017.

[34] Jiaxin Shi, Shengyang Sun, and Jun Zhu. Kernel implicit variational inference. In International Conference on Learning Representations, 2018.

[35] Hao Song, Tom Diethe, Meelis Kull, and Peter Flach. Distribution calibration for regression. In International Conference on Machine Learning, pages 5897 5906, 2019.

[36] Jayaraman J Thiagarajan, Bindya Venkatesh, Prasanna Sattigeri, and Peer-Timo Bremer. Building calibrated deep models via uncertainty matching with auxiliary interval predictors. In AAAI Conference on Artiﬁcial Intelligence, 2019.

[37] Chris Tofallis. A better measure of relative prediction accuracy for model selection and model estimation. Journal of the Operational Research Society, 66(8):1352 1362, 2015.

[38] Ziyu Wang, Tongzheng Ren, Jun Zhu, and Bo Zhang. Function space particle optimization for bayesian neural networks. In International Conference on Learning Representations, 2019.

[39] Lingxue Zhu and Nikolay Laptev. Deep and conﬁdent prediction for time series at Uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pages 103 110. IEEE, 2017.