# calibration_tests_beyond_classification__647d65ba.pdf

CALIBRATION TESTS BEYOND CLASSIFICATION

David Widmann Department of Information Technology Uppsala University, Sweden david.widmann@it.uu.se

Fredrik Lindsten Division of Statistics and Machine Learning Linköping University, Sweden fredrik.lindsten@liu.se

Dave Zachariah Department of Information Technology Uppsala University, Sweden dave.zachariah@it.uu.se

Most supervised machine learning tasks are subject to irreducible prediction errors. Probabilistic predictive models address this limitation by providing probability distributions that represent a belief over plausible targets, rather than point estimates. Such models can be a valuable tool in decision-making under uncertainty, provided that the model output is meaningful and interpretable. Calibrated models guarantee that the probabilistic predictions are neither overnor under-conﬁdent. In the machine learning literature, different measures and statistical tests have been proposed and studied for evaluating the calibration of classiﬁcation models. For regression problems, however, research has been focused on a weaker condition of calibration based on predicted quantiles for real-valued targets. In this paper, we propose the ﬁrst framework that uniﬁes calibration evaluation and tests for general probabilistic predictive models. It applies to any such model, including classiﬁcation and regression models of arbitrary dimension. Furthermore, the framework generalizes existing measures and provides a more intuitive reformulation of a recently proposed framework for calibration in multi-class classiﬁcation. In particular, we reformulate and generalize the kernel calibration error, its estimators, and hypothesis tests using scalar-valued kernels, and evaluate the calibration of real-valued regression problems.1

1 INTRODUCTION

We consider the general problem of modelling the relationship between a feature X and a target Y in a probabilistic setting, i.e., we focus on models that approximate the conditional probability distribution P(Y |X) of target Y for given feature X. The use of probabilistic models that output a probability distribution instead of a point estimate demands guarantees on the predictions beyond accuracy, enabling meaningful and interpretable predicted uncertainties. One such statistical guarantee is calibration, which has been studied extensively in metereological and statistical literature (De Groot & Fienberg, 1983; Murphy & Winkler, 1977).

A calibrated model ensures that almost every prediction matches the conditional distribution of targets given this prediction. Loosely speaking, in a classiﬁcation setting a predicted distribution of the model is called calibrated (or reliable), if the empirically observed frequencies of the different classes match the predictions in the long run, if the same class probabilities would be predicted repeatedly. A classical example is a weather forecaster who predicts each day if it is going to rain on the next day. If she predicts rain with probability 60% for a long series of days, her forecasting model is calibrated for predictions of 60% if it actually rains on 60% of these days.

If this property holds for almost every probability distribution that the model outputs, then the model is considered to be calibrated. Calibration is an appealing property of a probabilistic model since it

1The source code of the experiments is available at https://github.com/devmotion/ Calibration_ICLR2021.

provides safety guarantees on the predicted distributions even in the common case when the model does not predict the true distributions P(Y |X). Calibration, however, does not guarantee accuracy (or reﬁnement) a model that always predicts the marginal probabilities of each class is calibrated but probably inaccurate and of limited use. On the other hand, accuracy does not imply calibration either since the predictions of an accurate model can be too over-conﬁdent and hence miscalibrated, as observed, e.g., for deep neural networks (Guo et al., 2017).

In the ﬁeld of machine learning, calibration has been studied mainly for classiﬁcation problems (Bröcker, 2009; Guo et al., 2017; Kull et al., 2017; 2019; Kumar et al., 2018; Platt, 2000; Vaicenavicius et al., 2019; Widmann et al., 2019; Zadrozny, 2002) and for quantiles and conﬁdence intervals of models for regression problems with real-valued targets (Fasiolo et al., 2020; Ho & Lee, 2005; Kuleshov et al., 2018; Rueda et al., 2006; Taillardat et al., 2016). In our work, however, we do not restrict ourselves to these problem settings but instead consider calibration for arbitrary predictive models. Thus, we generalize the common notion of calibration as:

Deﬁnition 1. Consider a model PX := P(Y |X) of a conditional probability distribution P(Y |X). Then model P is said to be calibrated if and only if

P(Y |PX) = PX almost surely. (1)

If P is a classiﬁcation model, Deﬁnition 1 coincides with the notion of (multi-class) calibration by Bröcker (2009); Kull et al. (2019); Vaicenavicius et al. (2019). Alternatively, in classiﬁcation some authors (Guo et al., 2017; Kumar et al., 2018; Naeini et al., 2015) study the strictly weaker property of conﬁdence calibration (Kull et al., 2019), which only requires

P (Y = arg max PX| max PX) = max PX almost surely. (2)

This notion of calibration corresponds to calibration according to Deﬁnition 1 for a reduced problem with binary targets e Y := 1(Y = arg max PX) and Bernoulli distributions e PX := Ber(max PX) as probabilistic models.

For real-valued targets, Deﬁnition 1 coincides with the so-called distribution-level calibration by Song et al. (2019). Distribution-level calibration implies that the predicted quantiles are calibrated, i.e., the outcomes for all real-valued predictions of the, e.g., 75% quantile are actually below the predicted quantile with 75% probability (Song et al., 2019, Theorem 1). Conversely, although quantile-based calibration is a common approach for real-valued regression problems (Fasiolo et al., 2020; Ho & Lee, 2005; Kuleshov et al., 2018; Rueda et al., 2006; Taillardat et al., 2016), it provides weaker guarantees on the predictions. For instance, the linear regression model in Fig. 1 empirically shows quantiles that appear close to being calibrated albeit being uncalibrated according to Deﬁnition 1.

-1.0 -0.5 0.0 0.5 1.0

-1.0 -0.5 0.0 0.5 1.0

0.00 0.25 0.50 0.75 1.00

quantile level

cumulative probability

Figure 1: Illustration of a conditional distribution P(Y |X) with scalar feature and target. We consider a Gaussian predictive model P, obtained by ordinary least squares regression with 100 training data points (orange dots). Empirically the predicted quantiles on 50 validation data points appear close to being calibrated, although model P is uncalibrated according to Deﬁnition 1. Using the framework in this paper, on the same validation data a statistical test allows us to reject the null hypothesis that model P is calibrated at a signiﬁcance level of α = 0.05 (p < 0.05). See Appendix A.1 for details.

Figure 1 also raises the question of how to assess calibration for general target spaces in the sense of Deﬁnition 1, without having to rely on visual inspection. In classiﬁcation, measures of calibration such as the commonly used expected calibration error (ECE) (Guo et al., 2017; Kull et al., 2019;

Naeini et al., 2015; Vaicenavicius et al., 2019) and the maximum calibration error (MCE) (Naeini et al., 2015) try to capture the average and maximal discrepancy between the distributions on the left hand side and the right hand side of Eq. (1) or Eq. (2), respectively. These measures can be generalized to other target spaces (see Deﬁnition B.1), but unfortunately estimating these calibration errors from observations of features and corresponding targets is problematic. Typically, the predictions are different for (almost) all observations, and hence estimation of the conditional probability P (Y |PX), which is needed in the estimation of ECE and MCE, is challenging even for low-dimensional target spaces and usually leads to biased and inconsistent estimators (Vaicenavicius et al., 2019).

Kernel-based calibration errors such as the maximum mean calibration error (MMCE) (Kumar et al., 2018) and the kernel calibration error (KCE) (Widmann et al., 2019) for conﬁdence and multi-class calibration, respectively, can be estimated without ﬁrst estimating the conditional probability and hence avoid this issue. They are deﬁned as the expected value of a weighted sum of the differences of the left and right hand side of Eq. (1) for each class, where the weights are given as a function of the predictions (of all classes) and chosen such that the calibration error is maximized. A reformulation with matrix-valued kernels (Widmann et al., 2019) yields unbiased and differentiable estimators without explicit dependence on P(Y |PX), which simpliﬁes the estimation and allows to explicitly account for calibration in the training objective (Kumar et al., 2018). Additionally, the kernel-based framework allows the derivation of reliable statistical hypothesis tests for calibration in multi-class classiﬁcation (Widmann et al., 2019).

However, both the construction as a weighted difference of the class-wise distributions in Eq. (1) and the reformulation with matrix-valued kernels require ﬁnite target spaces and hence cannot be applied to regression problems. To be able to deal with general target spaces, we present a new and more general framework of calibration errors without these limitations.

Our framework can be used to reason about and test for calibration of any probabilistic predictive model. As explained above, this is in stark contrast with existing methods that are restricted to simple output distributions, such as classiﬁcation and scalar-valued regression problems. A key contribution of this paper is a new framework that is applicable to multivariate regression, as well as situations when the output is of a different (e.g., discrete ordinal) or more complex (e.g., graph-structured) type, with clear practical implications.

Within this framework a KCE for general target spaces is obtained. We want to highlight that for multi-class classiﬁcation problems its formulation is more intuitive and simpler to use than the measure proposed by Widmann et al. (2019) based on matrix-valued kernels. To ease the application of the KCE we derive several estimators of the KCE with subquadratic sample complexity and their asymptotic properties in tests for calibrated models, which improve on existing estimators and tests in the two-sample test literature by exploiting the special structure of the calibration framework. Using the proposed framework, we numerically evaluate the calibration of neural network models and ensembles of such models.

2 CALIBRATION ERROR: A GENERAL FRAMEWORK

In classiﬁcation, the distributions on the left and right hand side of Eq. (1) can be interpreted as vectors in the probability simplex. Hence ultimately the distance measure for ECE and MCE (see Deﬁnition B.1) can be chosen as a distance measure of real-valued vectors. The total variation, Euclidean, and squared Euclidean distances are common choices (Guo et al., 2017; Kull et al., 2019; Vaicenavicius et al., 2019). However, in a general setting measuring the discrepancy between P(Y |PX) and PX cannot necessarily be reduced to measuring distances between vectors. The conditional distribution P(Y |PX) can be arbitrarily complex, even if the predicted distributions are restricted to a simple class of distributions that can be represented as real-valued vectors. Hence in general we have to resort to dedicated distance measures of probability distributions.

Additionally, the estimation of conditional distributions P(Y |PX) is challenging, even more so than in the restricted case of classiﬁcation, since in general these distributions can be arbitrarily complex. To circumvent this problem, we propose to use the following construction: We deﬁne a random variable ZX PX obtained from the predictive model and study the discrepancy between the joint distributions of the two pairs of random variables (PX, Y ) and (PX, ZX), respectively, instead of

the discrepancy between the conditional distributions P(Y |PX) and PX. Since

(PX, Y ) d= (PX, ZX) if and only if P(Y |PX) = PX almost surely,

model P is calibrated if and only if the distributions of (PX, Y ) and (PX, ZX) are equal.

The random variable pairs (PX, Y ) and (PX, ZX) take values in the product space P Y, where P is the space of predicted distributions PX and Y is the space of targets Y . For instance, in classiﬁcation, P could be the probability simplex and Y the set of all class labels, whereas in the case of Gaussian predictive models for scalar targets P could be the space of normal distributions and Y be R.

The study of the joint distributions of (PX, Y ) and (PX, ZX) motivates the deﬁnition of a generally applicable calibration error as an integral probability metric (Müller, 1997; Sriperumbudur et al., 2009; 2012) between these distributions. In contrast to common f-divergences such as the Kullback-Leibler divergence, integral probability metrics do not require that one distribution is absolutely continuous with respect to the other, which cannot be guaranteed in general.

Deﬁnition 2. Let Y denote the space of targets Y , and P the space of predicted distributions PX. We deﬁne the calibration error with respect to a space of functions F of the form f : P Y R as

CEF := sup f F

EPX,Y f(PX, Y ) EPX,ZX f(PX, ZX) . (3)

By construction, if model P is calibrated, then CEF = 0 regardless of the choice of F. However, the converse statement is not true for arbitrary function spaces F. From the theory of integral probability metrics (see, e.g., Müller, 1997; Sriperumbudur et al., 2009; 2012), we know that for certain choices of F the calibration error in Eq. (3) is a well-known metric on the product space P Y, which implies that CEF = 0 if and only if model P is calibrated. Prominent examples include the maximum mean discrepancy2 (MMD) (Gretton et al., 2007), the total variation distance, the Kantorovich distance, and the Dudley metric (Dudley, 1989, p. 310).

As pointed out above, Deﬁnition 2 is a generalization of the deﬁnition for multi-class classiﬁcation proposed by Widmann et al. (2019) which is based on vector-valued functions and only applicable to ﬁnite target spaces to any probabilistic predictive model. In Appendix E we show this explicitly and discuss the special case of classiﬁcation problems in more detail. Previous results (Widmann et al., 2019) imply that in classiﬁcation MMCE and, for common distance measures d( , ) such as the total variation and squared Euclidean distance, ECEd and MCEd are special cases of CEF. In Appendix G we show that our framework also covers natural extensions of ECEd and MCEd to countably inﬁnite discrete target spaces, which to our knowledge have not been studied before and occur, e.g., in Poisson regression.

The literature of integral probability metrics suggests that we can resort to estimating CEF from i.i.d. samples from the distributions of (PX, Y ) and (PX, ZX). For the MMD, the Kantorovich distance, and the Dudley metric tractable strongly consistent empirical estimators exist (Sriperumbudur et al., 2012). Here the empirical estimator for the MMD is particularly appealing since compared with the other estimators it is computationally cheaper, the empirical estimate converges at a faster rate to the population value, and the rate of convergence is independent of the dimension d of the space (for S = Rd) (Sriperumbudur et al. (2012)).

Our speciﬁc design of (PX, ZX) can be exploited to improve on these estimators. If EZx Pxf(Px, Zx) can be evaluated analytically for a ﬁxed prediction Px, then CEF can be estimated empirically with reduced variance by marginalizing out ZX. Otherwise EZx Pxf(Px, Zx) has to be estimated, but in contrast to the common estimators of the integral probability metrics discussed above the artiﬁcial construction of ZX allows us to approximate it by numerical integration methods such as (quasi) Monte Carlo integration or quadrature rules with arbitrarily small error and variance. Monte Carlo integration preserves statistical properties of the estimators such as unbiasedness and consistency.

2As we discuss in Section 3, the MMD is a metric if and only if the employed kernel is characteristic.

3 KERNEL CALIBRATION ERROR

For the remaining parts of the paper we focus on the MMD formulation of CEF due to the appealing properties of the common empirical estimator mentioned above. We derive calibration-speciﬁc analogues of results for the MMD that exploit the special structure of the distribution of (PX, ZX) to improve on existing estimators and tests in the MMD literature. To the best of our knowledge these variance-reduced estimators and tests have not been discussed in the MMD literature.

Let k: (P Y) (P Y) R be a measurable kernel with corresponding reproducing kernel Hilbert space (RKHS) H, and assume that

EPX,Y k1/2 (PX, Y ), (PX, Y ) < and EPX,ZX k1/2 (PX, ZX), (PX, ZX) < .

We discuss how such kernels can be constructed in a generic way in Section 3.1 below.

Deﬁnition 3. Let Fk denote the unit ball in H, i.e., F := {f H| f H 1}. Then the kernel calibration error (KCE) with respect to kernel k is deﬁned as

KCEk := CEFk = sup f Fk

EPX,Y f(PX, Y ) EPX,ZX f(PX, ZX) .

As known from the MMD literature, a more explicit formulation can be given for the squared kernel calibration error SKCEk := KCE2 k (see Lemma B.2). A similar explicit expression for SKCEk was obtained by Widmann et al. (2019) for the special case of classiﬁcation problems. However, their expression relies on Y being ﬁnite and is based on matrix-valued kernels over the ﬁnite-dimensional probability simplex P. A key difference to the expression in Lemma B.2 is that we instead propose to use real-valued kernels deﬁned on the product space of predictions and targets. This construction is applicable to arbitrary target spaces and does not require Y to be ﬁnite.

3.1 CHOICE OF KERNEL

The construction of the product space P Y suggests the use of tensor product kernels k = k P k Y, where k P : P P R and k Y : Y Y R are kernels on the spaces of predicted distributions and targets, respectively.3

By deﬁnition, so-called characteristic kernels guarantee that KCE = 0 if and only if the distributions of (PX, Y ) and (PX, ZX) are equal (Fukumizu et al., 2004; 2008). Many common kernels such as the Gaussian and Laplacian kernel on Rd are characteristic (Fukumizu et al., 2008).4 Szabó & Sriperumbudur (2018, Theorem 4) showed that a tensor product kernel k P k Y is characteristic if k P and k Y are characteristic, continuous, bounded, and translation-invariant kernels on Rd, but the implication does not hold for general characteristic kernels (Szabó & Sriperumbudur, 2018, Example 1). For calibration evaluation, however, it is sufﬁcient to be able to distinguish between the conditional distributions P(Y |PX) and P(ZX|PX) = PX. Therefore, in contrast to the regular MMD setting, it is sufﬁcient that kernel k Y is characteristic and kernel k P is non-zero almost surely, to guarantee that KCE = 0 if and only if model P is calibrated. Thus it is suggestive to construct kernels on general spaces of predicted distributions as

k P(p, p ) = exp λdν P(p, p ) , (4)

where d P( , ) is a metric on P and ν, λ > 0 are kernel hyperparameters. The Wasserstein distance is a widely used metric for distributions from optimal transport theory that allows to lift a ground metric on the target space and possesses many important properties (see, e.g., Peyré & Cuturi, 2019, Chapter 2.4). In general, however, it does not lead to valid kernels k P, apart from the notable exception of elliptically contoured distributions such as normal and Laplace distributions (Peyré & Cuturi, 2019, Chapter 8.3).

3As mentioned above, our framework rephrases and generalizes the construction used by Widmann et al. (2019). The matrix-valued kernels that they employ can be recovered by setting k P to a Laplacian kernel on the probability simplex and k Y(y, y ) = δy,y . 4For a general discussion about characteristic kernels and their relation to universal kernels we refer to the paper by Sriperumbudur et al. (2011).

In machine learning, common probabilistic predictive models output parameters of distributions such as mean and variance of normal distributions. Naturally these parameterizations give rise to injective mappings φ: P Rd that can be used to deﬁne a Hilbertian metric

d P(p, p ) = φ(p) φ(p ) 2.

For such metrics, k P in Eq. (4) is a valid kernel for all λ > 0 and ν (0, 2] (Berg et al., 1984, Corollary 3.3.3, Proposition 3.2.7). In Appendix D.3 we show that for many mixture models, and hence model ensembles, Hilbertian metrics between model components can be lifted to Hilbertian metrics between mixture models. This construction is a generalization of the Wasserstein-like distance for Gaussian mixture models proposed by Chen et al. (2019; 2020); Delon & Desolneux (2020).

3.2 ESTIMATION

Let (X1, Y1), . . . , (Xn, Yn) be a data set of features and targets which are i.i.d. according to the law of (X, Y ). Moreover, for notational brevity, for (p, y), (p , y ) P Y we let

h (p, y), (p , y ) := k (p, y), (p , y ) EZ p k (p, Z), (p , y )

EZ p k (p, y), (p , Z ) + EZ p,Z p k (p, Z), (p , Z ) .

Note that in contrast to the regular MMD we marginalize out Z and Z . Similar to the MMD, there exist consistent estimators of the SKCE, both biased and unbiased.

Lemma 1. The plug-in estimator of SKCEk is non-negatively biased. It is given by

\ SKCEk = 1

i,j=1 h (PXi, Yi), (PXj, Yj) .

Inspired by the block tests for the regular MMD (Zaremba et al., 2013), we deﬁne the following class of unbiased estimators. Note that in contrast to \ SKCEk they do not include terms of the form h (PXi, Yi), (PXi, Yi) .

Lemma 2. The block estimator of SKCEk with block size B {2, . . . , n}, given by

\ SKCEk,B := n

(b 1)B<i<j b B h (PXi, Yi), (PXj, Yj) ,

is an unbiased estimator of SKCEk.

The extremal estimator with B = n is a so-called U-statistic of SKCEk (Hoeffding, 1948; van der Vaart, 1998), and hence it is the minimum variance unbiased estimator. All presented estimators are consistent, i.e., they converge to SKCEk almost surely as the number n of data points goes to inﬁnity. The sample complexity of \ SKCEk and \ SKCEk,B is O(n2) and O(Bn), respectively.

3.3 CALIBRATION TESTS

A fundamental issue with calibration errors in general, including ECE, is that their empirical estimates do not provide an answer to the question if a model is actually calibrated. Even if the measure is guaranteed to be zero if and only if the model is calibrated, usually the estimates of calibrated models are non-zero due to randomness in the data and (possibly) the estimation procedure. In classiﬁcation, statistical hypothesis tests of the null hypothesis

H0 : model P is calibrated,

so-called calibration tests, have been proposed as a tool for checking rigorously if P is calibrated (Bröcker & Smith, 2007; Vaicenavicius et al., 2019; Widmann et al., 2019). For multi-class classiﬁcation, Widmann et al. (2019) suggested calibration tests based on the asymptotic distributions of estimators of the previously formulated KCE. Although for ﬁnite data sets the asymptotic distributions are only approximations of the actual distributions of these estimators, in their experiments with 10 classes the resulting p-value approximations seemed reliable whereas p-values obtained by

so-called consistency resampling (Bröcker & Smith, 2007; Vaicenavicius et al., 2019) underestimated the p-value and hence rejected the null hypothesis too often (Widmann et al., 2019).

For ﬁxed block sizes p

n/B \ SKCEk,B SKCEk d N 0, σ2 B as n , and, under H0,

n \ SKCEk,n d P i=1 λi(Zi 1) as n , where Zi are independent χ2 1 distributed random variables. See Appendix B for details and deﬁnitions of the involved constants. From these results one can derive calibration tests that extend and generalize the existing tests for classiﬁcation problems, as explained in Remarks B.1 and B.2. Our formulation illustrates also the close connection of these tests to different two-sample tests (Gretton et al., 2007; Zaremba et al., 2013).

4 ALTERNATIVE APPROACHES

For two-sample tests, Chwialkowski et al. (2015) suggested the use of the so-called unnormalized mean embedding (UME) to overcome the quadratic sample complexity of the minimum variance unbiased estimator and its intractable asymptotic distribution. As we show in Appendix C, there exists an analogous measure of calibration, termed unnormalized calibration mean embedding (UCME), with a corresponding calibration mean embedding (CME) test.

As an alternative to our construction based on the joint distributions of (PX, Y ) and (PX, ZX), one could try to directly compare the conditional distributions P(Y |PX) and P(ZX|PX) = PX. For instance, Ren et al. (2016) proposed the conditional MMD based on the so-called conditional kernel mean embedding (Song et al., 2009; 2013). However, as noted by Park & Muandet (2020), its common deﬁnition as operator between two RKHS is based on very restrictive assumptions, which are violated in many situations (see, e.g., Fukumizu et al., 2013, Footnote 4) and typically require regularized estimates. Hence, even theoretically, often the conditional MMD is not an exact measure of discrepancy between conditional distributions (Park & Muandet (2020)). In contrast, the maximum conditional mean discrepancy (MCMD) proposed in a concurrent work by Park & Muandet (2020) is a random variable derived from much weaker measure-theoretical assumptions. The MCMD provides a local discrepancy conditional on random predictions whereas KCE is a global real-valued summary of these local discrepancies.5

5 EXPERIMENTS

In our experiments we evaluate the computational efﬁciency and empirical properties of the proposed calibration error estimators and calibration tests on both calibrated and uncalibrated models. By means of a classic regression problem from statistics literature, we demonstrate that the estimators and tests can be used for the evaluation of calibration of neural network models and ensembles of such models. This section contains only an high-level overview of these experiments to conserve space but all experimental details are provided in Appendix A.

5.1 EMPIRICAL PROPERTIES AND COMPUTATIONAL EFFICIENCY

We evaluate error, variance, and computation time of calibration error estimators for calibrated and uncalibrated Gaussian predictive models in synthetic regression problems. The results empirically conﬁrm the consistency of the estimators and the computational efﬁciency of the estimator with block size B = 2 which, however, comes at the cost of increased error and variance.

Additionally, we evaluate empirical test errors of calibration tests at a ﬁxed signiﬁcance level α = 0.05. The evaluations, visualized in Fig. 2 for models with ten-dimensional targets, demonstrate empirically that the percentage of incorrect rejections of H0 converges to the set signiﬁcance level as the number of samples increases. Moreover, the results highlight the computational burden of the calibration test that estimates quantiles of the intractable asymptotic distribution of n \ SKCEk,n by bootstrapping.

5In our calibration setting, the MCMD is almost surely equal to supf FY EY |PX f(Y )|PX EZX|PX f(ZX)|PX , where FY := {f : Y R| f HY 1} for an RKHS HY with kernel k Y : Y Y R. If kernel k Y is characteristic, MCMD = 0 almost surely if and only if model P is calibrated (Park & Muandet, 2020, Theorem 3.7). Although the deﬁnition of MCMD only requires a kernel k Y on the target space, a kernel k P on the space of predictions has to be speciﬁed for the evaluation of its regularized estimates.

As expected, due to the larger variance of \ SKCEk,2 the test with ﬁxed block size B = 2 shows a decreased test power although being computationally much more efﬁcient.

empirical test error

calibrated model

uncalibrated model

type I error

2² 2⁴ 2⁶ 2⁸ 2¹⁰

type II error

10 ⁶ 10 ⁴ 10 ² 10⁰

SKCE (B = 2) SKCE (B = n) SKCE (B = n) CME significance level

Figure 2: Empirical test errors for 500 data sets of n {4, 16, 64, 256, 1024} samples from models with targets of dimension d = 10. The dashed black line indicates the set signﬁcance level α = 0.05.

5.2 FRIEDMAN 1 REGRESSION PROBLEM

The Friedman 1 regression problem (Friedman, 1979; 1991; Friedman et al., 1983) is a classic non-linear regression problem with ten-dimensional features and real-valued targets with Gaussian noise. We train a Gaussian predictive model whose mean is modelled by a shallow neural network and a single scalar variance parameter (consistent with the data-generating model) ten times with different initial parameters. Figure 3 shows estimates of the mean squared error (MSE), the average negative log-likelihood (NLL), SKCEk, and a p-value approximation for these models and their ensemble on the training and a separate test data set. All estimates indicate consistently that the models are overﬁt after 1500 training iterations. The estimations of SKCEk and the p-values allow to focus on calibration speciﬁcally, whereas MSE indicates accuracy only and NLL, as any proper scoring rule (Bröcker, 2009), provides a summary of calibration and accuracy. The estimation of SKCEk in addition to NLL could serve as another source of information for early stopping and model selection.

400 800 1200

SKCE (biased)

sinh(4.5) NLL

400 800 1200

0.9 p-value

models ensemble

models ensemble

Figure 3: Mean squared error (MSE), average negative log-likelihood (NLL), \ SKCEk (SKCE (biased)), and p-value approximation (p-value) of ten Gaussian predictive models for the Friedman 1 regression problem versus the number of training iterations. Evaluations on the training data set (100 samples) are displayed in green and orange, and on the test data set (50 samples) in blue and purple. The green and blue line and their surrounding bands represent the mean and the range of the evaluations of the ten models. The orange and purple lines visualize the evaluations of their ensemble.

6 CONCLUSION

We presented a framework of calibration estimators and tests for any probabilistic model that captures both classiﬁcation and regression problems of arbitrary dimension as well as other predictive models. We successfully applied it for measuring calibration of (ensembles of) neural network models.

Our framework highlights connections of calibration to two-sample tests and optimal transport theory which we expect to be fruitful for future research. For instance, the power of calibration tests could be improved by heuristics and theoretical results about suitable kernel choices or hyperparameters (cf. Jitkrittum et al., 2016). It would also be interesting to investigate alternatives to KCE captured by our framework, e.g., by exploiting recent advances in optimal transport theory (cf. Genevay et al., 2016).

Since the presented estimators of SKCEk are differentiable, we imagine that our framework could be helpful for improving calibration of predictive models, during training (cf. Kumar et al., 2018) or post-hoc. Currently, many calibration methods (see, e.g., Guo et al., 2017; Kull et al., 2019; Song et al., 2019) are based on optimizing the log-likelihood since it is a strictly proper scoring rule and thus encourages both accurate and reliable predictions. However, as for any proper scoring rule, Per se, it is impossible to say how the score will rank unreliable forecast schemes [...]. The lack of reliability of one forecast scheme might be outbalanced by the lack of resolution of the other (Bröcker (2009)). In other words, if one does not use a calibration method such as temperature scaling (Guo et al., 2017) that keeps accuracy invariant6, it is unclear if the resulting model is trading off calibration for accuracy when using log-likelihood for re-calibration. Thus hypothetically ﬂexible calibration methods might beneﬁt from using the presented calibration error estimators.

ACKNOWLEDGMENTS

We thank the reviewers for all the constructive feedback on our paper. This research is ﬁnancially supported by the Swedish Research Council via the projects Learning of Large-Scale Probabilistic Dynamical Models (contract number: 2016-04278), Counterfactual Prediction Methods for Heterogeneous Populations (contract number: 2018-05040), and Handling Uncertainty in Machine Learning Systems (contract number: 2020-04122), by the Swedish Foundation for Strategic Research via the project Probabilistic Modeling and Inference for Machine Learning (contract number: ICA16-0015), by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, and by ELLIIT.

M. A. Arcones and E. Giné. On the bootstrap of U and V statistics. The Annals of Statistics, 20(2): 655 674, 1992.

C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer New York, 1984.

J. Bröcker and L. A. Smith. Increasing the reliability of reliability diagrams. Weather and Forecasting, 22(3):651 661, June 2007.

Jochen Bröcker. Reliability, sufﬁciency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643):1512 1519, July 2009.

Y. Chen, T. T. Georgiou, and A. Tannenbaum. Optimal transport for Gaussian mixture models. IEEE Access, 7:6269 6278, 2019.

Y. Chen, J. Ye, and J. Li. Aggregated Wasserstein distance and state registration for hidden Markov models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(9):2133 2147, September 2020.

K. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with analytic representations of probability measures. In Proceedings of the 28th International Conference on Neural Information Processing Systems, pp. 1981 1989, Cambridge, MA, USA, 2015. MIT Press.

6Temperature scaling can be deﬁned and applied for general probabilistic predictive models, see Appendix F.

M. H. De Groot and S. E. Fienberg. The comparison and evaluation of forecasters. The Statistician, 32(1/2):12, March 1983.

C. Deledalle, S. Parameswaran, and T. Q. Nguyen. Image denoising with generalized Gaussian mixture model patch priors. SIAM Journal on Imaging Sciences, 11(4):2568 2609, January 2018.

J. Delon and A. Desolneux. A Wasserstein-type distance in the space of Gaussian mixture models. SIAM Journal on Imaging Sciences, 13(2):936 970, January 2020.

R. M. Dudley. Real analysis and probability. Wadsworth & Brooks/Cole Pub. Co, Paciﬁc Grove, Calif, 1989.

M. Fasiolo, S. N. Wood, M. Zaffran, R. Nedellec, and Y. Goude. Fast calibrated additive quantile regression. Journal of the American Statistical Association, pp. 1 11, March 2020.

J. H. Friedman. A tree-structured approach to nonparametric multiple regression. In Lecture Notes in Mathematics, pp. 5 22. Springer Berlin Heidelberg, 1979.

J. H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1 67, 1991.

J. H. Friedman, E. Grosse, and W. Stuetzle. Multidimensional additive spline approximation. SIAM Journal on Scientiﬁc and Statistical Computing, 4(2):291 301, June 1983.

K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5(Jan):73 99, 2004.

K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20, pp. 489 496. 2008.

K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes rule: Bayesian inference with positive deﬁnite kernels. Journal of Machine Learning Research, 14(82):3753 3783, 2013.

M. Gelbrich. On a formula for the l2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Mathematische Nachrichten, 147(1):185 203, 1990.

A. Genevay, M. Cuturi, G. Peyré, and F. R. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems 29, pp. 3440 3448. 2016.

X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 249 256. PMLR, 5 2010.

E. Gómez, M. A. Gómez-Viilegas, and J. M. Marín. A multivariate generalization of the power exponential family of distributions. Communications in Statistics - Theory and Methods, 27(3): 589 600, January 1998.

E. Gómez-Sánchez-Manzano, M. A. Gómez-Villegas, and J. M. Marín. Multivariate exponential power distributions as mixtures of normal distributions with Bayesian applications. Communications in Statistics - Theory and Methods, 37(6):972 985, February 2008.

A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems 19, pp. 513 520. 2007.

A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sriperumbudur. A fast, consistent kernel twosample test. In Advances in Neural Information Processing Systems 22, pp. 673 681. 2009.

C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1321 1330. PMLR, 8 2017.

F. K. Gustafsson, M. Danelljan, and T. B. Schön. Evaluating scalable Bayesian deep learning methods for robust computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020.

Y. H. S. Ho and S. M. S. Lee. Calibrated interpolated conﬁdence intervals for population quantiles. Biometrika, 92(1):234 241, March 2005.

W. Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3):293 325, September 1948.

H. Hotelling. The generalization of student s ratio. The Annals of Mathematical Statistics, 2(3): 360 378, August 1931.

M. Innes. Flux: Elegant machine learning with Julia. Journal of Open Source Software, 3(25):602, May 2018.

M. Innes, E. Saba, K. Fischer, D. Gandhi, M. C. Rudilosso, N. M. Joy, T. Karmali, A. Pal, and V. Shah. Fashionable modelling with Flux, 2018.

W. Jitkrittum, Z. Szabó, K. P. Chwialkowski, and A. Gretton. Interpretable distribution features with maximum testing power. In Advances in Neural Information Processing Systems 29, pp. 181 189. 2016.

N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous univariate distributions: Vol. 1. Wiley, New York, 2nd edition, 1994.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.

V. Kuleshov, N. Fenner, and S. Ermon. Accurate uncertainties for deep learning using calibrated regression. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2796 2804. PMLR, 7 2018.

M. Kull, T. Silva Filho, and P. Flach. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classiﬁers. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 623 631. PMLR, 4 2017.

M. Kull, M. Perello Nieto, M. Kängsepp, T. Silva Filho, H. Song, and P. Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. In Advances in Neural Information Processing Systems 32, pp. 12316 12326. 2019.

A. Kumar, S. Sarawagi, and U. Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2805 2814. PMLR, 7 2018.

A. M. Mathai and S. B. Provost. Quadratic forms in random variables: Theory and applications, volume 126. M. Dekker, New York, 1992.

C. A. Micchelli and M. Pontil. On learning vector-valued functions. Neural Computation, 17(1): 177 204, January 2005.

A. Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, June 1997.

A. H. Murphy and R. L. Winkler. Reliability of subjective probability forecasts of precipitation and temperature. Applied Statistics, 26(1):41, 1977.

M. P. Naeini, G. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning. In AAAI Conference on Artiﬁcial Intelligence, 2015.

J. Park and K. Muandet. A measure-theoretic approach to kernel conditional mean embeddings. In Advances in Neural Information Processing Systems, volume 33, pp. 21247 21259, 2020.

G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

J. Platt. Probabilities for SV Machines, pp. 61 73. MIT Press, 2000.

Y. Ren, J. Zhu, J. Li, and Y. Luo. Conditional generative moment-matching networks. In Advances in Neural Information Processing Systems 29, pp. 2928 2936. 2016.

M. Rueda, S. Martínez-Puertas, H. Martínez-Puertas, and A. Arcos. Calibration methods for estimating quantiles. Metrika, 66(3):355 371, December 2006.

R. J. Serﬂing (ed.). Approximation Theorems of Mathematical Statistics. John Wiley & Sons, Inc., November 1980.

H. Song, T. Diethe, M. Kull, and P. Flach. Distribution calibration for regression. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5897 5906. PMLR, 6 2019.

L. Song, J. Huang, A. J. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, pp. 961 968. Association for Computing Machinery, 2009.

L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A uniﬁed kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98 111, July 2013.

B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. G. Lanckriet. On integral probability metrics, φ-divergences and binary classiﬁcation, 2009.

B. K. Sriperumbudur, K. Fukumizu, and G. R.G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12(70):2389 2410, 2011.

B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. G. Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6(0):1550 1599, 2012.

Z. Szabó and B. K. Sriperumbudur. Characteristic and universal tensor product kernels. Journal of Machine Learning Research, 18(233):1 29, 2018.

M. Taillardat, O. Mestre, M. Zamo, and P. Naveau. Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Monthly Weather Review, 144(6):2375 2393, June 2016.

J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. B. Schön. Evaluating model calibration in classiﬁcation. In Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pp. 3459 3467. PMLR, 4 2019.

A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, October 1998.

C. Villani. Optimal Transport. Springer Berlin Heidelberg, 2009.

D. Widmann, F. Lindsten, and D. Zachariah. Calibration tests in multi-class classiﬁcation: A unifying framework. In Proceedings of the 32th International Conference on Neural Information Processing Systems, pp. 12236 12246. 2019.

S. J. Yakowitz and J. D. Spragins. On the identiﬁability of ﬁnite mixtures. The Annals of Mathematical Statistics, 39(1):209 214, February 1968.

Bianca Zadrozny. Reducing multiclass to binary by coupling probability estimates. In Advances in Neural Information Processing Systems 14, pp. 1041 1048. MIT Press, 2002.

W. Zaremba, A. Gretton, and M. Blaschko. B-test: A non-parametric, low variance kernel two-sample test. In Advances in Neural Information Processing Systems 26, pp. 755 763. 2013.

A EXPERIMENTS

The source code of the experiments and instructions for reproducing the results are available at https://github.com/devmotion/Calibration_ICLR2021. Additional material such as automatically generated HTML output and Jupyter notebooks is available at https: //devmotion.github.io/Calibration_ICLR2021/.

A.1 ORDINARY LEAST SQUARES

We consider a regression problem with scalar feature X and scalar target Y with input-dependent Gaussian noise that is inspired by a problem by Gustafsson et al. (2020). Feature X is distributed uniformly at random in [ 1, 1], and target Y is distributed according to

Y sin(πX) + |1 + X|ϵ,

where ϵ N(0, 0.152). We train a linear regression model P with homoscedastic variance using ordinary least squares and a data set of 100 i.i.d. pairs of feature X and target Y (see Fig. 4).

-1.0 -0.5 0.0 0.5 1.0

-1.0 -0.5 0.0 0.5 1.0

Figure 4: Data generating distribution P(Y |X) and predicted distribution P(Y |X) of the linear regression model. Training data is indicated by orange dots.

A validation data set of n = 50 i.i.d. pairs of X and Y is used to evaluate the empirical cumulative probability

i=1 1[0,τ] P(Y Yi|X = Xi)

of model P for quantile levels τ [0, 1]. Model P would be quantile calibrated (Song et al., 2019) if

τ = PX ,Y P(Y Y |X = X ) τ

for all τ [0, 1], where (X, Y ) and (X , Y ) are independent identically distributed pairs of random variables (see Fig. 5).

Additionally, we compute a p-value estimate of the null hypothesis H0 that model P is calibrated using an estimation of the quantile of the asymptotic distribution of n \ SKCEk,n with 100000 bootstrap samples on the validation data set (see Remark B.2). Kernel k is chosen as the tensor product kernel

k (p, y), (p , y ) = exp W2(p, p ) exp (y y )2/2

(mp mp )2 + (σp σp )2 exp (y y )2/2 ,

where W2 is the 2-Wasserstein distance and mp, mp and σp, σp denote the mean and the standard deviation of the normal distributions p and p (see Appendix D.1). We obtain p < 0.05 in our experiment, and hence the calibration test rejects H0 at the signiﬁcance level α = 0.05.

0.00 0.25 0.50 0.75 1.00

quantile level

cumulative probability

Figure 5: Cumulative probability versus quantile level for the linear regression model on the validation data (orange curve). The green curve indicates the theoretical ideal for a quantile-calibrated model.

A.2 EMPIRICAL PROPERTIES AND COMPUTATIONAL EFFICIENCY

We study two setups with d-dimensional targets Y and normal distributions PX of the form N(c1d, 0.12Id) as predictions, where c U(0, 1). Since calibration analysis is only based on the targets and predicted distributions, we neglect features X in these experiments and specify only the distributions of Y and PX.

In the ﬁrst setup we simulate a calibrated model. We achieve this by sampling targets from the predicted distributions, i.e., by deﬁning the conditional distribution of Y given PX as

Y | PX = N(µ, Σ) N(µ, Σ).

In the second setup we simulate an uncalibrated model of the form

Y | PX = N(µ, Σ) N([0.1, µ2, . . . , µd]T, Σ).

We perform an evaluation of the convergence and computation time of the biased estimator \ SKCEk and the unbiased estimator \ SKCEk,B with blocks of size B {2, n, n}. We use the tensor product kernel

k (p, y), (p , y ) = exp W2(p, p ) exp (y y )2/2

(mp mp )2 + (σp σp )2 exp (y y )2/2 ,

where W2 is the 2-Wasserstein distance and mp, mp and σp, σp denote the mean and the standard deviation of the normal distributions p and p .

Figures 6 to 9 visualize the mean absolute error and the variance of the resulting estimates for the calibrated and the uncalibrated model with dimensions d = 1 and d = 10 for 500 independently drawn data sets of n {4, 16, 64, 256, 1024} samples of (PX, Y ). Computation time indicates the minimum time in the 500 evaluations on a computer with a 3.6 GHz processor. The ground truth values of the uncalibrated models were estimated by averaging the estimates of \ SKCEk,1000 for 1000 independently drawn data sets of 1000 samples of (PX, Y ) (independent from the data sets used for the evaluation of the estimates). Figures 6 and 7 illustrate that the computational efﬁciency of \ SKCEk,2 in comparison with the other estimators comes at the cost of increased error and variance for the calibrated models for ﬁxed numbers of samples.

We compare calibration tests based on the (tractable) asymptotic distribution of p

n/B \ SKCEk,B with ﬁxed block size B {2, n} (see Remark B.1), the (intractable) asymptotic distribution of n \ SKCEk,n which is approximated with 1000 bootstrap samples (see Remark B.2), and a Hotelling s

2² 2⁴ 2⁶ 2⁸ 2¹⁰

10 ⁶ ⁰ 10 ⁴ ⁵ 10 ³ ⁰ 10 ¹ ⁵

SKCE SKCE (B = 2) SKCE (B = n) SKCE (B = n)

Figure 6: Mean absolute error and variance of 500 calibration error estimates for data sets of n {4, 16, 64, 256, 1024} samples from the calibrated model of dimension d = 1.

2² 2⁴ 2⁶ 2⁸ 2¹⁰

10 ⁶ ⁰ 10 ⁴ ⁵ 10 ³ ⁰ 10 ¹ ⁵

SKCE SKCE (B = 2) SKCE (B = n) SKCE (B = n)

Figure 7: Mean absolute error and variance of 500 calibration error estimates for data sets of n {4, 16, 64, 256, 1024} samples from the calibrated model of dimension d = 10.

2² 2⁴ 2⁶ 2⁸ 2¹⁰

10 ⁶ ⁰ 10 ⁴ ⁵ 10 ³ ⁰ 10 ¹ ⁵

SKCE SKCE (B = 2) SKCE (B = n) SKCE (B = n)

Figure 8: Mean absolute error and variance of 500 calibration error estimates for data sets of n {4, 16, 64, 256, 1024} samples from the uncalibrated model of dimension d = 1.

2² 2⁴ 2⁶ 2⁸ 2¹⁰

10 ⁶ ⁰ 10 ⁴ ⁵ 10 ³ ⁰ 10 ¹ ⁵

SKCE SKCE (B = 2) SKCE (B = n) SKCE (B = n)

Figure 9: Mean absolute error and variance of 500 calibration error estimates for data sets of n {4, 16, 64, 256, 1024} samples from the uncalibrated model of dimension d = 10.

T 2-statistic for UCMEk,10 with 10 test locations (see Appendix C). We compute the empirical test errors (percentage of false rejections of the null hypothesis H0 that model P is calibrated if P is calibrated, and percentage of false non-rejections of H0 if P is not calibrated) at a ﬁxed signiﬁcance level α = 0.05 and the minimal computation time for the calibrated and the uncalibrated model with dimensions d = 1 and d = 10 for 500 independently drawn data sets of n {4, 16, 64, 256, 1024} samples of (PX, Y ). The 10 test predictions of the CME test are of the form N(m, 0.12Id) where m is distributed uniformly at random in the d-dimensional unit hypercube [0, 1]d, the corresponding 10 test targets are i.i.d. according to N(0, 0.12Id).

Figures 10 and 11 show that all tests adhere to the set signiﬁcance level asymptotically as the number of samples increases. The convergence of the CME test with 10 test locations is found to be much slower than the convergence of all other tests. The tests based on the tractable asymptotic distribution of p

n/B \ SKCEk,B for ﬁxed block size B are orders of magnitudes faster than the test based on the intractable asymptotic distribution of n \ SKCEk,n, approximated with 1000 bootstrap samples. We see that the efﬁciency gain comes at the cost of decreased test power for smaller number of samples, explained by the increasing variance of \ SKCEk,B for decreasing block sizes B. However, in our examples the test based on \ SKCEk, n still achieves good test power for reasonably large number of samples (> 30).

empirical test error

calibrated model

uncalibrated model

type I error

2² 2⁴ 2⁶ 2⁸ 2¹⁰

type II error

10 ⁶ 10 ⁴ 10 ² 10⁰

SKCE (B = 2) SKCE (B = n) SKCE (B = n) CME significance level

Figure 10: Empirical test errors for 500 data sets of n {4, 16, 64, 256, 1024} samples from models with targets of dimension d = 1. The dashed black line indicates the set signﬁcance level α = 0.05.

empirical test error

calibrated model

uncalibrated model

type I error

2² 2⁴ 2⁶ 2⁸ 2¹⁰

type II error

10 ⁶ 10 ⁴ 10 ² 10⁰

SKCE (B = 2) SKCE (B = n) SKCE (B = n) CME significance level

Figure 11: Empirical test errors for 500 data sets of n {4, 16, 64, 256, 1024} samples from models with targets of dimension d = 10. The dashed black line indicates the set signﬁcance level α = 0.05.

A.3 FRIEDMAN 1 REGRESSION PROBLEM

We study the so-called Friedman 1 regression problem, which was initially described for 200 inputs in the six-dimensional unit hypercube (Friedman, 1979; Friedman et al., 1983) and later modiﬁed to 100 inputs in the 10-dimensional unit hypercube (Friedman, 1991). In this regression problem real-valued target Y depends on input X via

Y = 10 sin (πX1X2) + 20(X3 0.5)2 + 10X4 + 5X5 + ϵ, where noise ϵ is typically chosen to be independently standard normally distributed. We generate a training data set of 100 inputs distributed uniformly at random in the 10-dimensional unit hypercube and corresponding targets with identically and independently distributed noise following a standard normal distribution.

We consider models P (θ,σ2) of normal distributions with ﬁxed variance σ2

P (θ,σ2) x = N(fθ(x), σ2), where fθ(x), the model of the mean of the distribution P(Y |X = x), is given by a fully connected neural network with two hidden layers with 200 and 50 hidden units and Re LU activation functions. The parameters of the neural network are denoted by θ.

We use a maximum likelihood approach and train the parameters θ of the model for 5000 iterations by minimizing the mean squared error on the training data set using ADAM (Kingma & Ba, 2015) (default settings in the machine learning framework Flux.jl (Innes, 2018; Innes et al., 2018)). In each iteration, the variance σ2 is set to the maximizer of the likelihood of the training data set.

We train 10 models with different initializations of parameters θ. The initial values of the weight matrices of the neural networks are sampled from the uniform Glorot initialization (Glorot & Bengio, 2010) and the offset vectors are initialized with zeros. In Fig. 12, we visualize estimates of accuracy and calibration measures on the training and test data set with 100 and 50 samples, respectively, for 5000 training iterations. The pinball loss is a common measure and training objective for calibration of quantiles (Song et al., 2019). It is deﬁned as EX,Y Lτ Y, quantile(PX, τ) , where Lτ(y, y) = (1 τ)( y y)+ + τ(y y)+ and quantile(Px, τ) = infy{Px(Y y) τ} for quantile level τ [0, 1]. In Fig. 12 we plot the average pinball loss (pinball) for quantile levels τ {0.05, 0.1, . . . , 0.95}. We evaluate \ SKCEk,n (SKCE (unbiased)) and \ SKCEk (SKCE (biased)) for the tensor product kernel k (p, y), (p , y ) = exp W2(p, p ) exp (y y )2/2

(mp mp )2 + (σp σp )2 exp (y y )2/2 ,

where W2 is the 2-Wasserstein distance and mp, mp and σp, σp denote the mean and the standard deviation of the normal distributions p and p (see Appendix D.1). The p-value estimate (p-value) is computed by estimating the quantile of the asymptotic distribution of n \ SKCEk,n with 1000 bootstrap samples (see Remark B.2). The estimates of the mean squared error and the average negative loglikelihood are denoted by MSE and NLL. All estimators indicate consistently that the trained models suffer from overﬁtting after around 1000 training iterations.

Additionally, we form ensembles of the ten individual models at every training iteration. The evaluations for the ensembles are visualized in Fig. 12 as well. Apart from the unbiased estimates of SKCEk, the estimates of the ensembles are consistently better than the average estimates of the ensemble members. For the mean squared error and the negative log-likelihood this behaviour is guaranteed theoretically by the generalized mean inequality.

B.1 GENERAL SETTING

Let (Ω, A, P) be a probability space. Deﬁne the random variables X : (Ω, A) (X, ΣX) and Y : (Ω, A) (Y, ΣY ) such that ΣX contains all singletons, and denote a version of the regular conditional distribution of Y given X = x by P(Y |X = x) for all x X.

sinh(24) NLL

0 1500 3000 4500

sinh(0.000)

sinh(0.003)

sinh(0.006)

SKCE (unbiased)

10⁰ SKCE (biased)

0 1500 3000 4500

0.9 p-value

models ensemble

models ensemble

Figure 12: Estimates of different accuracy and calibration measures of ten Gaussian predictive models for the Friedman 1 regression problem versus the number of training iterations. Evaluations on the training data set (100 samples) are displayed in green and orange, and on the test data set (50 samples) in blue and purple. The green and blue line and their surrounding bands represent the mean and the range of the evaluations of the ten models. The orange and purple lines visualize the evaluations of their ensemble.

Let P : (X, ΣX) P, B(P) be a measurable function that maps features in X to probability measures in P on the target space Y. We call P a probabilistic model, and denote by Px := P(x) its output for feature x X. This gives rise to the random variable PX : (Ω, A) P, B(P) as PX := P(X). We denote a version of the regular conditional distribution of Y given PX = Px by P(Y |PX = Px) for all Px P.

B.2 EXPECTED AND MAXIMUM CALIBRATION ERROR

The common deﬁnition of the expected and maximum calibration error (Guo et al., 2017; Kull et al., 2019; Naeini et al., 2015; Vaicenavicius et al., 2019) for classiﬁcation models can be generalized to arbitrary predictive models.

Deﬁnition B.1. Let d( , ) be a distance measure of probability distributions of target Y , and let µ be the law of PX. Then we call ECEd = E d P(Y |PX), PX and MCEd = µess sup d P(Y |PX), PX

the expected calibration error (ECE) and the maximum calibration error (MCE) of model P with respect to measure d, respectively.

B.3 KERNEL CALIBRATION ERROR

Recall the general notation: Let k: (P Y) (P Y) R be a kernel, amd denote its corresponding RKHS by H.

If not stated otherwise, we assume that

(K1) k( , ) is Borel-measurable. (K2) k is integrable with respect to the distributions of (PX, Y ) and (PX, ZX), i.e.,

EPX,Y k1/2 (PX, Y ), (PX, Y ) < and EPX,ZX k1/2 (PX, ZX), (PX, ZX) < .

Lemma B.1. There exist kernel mean embeddings µPXY , µPXZX H such that for all f H

f, µPXY H = EPX,Y f(PX, Y ) and f, µPXZX H = EPX,ZX f(PX, ZX).

This implies that

µPXY = EPX,Y k( , (PX, Y )) and µPXZX = EPX,ZX k( , (PX, ZX)).

Proof. The linear operators TPXY f := EPX,Y f(PX, Y ) and TPXZXf := EPX,ZX f(PX, ZX) for all f H are bounded since

|TPXY f| = | EPX,Y f(PX, Y )| EPX,Y |f(PX, Y )| = EPX,Y | k((PX, Y ), ), f H|

EPX,Y k((PX, Y ), ) H f H] = f H EPX,Y k1/2((PX, Y ), (PX, Y ))

and similarly |TPXZXf| f H EPX,ZX k1/2((PX, ZX), (PX, ZX)). Thus Riesz representation theorem implies that there exist µPXY , µPXZX H such that TPXY f = f, µPXY H and TPXZXf = f, µPXZX H. The reproducing property of H implies

µPXY (p, y) = k((p, y), ), µPXY H = EPX,Y k((p, y), (PX, Y ))

for all (p, y) P Y, and similarly µPXZX(p, y) = EPX,ZX k((p, y), (PX, ZX)).

Lemma B.2. The squared kernel calibration error (SKCE) with respect to kernel k, deﬁned as SKCEk := KCE2 k, is given by

SKCEk = EPX,Y,PX ,Y k (PX, Y ), (PX , Y ) 2 EPX,Y,PX ,ZX k (PX, Y ), (PX , ZX )

+ EPX,ZX,PX ,ZX k (PX, ZX), (PX , ZX ) ,

where (PX , Y , ZX ) is independently distributed according to the law of (PX, Y, ZX)

Proof. From Lemma B.1 we know that there exist kernel mean embeddings µPXY , µPXZX H that satisfy

f, µPXY µPXZX H = f, µPXY H f, µPXZX H = EPX,Y f(PX, Y ) EPX,ZX f(PX, ZX)

for all f H. Hence by the deﬁnition of the dual norm

CEFk = sup f Fk

EPX,Y f(PX, Y ) EPX,ZX f(PX, ZX)

f, µPX,Y µPX,ZX H = µPX,Y µPX,ZX H,

which implies SKCEk = µPXY µPXZX, µPXY µPXZX H. From Lemma B.1 we obtain

SKCEk = EPX,Y,PX ,Y k (PX, Y ), (PX , Y ) 2 EPX,Y,PX ,ZX k (PX, Y ), (PX , Z X)

+ EPX,ZX,PX ,Z X k (PX, ZX), (PX , Z X) ,

which yields the desired result.

Recall that (PX1, Y1), . . . , (PXn, Yn) is a validation data set that is sampled i.i.d. according to the law of (PX, Y ) and that for all (p, y), (p , y ) P Y

h((p, y), (p , y )) := k((p, y), (p , y )) EZ p k((p, Z), (p , y ))

EZ p k((p, y), (p , Z )) + EZ p,Z p k((p, Z), (p , Z )).

Lemma B.3. For all i, j = 1, . . . , n, h (PXi, Yi), (PXj, Yj) <

almost surely.

Proof. Let i, j {1, . . . , n}. By assumption (K2) we know that k (PXi, Yi), (PXj, Yj) k1/2 (PXi, Yi), (PXi, Yi) k1/2 (PXj, Yj), (PXj, Yj) <

almost surely. Moreover, EZXi k (PXi, ZXi), (PXj, Yj) EZXi k (PXi, ZXi), (PXj, Yj)

k1/2 (PXi, ZXi), (PXi, ZXi) k1/2 (PXj, Yj), (PXj, Yj) <

almost surely, and similarly EZXi,ZXj k (PXi, ZXi), (PXj, ZXj) < almost surely. Thus

h (PXi, Yi), (PXj, Yj) k (PXi, Yi), (PXj, Yj) + EZXi k (PXi, ZXi), (PXj, Yj)

+ EZXj k (PXi, Yi), (PXj, ZXj) + EZXi,ZXj k (PXi, ZXi), (PXj, ZXj) <

almost surely.

Lemma 1. The plug-in estimator of SKCEk is non-negatively biased. It is given by

\ SKCEk = 1

i,j=1 h (PXi, Yi), (PXj, Yj) .

Proof. From Lemma B.2 we know that KCEk < , and Lemma B.3 implies that \ SKCEk < almost surely.

For i = 1, . . . , n, the linear operators Tif := EZXi f(PXi, ZXi) for f H are bounded almost surely since

|Tif| = EZXi f(PXi, ZXi) EZXi f(PXi, ZXi) = EZXi k (PXi, ZXi), , f H

k (PXi, ZXi), H f H

= f H EZXi k1/2 (PXi, ZXi), (PXi, ZXi) .

Hence Riesz representation theorem implies that there exist ρi H such that Tif = f, ρi H almost surely. From the reproducing property of H we deduce that ρi(p, y) = k (p, y), , ρi H = EZXi k (p, y), (PXi, ZXi) for all (p, y) P Y almost surely.

Thus by the deﬁnition of the dual norm the plug-in estimator [ KCEk satisﬁes

[ KCEk = sup f Fk

f(PXi, Yi) EZXi f(PXi, ZXi)

k (PXi, Yi), ρi, f

k (PXi, Yi), ρi , f

k (Gi, Yi), ρi

i=1 k (PXi, Yi), ρi,

i=1 k (PXi, Yi), ρi

i,j=1 h (PXi, Yi), (PXj, Yj) !1/2

= \ SKCE 1/2 k <

almost surely, and hence indeed \ SKCE 1/2 k is the plug-in estimator of KCEk.

Since (PX, Y ), (PX , Y ), (PX1, Y1), . . . , (PXn, Yn) are identically distributed and pairwise independent, we obtain

n2 E \ SKCEk =

i,j=1, i =j

EPXi,Yi,PXj ,Yj h (PXi, Yi), (PXj, Yj)

i=1 EPXi,Yi h (PXi, Yi), (PXi, Yi)

= n(n 1) EPX,Y,PX ,Y h (PX, Y ), (PX , Y ) + n EPX,Y h (PX, Y ), (PX, Y )

= n(n 1)SKCEk + n EPX,Y h (PX, Y ), (PX, Y ) . (B.1)

With the same reasoning as above, there exist ρ, ρ H such that for all f H EZX f(PX, ZX) = f, ρ H and EZX f(PX , ZX ) = f, ρ H almost surely. Thus we obtain

h (PX, Y ), (PX , Y ) = k (PX, Y ), ρ, k (PX , Y ), ρ H

almost surely, and therefore by Lemma B.2 and the Cauchy-Schwarz inequality

SKCEk = EPX,Y,PX ,Y h (PX, Y ), (PX , Y )

= EPX,Y,PX ,Y k (PX, Y ), ρ, k (G , Y ), ρ H EPX,Y,PX ,Y k (PX, Y ), ρ, k (PX , Y ), ρ H

EPX,Y,PX ,Y k (PX, Y ), ρ H k (PX , Y ), ρ H E1/2 PX,Y k (PX, Y ), ρ 2 H E1/2 PX ,Y k (PX , Y ), ρ 2 H.

Since (PX, Y ) and (PX , Y ) are identically distributed, we obtain

SKCEk EPX,Y k (PX, Y ), ρ 2 H = EPX,Y h (PX, Y ), (PX, Y ) .

Thus together with Eq. (B.1) we get

n2 E \ SKCEk n(n 1)SKCEk + n SKCEk = n2SKCEk,

and hence \ SKCEk has a non-negative bias.

Lemma 2. The block estimator of SKCEk with block size B {2, . . . , n}, given by

\ SKCEk,B := n

(b 1)B<i<j b B h (PXi, Yi), (PXj, Yj) ,

is an unbiased estimator of SKCEk.

Proof. From Lemma B.2 we know that SKCEk < , and Lemma B.3 implies that \ SKCEk,B < almost surely.

For b {1, . . . , n/B }, let

(b 1)B<i<j b B h (PXi, Yi), (PXj, Yj) (B.2)

be the estimator of the bth block. From Lemma B.3 it follows that bηb < almost surely for all b. Moreover, for all b, bηb is a so-called U-statistic of SKCEk and hence satisﬁes E bηb = SKCEk (see, e.g., van der Vaart, 1998). Since (PX1, Y1), . . . , (PXn, Yn) are pairwise independent, this implies that \ SKCEk,B is an unbiased estimator of SKCEk.

B.4 CALIBRATION TESTS

Lemma B.4. Let B {2, . . . , n}. If VPX,Y,PX ,Y h (PX, Y ), (PX , Y ) < , then for all b {1, . . . , n/B }

V bηb = σ2 B := B 2

1 2(B 2)ζ1 + VPX,Y,PX ,Y h (PX, Y ), (PX , Y ) ,

where bηb is deﬁned according to Eq. (B.2) and

ζ1 := EPX,Y E2 PX ,Y h (PX, Y ), (PX , Y ) SKCE2 k. (B.3)

If model P is calibrated, it simpliﬁes to

1 EPX,Y,PX ,Y h2 (PX, Y ), (PX , Y ) .

Proof. Let b {1, . . . , n/B }. Since VPX,Y,PX ,Y h (PX, Y ), (PX , Y ) < , the Cauchy Schwarz inequality implies V bηb < as well.

As mentioned in the proof of Lemma 2 above, bηb is a U-statistic of SKCEk. From the general formula of the variance of a U-statistic (see, e.g., Hoeffding, 1948, p. 298 299) we obtain

V bηb = B 2

VPX,Y,PX ,Y h (PX, Y ), (PX , Y )

1 2(B 2)ζ1 + VPX,Y,PX ,Y h (PX, Y ), (PX , Y ) ,

where ζ1 = EPX,Y E2 PX ,Y h (PX, Y ), (PX , Y ) SKCE2 k.

If model P is calibrated, then (PX, Y ) d= (PX, Z), and hence for all (p, y) P Y

EPX,Y h (p, y), (PX, Y ) = EPX,Y k (p, y), (PX, Y ) EZ p EPX,Y k (p, Z ), (PX, Y )

EPX,Z k (p, y), (PX, Z) + EZ p EPX,Z k (p, Z ), (PX, Y )

This implies ζ1 = EPX,Y E2 PX ,Y h (PX, Y ), (PX , Y ) = 0 and SKCE2 k = 0 due to Lemma B.2. Thus

1 EPX,Y,PX ,Y h2 (PX, Y ), (PX , Y ) ,

as stated above.

Corollary B.1. Let B {2, . . . , n}. If VPX,Y,PX ,Y h (PX, Y ), (PX , Y ) < , then

V \ SKCEk,B = n/B 1σ2 B.

where σ2 B is deﬁned according to Lemma B.4.

Proof. Since the estimators bη1, . . . , bη n/B in each block are pairwise independent, this is an immediate consequence of Lemma B.4.

Corollary B.2. Let B {2, . . . , n}. If VPX,Y,PX ,Y h (PX, Y ), (PX , Y ) < , then p

n/B \ SKCEk,B SKCEk d N 0, σ2 B as n ,

where block size B is ﬁxed and σ2 B is deﬁned according to Lemma B.4.

Proof. The result follows from Lemma 2, Lemma B.4, and the central limit theorem (see, e.g., Serﬂing, 1980, Theorem A in Section 1.9).

Remark B.1. Corollary B.2 shows that \ SKCEk,B is a consistent estimator of SKCEk in the large sample limit as n with ﬁxed number B of samples per block. In particular, for the linear estimator with B = 2 we obtain p

n/2 \ SKCEk,2 SKCEk d N 0, σ2 2 as n .

Moreover, Lemma B.4 and Corollary B.2 show that the p-value of the null hypothesis that model P is calibrated can be estimated by

n/B \ SKCEk,B bσB

where Φ is the cumulative distribution function of the standard normal distribution and bσB is the empirical standard deviation of the block estimates bη1, . . . , bη n/B , and

n/B B(B 1) \ SKCEk,B

where bσ2 is an estimate of EPX,Y,PX ,Y h2 (PX, Y ), (PX , Y ) . Similar p-value approximations for the two-sample test with blocks of ﬁxed size were used by Chwialkowski et al. (2015).

Corollary B.3. Assume VPX,Y,PX ,Y h (PX, Y ), (PX , Y ) < . Let s {1, . . . , n/2 }. Then for all b {1, . . . , s}

B bηb SKCEk d N(0, 4ζ1) as B , (B.4) where bηb is deﬁned according to Eq. (B.2) with n = Bs, the number s of equally-sized blocks is ﬁxed, and ζ1 is deﬁned according to Eq. (B.3).

If model P is calibrated, then

B bηb SKCEk =

Bbηb is asymptotically tight since ζ1 = 0, and

i=1 λi(Zi 1) as B , (B.5)

where Zi are independent χ2 1 distributed random variables and λi R are eigenvalues of the Hilbert-Schmidt integral operator Kf(p, y) := EPX,Y h((p, y), (PX, Y ))f(PX, Y )

for Borel-measurable functions f : P Y R with EPX,Y f 2(PX, Y ) < .

Proof. Let s {1, . . . , n/2 } and b {1, . . . , s}. As mentioned above in the proof of Lemma 2, the estimator bηb, deﬁned according to Eq. (B.2), is a so-called U-statistic of SKCEk (see, e.g., van der Vaart, 1998). Thus Eq. (B.4) follows from the asymptotic behaviour of U-statistics (see, e.g., van der Vaart, 1998, Theorem 12.3).

If P is calibrated, then we know from the proof of Lemma B.4 that ζ1 = 0, and hence bηb is a so-called degenerate U-statistic (see, e.g., van der Vaart, 1998, Section 12.3). From the theory of degenerate U-statistics it follows that the sequence Bbηb converges in distribution to the limit distribution in Eq. (B.5), which is known as Gaussian chaos.

Corollary B.4. Assume VPX,Y,PX ,Y h (PX, Y ), (PX , Y ) < . Let s {1, . . . , n/2 }. Then

B \ SKCEk,B SKCEk d N(0, 4s 1ζ1) as B , where the number s of equally-sized blocks is ﬁxed, n = Bs, and ζ1 is deﬁned according to Eq. (B.3).

If model P is calibrated, then

B \ SKCEk,B SKCEk =

B \ SKCEk,B is asymptotically tight since ζ1 = 0, and

B \ SKCEk,B d s 1 X

i=1 λi(Zi s) as B ,

where Zi are independent χ2 s distributed random variables and λi R are eigenvalues of the Hilbert-Schmidt integral operator Kf(p, y) := EPX,Y h((p, y), (PX, Y ))f(PX, Y )

for Borel-measurable functions f : P Y R with EPX,Y f 2(PX, Y ) < .

Proof. Since the estimators bη1, . . . , bηs in each block are pairwise independent, this is an immediate consequence of Corollary B.3.

Remark B.2. Corollary B.4 shows that \ SKCEk,B is a consistent estimator of SKCEk in the large sample limit as B with ﬁxed number n/B of blocks. Moreover, for the minimum variance unbiased estimator with B = n, Corollary B.4 shows that under the null hypothesis that model P is calibrated

n \ SKCEk,n d

i=1 λi(Zi 1) as n ,

where Zi are independent χ2 1 distributed random variables. Unfortunately quantiles of the limit distribution of P i=1 λi(Zi 1) (and hence the p-value of the null hypothesis that model P is calibrated) can not be computed analytically but have to be estimated by, e.g., bootstrapping (Arcones & Giné, 1992), using a Gram matrix spectrum (Gretton et al., 2009), ﬁtting Pearson curves (Gretton et al., 2007), or using a Gamma approximation (Johnson et al., 1994, p. 343, p. 359).

Corollary B.5. Assume VPX,Y,PX ,Y h (PX, Y ), (PX , Y ) < . Then

n/B B \ SKCEk,B SKCEk d N(0, 4ζ1) as B and n/B , (B.6)

where B is the block size and s is the number of equally-sized blocks, n = Bs, and ζ1 is deﬁned according to Eq. (B.3).

If model P is calibrated, then p

n/B B \ SKCEk,B SKCEk = p

n/B B \ SKCEk,B is asymptotically tight since ζ1 = 0, and

n/B B \ SKCEk,B d N 0,

as B and n/B ,

where λi R are eigenvalues of the Hilbert-Schmidt integral operator

Kf(p, y) := EPX,Y h((p, y), (PX, Y ))f(PX, Y )

for Borel-measurable functions f : P Y R with EPX,Y f 2(PX, Y ) < .

Proof. The result follows from Corollary B.3 and the central limit theorem (see, e.g., Serﬂing, 1980, Theorem A in Section 1.9).

Remark B.3. Corollary B.5 shows that \ SKCEk,B is a consistent estimator of SKCEk in the large sample limit as B and n/B , i.e., as both the number of samples per block and the number of blocks go to inﬁnity. Moreover, Corollaries B.3 and B.5 show that the p-value of the null hypothesis that P is calibrated can be estimated by

n/B \ SKCEk,B bσB

where bσB is the empirical standard deviation of the block estimates bη1, . . . , bη n/B . Similar p-value approximations for the two-sample problem with blocks of increasing size were proposed and applied by Zaremba et al. (2013).

C CALIBRATION MEAN EMBEDDING

C.1 DEFINITION

Similar to the unnormalized mean embedding (UME) proposed by Chwialkowski et al. (2015) in the standard MMD setting, instead of the calibration error CEFk = µPXY µPXZX H we can consider the unnormalized calibration mean embedding (UCME).

Deﬁnition C.1. Let J N. The unnormalized calibration mean embedding (UCME) for kernel k with J test locations is deﬁned as the random variable

UCME2 k,J = J 1 J X

µPXY (Tj) µPXZX(Tj) 2

EPX,Y k(Tj, (PX, Y )) EPX,ZX k(Tj, (PX, ZX)) 2,

where T1, . . . , TJ are i.i.d. random variables (so-called test locations) whose distribution is absolutely continuous with respect to the Lebesgue measure on P Y.

As mentioned above, in many machine learning applications we actually have P Y Rd (up to some isomorphism). In such a case, if k is an analytic, integrable, characteristic kernel, then for each J N UCMEk,J is a random metric between the distributions of (PX, Y ) and (PX, ZX), as shown by Chwialkowski et al. (2015, Theorem 2). In particular, this implies that UCMEk,J = 0 almost surely if and only if the two distributions are equal.

C.2 ESTIMATION

Again we assume (PX1, Y1), . . . , (PXn, Yn) is a validation data set of predictions and targets, which are i.i.d. according to the law of (PX, Y ). The consistent, but biased, plug-in estimator of UCME2 k,J is given by

\ UCME 2 k,J = J 1 J X

k Tj, (PXi, Yi) EZXi k Tj, (PXi, ZXi) !2

C.3 CALIBRATION MEAN EMBEDDING TEST

As Chwialkowski et al. (2015) note, if model P is calibrated, for every ﬁxed sequence of unique

test locations n \ UCME 2 k,J converges in distribution to a sum of correlated χ2 random variables, as n . The estimation of this asymptotic distribution, and its quantiles required for hypothesis testing, requires a bootstrap or permutation procedure, which is computationally expensive. Hence Chwialkowski et al. (2015) proposed the following test based on Hotelling s T 2-statistic (Hotelling, 1931).

For i = 1, . . . , n, let

k T1, (PXi, Yi) EZXi k T1, (PXi, ZXi)

... k TJ, (PXi, Yi) EZXi k TJ, (PXi, ZXi)

and denote the empirical mean and covariance matrix of Z1, . . . , Zn by Z and S, respectively. If UCMEk,J is a random metric between the distributions of (PX, Y ) and (PX, ZX), then the test statistic Qn := n Z T S 1Z is almost surely asymptotically χ2 distributed with J degrees of freedom if model P is calibrated, as n with J ﬁxed; moreover, if model P is uncalibrated, then for any ﬁxed r R almost surely P(Qn > r) 1 as n (Chwialkowski et al., 2015, Proposition 2). We call the resulting calibration test calibration mean embedding (CME) test.

D KERNEL CHOICE

A natural choice for the kernel k: (P Y) (P Y) R on the product space of predicted distributions P and targets Y is a tensor product kernel of the form k = k P k Y, i.e., a kernel of the form k (p, y), (p , y ) = k P(p, p )k Y(y, y ),

where k P : P P R and k Y : Y Y R are kernels on the spaces of predicted distributions and targets, respectively.

As discussed in Section 3.1, if kernel k is characteristic, then the kernel calibration error KCEk of model P is zero if and only if P is calibrated. Unfortunately, as shown by Szabó & Sriperumbudur (2018, Example 1), even if k P and k Y are characteristic, the tensor product kernel k = k P k Y might not be characteristic. However, when analyzing calibration, it is sufﬁcient to be able to distinguish distributions for which the conditional distributions P(Y |PX) and P(ZX|PX) = PX are not equal almost surely. Thus it is sufﬁcient if k Y is characteristic and k P is non-zero almost surely.

Many common kernels such as the Gaussian and Laplacian kernel on Rd are characteristic and can therefore be chosen as kernel k Y for real-valued target spaces. The choice of k P might be less obvious since P is a space of probability distributions. Intuitively one might want to use kernels of the form k P p, p = exp λdν P(p, p ) , (D.1) where d P : P P R is a metric on P and ν, λ > 0 are kernel hyperparameters. Kernels of this form would be a generalization of the Gaussian and Laplacian kernel, and would clearly be non-zero almost surely.

Unfortunately, this construction does not necessarily yield valid kernels. Most prominently, the Wasserstein distance does not lead to valid kernels k P in general (Peyré & Cuturi, 2019, Chapter 8.3). However, if d P( , ) is a Hilbertian metric, i.e., a metric of the form

d P(p, p ) = φ(p) φ(p ) H for some Hilbert space H and mapping φ: P H, then k P in Eq. (D.1) is a valid kernel for all λ > 0 and ν (0, 2] (Berg et al., 1984, Corollary 3.3.3, Proposition 3.2.7).

D.1 NORMAL DISTRIBUTIONS

Assume that Y = Rd and P = {N(µ, Σ): µ Rd, Σ Rd d psd}, i.e., the model outputs normal distributions PX = N(µX, ΣX). The distribution of these outputs is deﬁned by the distribution of their mean µX and covariance matrix ΣX.

Let Px = N(µx, Σx) P, y Y = Rd, and γ > 0. We obtain

EZx Px exp γ Zx y 2 2

= Id + 2γΣx 1/2 exp γ(µx y)T Id + 2γΣx 1(µx y)

from Mathai & Provost (1992, Theorem 3.2.a.3). In particular, if Σx = diag(Σx,1, . . . , Σx,d), then

EZx Px exp γ Zx y 2 2

1 + 2γΣx,i 1/2 exp γ 1 + 2γΣx,i 1 µx,i yi 2 .

Let Px = N(µx , Σx ) be another normal distribution. Then we have

EZx Px,Zx Px exp γ Zx Zx 2 2

= Id + 2γΣx 1/2 EZx Px exp γ µx Zx T Id + 2γΣx 1 µx Zx

= Id + 2γ(Σx + Σx ) 1/2 exp γ µx µx T Id + 2γ(Σx + Σx ) 1 µx µx .

Thus if Σx = diag(Σx,1, . . . , Σx,d) and Σx = diag Σx ,1, . . . , Σx ,d , then

EZx Px,Zx Px exp γ Zx Zx 2 2

1 + 2γ(Σx,i + Σx ,i) 1/2 exp γ 1 + 2γ(Σx,i + Σx ,i) 1 µx,i µx ,i 2 .

Hence we see that a Gaussian kernel k Y(y, y ) = exp γ y y 2 2

with inverse length scale γ > 0 on the space of targets Y = Rd allows us to compute EZx Px k Y(Zx, y) and EZx Px,Zx Px k Y(Zx, Zx ) analytically. Moreover, the Gaussian kernel is characteristic on Rd (Fukumizu et al., 2008). Hence, as discussed above, by choosing a kernel k P that is non-zero almost surely we can guarantee that KCEk = 0 if and only if model P is calibrated.

On the space of normal distributions, the 2-Wasserstein distance with respect to the Euclidean distance between Px = N(µx, Σx) and Px = N(µx , Σx ) is given by

W 2 2 Px, Px = µx µx 2 2 + Tr Σx + Σx 2 Σx 1/2ΣxΣx 1/2 1/2 ,

which can be simpliﬁed to

W 2 2 Px, Px = µx µx 2 2 + Σ1/2 x Σ1/2 x 2

if ΣxΣx = Σx Σx. This shows that the 2-Wasserstein distance is a Hilbertian metric on the space of normal distributions. Hence as discussed above, the choice k P Px, Px = exp λW ν 2 (Px, Px )

yields a valid kernel for all λ > 0 and ν (0, 2].

Thus for all λ, γ > 0 and ν (0, 2]

k (p, y), (p , y ) = exp λW ν 2 (p, p ) exp γ y y 2 2

is a valid kernel on the product space P Y of normal distributions on Rd and Rd that allows to evaluate h (p, y), (p , y ) analytically and guarantees that KCEk = 0 if and only if model P is calibrated.

D.2 LAPLACE DISTRIBUTIONS

Assume that Y = R and P = {L(µ, β): µ R, β > 0}, i.e., the model outputs Laplace distributions PX = L(µX, βX) with probability density function

p X(y) = 1 2βX exp β 1 X |y µX|

for y Y = R. The distribution of these outputs is deﬁned by the distribution of their mean µX and scale parameter βX.

Let Px = L(µx, βx) P, y Y = R, and γ > 0. If βx = γ 1, we have

EZx Px exp γ|Zx y|

= β2 xγ2 1 1 βxγ exp β 1 x |µx y| exp γ|µx y| .

Additionally, if βx = γ 1, the dominated convergence theorem implies EZx Px exp γ|Zx y|

= lim γ β 1 x

β2 xγ2 1 1 βxγ exp β 1 x |µx y| exp γ|µx y|

2 1 + γ|µx y| exp γ|µx y| .

Let Px = L(µx , βx ) be another Laplace distribution. If βx = γ 1, βx = γ 1, and βx = βx , we obtain

EZx Px,Zx Px exp γ|Zx Zx | = γβ3 x (β2xγ2 1)(β2x βx 2) exp β 1 x |µx µx |

+ γβ3 x (β2 x γ2 1)(βx 2 β2x) exp βx 1|µx µx |

+ 1 (β2xγ2 1)(βx 2γ2 1) exp γ|µx µx | .

As above, all other possible cases can be deduced by applying the dominated convergence theorem. More concretely,

if βx = βx = γ 1, then

EZx Px,Zx Px exp γ|Zx Zx |

3 + 3γ|µx µx | + γ2|µx µx |2 exp γ|µx µx | ,

if βx = βx and βx = γ 1, then

EZx Px,Zx Px exp γ|Zx Zx | = 1

(β2xγ2 1)2 exp γ|µx µx |

+ γ βx + |µx µx |

2(β2xγ2 1) βxγ

exp β 1 x |µx µx | ,

if βx = βx and βx = γ 1, then

EZx Px,Zx Px exp γ|Zx Zx | = βx 3γ3

(βx 2γ2 1) 2 exp βx 1|µx µx |

1 + γ|µx µx |

2(βx 2γ2 1) + βx 2γ2

(βx 2γ2 1) 2

exp γ|µx µx | ,

and if βx = βx and βx = γ 1, then

EZx Px,Zx Px exp γ|Zx Zx | = β3 xγ3

(β2xγ2 1)2 exp β 1 x |µx µx |

1 + γ|µx µx |

2(β2xγ2 1) + β2 xγ2

exp γ|µx µx | .

The calculations above show that by choosing a Laplacian kernel

k Y y, y = exp γ|y y |

with inverse length scale γ > 0 on the space of targets Y = R, we can compute EZx Px k Y(Zx, y) and EZx Px,Zx Px k Y(Zx, Zx ) analytically. Additionally, the Laplacian kernel is characteristic on R (Fukumizu et al., 2008).

Since the Laplace distribution is an elliptically contoured distribution, we know from Gelbrich (1990, Corollary 2) that the 2-Wasserstein distance with respect to the Euclidean distance between Px = L(µx, βx) and Px = L(µx , βx ) can be computed in closed form and is given by

W 2 2 Px, Px = (µx µx )2 + 2(βx βx )2.

Thus we see that the 2-Wasserstein distance is also a Hilbertian metric on the space of Laplace distributions, and hence k P Px, Px = exp λW ν 2 (Px, Px )

is a valid kernel for 0 < ν 2 and all λ > 0.

Therefore, as discussed above, for all λ, γ > 0 and ν (0, 2]

k (p, y), (p , y ) = exp λW ν 2 (p, p ) exp γ|y y |

is a valid kernel on the product space P Y of Laplace distributions and R that allows to evaluate h (p, y), (p , y ) analytically and guarantees that KCEk = 0 if and only if model P is calibrated.

D.3 PREDICTING MIXTURES OF DISTRIBUTIONS

Assume that the model predicts mixture distributions, possibly with different numbers of components. A special case of this setting are ensembles of models, in which each ensemble member predicts a component of the mixture model.

Let p, p P with p = P

i πipi and p = P

j π jp j, where π, π are histograms and pi, p j are the mixture components. For kernel k Y and y Y we obtain

EZ p k Y(Z, y) = X

i πi EW pi k Y(Z, y)

and EZ p,Z p k Y(Z, Z ) = X

i,j πiπ j EZ pi,Z p j k Y(Z, Z ).

Of course, for these derivations to be meaningful, we require that they do not depend on the choice of histograms π, π and mixture components pi, p j.

Deﬁnition D.1 (see Yakowitz & Spragins (1968)). A family P of ﬁnite mixture models is called identiﬁable if two mixtures p = PK i=1 πipi P and p = PK

j=1 π jp j P, written such that all pi and all p j are pairwise distinct, are equal if and only if K = K and the indices can be reordered such that for all k {1, . . . , K} there exists some k {1, . . . , K} with πk = π k and pk = p k .

Clearly, if P is identiﬁable, then the derivations above do not depend on the choice of histograms and mixture components. Prominent examples of identiﬁable mixture models are Gaussian mixture models and mixture models of families of products of exponential distributions (Yakowitz & Spragins, 1968).

Moreover, similar to optimal transport for Gaussian mixture models by Chen et al. (2019; 2020); Delon & Desolneux (2020), we can consider metrics of the form

inf w Π(π,π )

i,j wi,jcs(pi, p j) 1/s ,

Π(π, π ) = w: X

i wi,j = π j X

j wi,j = πi i, j : wi,j 0

are the couplings of π and π , and c( , ) is a cost function between the components of the mixture model.

Theorem D.1. Let P be a family of ﬁnite mixture models that is identiﬁable in the sense of Deﬁnition D.1, and let s [1, ).

If d( , ) is a (Hilbertian) metric on the space of mixture components, then the Mixture Wasserstein distance of order s deﬁned by

MWs(p, p ) := inf w Π(π,π )

i,j wi,jds(pi, p j) 1/s , (D.2)

is a (Hilbertian) metric on P.

Proof. First of all, note that for all p, p P an optimal coupling ˆw exists (Villani, 2009, Theorem 4.1). Moreover, P i,j ˆwi,jds(pi, p j) 0, and hence MWs(p, p ) exists. Moreover, since P is identiﬁable, we see that MWs(p, p ) does not depend on the choice of histograms and mixture components. Thus MWs is well-deﬁned.

Clearly, for all p, p P we have MWs(p, p ) 0 and MWs(p, p ) = MWs(p , p). Moreover,

MWs s(p, p) = min w Π(π,π)

i,j wi,jds(pi, pj) X

i,j πiδi,jds(pi, pj)

i πids(pi, pi) = X

i πi02 = 0,

and hence MWs(p, p) = 0. On the other hand, let p, p P with optimal coupling ˆw with respect to π and π , and assume that MWs(p, p ) = 0. We have

i,j ˆwi,jpi = X

i,j : ˆ wi,j>0 ˆwi,jpi.

Since MWs(p, p ) = 0, we have ˆwi,jds(pi, p j) = 0 for all i, j, and hence ds(pi, p j) = 0 if ˆwi,j > 0. Since d is a metric, this implies pi = p j if ˆwi,j > 0. Thus we get

i,j : ˆ wi,j>0 ˆwi,jpi = X

i,j : ˆ wi,j>0 ˆwi,jp j = X

i,j ˆwi,jp j = X

j π jp j = p .

Function MWs also satisﬁes the triangle inequality, following a similar argument as Chen et al. (2019). Let p(1), p(2), p(3) P and denote the optimal coupling with respect to π(1) and π(2) by ˆw(12), and the optimal coupling with respect to π(2) and π(3) by ˆw(23). Deﬁne w(13) by

w(13) i,k := X

j : π(2) j =0

ˆw(12) i,j ˆw(23) j,k π(2) j .

Clearly w(13) i,k 0 for all i, k, and we see that

i w(13) i,k = X

j : π(2) j =0

ˆw(12) i,j ˆw(23) j,k π(2) j = X

j : π(2) j =0

ˆw(12) i,j ˆw(23) j,k π(2) j

j : π(2) j =0

π(2) j ˆw(23) j,k π(2) j = X

j : π(2) j =0 ˆw(23) j,k = π(3) X

j : π(2) j =0 ˆw(23) j,k

for all k. Since for all j, k, π(2) j ˆw(23) j,k , we know that π(2) j = 0 implies ˆw(23) j,k = 0 for all k. Thus for all k

i w(13) i,k = π(3).

Similarly we obtain for all i

k w(13) i,k = π(1).

Thus w(13) Π(π(1), π(3)), and therefore by exploiting the triangle inequality for metric d and the Minkowski inequality we get

MWs p(1), p(3) X

i,k w(13) i,k ds p(1) i , p(3) k 1/s = X

j : π(2) j =0

ˆw(12) i,j ˆw(23) j,k π(2) j ds p(1) i , p(3) k 1/s

j : π(2) j =0

ˆw(12) i,j ˆw(23) j,k π(2) j

d(p(1) i , p(2) j ) + d(p(2) j , p(3) k ) s 1/s

j : π(2) j =0

ˆw(12) i,j ˆw(23) j,k π(2) j ds p(1) i , p(2) j 1/s

j : π(2) j =0

ˆw(12) i,j ˆw(23) j,k π(2) j ds p(2) j , p(3) k 1/s

j : π(2) j =0 ˆw(12) i,j ds p(1) i , p(2) j 1/s

j : π(2) j =0 ˆw(23) i,k ds p(2) j , p(3) k 1/s

i,j ˆw(12) i,j ds p(1) i , p(2) j 1/s + X

j,k ˆw(23) i,k ds p(2) j , p(3) k 1/s

= MWs p(1), p(2) + MWs p(2), p(3) .

Thus MWs is a metric, and it is just left to show that it is Hilbertian if d is Hilbertian. Since d is a Hilbertian metric, there exists a Hilbert space H and a mapping φ such that

d(x, y) = φ(x) φ(y) H.

Let r1, . . . , rn R with P

i ri = 0 and p(1), . . . , p(n) P. Denote the optimal coupling with respect to π(i) and π(j) by ˆw(i,j). Then we have X

k,l ˆw(i,j) k,l φ(p(i) k ) 2 H = X

i,k ri φ(p(i) k ) 2 H X

l ˆw(i,j) k,l

i,k ri φ(p(i) k ) 2 H X

i,k riπ(i) k φ(p(i) k ) 2 H X

and similarly X

k,l ˆw(i,j) k,l φ(p(j) l ) 2 H = 0. (D.4)

Moreover, for all k, l we get X

i,j rirj ˆw(i,j) k,l D φ p(i) k , φ p(j) l E

ˆw(i,j) k,l φ p(i) k , X

ˆw(i,j) k,l φ p(j) l E

ˆw(i,j) k,l φ p(i) k 2

and hence X

k,l ˆw(i,j) k,l D φ(p(i) k ), φ(p(j) l ) E

and similarly X

k,l ˆw(i,j) k,l D φ(p(j) l ), φ(p(i) k ) E

Hence from Eqs. (D.3) to (D.6) we get X

i,j rirj MWs s(p(i), p(j)) = X

k,l ˆw(i,j) k,l ds p(i) k , p(j) l

k,l ˆw(i,j) k,l φ p(i) k φ p(j) l 2

k,l ˆw(i,j) k,l φ p(i) k 2

k,l ˆw(i,j) k,l D φ p(i) k , φ p(j) l E

k,l ˆw(i,j) k,l D φ p(j) l , φ p(i) k E

k,l ˆw(i,j) k,l φ p(j) l 2

which shows that MWs s is a negative deﬁnite kernel (Berg et al., 1984, Deﬁnition 3.1.1). Since 0 < 1/s < , MWs is a negative deﬁnite kernel as well (Berg et al., 1984, Corollary 3.2.10), which implies that metric MWs is Hilbertian (Berg et al., 1984, Proposition 3.3.2).

Hence we can lift a Hilbertian metric for the mixture components to a Hilbertian metric for the mixture models. For instance, if the mixture components are normal distributions, then the 2-Wasserstein distance with respect to the Euclidean distance is a Hilbertian metric for the mixture components. When we lift it to the space P of Gaussian mixture models we obtain the MW2 metric proposed by Chen et al. (2019; 2020); Delon & Desolneux (2020). As shown by Delon & Desolneux (2020), the discrete formulation of MW2 obtained by our construction is equivalent to the deﬁnition

MW2 2(p, p ) := inf γ Π(p,p ) GMM2n( )

Rn Rn d2(y, y ) dγ(y, y ) (D.7)

for two Gaussian mixtures p, p on Rn, where Π(p, p ) are the couplings of p and p (not of the histograms!) and GMM2n( ) = k 0GMM2n(k) is the set of all ﬁnite Gaussian mixture distributions on R2n. The construction of the discrete formulation as a solution to a constrained optimization problem similar to Eq. (D.7) can be generalized to mixtures of t-distributions. However, it is not possible for arbitrary mixture models such as mixtures of generalized Gaussian distributions, even though they are elliptically contoured distributions (Deledalle et al., 2018; Delon & Desolneux, 2020).

The optimal coupling of the discrete histograms can be computed efﬁciently using techniques from linear programming and optimal transport theory such as the network simplex algorithm and the Sinkhorn algorithm. As discussed above, if metric d P is of the form in Eq. (D.2), functions of the form k P(p, p ) = exp λdν P(p, p )

are valid kernels on P for all λ > 0 and ν (0, 2].

Thus taken together, if k Y is a characteristic kernel on the target space Y and d( , ) is a Hilbertian metric on the space of mixture components, then for all s [1, ), λ > 0, and ν (0, 2]

k (p, y), (p , y ) = exp λMWν s(p, p ) k Y(y, y )

is a valid kernel on the product space P Y of mixture distributions and targets that allows to evaluate h (p, y), (p , y ) analytically and guarantees that KCEk = 0 if and only if model P is calibrated.

E CLASSIFICATION AS A SPECIAL CASE

We show that the calibration error introduced in Deﬁnition 2 is a generalization of the calibration error for classiﬁcation proposed by Widmann et al. (2019). Their formulation of the calibration error is based on a weighted sum of class-wise discrepancies between the left hand side and right hand side of Deﬁnition 1, where the weights are output by a vector-valued function of the predictions. Hence their framework can only be applied to ﬁnite target spaces, i.e., if |Y| < .

Without loss of generality, we assume that Y = {1, . . . , d} for some d N \ {1}. In our notation, the previously deﬁned calibration error, denoted by CCE (classiﬁcation calibration error), with respect to a function space G {f : P Rd} is given by

CCEG := sup g G

P(Y = y|PX) PX({y}) gy PX .

For the function class

F := f : P Y R, (p, y) 7 gy(p) g G

we get CCEG = sup f F

EPX,Y f(PX, Y ) EPX,ZX f(PX, ZX) = CEF.

Similarly, for every function class F {f : P Y R}, we can deﬁne the space

G := n g: P Rd, p 7 f(p, 1), . . . , f(p, d) T f F o ,

CEF = sup g G

P(Y = y|PX) PX({y}) gy PX = CCEG.

Thus both deﬁnitions are equivalent for classiﬁcation models but the structure of the employed function classes differs. The deﬁnition of CCE is based on vector-valued functions on the probability simplex whereas the formulation presented in this paper uses real-valued function on the product space of the probability simplex and the targets.

An interesting theoretical aspect of this difference is that in the case of KCE we consider real-valued kernels on P Y instead of matrix-valued kernels on P, as shown by the following comparison. By ei Rd we denote the ith unit vector, and for a prediction p P its representation vp Rd in the probability simplex is deﬁned as (vp)y = p {y}

for all targets y Y.

Let k: (P Y) (P Y) R. We deﬁne the matrix-valued function K : P P Rd d by K(p, p )

y,y = k (p, y), (p , y )

for all y, y Y and p, p P. From the positive deﬁniteness of kernel k it follows that K is a matrix-valued kernel (Micchelli & Pontil, 2005, Deﬁnition 2). We obtain

SKCEk = EPX,Y,PX ,Y K(PX, PX )

Y,Y 2 EPX,Y,PX ,ZX K(PX, PX )

Y,ZX + EPX,ZX,PX ,ZX K(PX, PX )

ZX,ZX = EPX,Y,PX ,Y e T Y K(PX, PX )e Y 2 EPX,Y,PX ,Y e T Y K(PX, PX )v PX + EPX,Y,PX ,Y v T PXK(PX, PX )v PX

= EPX,Y,PX ,Y (e Y v PX)TK(PX, PX )(e Y v PX ),

which is exactly the result by Widmann et al. (2019) for matrix-valued kernels.

As a concrete example, Widmann et al. (2019) used a matrix-valued kernel of the form (p, p ) 7 exp ( γ p p )Id in their experiments. In our formulation this corresponds to the real-valued tensor product kernel (p, y), (p , y ) 7 exp ( γ p p )δy,y .

F TEMPERATURE SCALING

Since many modern neural network models for classiﬁcation have been demonstrated to be uncalibrated (Guo et al., 2017), it is of high practical interest being able to improve calibration of predictive models. Generally, one distinguishes between calibration techniques that are applied during training and post-hoc calibration methods that try to calibrate an existing model after training.

Temperature scaling (Guo et al., 2017) is a simple calibration method for classiﬁcation models with only one scalar parameter. Due to its simplicity it can trade off calibration of different classes (Kull et al., 2019), but conveniently it does not change the most-conﬁdent prediction and hence does not affect the accuracy of classiﬁcation models with respect to the 0-1 loss.

In regression, common post-hoc calibration methods are based on quantile binning and hence insufﬁcient for our framework. Song et al. (2019) proposed a calibration method for regression models with real-valued targets, based on a special case of Deﬁnition 1. This calibration method was shown to perform well empirically but is computationally expensive and requires users to choose hyperparameters for a Gaussian process model and its variational inference. As a simpler alternative, we generalize temperature scaling to arbitrary predictive models in the following way.

Deﬁnition F.1. Let Px be the output of a probabilistic predictive model P for feature x. If Px has probability density function px with respect to a reference measure µ, then temperature scaling with respect to µ with temperature T > 0 yields a new output Qx whose probability density function qx with respect to µ satisﬁes qx p1/T x .

The notion for classiﬁcation models given by Guo et al. (2017) can be recovered by choosing the counting measure on the classes as reference measure.

For some exponential families on Rd we obtain particularly simple transformations with respect to the Lebesgue measure λd that keep the type of predicted distribution and its mean invariant. Hence in contrast to other calibration methods, for these models temperature scaling yields analytically tractable distributions and does not negatively impact the accuracy of the models with respect to the mean squared error and the mean absolute error.

For instance, temperature scaling of multivariate power exponential distributions (Gómez et al., 1998) in Rd, of which multivariate normal distributions are a special case, with respect to λd corresponds to multiplication of their scale parameter with T 1/β, where β is the so-called kurtosis parameter (Gómez Sánchez-Manzano et al., 2008). For normal distributions, this corresponds to multiplication of the covariance matrix with T.

Similarly, temperature scaling of Beta and Dirichlet distributions with respect to reference measure µ(dx) := x 1(1 x) 11(0,1)(x)λ1(dx) and

µ(dx) := d Y

1(0,1)d(x)λd(dx),

respectively, corresponds to division of the canonical parameters of these distributions by T without affecting the predicted mean value.

All in all, we see that temperature scaling for general predictive models preserves some of the nice properties for classiﬁcation models. For some exponential families such as normal distributions reference measure µ can be chosen such that temperature scaling is a simple transformation of the parameters of the predicted distributions (and hence leaves the considered model class invariant) that does not affect accuracy of these models with respect to the mean squared error and the mean absolute error.

G EXPECTED CALIBRATION ERROR FOR COUNTABLY INFINITE DISCRETE TARGET SPACES

In literature, ECEd and MCEd are deﬁned for binary and multi-class classiﬁcation problems (Guo et al., 2017; Naeini et al., 2015; Vaicenavicius et al., 2019). For common distance measures on the

probability simplex such as the total variation distance and the squared Euclidean distance, ECEd and MCEd can be formulated as a calibration error in the framework of Widmann et al. (2019), which is a special case of the framework proposed in this paper for binary and multi-class classiﬁcation problems.

In contrast to previous approaches, our framework handles countably inﬁnite discrete target spaces as well. For every problem with countably inﬁnitely many targets, such as, e.g., Poisson regression, there exists an equivalent regression problem on the set of natural numbers. Hence without loss of generality we assume Y = N. Denote the space of probability distributions on N, the inﬁnite dimensional probability simplex, with . Clearly, can be viewed as a subspace of the sequence space ℓ1 that consists of all sequences x = (xn)n N with xn 0 for all n N and x 1 = 1.

Theorem G.1. Let 1 < p < with Hölder conjugate q. If

F := {f : N R | EPX f(PX, n)

n N p p 1},

then CEq F = EPX P(Y |PX) PX q q.

Let µ be the law of PX. If F := {f : N R | EPX (f(PX, n))n N 1 1}, then

CEF = µess sup ξ sup y N | P(Y = y|PX = ξ) ξ({y})|.

Moreover, if F = {f : N R | µess supξ supy N |f(ξ, y)| 1}, then

CEF = EPX P(Y |PX) PX 1.

Proof. Let 1 p , and let µ be the law of PX and ν be the counting measure on N. Since both µ and ν are σ-ﬁnite measures, the product measure µ ν is uniquely determined and σ-ﬁnite as well. Using these deﬁnitions, we can reformulate F as

F = {f Lp( N; µ ν) | f p;µ ν 1}.

Deﬁne the function δ: N R (µ ν)-almost surely by

δ(ξ, y) := P(Y = y | PX = ξ) ξ({y}).

Note that δ is well-deﬁned since we assume that all singletons on are µ-measurable. Moreover, δ Lq( N; µ ν), which follows from (ξ, y) 7 P(Y = y | PX = ξ) and (ξ, y) 7 ξ({y}) being functions in Lq( N; µ ν).

Since µ ν is a σ-ﬁnite measure, the extremal equality of Hölder s inequality implies that

CEF = sup f F EPX,Y f(PX, Y ) EPX,ZX f(PX, ZX)

EPX,Y f(PX, Y ) EPX,ZX f(PX, ZX)

N f(ξ, y)δ(ξ, y) (µ ν)(d(ξ, y))

Note that the second equality follows from the symmetry of the function spaces F: for every f F, also f F.

Hence for 1 < p , we obtain

N |δ(ξ, y)|q (µ ν)(d(ξ, y))

= EPX (δ(PX, y))y N q q = EPX P(Y |PX) PX q q.

For p = 1, we get

CEF = µess sup ξ sup y N |δ(ξ, y)| = µess sup ξ sup y N | P(Y = y|PX = ξ) ξ({y})|,

which concludes the proof.

We see that our framework deals with countably inﬁnite discrete target spaces seamlessly whereas the previously proposed framework by Widmann et al. (2019) is not applicable to such spaces. It is mathematically pleasing to see that for countably inﬁnite discrete targets the calibration errors obtained in Theorem G.1 within our framework coincide with the natural generalization of ECEd and MCEd given in Appendix B.2.