# calibration_tests_beyond_classification__647d65ba.pdf CALIBRATION TESTS BEYOND CLASSIFICATION David Widmann Department of Information Technology Uppsala University, Sweden david.widmann@it.uu.se Fredrik Lindsten Division of Statistics and Machine Learning Linköping University, Sweden fredrik.lindsten@liu.se Dave Zachariah Department of Information Technology Uppsala University, Sweden dave.zachariah@it.uu.se Most supervised machine learning tasks are subject to irreducible prediction errors. Probabilistic predictive models address this limitation by providing probability distributions that represent a belief over plausible targets, rather than point estimates. Such models can be a valuable tool in decision-making under uncertainty, provided that the model output is meaningful and interpretable. Calibrated models guarantee that the probabilistic predictions are neither overnor under-confident. In the machine learning literature, different measures and statistical tests have been proposed and studied for evaluating the calibration of classification models. For regression problems, however, research has been focused on a weaker condition of calibration based on predicted quantiles for real-valued targets. In this paper, we propose the first framework that unifies calibration evaluation and tests for general probabilistic predictive models. It applies to any such model, including classification and regression models of arbitrary dimension. Furthermore, the framework generalizes existing measures and provides a more intuitive reformulation of a recently proposed framework for calibration in multi-class classification. In particular, we reformulate and generalize the kernel calibration error, its estimators, and hypothesis tests using scalar-valued kernels, and evaluate the calibration of real-valued regression problems.1 1 INTRODUCTION We consider the general problem of modelling the relationship between a feature X and a target Y in a probabilistic setting, i.e., we focus on models that approximate the conditional probability distribution P(Y |X) of target Y for given feature X. The use of probabilistic models that output a probability distribution instead of a point estimate demands guarantees on the predictions beyond accuracy, enabling meaningful and interpretable predicted uncertainties. One such statistical guarantee is calibration, which has been studied extensively in metereological and statistical literature (De Groot & Fienberg, 1983; Murphy & Winkler, 1977). A calibrated model ensures that almost every prediction matches the conditional distribution of targets given this prediction. Loosely speaking, in a classification setting a predicted distribution of the model is called calibrated (or reliable), if the empirically observed frequencies of the different classes match the predictions in the long run, if the same class probabilities would be predicted repeatedly. A classical example is a weather forecaster who predicts each day if it is going to rain on the next day. If she predicts rain with probability 60% for a long series of days, her forecasting model is calibrated for predictions of 60% if it actually rains on 60% of these days. If this property holds for almost every probability distribution that the model outputs, then the model is considered to be calibrated. Calibration is an appealing property of a probabilistic model since it 1The source code of the experiments is available at https://github.com/devmotion/ Calibration_ICLR2021. provides safety guarantees on the predicted distributions even in the common case when the model does not predict the true distributions P(Y |X). Calibration, however, does not guarantee accuracy (or refinement) a model that always predicts the marginal probabilities of each class is calibrated but probably inaccurate and of limited use. On the other hand, accuracy does not imply calibration either since the predictions of an accurate model can be too over-confident and hence miscalibrated, as observed, e.g., for deep neural networks (Guo et al., 2017). In the field of machine learning, calibration has been studied mainly for classification problems (Bröcker, 2009; Guo et al., 2017; Kull et al., 2017; 2019; Kumar et al., 2018; Platt, 2000; Vaicenavicius et al., 2019; Widmann et al., 2019; Zadrozny, 2002) and for quantiles and confidence intervals of models for regression problems with real-valued targets (Fasiolo et al., 2020; Ho & Lee, 2005; Kuleshov et al., 2018; Rueda et al., 2006; Taillardat et al., 2016). In our work, however, we do not restrict ourselves to these problem settings but instead consider calibration for arbitrary predictive models. Thus, we generalize the common notion of calibration as: Definition 1. Consider a model PX := P(Y |X) of a conditional probability distribution P(Y |X). Then model P is said to be calibrated if and only if P(Y |PX) = PX almost surely. (1) If P is a classification model, Definition 1 coincides with the notion of (multi-class) calibration by Bröcker (2009); Kull et al. (2019); Vaicenavicius et al. (2019). Alternatively, in classification some authors (Guo et al., 2017; Kumar et al., 2018; Naeini et al., 2015) study the strictly weaker property of confidence calibration (Kull et al., 2019), which only requires P (Y = arg max PX| max PX) = max PX almost surely. (2) This notion of calibration corresponds to calibration according to Definition 1 for a reduced problem with binary targets e Y := 1(Y = arg max PX) and Bernoulli distributions e PX := Ber(max PX) as probabilistic models. For real-valued targets, Definition 1 coincides with the so-called distribution-level calibration by Song et al. (2019). Distribution-level calibration implies that the predicted quantiles are calibrated, i.e., the outcomes for all real-valued predictions of the, e.g., 75% quantile are actually below the predicted quantile with 75% probability (Song et al., 2019, Theorem 1). Conversely, although quantile-based calibration is a common approach for real-valued regression problems (Fasiolo et al., 2020; Ho & Lee, 2005; Kuleshov et al., 2018; Rueda et al., 2006; Taillardat et al., 2016), it provides weaker guarantees on the predictions. For instance, the linear regression model in Fig. 1 empirically shows quantiles that appear close to being calibrated albeit being uncalibrated according to Definition 1. -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 quantile level cumulative probability Figure 1: Illustration of a conditional distribution P(Y |X) with scalar feature and target. We consider a Gaussian predictive model P, obtained by ordinary least squares regression with 100 training data points (orange dots). Empirically the predicted quantiles on 50 validation data points appear close to being calibrated, although model P is uncalibrated according to Definition 1. Using the framework in this paper, on the same validation data a statistical test allows us to reject the null hypothesis that model P is calibrated at a significance level of α = 0.05 (p < 0.05). See Appendix A.1 for details. Figure 1 also raises the question of how to assess calibration for general target spaces in the sense of Definition 1, without having to rely on visual inspection. In classification, measures of calibration such as the commonly used expected calibration error (ECE) (Guo et al., 2017; Kull et al., 2019; Naeini et al., 2015; Vaicenavicius et al., 2019) and the maximum calibration error (MCE) (Naeini et al., 2015) try to capture the average and maximal discrepancy between the distributions on the left hand side and the right hand side of Eq. (1) or Eq. (2), respectively. These measures can be generalized to other target spaces (see Definition B.1), but unfortunately estimating these calibration errors from observations of features and corresponding targets is problematic. Typically, the predictions are different for (almost) all observations, and hence estimation of the conditional probability P (Y |PX), which is needed in the estimation of ECE and MCE, is challenging even for low-dimensional target spaces and usually leads to biased and inconsistent estimators (Vaicenavicius et al., 2019). Kernel-based calibration errors such as the maximum mean calibration error (MMCE) (Kumar et al., 2018) and the kernel calibration error (KCE) (Widmann et al., 2019) for confidence and multi-class calibration, respectively, can be estimated without first estimating the conditional probability and hence avoid this issue. They are defined as the expected value of a weighted sum of the differences of the left and right hand side of Eq. (1) for each class, where the weights are given as a function of the predictions (of all classes) and chosen such that the calibration error is maximized. A reformulation with matrix-valued kernels (Widmann et al., 2019) yields unbiased and differentiable estimators without explicit dependence on P(Y |PX), which simplifies the estimation and allows to explicitly account for calibration in the training objective (Kumar et al., 2018). Additionally, the kernel-based framework allows the derivation of reliable statistical hypothesis tests for calibration in multi-class classification (Widmann et al., 2019). However, both the construction as a weighted difference of the class-wise distributions in Eq. (1) and the reformulation with matrix-valued kernels require finite target spaces and hence cannot be applied to regression problems. To be able to deal with general target spaces, we present a new and more general framework of calibration errors without these limitations. Our framework can be used to reason about and test for calibration of any probabilistic predictive model. As explained above, this is in stark contrast with existing methods that are restricted to simple output distributions, such as classification and scalar-valued regression problems. A key contribution of this paper is a new framework that is applicable to multivariate regression, as well as situations when the output is of a different (e.g., discrete ordinal) or more complex (e.g., graph-structured) type, with clear practical implications. Within this framework a KCE for general target spaces is obtained. We want to highlight that for multi-class classification problems its formulation is more intuitive and simpler to use than the measure proposed by Widmann et al. (2019) based on matrix-valued kernels. To ease the application of the KCE we derive several estimators of the KCE with subquadratic sample complexity and their asymptotic properties in tests for calibrated models, which improve on existing estimators and tests in the two-sample test literature by exploiting the special structure of the calibration framework. Using the proposed framework, we numerically evaluate the calibration of neural network models and ensembles of such models. 2 CALIBRATION ERROR: A GENERAL FRAMEWORK In classification, the distributions on the left and right hand side of Eq. (1) can be interpreted as vectors in the probability simplex. Hence ultimately the distance measure for ECE and MCE (see Definition B.1) can be chosen as a distance measure of real-valued vectors. The total variation, Euclidean, and squared Euclidean distances are common choices (Guo et al., 2017; Kull et al., 2019; Vaicenavicius et al., 2019). However, in a general setting measuring the discrepancy between P(Y |PX) and PX cannot necessarily be reduced to measuring distances between vectors. The conditional distribution P(Y |PX) can be arbitrarily complex, even if the predicted distributions are restricted to a simple class of distributions that can be represented as real-valued vectors. Hence in general we have to resort to dedicated distance measures of probability distributions. Additionally, the estimation of conditional distributions P(Y |PX) is challenging, even more so than in the restricted case of classification, since in general these distributions can be arbitrarily complex. To circumvent this problem, we propose to use the following construction: We define a random variable ZX PX obtained from the predictive model and study the discrepancy between the joint distributions of the two pairs of random variables (PX, Y ) and (PX, ZX), respectively, instead of the discrepancy between the conditional distributions P(Y |PX) and PX. Since (PX, Y ) d= (PX, ZX) if and only if P(Y |PX) = PX almost surely, model P is calibrated if and only if the distributions of (PX, Y ) and (PX, ZX) are equal. The random variable pairs (PX, Y ) and (PX, ZX) take values in the product space P Y, where P is the space of predicted distributions PX and Y is the space of targets Y . For instance, in classification, P could be the probability simplex and Y the set of all class labels, whereas in the case of Gaussian predictive models for scalar targets P could be the space of normal distributions and Y be R. The study of the joint distributions of (PX, Y ) and (PX, ZX) motivates the definition of a generally applicable calibration error as an integral probability metric (Müller, 1997; Sriperumbudur et al., 2009; 2012) between these distributions. In contrast to common f-divergences such as the Kullback-Leibler divergence, integral probability metrics do not require that one distribution is absolutely continuous with respect to the other, which cannot be guaranteed in general. Definition 2. Let Y denote the space of targets Y , and P the space of predicted distributions PX. We define the calibration error with respect to a space of functions F of the form f : P Y R as CEF := sup f F EPX,Y f(PX, Y ) EPX,ZX f(PX, ZX) . (3) By construction, if model P is calibrated, then CEF = 0 regardless of the choice of F. However, the converse statement is not true for arbitrary function spaces F. From the theory of integral probability metrics (see, e.g., Müller, 1997; Sriperumbudur et al., 2009; 2012), we know that for certain choices of F the calibration error in Eq. (3) is a well-known metric on the product space P Y, which implies that CEF = 0 if and only if model P is calibrated. Prominent examples include the maximum mean discrepancy2 (MMD) (Gretton et al., 2007), the total variation distance, the Kantorovich distance, and the Dudley metric (Dudley, 1989, p. 310). As pointed out above, Definition 2 is a generalization of the definition for multi-class classification proposed by Widmann et al. (2019) which is based on vector-valued functions and only applicable to finite target spaces to any probabilistic predictive model. In Appendix E we show this explicitly and discuss the special case of classification problems in more detail. Previous results (Widmann et al., 2019) imply that in classification MMCE and, for common distance measures d( , ) such as the total variation and squared Euclidean distance, ECEd and MCEd are special cases of CEF. In Appendix G we show that our framework also covers natural extensions of ECEd and MCEd to countably infinite discrete target spaces, which to our knowledge have not been studied before and occur, e.g., in Poisson regression. The literature of integral probability metrics suggests that we can resort to estimating CEF from i.i.d. samples from the distributions of (PX, Y ) and (PX, ZX). For the MMD, the Kantorovich distance, and the Dudley metric tractable strongly consistent empirical estimators exist (Sriperumbudur et al., 2012). Here the empirical estimator for the MMD is particularly appealing since compared with the other estimators it is computationally cheaper, the empirical estimate converges at a faster rate to the population value, and the rate of convergence is independent of the dimension d of the space (for S = Rd) (Sriperumbudur et al. (2012)). Our specific design of (PX, ZX) can be exploited to improve on these estimators. If EZx Pxf(Px, Zx) can be evaluated analytically for a fixed prediction Px, then CEF can be estimated empirically with reduced variance by marginalizing out ZX. Otherwise EZx Pxf(Px, Zx) has to be estimated, but in contrast to the common estimators of the integral probability metrics discussed above the artificial construction of ZX allows us to approximate it by numerical integration methods such as (quasi) Monte Carlo integration or quadrature rules with arbitrarily small error and variance. Monte Carlo integration preserves statistical properties of the estimators such as unbiasedness and consistency. 2As we discuss in Section 3, the MMD is a metric if and only if the employed kernel is characteristic. 3 KERNEL CALIBRATION ERROR For the remaining parts of the paper we focus on the MMD formulation of CEF due to the appealing properties of the common empirical estimator mentioned above. We derive calibration-specific analogues of results for the MMD that exploit the special structure of the distribution of (PX, ZX) to improve on existing estimators and tests in the MMD literature. To the best of our knowledge these variance-reduced estimators and tests have not been discussed in the MMD literature. Let k: (P Y) (P Y) R be a measurable kernel with corresponding reproducing kernel Hilbert space (RKHS) H, and assume that EPX,Y k1/2 (PX, Y ), (PX, Y ) < and EPX,ZX k1/2 (PX, ZX), (PX, ZX) < . We discuss how such kernels can be constructed in a generic way in Section 3.1 below. Definition 3. Let Fk denote the unit ball in H, i.e., F := {f H| f H 1}. Then the kernel calibration error (KCE) with respect to kernel k is defined as KCEk := CEFk = sup f Fk EPX,Y f(PX, Y ) EPX,ZX f(PX, ZX) . As known from the MMD literature, a more explicit formulation can be given for the squared kernel calibration error SKCEk := KCE2 k (see Lemma B.2). A similar explicit expression for SKCEk was obtained by Widmann et al. (2019) for the special case of classification problems. However, their expression relies on Y being finite and is based on matrix-valued kernels over the finite-dimensional probability simplex P. A key difference to the expression in Lemma B.2 is that we instead propose to use real-valued kernels defined on the product space of predictions and targets. This construction is applicable to arbitrary target spaces and does not require Y to be finite. 3.1 CHOICE OF KERNEL The construction of the product space P Y suggests the use of tensor product kernels k = k P k Y, where k P : P P R and k Y : Y Y R are kernels on the spaces of predicted distributions and targets, respectively.3 By definition, so-called characteristic kernels guarantee that KCE = 0 if and only if the distributions of (PX, Y ) and (PX, ZX) are equal (Fukumizu et al., 2004; 2008). Many common kernels such as the Gaussian and Laplacian kernel on Rd are characteristic (Fukumizu et al., 2008).4 Szabó & Sriperumbudur (2018, Theorem 4) showed that a tensor product kernel k P k Y is characteristic if k P and k Y are characteristic, continuous, bounded, and translation-invariant kernels on Rd, but the implication does not hold for general characteristic kernels (Szabó & Sriperumbudur, 2018, Example 1). For calibration evaluation, however, it is sufficient to be able to distinguish between the conditional distributions P(Y |PX) and P(ZX|PX) = PX. Therefore, in contrast to the regular MMD setting, it is sufficient that kernel k Y is characteristic and kernel k P is non-zero almost surely, to guarantee that KCE = 0 if and only if model P is calibrated. Thus it is suggestive to construct kernels on general spaces of predicted distributions as k P(p, p ) = exp λdν P(p, p ) , (4) where d P( , ) is a metric on P and ν, λ > 0 are kernel hyperparameters. The Wasserstein distance is a widely used metric for distributions from optimal transport theory that allows to lift a ground metric on the target space and possesses many important properties (see, e.g., Peyré & Cuturi, 2019, Chapter 2.4). In general, however, it does not lead to valid kernels k P, apart from the notable exception of elliptically contoured distributions such as normal and Laplace distributions (Peyré & Cuturi, 2019, Chapter 8.3). 3As mentioned above, our framework rephrases and generalizes the construction used by Widmann et al. (2019). The matrix-valued kernels that they employ can be recovered by setting k P to a Laplacian kernel on the probability simplex and k Y(y, y ) = δy,y . 4For a general discussion about characteristic kernels and their relation to universal kernels we refer to the paper by Sriperumbudur et al. (2011). In machine learning, common probabilistic predictive models output parameters of distributions such as mean and variance of normal distributions. Naturally these parameterizations give rise to injective mappings φ: P Rd that can be used to define a Hilbertian metric d P(p, p ) = φ(p) φ(p ) 2. For such metrics, k P in Eq. (4) is a valid kernel for all λ > 0 and ν (0, 2] (Berg et al., 1984, Corollary 3.3.3, Proposition 3.2.7). In Appendix D.3 we show that for many mixture models, and hence model ensembles, Hilbertian metrics between model components can be lifted to Hilbertian metrics between mixture models. This construction is a generalization of the Wasserstein-like distance for Gaussian mixture models proposed by Chen et al. (2019; 2020); Delon & Desolneux (2020). 3.2 ESTIMATION Let (X1, Y1), . . . , (Xn, Yn) be a data set of features and targets which are i.i.d. according to the law of (X, Y ). Moreover, for notational brevity, for (p, y), (p , y ) P Y we let h (p, y), (p , y ) := k (p, y), (p , y ) EZ p k (p, Z), (p , y ) EZ p k (p, y), (p , Z ) + EZ p,Z p k (p, Z), (p , Z ) . Note that in contrast to the regular MMD we marginalize out Z and Z . Similar to the MMD, there exist consistent estimators of the SKCE, both biased and unbiased. Lemma 1. The plug-in estimator of SKCEk is non-negatively biased. It is given by \ SKCEk = 1 i,j=1 h (PXi, Yi), (PXj, Yj) . Inspired by the block tests for the regular MMD (Zaremba et al., 2013), we define the following class of unbiased estimators. Note that in contrast to \ SKCEk they do not include terms of the form h (PXi, Yi), (PXi, Yi) . Lemma 2. The block estimator of SKCEk with block size B {2, . . . , n}, given by \ SKCEk,B := n (b 1)B