# testing_whether_a_learning_procedure_is_calibrated__5d0e947f.pdf

Journal of Machine Learning Research 23 (2022) 1-36 Submitted 9/21; Revised 3/22; Published 5/22

Testing Whether a Learning Procedure is Calibrated

Jon Cockayne jon.cockayne@soton.ac.uk Mathematical Sciences University of Southampton Highﬁeld Southampton, SO17 1BJ, UK

Matthew M. Graham m.graham@ucl.ac.uk Centre for Advanced Research Computing University College London Gower Street London, WC1E 6BT, UK

Chris J. Oates chris.oates@ncl.ac.uk School of Mathematics, Statistics & Physics Newcastle University Newcastle Upon Tyne, NE1 7RU, UK

T. J. Sullivan t.j.sullivan@warwick.ac.uk Mathematics Institute and School of Engineering University of Warwick Coventry, CV4 7AL, UK

Onur Teymur o@teymur.uk School of Mathematics, Statistics & Actuarial Science University of Kent Cantebury, CT2 7NZ, UK

Editor: Anthony Lee

A learning procedure takes as input a dataset and performs inference for the parameters θ of a model that is assumed to have given rise to the dataset. Here we consider learning procedures whose output is a probability distribution, representing uncertainty about θ after seeing the dataset. Bayesian inference is a prime example of such a procedure, but one can also construct other learning procedures that return distributional output. This paper studies conditions for a learning procedure to be considered calibrated, in the sense that the true data-generating parameters are plausible as samples from its distributional output. A learning procedure whose inferences and predictions are systematically overor under-conﬁdent will fail to be calibrated. On the other hand, a learning procedure that is calibrated need not be statistically eﬃcient. A hypothesis-testing framework is developed in order to assess, using simulation, whether a learning procedure is calibrated. Several vignettes are presented to illustrate diﬀerent aspects of the framework. Keywords: calibratedness, credible sets, uncertainty quantiﬁcation

2022 Jon Cockayne, Matthew M. Graham, Chris J. Oates, T. J. Sullivan, Onur Teymur.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1065.html.

Cockayne, Graham, Oates, Sullivan, Teymur

1. Introduction

Given a parametric model and a dataset purported to be generated from the model, the modern workﬂow for parameter inference ﬁrst identiﬁes a statistical paradigm (e.g. Bayesian inference), performs any required numerical computation using an appropriate numerical method, then inspects the results and reﬁnes the approach until some desiderata (e.g. posterior predictive checks, or a convergence diagnostic for a Markov chain Monte Carlo method) are satisﬁed. This paper takes a holistic perspective and refers to the overall workﬂow as a learning procedure. Our focus is on learning procedures that produce distributional output, examples of which include workﬂows based on Bayesian and generalised Bayesian inference (Bissiri et al., 2016), fractional posteriors (Bhattacharya et al., 2019), empirical Bayes (Casella, 1985), variational Bayes (Blei et al., 2017), approximate Bayesian computation (Beaumont et al., 2002), Bayesian synthetic likelihood (Price et al., 2018), and also approaches that have a non-Bayesian motivation, such as the maximum entropy approach (Jaynes, 1982). It is natural to hope that a learning procedure is calibrated, in the sense that the true data-generating parameters are plausible as samples from the distributional output. Indeed, a learning procedure that is not calibrated can produce inferences and predictions that are either biased or over/under-conﬁdent, and lead users to draw spurious conclusions in model selection problems. The consequences of over-conﬁdence, in particular, could be dire when those inferences are used in safety-critical applications. This point has been discussed at length in the literature, such as in investigating frequentist coverage of credible sets in Bayesian inference and in calibrating probabilistic forecasts. However, the literature appears to lack a deﬁnition of calibration that is suﬃciently general to be applied to an arbitrary learning procedure that produces distributional output. The aim of this paper is to introduce a general deﬁnition of calibration and accompany this with a methodology for testing whether a learning procedure is calibrated. The term calibration is unfortunately overloaded in the statistical literature. It is also used to refer to the parameter inference task in applications that involve a computer model. For example, Kennedy and O Hagan (2001) write that the process of ﬁtting the model to the observed data by adjusting the parameters is known as calibration . For avoidance of doubt, we use the standard terminology of parameter inference to refer to the task of estimating parameters of a model. The term calibration is also used in the literature on forecast assessment. There the useage is close to the notions proposed in this paper, though in that literature the focus is on testing calibration at the level of the data rather than at the level of the parameters. This is discussed further in Sections 2.3.2 and 2.4.2. We reserve the term calibration for the speciﬁc notions proposed in this paper. The outline of the paper is as follows: Section 2 presents our proposed deﬁnitions, where we identify both strong and weak senses in which a learning procedure can be said to be calibrated. To ensure our deﬁnitions are precise in a mathematical sense, we conceptualise a learning procedure as a mathematical object in Section 2.1 and impose mild regularity assumptions on this object in Section 2.2. In Section 2.3 our notion of strong calibration is presented, illustrated by examples in Section 2.3.1, and compared to existing deﬁnitions in the literature in Section 2.3.2. Likewise, in Section 2.4 our notion of weak calibration is presented, illustrated by examples in Section 2.4.1, and compared to existing deﬁnitions in

Testing Whether a Learning Procedure is Calibrated

the literature in Section 2.4.2. Several vignettes are provided in Section 3, showing through simulations that our proposed deﬁnitions of calibration both accord with intuition and can be tested for. A brief discussion concludes the paper in Section 4.

1.1 Notation

For a measurable space S, P(S) will denote the set of probability measures on S. For s S let δ(S) P(s) denote the Dirac distribution on s. For a measurable function f : S R, a measurable set A R, and a probability measure ν P(S), let f 1(A) := {x S | f(x) A} denote the preimage of A and recall that the pushforward measure f#ν P(R) is deﬁned as (f#ν)(A) := ν(f 1(A)).

2. What it Means for a Learning Procedure to be Calibrated

This section sets out our proposed deﬁnitions of strong and weak calibration, provides examples of learning procedures that are strongly and weakly calibrated, and relates our deﬁnitions to existing work.

Let Θ be a measurable space, which will play the role of the parameter space in this work. It is assumed that there is a unique true parameter θ Θ and we consider the parameter inference task of estimating θ based on a dataset. Let Y be a measurable space in which datasets are realised.

Deﬁnition 1 (Learning Procedure). A learning procedure is a function

µ: P(Θ) Y P(Θ)

(µ0, y) 7 µ(µ0, y).

Here µ0 is interpreted as an initial belief distribution, quantifying uncertainty about the parameter θ before any data have been observed, and y denotes a dataset. The distributional output µ(µ0, y) is interpreted as a quantiﬁcation of the uncertainty associated with the parameter θ, after the data y have been observed.

The standard example of a learning procedure is Bayesian inference, wherein µ0 is the prior distribution and µ(µ0, y) is the posterior distribution, this being determined by the prior, the observed data y, and a likelihood function that must be speciﬁed. However, Deﬁnition 1 is general enough to accommodate any workﬂow that produces distributional output. In particular, Deﬁnition 1 does not pre-suppose that a data-generating model exists or is known to the user, so that the deﬁnition of a learning procedure may be applied even in the M-open setting (Bernardo and Smith, 1994, 6.1.2). Further, one may consider that computational procedures such as variational inference or Monte Carlo form part of the learning procedure, and in this sense a myriad of diﬀerent learning procedures can be considered. Note that we call µ0 a belief distribution following Bissiri et al. (2016) and reserve the term prior for use only in the Bayesian context. We also emphasise that a learning

Cockayne, Graham, Oates, Sullivan, Teymur

procedure need not depend upon the initial belief distribution µ0; for example, in the maximum entropy approach (Jaynes, 1982) a distributional output is produced that does not explicitly depend on any initial belief, so that eﬀectively µ(µ0, y) µ(y). In the next section we will introduce the mathematical facts required for our notions of strong and weak calibration in Sections 2.3 and 2.4.

2.2 A Mathematical Characterisation

The deﬁnitions that we will present rely on cumulative distribution functions (cdfs) and their inverses, and we therefore impose regularity conditions to ensure that such inverse cdfs are well-deﬁned. That is, we impose suﬃcient regularity to restrict our attention in the sequel to inverse cdfs that are well-deﬁned functions, as opposed to dealing with generalised functions that are set-valued.

Deﬁnition 2 (Regular Distribution). Let Θ be a measurable space equipped with a reference measure λ. A distribution ν P(Θ) is regular (with respect to λ) if it admits a probability density function (pdf) pν := dν/dλ such that pν > 0 on Θ (i.e. the measures ν and λ are equivalent). The set of all regular distributions will be denoted Pr(Θ).

When Θ is a Borelor Lebesgue-measurable subset of Euclidean space, the reference measure λ will be assumed to be Lebesgue measure. For a < b and a univariate distribution γ P((a, b)), we let Fγ : (a, b) [0, 1] denote the associated cdf Fγ(x) := γ((a, x]) = R x a dγ. Our ﬁrst result, Lemma 3, is classical (e.g. Rosenblatt, 1952) and underpins methods for simulation of univariate random variables using inverse cdfs. This result establishes that the level of regularity in Deﬁnition 2 is suﬃcient for the inverse cdf approach to simulation of such distributions to be applied. It also ensures that our subsequent constructions that depend on Deﬁnition 2 are well-deﬁned.

Lemma 3. For a < b and ν Pr((a, b)) we have that Fν(X) U(0, 1) whenever X ν.

Proof Since ν admits a pdf pν on (a, b), the fundamental theorem of calculus implies that Fν is diﬀerentiable with DFν(θ) = pν(θ). In particular, since pν > 0 we have that Fν is continuous and strictly increasing and therefore the sets F 1 ν (z) are singletons for all z [0, 1]. Let X ν and Z := Fν(X). Then, from the change of variables formula, Z admits a pdf q(z) on [0, 1] with

θ:Fν(θ)=z pν(θ)|DFν(θ)| 1 = pν(θ) 1 pν(θ) = 1,

which is indeed the pdf of U(0, 1).

The random variable Fν(X) is sometimes called the probability integral transform; see e.g. Dawid (1984); Diebold et al. (1997). When Θ R, the cdf of a distribution ν Pr(Θ) is not in general well-deﬁned. To characterise such distributions analogously to the above, consider a set of test functions of the form f : Θ (a, b), with the property that each univariate marginal f#ν does admit an invertible cdf. We next establish that regular distributions are characterised by a certain (large) set of such statistics.

Testing Whether a Learning Procedure is Calibrated

Deﬁnition 4 (Test Functions FΘ). Consider measurable functions of the form f : Θ (a, b) for some a < b . Then the test functions FΘ are the set of all such f for which f#ν Pr((a, b)) whenever ν Pr(Θ).

Intuitively, FΘ rules out functions f that take a constant value on a non-null set, in order to avoid the situation where f#ν contains an atom and the cdf Ff#ν : (a, b) [0, 1] is not invertible. In the univariate case Θ = R, the set FΘ contains functions f for which the gradient exists and is nonzero almost everywhere and, moreover, the preimages f 1(z) have cardinality n such that 0 < n < for each z (a, b). Indeed, in this case f#ν admits an everywhere positive (Lebesgue) pdf on (a, b) of the form

pf#ν(z) = X

θ f 1(z) pν(θ) |Df(θ)| 1 . (1)

Since by assumption f#ν is regular on (a, b), from Lemma 3 we have that Ff#ν(f(θ)) U(0, 1) whenever X ν. For the multivariate case Θ = Rd, by the co-area formula the (Lebesgue) pdf of f#ν is

pf#ν(z) = Z

f 1(z) pν(θ)|det(Df(θ)Df(θ)T)| 1

2 Hd 1(dθ), (2)

where Hd 1 indicates the (d 1) dimensional Hausdorﬀmeasure on Θ (Diaconis et al., 2013, Proposition 2). In this case, the requirement on the Jacobian determinant is that det(Df Df T) = 0 almost everywhere. As H0 is equivalent to the counting measure, (2) collapses back to (1) when d = 1. The restriction of attention to FΘ is essentially without loss of generality, as evidenced by the following result, whose proof is contained in Section A.1:

Lemma 5 (Regular Distributions are Characterised by FΘ). Let Θ = Rd for some d N. Suppose that µ, ν Pr(Θ) and R f dµ = R f dν for all f FΘ. Then µ = ν.

Now we have the mathematical tools to deﬁne what it means for a learning procedure to be calibrated. In Section 2.3 we introduce a strong notion of calibration, which clariﬁes the sense in which the true parameter can be considered plausible as a sample from the distributional output. Then, in Section 2.4, we consider a strictly weaker notion of calibration that is more easily tested.

2.3 Strongly Calibrated Learning Procedures

To assess whether a learning procedure is calibrated we must specify what it is calibrated against, and this requires a data-generating model. Thus, the assessment framework we present exists in the M-complete setting (Bernardo and Smith, 1994, 6.1.2).

Deﬁnition 6 (Data-Generating Model). A data-generating model is a function

P : Θ P(Y )

where Pθ carries the interpretation of a statistical model from which data are generated.

Cockayne, Graham, Oates, Sullivan, Teymur

In this section we present a strong notion of what it means for a learning procedure to be calibrated to a data-generating model. It simpliﬁes matters to restrict to learning procedures that produce regular distributional output:

Deﬁnition 7 (Regular Learning Procedure). A learning procedure µ: P(Θ) Y P(Θ) is regular if µ(µ0, y) Pr(Θ) for all µ0 Pr(Θ) and all y Y .

Deﬁnition 8 (Strongly Calibrated). Let B Pr(Θ) denote a set of belief distributions and P a data-generating model. A regular learning procedure µ is said to be strongly calibrated to (B, P) if

θ µ0 y | θ Pθ

= Ff#µ(µ0,y)(f(θ)) U(0, 1) (3)

for all f FΘ and for all µ0 B. If the set B contains a single element, µ0, then we say simply that µ is strongly calibrated to (µ0, P).

The assumption that both the belief distribution and learning procedure are regular excludes some important learning procedures. For example, in Bayesian inference one sometimes uses an improper, uninformative prior such as p(θ) 1, which would not be regular unless Θ0 is bounded. To study such a learning procedure in the framework of Deﬁnition 8 one could consider constructing an artiﬁcial learning procedure that took a regular distribution µ0 as input, but ignored this for the purposes of inference and instead used an improper prior though, one would still need to ensure that the learning procedure itself returned a regular output, which is not guaranteed for an improper prior. In addition to this, any application of Bayesian inference for which the support of the posterior is a strict subset of Θ (e.g. procedures with truncated likelihoods) will fail to be regular. The distributional output of approximate Bayesian computation (abc) may not be regular for similar reasons. This motivates the introduction of weakly calibrated learning procedures in Section 2.4, for which the regularity assumption can be relaxed. To gain intuition for Deﬁnition 8, notice that the unknown data-generating parameter θ is statistically identical to a sample from the distributional output µ(µ0, y) when the learning procedure is strongly calibrated. This intuition is clariﬁed in the following remark:

Remark 9 (Correct Coverage for Credible Sets). Suppose that the learning procedure µ is strongly calibrated to (µ0, P). If the distribution µ(µ0, y) is used to construct a 1 α probability credible set for θ, then this interval will indeed contain θ with probability 1 α under the hierarchical data-generating model θ µ0, y | θ Pθ.

Thus, the distributional output from a strongly calibrated learning procedure can be meaningfully related to the parameter inference task. Note, however, that even a small degree of misspeciﬁcation can lead to failure of calibration. Thus strong calibration captures the absence of systematic errors, similar to the notion of an unbiased estimator. Next we present an actionable test for the hypothesis that a learning procedure is strongly calibrated. We emphasise that this test can in theory be applied to any learning procedure (i.e. any workﬂow used for parameter inference that returns distributional output), providing that the regularity requirements are satisﬁed and that one is able to simulate from the data-generating model.

Testing Whether a Learning Procedure is Calibrated

Remark 10 (Testing whether a Learning Procedure is Strongly Calibrated). Fix µ0 Pr(Θ) and let

yi | θi iid Pθi

Then we can test whether a (regular) learning procedure µ is strongly calibrated to (µ0, P) by picking a test function f FΘ and using any goodness-of-ﬁt test for the hypothesis

Ff#µ(µ0,yi)(f(θi)) iid U(0, 1).

Such a test will not have power against all alternatives unless, for example, d = 1 and f(θ) = θ. To increase the power of the test in higher dimensions, multiple f should be simultaneously considered. Methodology for selecting a suitable test function is proposed in Section 3.4.

Remark 11. For simplicity we have assumed that each θi is associated with exactly one yi. In practice this need not be the case; each parameter could be associated with many pieces of data. For example in some applications a sample from µ0 may be more diﬃcult to obtain than repeated measurements y1, . . . , yn Pθi. However we note that this will violate the independence assumption in Remark 10, and would require a more complicated test to be used.

Remark 12 (Quantiﬁcation of Strong Calibration). The departure from uniformity of the law of Ff#µ(µ0,y)(f(θ)) under θ µ0, y | θ Pθ can be used to assess the nature and extent to which the learning procedure fails to be strongly calibrated. Histograms can provide an intuitive visualisation; see Section 3.3.

In the next section we illustrate Deﬁnition 8 with some examples for which strong calibration can be veriﬁed. Then, in Section 2.3.2 we discuss the relationship between Deﬁnition 8 and earlier work.

2.3.1 Examples of Strongly Calibrated Learning Procedures

Our ﬁrst example conﬁrms the intuition that the Bayesian framework is strongly calibrated to the prior and the data-generating model.

Example 1 (Bayes is Strongly Calibrated). If θ µ0 and y | θ Pθ then (θ, y) can be considered to be a sample from the joint distribution of the parameters and dataset. In the Bayesian framework (with the data-generating model P correctly speciﬁed), µ(µ0, y) is deﬁned as the conditional distribution of the parameters given the data, and thus θ | y µ(µ0, y). Thus if µ0 and µ are regular, it follows from Lemma 3 that Ff#µ(µ0,y)(f(θ)) U(0, 1) for all f FΘ. Thus Bayesian inference is strongly calibrated to (Pr(Θ), P).

The following example1 shows that strongly calibrated learning procedures do not necessarily yield accurate estimators:

1. This example is similar in spirit to the climatological forecaster in Example 2 of Gneiting et al. (2007), who uses only historical frequencies to predict tomorrow s weather, agnostic of any recent data that may have been obtained.

Cockayne, Graham, Oates, Sullivan, Teymur

Example 2 (Data-Agnostic Learning Procedure is Strongly Calibrated). The trivial learning procedure that takes µ(µ0, y) := µ0 is strongly calibrated to (Pr(Θ), P), since for θ µ0 and µ0 Pr(Θ), Ff#µ(µ0,y)(f(θ)) = Ff#µ0(f(θ)) U(0, 1)

for all f FΘ.

The implication of Example 2 is that strong calibration alone is not suﬃcient to justify the practical application of a learning procedure, and additional desiderata, such as statistical eﬃciency, will typically also need to be taken into account.2 This paper focusses on calibration and does not attempt to discuss other desiderata and how they should be balanced in the applied context. One can consider situations between the two extremes of Example 1 and Example 2:

Example 3 (Partial Posteriors are Strongly Calibrated). A partial posterior corresponds to performing full Bayesian inference using only summary statistics s: Y S of the dataset. These have recently been proposed as a tool for compensating for model misspeciﬁcation (Lewis et al., 2021). For the partial posterior learning procedure, µ(µ0, y) is the conditional distribution of the parameters given the summarised data s(y) and θ | s(y) µ(µ0, y). When both the prior and the partial posterior learning procedure are regular, it follows from Lemma 3 that Ff#µ(µ0,y)(f(θ)) U(0, 1) for all f FΘ. Thus partial posteriors are strongly calibrated to (Pr(Θ), P).

Next we present an example that is a clear departure from the Bayesian framework, in that it clearly does not return a posterior distribution and yet is provably strongly calibrated:

Example 4 (Probabilistic Stationary Iterative Methods are Strongly Calibrated). Let Θ = Rd and consider the data-generating model Pθ = δ(Aθ) that returns a Dirac distribution on y = Aθ, where A is a non-singular matrix. An ideal learning procedure would return µ(µ0, y) = δ(A 1y) = δ(θ), but in many practical scenarios the exact action of A 1 on y cannot be computed, either due to poor conditioning of the matrix A or due to the O(d3) computational cost associated with inverting A. This motivates the use of an alternative procedure, called a probabilistic iterative method, recently proposed in Cockayne et al. (2021) and based on classical iterative methods for solving linear systems (see e.g. Saad, 2003). To describe the procedure, let Ry : Θ Θ be an map, constructed using y, such that θ is a solution of the ﬁxed point equation θ = Ry(θ). For example, the choice Ry(θ) = (I ϵA)θ+ϵy, ϵ > 0, corresponds to a classical iterative method called Richardson s method. Consider then the learning procedure µ(µ0, y) := Ry #µ0, whose output is conjugate under a Gaussian input µ0, being an aﬃne transform, and can be exactly computed at cost O(d2). Cockayne et al. (2021) proved that, under mild conditions, the iterative application of Ry produces a sequence of distributions on Θ that contract to δ(θ), and that this procedure is strongly calibrated to (G(Rd), P), where G is the set of all Gaussian distributions

2. For example, maximizing the sharpness of the predictive distributions subject to calibration was proposed in Gneiting et al. (2007), although their use of the term calibration is distinct from the present paper, being focussed on forecast assessment. See Section 2.3.2 for further discussion of the literature on forecast assessment.

Testing Whether a Learning Procedure is Calibrated

supported on Rd. This example speaks to one potential use of Deﬁnition 8, in providing theoretical justiﬁcation for non-traditional learning procedures which nevertheless produce meaningful distributional output.

Next our attention turns to the relationship between Deﬁnition 8 and existing concepts in the literature.

2.3.2 Relation to Existing Concepts

Here we compare and contrast our notion of strong calibration with concepts appearing in earlier work and in related ﬁelds.

Frequentist Coverage: There is a rich literature that aims to assess learning procedures according to frequentist desiderata. In particular, one can ask whether credible sets have correct frequentist coverage, which is analogous to ﬁxing θ = θ0 and asking if y Pθ0 implies Ff#µ(µ0,y)(f(θ0)) U(0, 1); i.e. the only randomness is introduced during generation of the dataset. This diﬀers to our notion of strong calibration in that we sample θ from µ0 while, in the frequentist assessment, θ is ﬁxed. In particular, it is possible to prove certain learning procedures are strongly calibrated, but no learning procedure can be expected to attain correct frequentist coverage in general. The literature on frequentist assessment therefore focuses on weaker notions of coverage, such as asymptotically correct frequentist coverage, where the data are of the form y = (y1, . . . , yn) and credible sets are required to have correct frequentist coverage in the n limit. In ﬁnite-dimensional Bayesian analyses where a Bernstein von Mises theorem holds, asymptotically correct frequentist coverage is guaranteed (Freedman, 1999). Results on frequentist coverage have also been established in ﬁnite dimensions for variational Bayes (Wang and Blei, 2019). In inﬁnitedimensional settings, a Bayesian learning procedure can fail to have even asymptotically correct frequentist coverage (Cox, 1993; Freedman, 1999). An active area of research is to establish suﬃcient conditions for asymptotically correct frequentist coverage, and recent results have been established that hold uniformly over a set of values for θ0; for results in this direction see Szab o et al. (2015) and references therein.

Forecast Assessment: Dawid (1984) refers to the question of whether a probabilistic forecasting system is in some sense good as the fundamental question of prequential statistics . Our notion of strong calibration is closely related to a concept developed in that literature to answer this question, for which the term probabilistic calibration is used (Dawid, 1982; Diebold et al., 1997; Gneiting et al., 2007; Gneiting and Ranjan, 2013). An important distinction between forecast assessment and the present paper is the sense in which probabilistic calibration is applied; here we estimate a true parameter θ, which is not a random variable, whereas in forecast assessment there remains inherent randomness in the quantities being predicted. In the econometrics literature, Diebold et al. (1997) considered a sequence of forecasts (Qi)n i=1 Pr(R), representing predictions for corresponding quantities (qi)n i=1 R. The authors advocated a visual diagnostic, called a correlogram, to assess whether {FQi(qi)}n i=1 are plausible as an independent random sample from U(0, 1); see also Christoﬀersen (1998); Berkowitz (2001). In the statistics community, Gneiting and Ranjan (2013) proposed to compare the variance of the {FQi(qi)}n i=1 to 1/12, the variance of a U(0, 1) random variable,

Cockayne, Graham, Oates, Sullivan, Teymur

with the sequence of forecasts being called overdispersed if this variance is smaller than 1/12, and underdispersed if it is larger; see the review of Gneiting and Katzfuss (2014). This literature contains elements that are similar in spirit to our notion of strong calibration, except that a parametric statistical model is not explicitly involved; an important distinction that we require when assessing whether a learning procedure is calibrated. In the meteorology literature, the calibration of probabilistic forecasts is routinely assessed using rank histograms (Anderson, 1996; Talagrand et al., 1997; Hamill and Colucci, 1997; Hamill, 2001). For computational reasons, a forecast is typically represented by a discrete distribution µ(µ0, y) 1 M PM m=1 δ(θm), produced based on initial belief µ0 and after observing data y, assumed to have arisen from a data-generating model P. To assess the forecast, an ensemble of synthetic datasets {ym}M m=1 is simulated as ym Pθm. For a test function f FY , the rank statistic

r({f(ym)}M m=1, f(y)) :=

m=1 I[f(ym) < f(y)]

will be uniformly distributed on {0, 1, . . . , M} if the forecast is calibrated. This is assessed empirically by producing a histogram of rank statistics for a collection of T ensembles of synthetic datasets {{ym t }M m=1}T t=1 and corresponding real datasets {yt}T t=1, where t = 1, . . . , T may index distinct times, spatial locations, or both. Denoting the empirical measure associated with an ensemble of synthetic datasets as ˆνt = 1 M PM m=1 δ(ym t ), the rank statistic r({f(ym t )}M m=1, f(yt)) is related to the cdf of ˆνt by

Ff#ˆνt(f(yt)) = 1

M r({f(ym t )}M m=1, f(yt)).

Checking for rank histogram uniformity is therefore similar in spirit to the test for strong calibration in Remark 10, with relaxations to allow for the fact that the learning procedure produces an empirical distribution output and that the true parameters {θt}T t=1 that gave rise to the real datasets {yt}T t=1 are unknown, so that testing occurs in the data domain Y rather than in the parameter domain Θ.

Signal Processing: An important goal in signal processing is to estimate a time-dependent latent state {θt}T t=1, θt Rd, based on time-series data {yt}T t=1. For Gaussian ﬁltering algorithms, such as the extended Kalman ﬁlter (see Law et al., 2015, p84), the output of the learning procedure is a sequence of Gaussian distributions N(mt, Σt). These serve to quantify uncertainty as to the unknown value of the parameter θt, t = 1, . . . , T. Such a ﬁltering algorithm is considered to be calibrated if the Z-score Σ 1/2 t (θt mt) is plausible as a sample from N(0, 1). The average normalised estimation error squared (anees) (Bar-Shalom and Birmiwal, 1983; Drummond et al., 1998)

t=1 (θt mt) Σ 1 t (θt mt)

attempts to quantify this property, with values of anees close to 1 when the learning procedure is calibrated. Li et al. (2002) argued against the use of anees on the grounds

Testing Whether a Learning Procedure is Calibrated

that it penalises optimism much more severely than pessimism .3 These authors then proposed the non-credibility index (nci)

(θt mt) Σ 1 t (θt mt) (θt mt) Σ 1 t (θt mt)

where Σt is the covariance matrix of the random vector θt mt, where the randomness here refers to the generation of the dataset. The nci, which is also called the inclusion indicator in Li and Zhao (2006), takes values close to 0 if the ﬁltering algorithm is calibrated and is quite widely used (e.g. Pr uher et al., 2020). Further discussion can be found in Li et al. (2011). The anees is similar in spirit to our Deﬁnition 8, but it is adapted to learning procedures that produce Gaussian output and to a temporal data-generating model.

Validation of Algorithms for Bayesian Computation: Cook et al. (2006) observed that Bayesian inference is strongly calibrated to the prior and the data-generating model4

and presented the argument used in Example 1. Their interest was in validating software for Bayesian inference, and general learning procedures were not considered. They proposed a goodness-of-ﬁt test for the case Θ = Rd that corresponds to Remark 10, using a test statistic of the form

i=1 (F 1 N(0,1)(Ff#µ(µ0,yi)(f(θi))))2 (4)

for some f FΘ. If the null hypothesis holds and the learning procedure is strongly calibrated, then T χ2 n. Cook et al. (2006) focused on software that uses Markov chain Monte Carlo (mcmc), meaning that cdfs are not exactly computed, and advocated an empirical approximation to the cdf based on approximate samples {θm i }M m=1 from µ(µ0, yi) generated using mcmc. A similar approach was used to analyse abc in Wegmann et al. (2009), who performed a Kolmogorov Smirnov (ks) test for uniformity, and in Prangle et al. (2014) who used the name coverage property and advocated a visual diagnostic plot. In more recent work, Lee et al. (2019); Xing et al. (2019) proposed the use of credible sets to circumvent access to cdfs; this is similar in spirit to taking f to be an indicator function in Deﬁnition 8. In Talts et al. (2018) the authors modiﬁed the approach of Cook et al. (2006) to address issues surrounding empirical approximation of the cdf, such as discretisation artefacts when displayed as a histogram if an appropriate continuity correction or binning scheme is not used. Talts et al. (2018) showed that, for iid samples {θm i }M m=1 from the posterior given yi, rank statistics r({f(θm i )}M m=1, f(θi)) for a test function f FΘ will follow a discrete uniform distribution on {0, 1, . . . M}, and proposed to use this to test calibration rather than checking the (continuous) uniformity of estimated quantiles. Further, Talts et al. (2018) proposed to alleviate departures from uniformity in the rank statistics arising from the use of dependent mcmc rather than iid samples by thinning the mcmc samples using a heuristic based on the estimated chain autocorrelations.

3. It is unclear to us whether this is a problem, since in most statistical applications estimates that are conservative are generally preferred to estimates that are over-conﬁdent. 4. Though, the result was not described in such terms in that work.

Cockayne, Graham, Oates, Sullivan, Teymur

Validation of Bayesian Workﬂows: The aforementioned authors including Cook et al. (2006) focussed on the correctness of algorithms for Bayesian computation, but one can take a broader view in which a Bayesian workﬂow (e.g. including prior elicitation, selection of a likelihood, and so forth; see Gelman et al. (2020)), also form part of the learning procedure to be assessed. The earliest related work in this direction of which we are aware is Monahan and Boos (1992), who stated a deﬁnition similar to our strong calibration (albeit in terms of credible sets). These authors considered generalised Bayesian inference and provided the argument used in Example 3. A ks test for uniformity of Fµ(µ0,yi)(θi) was proposed in the case where Θ is one-dimensional. Harrison et al. (2015) proposed a notion of calibration that is similar in spirit to our Deﬁnition 8, motivated by the often challenging computational workﬂows encountered in applications to astronomy. First, the authors take a collection of candidate values θi for the parameter and generate associated datasets yi | θi iid Pθi. The values θi may be the same for each simulation generated or diﬀer between them, depending on the nature of the inference problem . Then, recasting into our notation, these authors proposed to test the null hypothesis that each set of assumed parameter values θi is drawn from the corresponding derived posterior µ(µ0, y) . This procedure coincides with our notion of strong calibration only if θi iid µ0. The authors considered Bayesian workﬂows ( our validation procedure [...] allows for the veriﬁcation of the implementation and any simplifying assumptions of the data model ) and proposed a multiple simultaneous version of [a novel, multidimensional] Kolmogorov Smirnov test for the calibrated null hypothesis. This multidimensional Kolmogorov Smirnov (mks) test provides an ingenious way to circumvent the selection of a test function f in Remark 10, being based on highest probability density regions instead of cdfs. However, the mks test does not have power against all alternatives to the calibrated null hypothesis, even in dimension d = 1, and the description of the test as a multi-dimensional ks test is misleading, as when d = 1 the test does not correspond to a standard ks test.

Summary: In summary, the content of Sections 2.1 to 2.3 and 2.3.1 departs from existing work on this topic in that:

1. where similar hypothesis tests have been performed in Monahan and Boos (1992); Cook et al. (2006); Harrison et al. (2015), they were used only to verify the correctness of algorithms and/or workﬂows for some form of Bayesian computation, while we proposed a notion of strong calibration that is ambivalent to any particular statistical framework;

2. Deﬁnition 8 is suﬃciently precise to allow for logical deduction, such as proving the strong calibration property holds for a non-traditional learning procedure such as that in Example 4.

The main drawback with Deﬁnition 8 appears to be practical, since testing for strong calibration in principle requires access to the cdf of f#µ(µ0, y) for at least one test function f FΘ. In some cases the cdf will be explicitly available or easily approximated, but in other cases it will not. Therefore, in the next section we propose a second, strictly weaker notion of calibration which can be tested without access to the cdf.

Testing Whether a Learning Procedure is Calibrated

2.4 Weakly Calibrated Learning Procedures

Testing whether a learning procedure is strongly calibrated may be challenging in practice. Furthermore, as discussed in Section 2.3, the requirement that both µ0 and the learning procedure are regular in the sense of Deﬁnitions 2 and 7 will often be too strong, given the diverse algorithms for uncertainty quantiﬁcation that have been proposed in literature. We therefore propose a second, weaker deﬁnition that requires neither additional structure to deﬁne a cdf nor regularity of the distributions involved:

Deﬁnition 13 (Weakly Calibrated). Let B P(Θ) denote a set of belief distributions and P a data-generating model. A learning procedure µ is said to be weakly calibrated to (B, P) if either of the following equivalent properties hold:

ZZ µ(µ0, y) d Pθ(y) dµ0(θ) = µ0.

(ii) θ 7 Z µ(µ0, y) d Pθ(y) is a µ0-invariant Markov kernel on Θ.

for all µ0 B. If the set B contains a single element, µ0, we say simply that µ is weakly calibrated to (µ0, P).

To give some intuition, the deﬁnition (i) above states that if one randomises the true parameter according to θ µ0, generates synthetic data according to y Pθ, and then samples ϑ µ(µ0, y) from the distributional output, this should be identical in distribution to sampling ϑ from µ0 directly. Similarly to Remark 12, one could consider quantifying departures from weak calibration in terms of a statistical divergence between the two measures appearing in (i), but here we focus on testing for equality and quantitative descriptions will not be pursued. Focussing on (ii), note that a suﬃcient condition is provided by the detailed balance condition (Eq. 20.5 in Meyn and Tweedie, 2009)

µ0(dθ) Z µ(µ0, y)(dϑ) d Pθ(y) = µ0(dϑ) Z µ(µ0, y)(dθ) d Pϑ(y), θ, ϑ Θ. (5)

On the other hand, the existence of non-reversible Markov kernels that are µ0 invariant (e.g. Bierkens, 2016) demonstrates that (5) is not a necessary condition for (ii) to hold. The main practical advantage of Deﬁnition 13 is that we may test whether a learning procedure is weakly calibrated without access to cdfs of any univariate summary f#µ(µ0, y), f FΘ:

Remark 14 (Testing whether a Learning Procedure is Weakly Calibrated). Let µ0 P(Θ) and let

yi | θi iid Pθi

ϑi | θi, yi iid µ(µ0, yi).

Then weak calibration of a learning procedure µ to (µ0, P) can be tested using any goodnessof-ﬁt test for the null hypothesis that ϑi iid µ0. Alternatively if µ0 and µ are each regular,

Cockayne, Graham, Oates, Sullivan, Teymur

one could instead test for weak calibration by picking one or more functions f FΘ and using any goodness-of-ﬁt test for the null hypothesis

Ff#µ0(f(ϑi)) iid U(0, 1).

This is of course equivalent to the procedure described in Remark 14 provided a suﬃciently large set of f Fr(Θ) are used, but we write it in this way to draw a comparison with Remark 10.

2.4.1 Examples of Weakly Calibrated Learning Procedures

A natural question is whether a learning procedure that is strongly calibrated to (B, P) is also weakly calibrated to (B, P), as the nomenclature suggests. This is indeed the case, as stated below and proven in Section A.2.

Lemma 15 (Strongly Calibrated = Weakly Calibrated). Let Θ = Rd for some d N. Suppose that µ is a regular learning procedure that is strongly calibrated to (B, P), where B Pr(Θ) and P is a data-generating model. Then the learning procedure µ is also weakly calibrated to (B, P).

By virtue of Lemma 15, the learning procedures that were shown to be strongly calibrated in Section 2.3.1 are also weakly calibrated. However, the converse is not true in general, and the following example provides a cautionary tale:

Example 5 (Weakly Calibrated = Strongly Calibrated). A learning procedure µ may produce quite unreasonable distributional output µ(µ0, y) and yet be weakly calibrated. As a concrete example, consider Θ = R, an initial belief distribution µ0 = N(0, 1), and a data-generating model Py(θ) distributed according to y = θ + ϵ, with independent noise ϵ N(0, 1). The Bayesian learning procedure produces µ(µ0, y) = N(y/2, 1/2) and is both weakly and strongly calibrated to (µ0, Py) (see the left hand panel in Figure 1). The mirror Bayes learning procedure, which ﬂips the sign of the datum y before the Bayesian learning procedure is applied, produces µ(µ0, y) = N( y/2, 1/2), which is not strongly calibrated to (µ0, Py) but is nevertheless weakly calibrated to (µ0, Py) (see the right hand panel in Figure 1).

Thus there is a trade-oﬀbetween strong and weak calibration, where the more straightforward approach to testing aﬀorded by weak calibration occurs at the expense of failing to rule out pathologically bad learning procedures, such as Example 5. An important class of learning procedures that are widely used and yet are not weakly calibrated are the generalised Bayesian learning procedures (Bissiri et al., 2016). These are typically not weakly calibrated to the data-generating model and the prior, since these learning procedures are motivated by the M-open setting (Bernardo and Smith, 1994, 6.1.2) where the data-generating model may be misspeciﬁed. A canonical example of a generalised Bayesian procedure is presented next:

Example 6 (Fractional Posteriors are not Weakly Calibrated). To avoid technical obfuscation, in this example we abuse notation and assume that µ0 and µ(µ0, y) can be identiﬁed with densities with respect to the reference measure λ on Θ, i.e. µ0(A) = R

A µ0(θ) dλ(θ)

Testing Whether a Learning Procedure is Calibrated

Strongly calibrated

Weakly calibrated

µ0 (θi, yi) µ(µ0, yi) 1 n Pn i=1 µ(µ0, yi)

Figure 1: Strong versus weak calibration. Centre: (θi, yi) pairs (blue, orange, green circles) were generated from the joint distribution described by a reference distribution µ0 and data generating model P (purple heatmap and contours). Left: Distributional output µ(µ0, yi) (blue, orange, and green dash-dotted lines) from a learning procedure that is strongly calibrated. Note how the true parameters θi (blue, orange, and green dotted lines) are plausible as samples from their associated distributions µ(µ0, yi). Right: Distributional output from a learning procedure that is not strongly calibrated but nevertheless weakly calibrated. Note that the average of the distributional outputs µ(µ0, yi) (black dashed line) is close to µ0 (solid black line), even though the individual θi, in some cases, lie far out in the tails of the associated distributions µ(µ0, yi), and thus are not plausible as samples from said distributions.

for each µ0-measurable set A (and analogously for µ(µ0, y)). Similarly, we assume that Pθ admits a density p( | θ) with respect to a suitable reference measure dy on Y .5

Here we consider fractional posteriors (Bhattacharya et al., 2019), a prototypical instance of a generalised Bayesian learning procedure. As with partial posteriors in Example 3, fractional posteriors have been proposed as a remedy for model misspeciﬁcation (e.g. in Safe Bayes, Gr unwald and van Ommen (2017)). The distributional output of a fractional posterior is deﬁned as µ(µ0, y)(θ) := p(y|θ)tµ0(θ)/pt(y), θ Θ, where t [0, 1] and we have deﬁned pt(y) := R p(y | ϑ)tµ0(ϑ) dϑ, assuming that pt(y) > 0. As an example, consider µ0 = N(0, 1), p(y | θ) = N(y; θ, σ2), σ > 0. Our aim is to verify condition (i) in Deﬁnition 13, which requires the distribution ZZ µ(µ0, y) d Pθ(y) dµ0(θ) = N ϑ; 0, t2(σ2 + 1) + σ2(t + σ2)

5. Note that this is not the same as assuming µ and µ0 are regular, since their pdfs are not required to be positive on Θ.

Cockayne, Graham, Oates, Sullivan, Teymur

Uncalibrated (Overconﬁdent)

Uncalibrated (Overly Conservative)

µ0 (θi, yi) µ(µ0, yi) 1 n Pn i=1 µ(µ0, yi)

Figure 2: Potential consequences of uncalibrated methods. Centre: as in Fig. 1. Left: Distributional output µ(µ0, yi) (blue, orange, and green dash-dotted lines) from a learning procedure that is overconﬁdent. The true parameters θi will tend to be further in the tails of µ(µ0, yi). Right: Distributional output from a learning procedure that is overly conservative. True parameters θi tend to be close to the mode of µ(µ0, yi). The width of the distribution µ(µ0, yi) suggests that, when the mode is used as an estimator, this estimator has higher uncertainty than the typical error. In both cases the average of the distributional outputs (black dashed line) diﬀers from µ0 (solid black line).

to be equal to N(ϑ; 0, 1), i.e.

t2(σ2 + 1) + σ2(t + σ2)

(t + σ2)2 = 1 = t2(σ2 + 1) + σ2(t + σ2) = (t + σ2)2 = σ2t(t 1) = 0.

Thus, fractional posteriors are weakly calibrated if and only if either t = 1, which reduces to standard Bayesian inference (Example 1), or t = 0, which is data-agnostic (Example 2).

Finally we present two examples of learning procedures that are neither strongly nor weakly calibrated, to demonstrate the potential consequences of methods not being calibrated.

Example 7 (Consequences of Uncalibrated Methods). We return to the setting of Example 5. Recall that we have an initial belief distribution µ0 = N(0, 1) and a data-generating model Py(θ) such that y = θ +ϵ with independent noise ϵ N(0, 1). The Bayesian learning procedure µ(µ0, y) = N(y/2, 1/2) is both weakly and strongly calibrated to (µ0, Py). Consider the setting of learning procedures that return distributional output µ(µ0, y) = N(y/2, η/2) for some η > 0, that is, the procedures have the same mean as the Bayesian learning procedure but a diﬀerent variance for η = 1. We illustrate the output in Fig. 2. When η = 0.5 < 1 (left panel), the learning procedures are overconﬁdent. The output µ(µ0, y) is narrower and more peaked than the correctly speciﬁed Bayesian procedure

Testing Whether a Learning Procedure is Calibrated

µ(µ0, y), with the consequence that the true parameter θi typically lies further in the tails of the distribution than for the correctly speciﬁed procedure. Thus, the misspeciﬁed procedure will often suggest a high degree of conﬁdence in the wrong value of the parameter θi. Conversely, when σ = 2 > 1 (right panel), the learning procedures are overly conservative. The procedure µ(µ0, y) produces a distributional output that is wider and ﬂatter than the correctly speciﬁed Bayesian procedure µ(µ0, y). Thus the true value of the parameter θi will typically be closer to the mean than the posterior variance would suggest, with the consequence that a user will often associate an accurate estimator of θ with a high degree of uncertainty. In both cases σ = 0.5 and σ = 2 note that the average of µ(µ0, yi) diﬀers from µ0.

2.4.2 Relation to Existing Concepts

Here we compare and contrast our notion of weak calibration with concepts appearing in earlier work and in related ﬁelds.

Forecast Assessment: Our notion of weak calibration is closely related to a concept developed in the literature on forecast assessment, for which the term marginal calibration is used (Gneiting et al., 2007). As previously mentioned in Section 2.3.2, an important distinction between forecast assessment and the present paper is the sense in which notions such as probabilistic calibration and marginal calibration are applied. This leads to major diﬀerences between forecast assessment and the present work. For example, probabilistic calibration does not imply marginal calibration in the context of forecast assessment,6 while our notion of strong calibration does imply weak calibration in the context of testing whether learning procedures are calibrated, as established in Lemma 15.

Validation of Algorithms for Bayesian Computation: The invariance property that underpins our notion of weak calibration has previously been noted in the Bayesian context. Talts et al. (2018) call this self-consistency of the data-averaged posterior . It appears to have been ﬁrst used in Geweke (2004), who proposed to use it to check the correctness of mcmc algorithms and their code. Therein, the author proposed to alternatively sample from y | θ and θ | y, the latter using mcmc. For a correctly implemented mcmc method, θ will be marginally distributed according to the prior µ0 after an initial burn-in period has passed. Geweke (2004) performed a collection of univariate hypothesis tests for this weak calibration null hypothesis, followed by a Bonferroni correction to adjust for multiple testing. Our Deﬁnition 13 is similar in spirit, but is precise enough to permit logical deduction, such as Lemma 15, and yet general enough to cover learning procedures which need not exist within a Bayesian context. Additionally, we do not assume the structure of mcmc that is required to render this Gibbs-like approach practical.

This completes our formal discussion of what it means for a learning procedure to be called calibrated . The next section presents several vignettes designed to illustrate our the general framework.

6. A simple example of a forecaster who is marginally calibrated but not probabilistically calibrated is provided by the unfocussed forecaster of Gneiting et al. (2007); see also Hamill (2001). These examples have no analogue in our context, due to the fact that there is no inherent randomness in the true parameter θ, while the quantity being predicted is inherently random in the setting of forecast assessment.

Cockayne, Graham, Oates, Sullivan, Teymur

3. Vignettes

In this section we exploit our framework to test whether or not several popular learning procedures are calibrated, with ﬁve separate vignettes presented. The ﬁrst two vignettes, Sections 3.1 and 3.2, consider learning procedures that are motivated as being approximations to Bayesian inference and are widely used: Gaussian approximations to non-Gaussian posteriors and approximate Bayesian computation, respectively. In challenging applications, the output produced using these approximations can fail to resemble the usual Bayesian posterior; we therefore view these approximations as learning procedures in their own right and we ask whether their distributional output is calibrated. Section 3.3 presents a topical application to recently developed probabilistic ordinary diﬀerential equation (ode) solvers. Section 3.4 concerns the challenge of performing a goodness-of-ﬁt test for strong calibration in multiple dimensions, where a suitable test function f must ﬁrst be identiﬁed. The ﬁnal vignette, Section 3.5 examines how our notions of calibration can be extended to the setting where the data-generating model is misspeciﬁed.

3.1 Gaussian Approximations

A common approach in statistics is to output a Gaussian distribution which approximates, in some sense, the distributional output of an idealised learning procedure. The targeted learning procedure will often be Bayesian inference, however Gaussian approximations can also be used within diﬀerent inferential paradigms. As an example of such an approach, Gaussian approximations are often the output of variational inference methods, wherein the learning procedure outputs the member of a family of distributions (in this case Gaussian) which minimises a divergence from the target distribution (Blei et al., 2017). A distinct but related approach is that of ﬁtting a Gaussian approximation based on only local information. The Laplace approximation, which outputs a Gaussian distribution centred at a maximum of the log density of the target distribution and with covariance equal to the inverse of the Hessian of the log density at this point, is a canonical example of such a method. As a ﬁrst simulation study we test the calibration of Laplace approximations to the Bayesian posterior in a model with a location parameter θ. We assign a prior µ0 = N(0, 1), and a Student s t data-generating model Pθ such that y consists of n independent draws from a T (θ, 1, ν) distribution. To be speciﬁc, y = (y(n))N n=1, with y(n) iid T (θ, 1, ν) for n {1, . . . , N}. The true posterior in this case is non-Gaussian and so our expectation is that a Laplace approximation will be neither strongly nor weakly calibrated. However, for ν or N (and ν > 2) the posterior will become increasingly close to Gaussian, in the former case due to the Student s t distribution becoming increasingly close to Gaussian as ν , and in the latter due to the asymptotic normality of the posterior as N by the Bernstein von Mises theorem for ν > 2. We therefore would expect it to be increasingly challenging for the tests in Remark 10 and Remark 14 to reject respectively strong and weak calibration as ν or N . In univariate cases such as this, we may employ the identity test function f(θ) = θ and a one-sample ks test to check for uniformity in the tests in Remarks 10 and 14. Laplace approximations were computed for 106 realisations from the hierarchical model θi iid µ0, yi | θi iid Pθi, for each of ν {1, 2, . . . 20} with N = 5 and for each of N {1, 2, . . . 20}

Testing Whether a Learning Procedure is Calibrated

2 4 6 8 10 12 14 16 18 20 Degrees of freedom ν

KS test statistic

Strong calibration Weak calibration

2 4 6 8 10 12 14 16 18 20 Degrees of freedom ν

KS test p-value

Strong calibration Weak calibration p = 0.05

2 4 6 8 10 12 14 16 18 20 Number of data points N

KS test statistic

Strong calibration Weak calibration

2 4 6 8 10 12 14 16 18 20 Number of data points N

KS test p-value

Strong calibration Weak calibration p = 0.05

Figure 3: Gaussian approximations: ks test statistics (left) and p-values (right) for strong and weak calibration of Laplace approximations in the t-distribution example for varying degrees of freedom ν and N = 5 (top) and varying number of data N and ν = 3 (bottom).

with ν = 3. The strong and weak calibration test results are summarised in Figure 3. As expected, we see that the power of both the strong and weak calibration tests decrease as ν and N increase, with the ks test statistics (deﬁned in (7)) showing decreasing departures from uniformity. While the strong calibration test rejects the null hypothesis at a 0.05 signiﬁcance criterion for all values of ν and N tested, the weak calibration test fails to reject at a 0.05 level for most of the ν range. However, for the results with varying N, we see that weak calibration test correctly rejects the null hypothesis at a 0.05 signiﬁcance level up to N = 20. A test of strong calibration is clearly preferable to a test of weak calibration in situations where it is possible to be performed. However, these results indicate that the weaker test in Remark 14 is still able to provide a useful check of calibration in some situations, with the beneﬁt of being simpler to compute and more widely applicable than the test in Remark 10.

3.2 Approximate Bayesian Computation

Performing Bayesian inference in settings for which the data-generating model Pθ does not have a tractable pdf is challenging, with abc methods (Beaumont et al., 2002) often used

Cockayne, Graham, Oates, Sullivan, Teymur

0.0 0.5 1.0 Fµ(µ0,yi)(θi)

Cumulative density

Tolerance ϵ = 1

ABC (KS = 0.11) Noisy ABC (KS = 0.011) Uniform

0.0 0.5 1.0 Fµ(µ0,yi)(θi)

Cumulative density

Tolerance ϵ = 2

ABC (KS = 0.12) Noisy ABC (KS = 0.0085) Uniform

0.0 0.5 1.0 Fµ(µ0,yi)(θi)

Cumulative density

Tolerance ϵ = 10

ABC (KS = 0.011) Noisy ABC (KS = 0.0071) Uniform

Figure 4: Approximate Bayesian computation: Empirical cdfs for both abc and noisy abc in the g-and-k quantile distribution example, for tolerances ϵ = 1 (left), ϵ = 2 (middle) and ϵ = 10 (right). The values of the ks test statistics are shown in the legend.

as an alternative in such situations. The key idea in abc is that, in contrast to the standard Bayesian procedure of conditioning on the observed dataset y = yobs, one instead conditions on the event that d(y, yobs) < ϵ, for some distance d: Y Y [0, ) and some tolerance ϵ > 0. Typically the distance is speciﬁed by embedding the data into a ﬁnite-dimensional normed vector space S via a summary statistic function s: Y S and specifying the distance as d(y, yobs) = s(y) s(yobs) . As a consequence of Example 3, the learning procedure that exactly conditions on s(y) = s(yobs), i.e. abc with tolerance ϵ = 0, is guaranteed to be strongly calibrated. Likewise in the limit of ϵ the abc posterior will be strongly calibrated, as the posterior will collapse to the prior (see Example 2). For ϵ (0, ) the abc posterior will in general however be neither strongly nor weakly calibrated. To resolve this lack of calibration of abc methods, Fearnhead and Prangle (2012) proposed the noisy abc algorithm, which is calibrated for any tolerance ϵ 0. Rather than conditioning on the event s(y) s(yobs) < ϵ, noisy abc replaces s(yobs) with noisy summary statistics sobs generated according to sobs = s(yobs)+ϵx, with x uniformly distributed on the unit ball in S. The distributional output of noisy abc is the partial posterior based on sobs, which takes into account the additional noise in the data-generating model, and is therefore strongly calibrated by an extension of the argument in Example 3. Here we consider the parameter inference task for a g-and-k distribution. The g-and-k distribution is deﬁned through the inverse of its cdf (quantile function) and it does not have a closed-form pdf (though the pdf can be evaluated numerically; Rayner and Mac Gillivray, 2002). Here we aim to infer the location parameter θ, which is assigned a prior µ0 = N(0, 1), given a dataset y RN, N = 20, generated according to the data-generating model

Pθ : y(n) = θ + b 1 + 0.8 1 exp( gun)

1 + exp( gun)

un(1 + u2 n)k,

Testing Whether a Learning Procedure is Calibrated

1 2 3 4 5 6 7 8 9 10 Tolerance ϵ

KS test statistic

ABC Noisy ABC

1 2 3 4 5 6 7 8 9 10 Tolerance ϵ

KS test statistic

ABC Noisy ABC

1 2 3 4 5 6 7 8 9 10 Tolerance ϵ

KS test p-value

ABC Noisy ABC p = 0.05

1 2 3 4 5 6 7 8 9 10 Tolerance ϵ

KS test p-value

ABC Noisy ABC p = 0.05

Figure 5: Approximate Bayesian computation: ks test statistics (top) and p-values (bottom) for strong (left) and weak (right) calibration of abc and noisy abc in the g-and-k quantile distribution example, for various tolerances ϵ > 0.

with un iid N(0, 1), n {1, 2, . . . N}, b = 1, g = 2 and k = 0.5. For the tests that follow we computed independent realisations from the hierarchical model θi iid µ0, yi | θi iid Pθi. In each case, data were summarised as a vector s: RN R5 consisting of the ﬁve quartiles of the dataset, and rejection sampling was used to generate M samples {θm i }M m=1 from the distributional output of both abc and noisy abc, for tolerances ϵ {1, 2, . . . 10}. Single samples (M = 1) can be directly used to test for weak calibration, as per Remark 14. However, the intractability of the distributional output for abc and noisy abc precludes a straightforward test for strong calibration. Instead, we consider a variant of the test for strong calibration in Remark 10, which in a similar spirit to Talts et al. (2018), wherein we test whether the rank statistics r({θm i }M m=1, θi) are iid samples from the discrete uniform distribution on {0, 1, . . . M}. For testing strong calibration, a total of 104 realisations of the hierarchical model were considered with M = 100, while for the less computationally demanding test for weak calibration a total of 106 realisations were considered with M = 1. Figure 4 presents empirical cdfs for both abc and noisy abc, on which our test for strong calibration is based. Figure 5 presents the ks test statistics and corresponding pvalues for both strong and weak calibration, for diﬀerent values of the tolerance ϵ > 0. In each case noisy abc is, as expected, seen to be better calibrated than abc. Both the strong

Cockayne, Graham, Oates, Sullivan, Teymur

and weak calibration tests correctly fail to reject the null hypothesis at a 0.05 signiﬁcance level for noisy abc, which is strongly (and weakly) calibrated, for all values of the tolerance ϵ. The strong calibration test fails to reject the null hypothesis that abc is strongly calibrated for the highest two tolerances ϵ 9. The weak calibration test on the other hand correctly rejects at a 0.05 level the null hypothesis that abc is weakly calibrated for all ϵ. The apparent greater power of the weak calibration test here likely arises from the much larger number of model realisations used 106 compared to 104 for the strong test for a given computational expenditure due to the need to generate only M = 1 abc sample per realisation rather than M = 100. A ﬁnal interesting point of note is that both weak and strong calibration show a dip in the ks test statistic at ϵ = 1, reﬂecting that as ϵ 0 classical abc tends towards a Bayesian procedure, which is guaranteed to be calibrated.

3.3 Calibration of Probabilistic ODE Solvers

A traditional (adaptive) numerical method for the approximate solution of an ode accepts, as its input, an error tolerance τ > 0 and returns, as its output, an approximation to the solution of the ode. In general it is not guaranteed that the resulting approximation has error less than τ, but empirical analysis over a range of typical odes can provide reassurance that the error will be below τ for many problems practically encountered. In contrast to the traditional approach, there has been a concerted research eﬀort in recent years to develop probabilistic numerical methods (pnms) for odes. A pnm returns a probability distribution over the solution space of the ode, representing epistemic uncertainty associated with the unknown true solution of the ode. The scale of this distributional output can be used as the basis for selecting a suitable time step size in order to drive the uncertainty below a user-speciﬁed tolerance τ, if desired. Compared to traditional numerical methods, which have beneﬁted from over a century of development, important questions regarding their behaviour of pnms remain unanswered, including whether such methods are calibrated. Most pnms exploit Gaussian process (gp) models for the solution of the ode, motivated by mathematical convenience rather than detailed knowledge of the ode to be solved. These models typically include hyperparameters for the gp, which are jointly estimated along with the solution of the ode. Given that pnm act on the basis of a default gp model, essentially independent of initial belief µ0 regarding the ode at hand, it is unclear whether hyperparameter estimation is suﬃcient to ensure pnm are calibrated. The principal application of pnms for odes is to inverse problems, where an ode s parameters are to be estimated based on a dataset. This usually requires the numerical solution of many odes, each corresponding to diﬀerent values of the parameters, to see which parameter values are compatible with the dataset. The motivation for pnms in this setting is that the solution of the odes can be viewed as an unknown latent quantity and integrated out, potentially using a fast-but-crude pnm in place of an adaptive ode solver and adjusting credible sets for ode parameters in a way commensurate with the accuracy of the pnm used. However, the success of this approach hinges on whether the underlying pnm is calibrated, as otherwise underor over-conﬁdent parameter inferences could be produced. To shed light on this question, we considered the probabilistic numerical solution of the

Testing Whether a Learning Procedure is Calibrated

8 Chkrebtii et al. (2016)

2.5 Teymur et al. (2018)

Schober et al. (2018)

6 Tronarp et al. (2019)

4 Teymur et al. (2021)

Figure 6: Calibration of probabilistic ODE solvers: Samples (blue) from the strong calibration test statistic Ff#µ(µ0,yi)(f(θi)) (c.f. Remark 10), with f(θ) = θ(1)(10) (top row), θ(2)(10) (middle row), and from the weak calibration test statistic ϑi µ(µ0, yi) (bottom row; c.f. Remark 14). The reference distribution, corresponding to the null hypothesis that the learning procedures are calibrated, is in each case shown in red. The p-values for the associated hypothesis tests (see main text for details) are shown in the top right-hand corner of each panel.

following Lotka Volterra ode

dt = αθ(1) + βθ(1)θ(2)

γθ(2) + δθ(1)θ(2)

, θ(0) = 10 10

, t [0, 10], (6)

with an initial belief distribution µ0 induced over the solution space of diﬀerentiable functions θ(t) on [0, 10] by sampling parameters (α, β, γ, δ) from a probability distribution π on [0, )4. For this experiment we took the distribution π to be

α, γ iid log Normal(0, 0.25), β, δ iid log Normal( 2, 0.1),

which produces a variety of periodic trajectories typically associated with this type of predator-prey model. The following pnms were considered: Chkrebtii et al. (2016), which employs a particle-based approach requiring parallel simulations to produce empirical credible sets; Teymur et al. (2018), which is based on stochastic perturbation of traditional numerical methods, continuing a line of work that originated in Conrad et al. (2017); Schober et al. (2019) and Tronarp et al. (2019), which are both based on Gaussian ﬁltering but with

Cockayne, Graham, Oates, Sullivan, Teymur

diﬀerent approaches to the (local) linearisation of (6); and Teymur et al. (2021), which is based on a probabilistic version of Richardson extrapolation. Each method has user-deﬁned settings that can in principle aﬀect the selection of its hyperparameters, and thus, its calibration in the senses used in this paper; for this experiment we considered one setting only for each pnm, with full details contained in Appendix B. In particular, default settings were used for some pnms, whilst the settings of other pnms were manually selected. Thus we do not claim to draw general conclusions about the speciﬁc pnms involved; our aim is only to show how diverse algorithms can be analysed using the notions of calibration we have introduced. Tests of strong and weak calibration were performed, in each case using the test functions fj(θ) = θ(j)(10), j {1, 2}, i.e. the value of the solution at the ﬁnal time point. Results are displayed in Figure 6. The top two rows show histograms of Ff#µ(µ0,yi)(θ(j) i (10)) for j {1, 2}, using 100 samples θi drawn from µ0. A Kolmogorov Smirnov test of uniformity was then used to test whether the pnms are strongly calibrated (c.f. Remark 10). The bottom panels show scatter plots of samples ϑi(10) where ϑi µ(µ0, yi), overlaid on contours of µ0 (empirically obtained). A kernel two-sample test (Gretton et al., 2012) was performed based on samples from the intractable distribution µ0 to assess whether the pnms are weakly calibrated (c.f. Remark 14). The results of these simulations show that strong calibration is not a property enjoyed by most pnm at present. The only instance where strong calibration was not emphatically rejected is Teymur et al. (2018), for inference of the ﬁrst component θ(1)(10). It is interesting to note that Teymur et al. (2018) performs an exhaustive grid search for gp hyperparameter estimation, which can require more computation compared to the other pnm considered, and this may explain its relative success in this calibration assessment. The remaining pnm perform poorly in diﬀerent ways, including being overconﬁdent (e.g. Schober et al., 2019) and under-conﬁdent (e.g. Tronarp et al., 2019). However, we reiterate that these conclusions will depend on additional user-speciﬁed settings, speciﬁc to how each pnm is implemented. On the other hand, weak calibration was never rejected, and indeed this was also the case over a much wider variety of algorithm settings (not presented). This suggests that weak calibration of pnms, in as far as this testing framework is concerned, is indeed a weak requirement.

3.4 Data-Driven Goodness-of-Fit Testing for Strong Calibration

For multivariate parameter inference tasks, where e.g. Θ = Rd, d > 1, it will not be possible in general to identify a single test function f FΘ that has power against all alternatives to the strong calibration null. Indeed, even a simultaneous test using all coordinate functions fi(θ) = θ(i), i = 1, . . . , d, does not have power against all alternatives, since a multivariate distribution is not uniquely determined by its univariate marginals. Nevertheless, the richness of the set FΘ is such that we expect some f FΘ to yield a test with the power to reject the null hypothesis, due to Lemma 5. A strategy to select a suitable test function f is therefore required. Following a generic approach to goodness-of-ﬁt testing, one way to proceed is to consider splitting the collection of simulated parameter-dataset pairs into two disjoint sets: S1 := {(θi, yi)}s i=1, S2 := {(θi, yi)}S i=s+1. The ﬁrst subset S1 can be used to identify a suitable test function f, after which a goodness-of-ﬁt test can be conducted using f and S2. The

Testing Whether a Learning Procedure is Calibrated

independence of S1 and S2 ensures that a test conducted in this way is valid. To select a suitable test function, one ﬁrst identiﬁes a suﬃciently small subset Fs FΘ of test functions and, for each f Fs, a univariate goodness-of-ﬁt test is performed using S1. The element of Fs that gives rise to the strongest evidence against the null hypothesis, based on S1, is selected. The main advantage of a data-splitting approach is that the selection of f is data-driven, as opposed to f being user-speciﬁed. The role of data to inform the selection of f is anticipated to be increasingly important in higher dimensional settings, d 1. To explore this, we consider now a setting that is, at least notionally, inﬁnite dimensional. Let θ : [0, 1] R be a continuous function-valued parameter, so that Θ = C(0, 1) is the set of continuous functions on [0, 1]. For µ0 we consider a hierarchical, non-stationary gp of the form θ(x) := σ(x)g(x), g GP(0, k) with k(x, x ) := exp( (x x )2/ℓ2), σ ν for some distribution ν to be speciﬁed, and for simplicity ℓ= 0.1 is ﬁxed. Consider the datagenerating model that returns y = (y(1), . . . , y(10)), where y(n) = θ(xn) and xn U(0, 1) are independently sampled. A popular, pragmatic workﬂow acknowledges the non-stationarity encoded in µ0 but, for computational convenience, ﬁts instead a stationary, non-hierarchical gp of the form θ(x) = σ0g(x), where the scalar σ0 is estimated using maximum likelihood. Estimating σ0 from data enables the scale of the distributional output to roughly adapt to the scale of the dataset, but this is insuﬃcient to ensure the learning procedure is strongly calibrated (Karvonen et al., 2020). Our interest here is in whether we can detect failure of strong calibration, and for this purpose we consider a simple form of ν that sets σ(x) = 1 + x with probability one. It can be expected that simpliﬁed gp regression produces a compromise value of σ0, which leads to under-conﬁdent inferences for θ(x) when x is close to 0 and over-conﬁdent inferences when x is close to 1. For the set of candidate test functions Fs, we consider the evaluation functions fx(θ) := θ(x), indexed by x [0, 1]. A number, S, of parameter-dataset pairs were generated, of which s = S

2 were assigned to S1 and used to identify a promising location x [0, 1] at which to perform a hypothesis test of strong calibration using the held-out S2. Since the marginals (fx)#µ(µ0, y) are Gaussian, it is natural to use a χ2 s test, as per (4). Thus we select x to minimise the p-value of a two-sided χ2 s test, based on fx and computed using S1, over x [0, 1]. The total number of simulated parameter-dataset pairs S was varied from 10 to 150 and, through repeated simulation, the p-values of a two-sided χ2 s test of strong calibration, based on the estimated x and S1, were computed. As a baseline, we also computed p-values for a user-speciﬁed test function centred at xb := 0.5. In Figure 7 (left) we plot log p-values as a function of x, for s = 10 (top) and s = 150 (bottom), for one typical realisation of S1. These results indicate that values of x close to 0 are likely to provide the most power for our hypothesis test. Here x is indicated as a vertical red line and xb indicated as a vertical blue line; the identiﬁcation of a suitable x is seen to be easier when the number, s, of simulations available in S1 is increased. Finally, in Figure 7 (right) we plot the p-values obtained when the x -based and xb-based tests are applied to S2. To avoid reporting an artefact of the random seed, average log p-values are reported, along with standard errors, based on 100 independent realisations of S1 and S2. It is seen that the data-driven goodness-of-ﬁt test (based on x ) is more powerful than the user-speciﬁed test (based on xb). This illustration makes clear that, for a data-splitting approach to work well, the size of the set Fs of candidate test functions should be carefully controlled, relative to the number

Cockayne, Graham, Oates, Sullivan, Teymur

log(p), computed on S1

0.0 0.2 0.4 0.6 0.8 1.0 x

log(p), computed on S1

2.5 3.0 3.5 4.0 4.5 5.0 log(s)

E[log(p)], computed on S2

Fixed Data-driven

Figure 7: Data-driven goodness-of-ﬁt testing for strong calibration: Here we consider an inﬁnite-dimensional parameter θ C(0, 1). On the left, we display typical pvalues obtained using a test based on the evaluation functional fx(θ) = θ(x) and a number s of independent simulations of the parameter and dataset (top: s = 10, bottom: s = 150). The right hand panel displays average p-values obtained when the procedure is applied to 100 independent realisations of S1 and S2, each of size s, using either the data-driven choice x = x or the ﬁxed choice x = xb of test.

s of samples in S1. For example, if we simply took Fs = FΘ, then for any α (0, 1) there would be inﬁnitely many elements of Fs for which the null hypothesis is rejected at level α by virtue only of the fact that S1 is a ﬁnite set. Consideration of multiple data splits can also be exploited to increase the power of such a test (Romano and Di Ciccio, 2019).

3.5 Robust Calibration

Our proposed notions of strong and weak calibration can be extended to the M-open setting (Bernardo and Smith, 1994, 6.1.2) where the data-generating model may be misspeciﬁed. This permits us to deﬁne notions of robust calibration , which are analogous (and orthogonal) to the notions of robust estimation that are already widely studied (Berger, 1994; Huber and Ronchetti, 2009). For example, suppose that a learning procedure µ is strongly calibrated to (µ0, P). Then, for any f FΘ, the distribution Uf,P of the random variable Ff#µ(µ0,y)(f(θ)), where θ µ0, y | θ Pθ, is by deﬁnition U(0, 1). Thus, when the data-generating model P is misspeciﬁed, we may quantify the loss of strong calibration in terms of a statistical divergence between Uf,Q and U(0, 1). Here we adopt a more practical perspective, using the framework of Section 2.3 to test the strong calibration null hypothesis in settings where the data-generating model is misspeciﬁed. For example, consider a Bayesian learning procedure µ for a location parameter

Testing Whether a Learning Procedure is Calibrated

0.0 0.2 0.4 0.6 0.8 Tolerance ϵ

KS test statistic

Bayes RBayes, t = 0.1 RBayes, t = 0.2 RBayes, t = 0.3

0.0 0.2 0.4 0.6 0.8 Tolerance ϵ

KS test p-value

Bayes RBayes, t = 0.1 RBayes, t = 0.2 RBayes, t = 0.3

Figure 8: Robust calibration: Results of ks tests for strong calibration, comparing standard Bayesian inference ( Bayes ) to the method of fractional posteriors (also called robust Bayes; RBayes ) with exponents t {0.1, 0.2, 0.3}, in a setting where the likelihood is misspeciﬁed. Note that in the right panel the RBayes lines for t = 0.1, t = 0.2 and t = 0.3 coincide.

θ, which is assigned a prior µ0 = N(0, 3), based on a likelihood Y | θ N(θ, 1). Our assessment will be performed using the data-generating model

( N(θ, 1) w.p. 1 ϵ N(5, 1) w.p. ϵ ,

where ϵ [0, 1] is a probability of obtaining a contaminated observation, so that for ϵ > 0 the likelihood is misspeciﬁed and the Bayesian learning procedure is not strongly calibrated to (µ0, P). Fractional posteriors with exponent t [0, 1], as deﬁned in Example 6, have been proposed as learning procedures that can oﬀer robustness to misspeciﬁcation of the likelihood e.g. in Gr unwald and van Ommen (2017). Our aim is to assess this claim within our testing framework. Results of performing a ks test of the strong calibration null hypothesis, using the identity test function f(θ) = θ, are displayed for a variety of values of t and ϵ in Figure 8. Clearly the only circumstance in which any of the learning procedures is strongly calibrated is when ϵ = 0 and the Bayesian procedure is used. Otherwise, according to the test statistic in the left panel, fractional posteriors are marginally better calibrated when ϵ > 0 than the Bayesian procedure, though regarding the p-values in the right panel one sees that the values of the statistic in these cases are still suﬃciently suﬃciently large to emphatically reject the strong calibration null hypothesis. Finally, we note that other senses of robust calibration could be considered, analogous to the various notions of robust estimation that have been studied (Berger, 1994; Huber and Ronchetti, 2009). For example, one could consider a setting where true parameters θ are drawn from a distribution other than µ0 and assess the consequences, in terms of calibration, for a learning procedure that uses µ0 as the initial belief distribution.

Cockayne, Graham, Oates, Sullivan, Teymur

4. Discussion

The desire that a parameter used to generate a dataset should appear plausible as a sample from the distributional output of a learning procedure, such as a Bayesian posterior, is foundational and, at least in an informal sense, widely understood and accepted. Despite this, a precise and widely applicable notion of what it means for a learning procedure to be calibrated appears not to have been put forward. Our aim in this paper was to propose such a deﬁnition, together with a framework for testing whether a learning procedure is calibrated. In particular, we proposed a property called strong calibration (Deﬁnition 8), which provides an explicit sense in which output from the learning procedure can be considered to be meaningful. A strictly weaker property, called weak calibration, was also proposed (Deﬁnition 13), which has the advantage of being more straightforward to test. Several vignettes were provided to illustrate the generality and usefulness of the framework. Our hope, in writing this manuscript, is to stimulate further critical discussion around calibration as a desideratum for a learning procedure, and to bring together some of the disparate strands of literature where related concepts and domain-speciﬁc deﬁnitions have been developed (cf. Section 2.3.2).

4.1 Further Work

A particularly promising avenue for further research would be to develop measures of miscalibration using the ideas proposed in this paper. Generally speaking, when using approximate methods such as Laplace approximation (cf. Section 3.1) or abc (cf. Section 3.2), or generalised Bayesian methods (cf. Section 3.5), a user has purposefully departed from the Bayesian framework due to challenges such as its lack of computational tractability or the possibility that the model is misspeciﬁed. In such settings a measure of miscalibration is likely to be of more use than a test for calibration, since exact calibration cannot be expected to hold. A measure of miscalibration might allow a user to select the most calibrated method from among multiple alternatives, or perhaps even incorporate calibration into a variational objective in a variational Bayesian framework (Knoblauch et al., 2021). In the context of Deﬁnition 8, such a measure could be constructed by selecting some test function f FΘ and computing a statistical divergence between Ff #µ(µ0,y)(f (θ)) and U(0, 1). The former quantity is unlikely to be available in closed-form but could be estimated using Monte Carlo techniques. Immediate challenges with this would concern selection of a suitable divergence and a suitable f . For the latter, one could perhaps instead consider selecting a subset Ftest Fθ over which a supremum can be taken tractably. However, we leave this task for future work.

Acknowledgements

JC was supported by Wave 1 of the UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the Digital Twins for Complex Engineering Systems theme within that grant, and the Alan Turing Institute, UK. MMG and CJO were supported by the Lloyd s Register Foundation programme on data-centric engineering at the Alan Turing Institute, UK. TJS has been supported in part by the German Research Foundation (Deutsche Forschungsgemeinschaft) through project 415980428 and the Excellence Cluster

Testing Whether a Learning Procedure is Calibrated

MATH+ The Berlin Mathematics Research Centre (EXC-2046/1, project 390685689). The authors thank Dennis Prangle for feedback on an earlier version of the manuscript.

Appendix A. Proof of Theoretical Results

This appendix contains proofs for all novel results in the main text. For x Rd, we let ( , x] := ( , x1] ( , xd] and we write y x whenever y ( , x], i.e. when yi xi for i = 1, . . . , d.

A.1 Proof of Lemma 5

Our proof of Lemma 5 makes use of the Kolmogorov distance

d K(µ, ν) := sup x Rd d K(x; µ, ν), (7)

d K(x; µ, ν) :=

Z 1( ,x] dµ Z 1( ,x] dν , (8)

which is a metric on P(Rd) (Shorack, 2000, Theorem 2.4).

Proof [Proof of Lemma 5] Suppose that µ = ν, so that it suﬃces to exhibit an element f FΘ for which R f dµ = R f dν. From the metric property of d K, there must exist x Rd such that ε := d K(x ; µ, ν) > 0. Now, for c > 0, consider the function

f(c) x : Rd (0, 1), f(c) x (x) :=

1 1 + e2c(xi x i ) , (9)

which satisﬁes f(c) x FΘ. Since f(c) x converges pointwise to f( ) x := 1( ,x ] outside of a

null set and |f(c) x | 1, the dominated convergence theorem implies that f(c) x is a consistent approximation of f( ) x in the c limit in both L1(µ) and L1(ν). Therefore, there exists c > 0 such that f(c ) x f( ) x L1(µ) < ε/2 and f(c ) x f( ) x L1(ν) < ε/2. For this f(c ) x FΘ we have from the reverse triangle inequality that

Z f(c ) x dµ Z f(c ) x dν =

Z f(c ) x dµ Z f( ) x dµ + Z f( ) x dµ Z f( ) x dν

+ Z f( ) x dν Z f(c ) x dν

Z f(c ) x dµ Z f( ) x dµ + Z f( ) x dν Z f(c ) x dν | {z } ( )

Z f( ) x dµ Z f( ) x dν | {z } =ε

The triangle inequality implies that

|( )| f(c ) x f( ) x L1(µ) + f( ) x f(c ) x L1(ν) < ε/2 + ε/2 = ε,

Cockayne, Graham, Oates, Sullivan, Teymur

and so it follows that R f(c ) x dµ R f(c ) x dν = 0. Thus we have exhibited an element

f(c ) x FΘ for which R f(c ) x dµ = R f(c ) x dν. This completes the proof.

A.2 Proof of Lemma 15

First we derive a corollary of Lemma 5 that will be used to prove Lemma 15:

Corollary 16. Let Θ = Rd for some d N. Suppose that µ, ν Pr(Θ) and that the independent random variables θ µ, ϑ ν satisfy P(f(θ) f(ϑ)) = 1/2 for all f FΘ. Then µ = ν.

Proof If µ = ν then, as in the proof of Lemma 5, we can identify x Rd such that d K(x ; µ, ν) > 0. Since µ and ν are regular, the function x 7 d K(x; µ, ν) is continuous on Rd and there exists an open neighbourhood N(x ) of x such that d K(x; µ, ν) > 0 for all x N(x ). Suppose, to arrive at a contradiction, that P(f(θ) f(ϑ)) = 1/2 for all f FΘ. Then, for all x N(x ), we can construct functions f(c) x FΘ as per (9), for which it holds that

P(ϑ x) = P(f( ) x (θ) f( ) x (ϑ)) = lim c P(f(c) x (θ) f(c) x (ϑ)) = 1

But ν was assumed to be regular, meaning that ν has a positive Lebesgue pdf, so that P(ϑ x) = 1/2 cannot simultaneously hold for all x N(x ). Indeed, since N(x ) is open, there exists x N(x ) such that x i < xi for all i = 1, . . . , d. Then P(ϑ x) = P(ϑ x ) + ν(S), where S := ( , x] \ ( , x ] is a measurable set with ν(S) > 0. This contradiction completes the proof.

Proof [Proof of Lemma 15] Fix µ0 B. Let θ µ0, y|θ Pθ and ϑ|θ, y µ(µ0, y). First we argue that the distribution ν := RR µ(µ0, y) d Pθ(y) dµ0(θ) of the random variable ϑ is regular. Since µ is a regular learning procedure, µ(µ0, y) admits a pdf pµ(µ0,y) for each y Rd. Thus, ν admits the pdf

pν(x) := Z pµ(µ0,y)(x) d Q(y), Q := Z Pθ dµ(θ)

and our task is to establish that this pdf is positive on Rd. Fix x Rd. Now, since pµ(µ0,y)(x) > 0 for all y Rd, we have

n N Sn, Sn := y Rd pµ(µ0,y)(x) > 1

Since Q is a probability distribution on Rd, it follows that for some n N, Q(Sn) > 0. Therefore pν(x) = Z pµ(µ0,y)(x) d Q(y) > 1

n Q(Sn) > 0

and, since this argument holds for all x Rd, pν is a positive pdf on Rd and ν is regular.

Testing Whether a Learning Procedure is Calibrated

Next, since µ0 and the learning procedure µ are regular, and µ is strongly calibrated, for each f FΘ,

Ff#µ(µ0,y)(f(θ)) = P(f(ϑ) f(θ)|θ, y) U(0, 1) (10)

and taking expectations of both sides yields

P(f(ϑ) f(θ)) = 1

Since both µ0 and ν are regular, it follows from Corollary 16 and (11) that µ0 = ν, and so ϑ has the marginal distribution µ0. Thus we have shown that the learning procedure µ is weakly calibrated to the belief distribution µ0 and the data-generating model P.

Appendix B. Probabilistic Numerical Methods for ODEs

This appendix contains full details of how the pnms in Section 3.3 were implemented:

The code for Chkrebtii et al. (2016) was taken from git.io/J33l L and the step-size was set at h = 0.1. The following settings were used: nsolves = 100, N = 100, nevalpoints = 500, lambda = 0.08 and alpha = 1. These values were manually selected, over the default values recommended in the code, since they led to improved calibration of the output. Rigorous optimisation of these settings was not attempted.

The code for Teymur et al. (2018) was provided to us by the authors and is not yet publicly released. The method used is the 2-step (i.e. order 3) probabilistic Adams Moulton method with step-size h = 0.5 and overall scaling parameter α = 0.3. These values were manually selected with the intention of improving calibration of the output, but rigorous optimisation of these settings was not attempted. The stepwise perturbations are scaled using the global calibration procedure described in Conrad et al. (2017).

The code for both Schober et al. (2019) and Tronarp et al. (2019) derives from the comprehensive open-source Python package probnum. On the advice of the authors of this package we implemented the adaptive routine probnum/diffeq.probsolve ivp. In this case the default values of tolerances were used. The only hyperparameter it is required to set is algo order, which we set to 3. The setting method = EK0 corresponds to Schober et al. (2019), and method = EK1 corresponds to Tronarp et al. (2019).

The code for Teymur et al. (2021) was provided to us by the authors and expected to be made public on full publication of that paper. This method is based on multiﬁdelity simulation, so we take h {0.1, 0.2, 0.4} and solve the ode using a 2-step (i.e. order 2) Adams Bashforth method. All other hyperparameters are optimised automatically as part of the routine.

Cockayne, Graham, Oates, Sullivan, Teymur

Jeﬀrey L Anderson. A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9(7):1518 1530, 1996. doi: 10.1175/ 1520-0442(1996)009 1518:AMFPAE 2.0.CO;2.

Yaakov Bar-Shalom and K. Birmiwal. Consistency and robustness of PDAF for target tracking in cluttered environments. Automatica, 19(4):431 437, 1983. doi: 10.1016/ 0005-1098(83)90059-6.

Mark A. Beaumont, Wenyang Zhang, and David J. Balding. Approximate Bayesian computation in population genetics. Genetics, 162(4):2025 2035, 2002.

James O. Berger. An overview of robust Bayesian analysis: With comments and a rejoinder by the author. Test, 3(1):5 124, 1994. doi: 10.1007/BF02562676.

Jeremy Berkowitz. Testing density forecasts, with applications to risk management. J. Bus. Econom. Statist., 19(4):465 474, 2001. doi: 10.1198/07350010152596718.

Jose-M. Bernardo and Adrian F. M. Smith. Bayesian Theory. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Ltd., Chichester, 1994. doi: 10.1002/ 9780470316870.

Anirban Bhattacharya, Debdeep Pati, and Yun Yang. Bayesian fractional posteriors. Ann. Statist., 47(1):39 66, 2019. doi: 10.1214/18-AOS1712.

Joris Bierkens. Non-reversible Metropolis Hastings. Stat. Comput., 26(6):1213 1228, 2016. doi: 10.1007/s11222-015-9598-x.

Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B. Stat. Methodol., 78(5):1103 1130, 2016. doi: 10.1111/rssb.12158.

David M. Blei, Alp Kucukelbir, and Jon D. Mc Auliﬀe. Variational inference: A review for statisticians. J. Amer. Statist. Assoc., 112(518):859 877, 2017. doi: 10.1080/01621459. 2017.1285773.

George Casella. An introduction to empirical Bayes data analysis. Amer. Statist., 39(2): 83 87, 1985. doi: 10.2307/2682801.

Oksana A. Chkrebtii, David A. Campbell, Ben Calderhead, and Mark A. Girolami. Bayesian solution uncertainty quantiﬁcation for diﬀerential equations. Bayesian Anal., 11(4):1239 1267, 2016. doi: 10.1214/16-BA1017.

Peter F. Christoﬀersen. Evaluating interval forecasts. Int. Econ. Rev., 39(4):841 862, 1998. doi: 10.2307/2527341.

Jon Cockayne, Chris J. Oates, Ilse C. F. Ipsen, and Tim Reid. Probabilistic iterative methods for linear systems. Journal of Machine Learning Research, 2021. To appear. ar Xiv:2012.12615.

Testing Whether a Learning Procedure is Calibrated

Patrick R. Conrad, Mark Girolami, Simo S arkk a, Andrew Stuart, and Konstantinos Zygalakis. Statistical analysis of diﬀerential equations: Introducing probability measures on numerical solutions. Stat. Comput., 27(4):1065 1082, 2017. doi: 10.1007/ s11222-016-9671-0.

Samantha R. Cook, Andrew Gelman, and Donald B. Rubin. Validation of software for Bayesian models using posterior quantiles. J. Comput. Graph. Statist., 15(3):675 692, 2006. doi: 10.1198/106186006X136976.

Dennis D. Cox. An analysis of Bayesian inference for nonparametric regression. Ann. Statist., 21(2):903 923, 1993. doi: 10.1214/aos/1176349157.

A. P. Dawid. The well-calibrated Bayesian. J. Amer. Statist. Assoc., 77(379):605 603, 1982. doi: 10.1080/01621459.1982.10477856.

A. P. Dawid. Statistical theory. The prequential approach. J. Roy. Statist. Soc. Ser. A, 147 (2):278 292, 1984. doi: 10.2307/2981683.

Persi Diaconis, Susan Holmes, and Mehrdad Shahshahani. Sampling from a manifold. In Advances in modern statistical theory and applications: a Festschrift in honor of Morris L. Eaton, volume 10 of Inst. Math. Stat. (IMS) Collect., pages 102 125. Inst. Math. Statist., Beachwood, OH, 2013.

Francis X. Diebold, Todd A. Gunther, and Anthony S. Tay. Evaluating density forecasts with applications to ﬁnancial risk management. Int. Econ. Rev., 39(4):863 883, 1997. doi: 10.2307/2527342.

Oliver E. Drummond, X. Rong Li, and Chen He. Comparison of various static multiplemodel estimation algorithms. In Signal and Data Processing of Small Targets 1998, volume 3373, pages 510 527. International Society for Optics and Photonics, 1998.

Paul Fearnhead and Dennis Prangle. Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. R. Stat. Soc. Ser. B. Stat. Methodol., 74(3):419 474, 2012. doi: 10.1111/j.1467-9868.2011.01010.x.

David Freedman. On the Bernstein von Mises theorem with inﬁnite-dimensional parameters. Ann. Statist., 27(4):1119 1140, 1999. doi: 10.1214/aos/1017938917.

Andrew Gelman, Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian B urkner, and Martin Modr ak. Bayesian workﬂow, 2020. ar Xiv:2011.01808.

John Geweke. Getting it right: Joint distribution tests of posterior simulators. J. Amer. Statist. Assoc., 99(467):799 804, 2004. doi: 10.1198/016214504000001132.

Tilmann Gneiting and Matthias Katzfuss. Probabilistic forecasting. Annu. Rev. Stat. Appl., 1(1):125 151, 2014. doi: 10.1146/annurev-statistics-062713-085831.

Tilmann Gneiting and Roopesh Ranjan. Combining predictive distributions. Electron. J. Stat., 7:1747 1782, 2013. doi: 10.1214/13-EJS823.

Cockayne, Graham, Oates, Sullivan, Teymur

Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E. Raftery. Probabilistic forecasts, calibration and sharpness. J. R. Stat. Soc. Ser. B Stat. Methodol., 69(2):243 268, 2007. doi: 10.1111/j.1467-9868.2007.00587.x.

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. J. Mach. Learn. Res., 13(25):723 773, 2012. URL https://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf.

Peter Gr unwald and Thijs van Ommen. Inconsistency of bayesian inference for misspeciﬁed linear models, and a proposal for repairing it. Bayesian Analysis, 12(4), December 2017. doi: 10.1214/17-ba1085. URL https://doi.org/10.1214/17-ba1085.

Thomas M. Hamill. Interpretation of rank histograms for verifying ensemble forecasts. Mon. Weather Rev., 129(3):550 560, 2001. doi: 10.1175/1520-0493(2001)129 0550:IORHFV 2. 0.CO;2.

Thomas M. Hamill and Stephen J. Colucci. Veriﬁcation of Eta-RSM short-range ensemble forecasts. Mon. Weather Rev., 125(6):1312 1327, 1997. doi: 10.1175/1520-0493(1997) 125 1312:VOERSR 2.0.CO;2.

Diana Harrison, David Sutton, Pedro Carvalho, and Michael Hobson. Validation of Bayesian posterior distributions using a multidimensional Kolmogorov Smirnov test. Mon. Not. R. Astron. Soc., 451(3):2610 2624, 2015. doi: 10.1093/mnras/stv1110.

Peter J. Huber and Elvezio M. Ronchetti. Robust Statistics. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ, second edition, 2009. doi: 10. 1002/9780470434697.

Edwin T. Jaynes. On the rationale of maximum-entropy methods. Proc. IEEE, 70(9): 939 952, 1982. doi: 10.1109/PROC.1982.12425.

Toni Karvonen, George Wynne, Filip Tronarp, Chris Oates, and Simo S arkk a. Maximum likelihood estimation and uncertainty quantiﬁcation for Gaussian process approximation of deterministic functions. SIAM/ASA J. Uncertain. Quantif., 8(3):926 958, 2020. doi: 10.1137/20M1315968.

Marc C. Kennedy and Anthony O Hagan. Bayesian calibration of computer models. J. R. Stat. Soc. Ser. B Stat. Methodol., 63(3):425 464, 2001. doi: 10.1111/1467-9868.00294.

Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. Generalized variational inference: Three arguments for deriving new posteriors. Journal of Machine Learning Research, 2021. To appear.

Kody Law, Andrew Stuart, and Konstantinos Zygalakis. Data Assimilation: A Mathematical Introduction, volume 62 of Texts in Applied Mathematics. Springer, Cham, 2015. doi: 10.1007/978-3-319-20325-6.

Jeong Eun Lee, GeoﬀK. Nicholls, and Robin J. Ryder. Calibration procedures for approximate Bayesian credible sets. Bayesian Anal., 14(4):1245 1269, 2019. doi: 10.1214/19-BA1175.

Testing Whether a Learning Procedure is Calibrated

John R. Lewis, Steven N. Mac Eachern, and Yoonkyung Lee. Bayesian restricted likelihood methods: Conditioning on insuﬃcient statistics in Bayesian regression (with discussion). Bayesian Analysis, 16(4), December 2021. doi: 10.1214/21-ba1257. URL https://doi. org/10.1214/21-ba1257.

X. Rong Li and Zhanlue Zhao. Measuring estimator s credibility: Noncredibility index. In Proceedings of the 9th International Conference on Information Fusion, pages 1 8. IEEE, 2006. doi: 10.1109/ICIF.2006.301770.

X. Rong Li, Zhanlue Zhao, and Vesselin P. Jilkov. Estimator s credibility and its measures. In Proceedings of the IFAC 15th World Congress, 2002.

X. Rong Li, Zhanlue Zhao, and Xiao-Bai Li. Evaluation of estimation algorithms: Credibility tests. IEEE T. Syst. Man Cy. A, 42(1):147 163, 2011. doi: 10.1109/TSMCA.2011. 2158095.

Sean Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. Cambridge University Press, Cambridge, second edition, 2009. doi: 10.1017/CBO9780511626630.

John F. Monahan and Dennis D. Boos. Proper likelihoods for Bayesian analysis. Biometrika, 79(2):271 278, 1992. doi: 10.1093/biomet/79.2.271.

D. Prangle, M. G. B. Blum, G. Popovic, and S. A. Sisson. Diagnostic tools for approximate Bayesian computation using the coverage property. Aust. N. Z. J. Stat., 56(4):309 329, 2014. doi: 10.1111/anzs.12087.

Leah F. Price, Christopher C. Drovandi, Anthony Lee, and David J. Nott. Bayesian synthetic likelihood. J. Comput. Graph. Statist., 27(1):1 11, 2018. doi: 10.1080/10618600. 2017.1302882.

Jakub Pr uher, Toni Karvonen, Christopher James Oates, Ondˇrej Straka, and Simo S arkk a. Improved calibration of numerical integration error in sigma-point ﬁlters. IEEE Trans. Automat. Contr., 66(3):1286 1292, 2020. doi: 10.1109/TAC.2020.2991698.

G. D. Rayner and H. L. Mac Gillivray. Numerical maximum likelihood estimation for the g-and-k and generalized g-and-h distributions. Stat. Comput., 12(1):57 75, 2002. doi: 10.1023/A:1013120305780.

Joseph P. Romano and Cyrus Di Ciccio. Multiple data splitting for testing. Technical report, Department of Statistics, Stanford University, 2019. URL https://statistics. stanford.edu/sites/g/files/sbiybj6031/f/2019-03.pdf. Technical Report No. 2019-03.

Murray Rosenblatt. Remarks on a multivariate transformation. Ann. Math. Statistics, 23: 470 472, 1952. doi: 10.1214/aoms/1177729394.

Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, second edition, 2003. doi: 10.1137/1. 9780898718003.

Cockayne, Graham, Oates, Sullivan, Teymur

Michael Schober, Simo S arkk a, and Philipp Hennig. A probabilistic model for the numerical solution of initial value problems. Stat. Comput., 29(1):99 122, 2019. doi: 10.1007/ s11222-017-9798-7.

Galen R. Shorack. Probability for Statisticians. Springer Texts in Statistics. Springer-Verlag, New York, 2000.

Botond Szab o, A. W. van der Vaart, and J. H. van Zanten. Frequentist coverage of adaptive nonparametric Bayesian credible sets. Ann. Statist., 43(4):1391 1428, 2015. doi: 10.1214/ 14-AOS1270.

O. Talagrand, R. Vautard, and B. Strauss. Evaluation of probabilistic prediction systems. In Proceedings of the ECMWF Workshop on Predictability, pages 1 25. ECMWF, 1997. URL https://www.ecmwf.int/node/12555.

Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. Validating Bayesian inference algorithms with simulation-based calibration, 2018. ar Xiv:1804.06788.

Onur Teymur, Han Cheng Lie, T. J. Sullivan, and Ben Calderhead. Implicit probabilistic integrators for ODEs. Neur IPS 31, pages 7255 7264, 2018.

Onur Teymur, Chris N. Foley, Philip G. Breen, Toni Karvonen, and Chris J. Oates. Blackbox probabilistic numerics, 2021. ar Xiv:2106.13718.

Filip Tronarp, Hans Kersting, Simo S arkk a, and Philipp Hennig. Probabilistic solutions to ordinary diﬀerential equations as nonlinear Bayesian ﬁltering: A new perspective. Stat. Comput., 29(6):1297 1315, 2019. doi: 10.1007/s11222-019-09900-1.

Yixin Wang and David M. Blei. Frequentist consistency of variational Bayes. J. Amer. Statist. Assoc., 114(527):1147 1161, 2019. doi: 10.1080/01621459.2018.1473776.

Daniel Wegmann, Christoph Leuenberger, and Laurent Excoﬃer. Eﬃcient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics, 182(4):1207 1218, 2009. doi: 10.1534/genetics.109.102509.

Hanwen Xing, GeoﬀNicholls, and Jeong Lee. Calibrated approximate Bayesian inference. In International Conference on Machine Learning, pages 6912 6920, 2019.