# calibrated_approximate_bayesian_inference__c02611aa.pdf

Calibrated Approximate Bayesian Inference

Hanwen Xing 1 Geoff K. Nicholls 1 Jeong Eun Lee 2

We give a computational framework for estimating the bias in coverage resulting from making approximations in Bayesian inference. Coverage is the probability credible sets cover prior parameter values. We show how to estimate the coverage an approximation scheme achieves when the ideal but intractable observation model and the prior can be simulated, but have been replaced, in the Monte Carlo, with approximations. Coverage estimation procedures given in Lee et al. (2018) work on simple problems, but do not scale well, as those authors note. For example, Lee et al. (2018) calibrate a completely collapsed MCMC algorithm for partition structure in a Dirichlet process model for random effects in a hierarchical model and a small data set, but they note it fails when the model is applied to clustering on a larger dataset. By exploiting the symmetry of the coverage error under permutation of low level group labels and smoothing with Bayesian Additive Regression Trees, we show that the original approximate inference had poor coverage for these data and should not be trusted.

1. Introduction

Bayesian credible sets with stated nominal coverage are a fundamental way to communicate statistical uncertainty. However, we usually report approximate credible sets with uncalibrated coverage as some approximation is inevitable for large data sets and complex models. Approximation comes in many forms. In MCMC samples are only asymptotically distributed according to the posterior. The precision parameter is the run length. Approximate Bayesian Computation (ABC, Pritchard et al. (1999)) typically has two kinds of precision parameters, a distance threshold and a

1Department of Statistics, University of Oxford, UK 2Department of Statistics, University of Auckland, New Zealand. Correspondence to: Hanwen Xing <hanwen.xing@stx.ox.ac.uk>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

Monte Carlo sample. There are also ﬁxed approximations with no precision parameters, in which a likelihood evaluation is replaced by an approximation which cannot be improved by varying a control parameter. Pseudo-likelihood (Besag, 1975) and variational inference (Jordan et al., 1999; Hoffman et al., 2013) often lead to a ﬁxed approximation.

A number of methods have been developed to check the approximation is acceptable. Recent new generic diagnostic tools given in Talts et al. (2018) and Yao et al. (2018) are related to earlier work in Prangle et al. (2014) and exploit an idea, developed in Geweke (2004); Cook et al. (2006) as a MCMC convergence diagnostic, and going back to Monahan & Boos (1992). In early related work, Menendez et al. (2014) gives procedures for correcting credible sets under conditions stronger than those required here and Rodrigues et al. (2018) recalibrates ABC samples.

We consider an approximate Bayesian credible set with given nominal level α. What coverage does the credible set actually achieve? Wherever approximate Bayesian inference reports a credible set, an associated coverage measure should be given. We do not build a new credible set with improved coverage, although this is easy in our AIS method, because we would then have to estimate the coverage of that corrected credible set.

Let π(φ) be the prior for φ Ω, let p(y|φ) be the observation model (the likelihood) for data y Y and let π(φ|y) π(φ)p(y|φ) be the posterior for φ given data y. Let π(θ) and p(y|θ) be the approximate prior and likelihood for parameter θ Ωwith approximate posterior π(θ|y) π(θ) p(y|θ). This paper is motivated by problems where we cannot in practice sample π(φ|y) using any known Monte Carlo method. We assume a tractable approximation π(θ|y) is available, and we assume it is possible to sample φ π( ) and y p( |φ) (just as in ABC).

The estimated credible set is computed for a posterior distribution π(θ|y) which approximates the exact posterior π(φ|y). The exact level α credible set Cy for the exact posterior π(φ|y) satisﬁes

Ω 1φ Cyπ(φ|y)dφ.

This set Cy has perfect Bayes coverage in the sense that, if φ π( ) is a draw from the prior, and y p( |φ) is a draw

Calibrated Approximate Bayesian Inference

from the observation model, then Pr(φ CY |Y = y) = α. The credible set covers the true parameter φ with probability α if nature drew φ from the prior, and the data y really was generated using the observation model we are using. This is the deﬁnition of Bayesian coverage, not an assumption.

In practice we compute a credible set Cy using the approximate posterior π(θ|y). This is a set Cy satisfying

Ω 1θ Cy π(θ|y)dθ.

This will not in general have the right coverage for the exact posterior. If φ π( ), and y p( |φ), and we let b(y) = Pr(φ CY |Y = y), then

Ω 1φ Cyπ(φ|y)dφ (1)

is the operational coverage CY achieves for φ a draw from the exact posterior. This is not equal α in general. The coverage bias b(y) α can vary markedly over data space. Cook et al. (2006) observe that coverage may be estimated, but the quantity they mention is equivalent to R

Y p(y)b(y)dy, an average over data space which may differ a great deal from b(y), as we see in the example in Section 4.

In practice we may not be able to compute an exact credible set, even after making the approximation leading to π(θ|y). In this case we would typically simulate θj π( |y) for j = 1, . . . , J, set θ = (θ1, . . . , θJ) and compute an estimate, ˆCy(θ), for Cy based on J samples. There is an additional Monte Carlo error in the coverage and so, with φ, y and θ distributed as prior, observation model and approximate posterior, we let

c(y) = Pr(φ ˆCY (θ)|Y = y),

denote the realised coverage allowing for Monte Carlo error. We give algorithms for estimation of b(yobs) and c(yobs). We discuss the function c(y) by default, as estimation of b(y) is a simpler special case.

The joint distribution of φ, y and θ in the generative model is given by

m(φ, y, θ) = π(φ)p(y|φ) π(θ|y). (2)

The conditional distribution of φ, θ given y is

m(φ, θ|y) = π(φ|y) π(θ|y),

writing π(θ|y) for the joint distribution of θ ΩJ (an abuse of notation). Now c(y) is an expectation over m,

Ω 1φ ˆ CY (θ)m(φ, θ|y)dφdθ.

The coverage is a posterior expectation in the exact distribution. Coverage estimation resembles the original problem

of estimating credible sets from the exact posterior, which we have said we cannot do! However we can sample π(φ) and the observation model p(y|φ), and it proves easier to estimate c(y) than some general expectation in the posterior.

If it were possible to simulate φ π( |y) then estimation of b(y) and c(y) would be straightforward. For i = 1, . . . , M, we simulate φ(i) π( |y) and θi,j π( |y) for j = 1, . . . , J. We then construct an α level approximate credible set ˆC(i) based on θi,1, . . . , θi,J, and deﬁne

ci = 1(φi ˆCi).

Now ˆc(y) = 1 M PM i=1 ci is an unbiased and consistent estimator for c(y). See Algorithm 1. If we replace ˆCi by Ci then the procedure estimates b(y). We assume that Algorithm 1 cannot be implemented. In the examples we give in Section 4 we implement Algorithm 1 to assist understanding. This will not in general be possible.

Algorithm 1 Estimation of operational coverage c(y)

Input: Observed data y; Exact posterior distribution π(φ|y); Approximate posterior distribution π(θ|y); Number of samples J from the approximate posterior; Number of samples M from the generative model. for i in 1, . . . , M do

Simulate φi π(φ|y) and θ(i) = {θi,1, . . . , θi,J} where θi,j π(θ|y) for j = 1, . . . , J

Compute the approximate credible set ˆCi based on θ(i). Set ci = 1(φi ˆCi) end for Return: Estimated coverage ˆc(y) = 1 M PM i=1 ci

The paper is structured as follows. In Section 2 we discuss previous work. We then outline two algorithms for the estimation problem in Section 3. In Section 4 we apply the algorithms to calibrate a pseudo-likelihood approximation. In Section 5, we calibrate the approximate posterior of a random partition in a hierarchical model with a Dirichlet Process prior on the distribution of random effects. We ﬁnish with a brief discussion.

2. Relation to Previous Work

Lee et al. (2018) introduce the idea of estimating coverage probabilities as a validation procedure, and give two proofof-concept estimators for c(y) when simulation from the exact posterior π(θ|y) is not possible. Our own algorithms are a qualitative improvement, as those earlier methods completely fail on the examples we give in this paper.

Lee et al. (2018) give an importance sampling procedure which targets an approximation to c(y). Let δ(x, y) be a generic distance function on the data space Y, ρ > 0 a

Calibrated Approximate Bayesian Inference

tolerance level and let (y) = {y : δ(y, y ) ρ} be a closed ball in Y centered at y. Let

d(y) = Pr(φ ˆCY (θ)|Y (y))

be an ABC-style approximation to c(y). In terms of the density m(φ, y , θ) in Equation 2,

Iφ ˆ C(θ) z(y, ρ) m(φ, y , θ)dφdθdy (3)

with z(y, ρ) = Pr(Y (y)) a normalising constant. Lee et al. (2018) estimate d(yobs) with ˆd = PM i=1 w(φi, yi, θi) using M samples (φi, yi, θi), i = 1, . . . , M from the importance sampling distribution

m(φ, y, θ) π(φ|yobs)p(y|φ) π(θ|y)1y (yobs),

and weights w(φ, y, θ) m(φ, y, θ)/ m(φ, y, θ). They use ˆd as an estimate of c(yobs). This approach has two drawbacks: ˆd d as M , however d = c in general so the method is asymptotically biased (in M) unless we additionally take ρ to zero. This would be impractical as we simulate no data y (yobs). Secondly, as the authors observe, this estimator can be unstable due to high weight variance. Our Annealed Importance Sampling (AIS) estimate is also asymptotically biased, but AIS is a much more powerful tool, and the bias can be made very small. Also AIS gives a much higher Effective Sample Size (ESS) at similar cost. We give examples in the Supplementary Material illustrating this point for a simple normal example. The issue is qualitative. When ESS estimates become small, they cannot be trusted (for example, a poorly estimated ESS may not increase when the sample size increases).

We also take advantage of a simple but important simpliﬁcation not exploited by Lee et al. (2018). We replace Iφ ˆ CY (θ) in Equation 3 with Iφ ˆ Cyobs(θ). As Y yobs, the distribu-

tion of φ changes in a complex way, but the limit of ˆCY (θ) as Y yobs is computed from the approximate posterior, so we simply substitute the limiting value. This avoids the need to simulate θ for each simulated y-value, a big timesaver in some settings. For comparison with the IS method in Lee et al. (2018), we take their Ising Model example and scale it up by a factor of 25. On this larger problem we ﬁnd AIS calibration far out-performs Importance Sampling, yielding an ESS some 10 times larger in a comparable run time. In the Supplementary Material we show the ESS for IS falls off to small values as the Ising image size increases.

Lee et al. (2018) suggest an alternative regression procedure for estimating c(y). If we simulate (φi, yi, θ(i)) m iid for i = 1, . . . , M then, at each y, we have (φi, θ(i)|yi) π(φi|yi) π(θ(i)|yi). It follows that if we compute ˆCi = ˆC(θ(i)) and a response ci = 1φi ˆ Ci then the pairs

ci, yi are measurements of a Bernoulli process in which ci Bernoulli(c(yi)), i = 1, . . . , M. Lee et al. (2018) suggest using a Generalised Additive Model (GAM) for logistic regression in order to estimate c(y). Those authors take as regression covariates the components of a p-dimensional summary statistic s(y) = (s1(y), . . . , sp(y)) with p dim(Y). This works when s(y) is sufﬁcient, as is the case in our Ising model example in Section 3. However the choice of summary statistics is not in general clear and may bias results. However this is the sort of problem, variable selection, in which random forest regression and Bayesian Additive Regression Trees are very effective. For comparison with previous work, we take an example where the methods of Lee et al. (2018) fail, and show that estimation via BART gives reproducible results.

We exploit a symmetry of the approximation that will hold more widely: the approximation error is invariant under permutation of labels of levels of categorical variables. Let y Rn be a data vector and let L = {ℓ1, . . . , ℓN} be the levels of a variable x Ln. Suppose our data are simply y, x. Let T(x) = {T1, . . . , TN} be a collection of subsets of {1, . . . , n} partitioning the indices of x into groups of observations in the same level, so that i Tj xi = ℓj for each i = 1, . . . , n and each j = 1, . . . , N. Let PR = {σ P : T(xσ) = T(x)} be the set of permutations corresponding to level-relabeling. If the approximation does not distinguish levels, we have c(y) = c(yσ) for each σ PR (permuting y s with x s ﬁxed gives the same data set). This can be extended to multiple categorical variables. In a complete balanced design every level of every covariate co-occurs with every level of every other covariate the same number of times. Levels are then exchangeable within each variable simultaneously. In this setting we can compute c(y) by computing it on one special quadrant Y0 of Y and then mapping identical copies of c(y) out over Y by permutation. We take Y0 = {y Y : y T1 y T2 . . . y TN } (ie ordered on the averages of y s associated with each level, with additional order constraints for each variable if there are multiple categorical variables). We simulate φ, y and θ, map y back into Y0 and regress over this smaller space where y-values are more dense and regression (using BART) is easier. We map the function ˆc(y) back to the quadrant containing yobs.

3. Estimating the Operational Coverage

3.1. A Weighted-Sample Estimate for Coverage

In this section we estimate c(yobs) using Annealed Importance Sampling (AIS) (Neal, 2001) to approximately sample the true posterior π(φ|yobs). This leverages our ability to draw samples from the approximate posterior π(φ|yobs) as a starting point for the AIS iteration.

Calibrated Approximate Bayesian Inference

Let {γj}NAIS j=1 and {βj}NAIS j=1 be increasing sequences with γ0 = 0, γN = 1 and β0 = 0. Deﬁne an initial distribution

p0(φ, y) = π(φ|yobs) p(y|φ),

and, for j = 1, . . . , NAIS, intermediate distributions

pj(φ, y) π(φ)γj π(φ|yobs)1 γjp(y|φ) exp ( βjδ(y, yobs))

π(φ) p(yobs|φ)1 γjp(y|φ) exp ( βjδ(y, yobs)).

If γNAIS = 1, then p NAIS(φ, y) converges to π(φ)p(yobs|φ) as βNAIS and so p NAIS(φ) π(φ|yobs), the true posterior. The approximate posterior π(φ|yobs) is a useful part of the initial distribution p0, as it supports φ values for which typical synthetic data y p(y|φ) are relatively close to the observed yobs from the start.

An update scheme for φ and y generates transitions from pj(φ, y) to pj+1(φ, y). For each j = 1, . . . , NAIS, let

Qj((φ, y), (φ , y )) = qj((φ, y) (φ , y ))αj((φ, y) (φ , y ))

be a transition kernel with proposal distribution qj({φj, yj} {φ , y }) = fj(φj, φ )p(y |φ ) based on a simple local proposal distribution fj(φ, φ ) for φ, and an acceptance probability

αj = 1 pj(φ , y )qj({φ , y } {φj, yj})

pj(φj, yj)qj({φj, yj} {φ , y })

= 1 π(φ ) p(yobs|φ )1 γjfj(φ , φj)

π(φj) p(yobs|φj)1 γjfj(φj, φ ) ( )

exp( βjδ(y , yobs))

exp( βjδ(yj, yobs))

which admits pj(φ, y) as an invariant distribution.

Let d NAIS = Ep NAIS (1(φ ˆCyobs)). Let {wk, φk}K k=1 be a set of weighted samples generated by the AIS algorithm described above. These are AIS-weighted samples from p NAIS(φ, y), so that

k=1 wi1(φk ˆCyobs)

is a consistent estimate for d NAIS and d NAIS c(yobs) as βNAIS .

Theorem 1 ˆc(yobs) is a consistent estimator of the true realized coverage c(yobs) as K and βNAIS .

Proof: Since ˆc(yobs) is a self-normalised importance sampling estimator for the quantity

Prp N (φ ˆCyobs) = Z 1(φ ˆCyobs)p N(φ)dφ,

where p N(φ) is the marginal distribution of p N(φ, y), we have that as K ,

ˆc(yobs) p Prp N (φ ˆCyobs). (4)

As βNAIS and γNAIS = 1 the target density p NAIS(φ) converges to the true posterior density π(φ|yobs). Hence by Scheff e s theorem, we must have

Prp NAIS (φ ˆCyobs) c(yobs), (5)

where c(y) = Pr(φ ˆCY (θ)|Y = yobs) is the true posterior probability. Combining (1) and (2), we conclude that ˆc(yobs) is a consistent estimator for c(yobs)

Algorithm 2 summarizes the procedure. In contrast to the importance sampling approach in Lee et al. (2018), we do not need to simulate θ(i) and compute the approximate

credible set for each synthetic data vector y(j) i in Algorithm 2. This may speed up computation a great deal.

Algorithm 2 AIS Estimation of operational coverage c(y)

Input: Observed data yobs; Summary statistics s : Y Rp; Number of samples J from the approximate posterior; Number of samples M from the generative model. Simulate θ = {θ1, . . . , θJ} where θ1, . . . , θJ π( |yobs); Compute the approximate credible set ˆCyobs(θ). for i in 1, . . . , M do

Sample (φ(1) i , y(1) i ) p0(φ, y).

Compute w(1) i p1(φ(1) i ,y(1) i )

p0(φ(1) i ,y(1) i ).

for j in 2, . . . , NAIS do

Sample φ i fj 1( |φ(j 1) i ), y i p(y|φ i). Set φ(j) i = φ i, y(j) i = y i with probability αj deﬁned in ( ) and set φ(j) i = φ(j 1) i , y(j) i = y(j 1) i otherwise.

Compute w(j) i pj(φ(j) i ,y(j) i )

pj 1(φ(j) i ,y(j) i ) end for Compute ci = 1(φ(NAIS) i ˆCyobs) Compute wi = QT t=1 w(t) i end for Compute Wi = wi/ PM i=1 wi Return: Estimated coverage ˆc(yobs) = PM i=1 Wici

3.2. A Regression Estimate for Coverage

In Lee et al. (2018), the authors also suggest estimating c(y) via regression. Let {φi, yi}M i=1 be samples from the generative model π(φ)p(y|φ), let ˆCyi be an approximate credible set for yi, and ci = 1φi ˆ Cyi. Conditional on yi,

ci Bernoulli(c(yi)), c(yi) = Pr(φi ˆCYi|Yi = yi).

Calibrated Approximate Bayesian Inference

Let s(yi) Rp be a vector of summary statistics of yi. Lee et al. (2018) use a regression model with response ci and covariates s(yi) to learn the map from y to c(y). This is logistic regression, though Lee et al. (2018) suggest a more ﬂexible GAM logistic regression. Raynal et al. (2018) and Marin et al. (2018) observe that, for ABC work, Random Forests allow us to handle a potentially large number of summary statistics (even if some or many of them are poorly informative) without preliminary selection. Inspired by their ideas, we applied a Probit Bayesian Additive Regression Tree (BART) model (Chipman et al., 2010) to estimate c(y) when low-dimensional sufﬁcient statistics are not available.

BART is a sum-of-trees model where each tree is regularized by a prior to be a weak learner. Let Y {0, 1} be a generic binary output and x Rp be a generic input. In the classiﬁcation setting we wish to infer an unknown function f such that Pr(Y = 1|x) = Φ(f(x)), with Φ( ) the standard normal CDF. The Probit BART model approximates the function f(x) with a sum of NT trees, that is

f(x) h(x) =

where each gi(x) is given by a separate regression tree. A regularizing prior is imposed on the trees to keep individual tree effects small and prevent overﬁtting. The posterior distribution over trees is sampled using a Bayesian backﬁtting procedure. Details can be found in Chipman et al. (2010) and Kapelner & Bleich (2016).

Let s(y) be a high dimensional vector of summary statistics for y. We ﬁt a probit BART model with pi = Φ(h(s(yi))) and ci Bernoulli( pi) using {ci, s(yi)}M i=1 as the training data. For sobs = s(yobs), we obtain πB(h(sobs)|{ci, s(yi)}M i=1), the posterior distribution of h(sobs) in the ﬁtted model, and estimate c(yobs) using

ˆc(yobs) = EπB(Φ(h(sobs)|c1, . . . , c N, s1, . . . , s N)),

the (sample) posterior mean of Φ(h(sobs)).

We chose BART for two reasons. First, the tree structure is capable of handling potentially high dimensional input s(y). This is crucial when low dimensional sufﬁcient statistics for y Y are unavailable. Also, since BART is Bayesian, we have a natural way to assess the uncertainty of our estimate. We ﬁt Probit BART models using the R package bart Machine (Kapelner & Bleich, 2016).

4. 2-D Ising Model

Figure 1 is a 200 by 200 binary image obtained by thresholding a grey-level image of ice ﬂoes from Banﬁeld & Raftery (1992). We illustrate our method on the problem of ﬁtting an Ising model with smoothing parameter φ and free boundary

Algorithm 3 Estimation of operational coverage c(y) via regression

Input: Observed data yobs; Summary statistics s : Y Rp; Number of samples J from the approximate posterior; Number of samples M from the generative model; A regression model M. for i in 1, . . . , M do

Simulate φi π(φ), yi p(y|φi) and θ(i) = {θi,1, . . . , θi,J} where θi,j π(θ|y) for j = 1, . . . , J

Compute the approximate credible set ˆCi based on θ(i). Set ci = 1(φi ˆCi) and compute the p-dimensional summary statistics s(yi). end for Fit the regression model M : c h(s(y)) to learn the relation between coverage c(i) and summary statistics s(yi) using {ci, s(yi)}M i=1 as training data. Return: ˆc(yobs), the ﬁtted value given s(yobs) based on the regression model M.

conditions to these data. The model with free boundary conditions has an intractable likelihood so we approximate it using a solveable model with toroidial boundary conditions. We then calibrate the approximate credible interval for φ. Lee et al. (2018) work on a 40 by 40 subset of the image. We apply the Importance Sampler of Lee et al. (2018) to the full problem in the supplement and ﬁnd it breaks down at around 100 by 100. However AIS works well as we show.

The Ising model is a Markov model on a binary lattice. Let G = (V, E) be a graph with edge set E and vertices V . For each v V , let yv {0, 1} be binary data at v and y = yvv V , y {0, 1}|V | the collection of all yv. Let < u, v > E be an edge in G between vertices u, v. Let

f(y, E) = X

<u,v> E 1(yu = yv)

be the number of pairs of vertices with disagreeing neighbours on G. In this example, G is a rectangular NI NI lattice with NI = 200 and free boundary conditions. We denote the graph by GF = (EF , V ). Interior vertices on G have degree 4, edge vertices have degree 3 and corner vertices have degree 2. We consider also the toroidal boundary condition. The lattice GT = (ET , V ) is wrapped onto a torus so all vertices in G have degree 4.

Let φ > 0 be a scalar parameter. The likelihood under free boundary conditions is

p F (y|θ) = ZF (θ) 1 exp( θf(y, EF ))

x {0,1}|V | exp( θf(x, EF ))

Calibrated Approximate Bayesian Inference

Figure 1. Ice ﬂoe image from Banﬁeld & Raftery (1992)

is a normalizing constant. Similarly, let ZP (θ) be the normalizing constant under toroidal boundary conditions. When NI is large, ZF (θ) is computationally intractable. However, ZP (θ) is given in closed form in Kaufman (1949).

Following Lee et al. (2018) we impose a uniform prior π(θ) 1(θ (0, 2)) on θ. The posterior is

π(θ|y) ZF (θ) 1 exp( θf(y, EF ))1(θ (0, 2)).

Although π(θ|y) is doubly intractable we can use an exchange algorithm (Murray et al., 2006) to get asymptotically exact inference. Alternatively we can approximate ZF (θ) by ZP (θ). Let

π(θ|y) ZP (θ) 1 exp( θf(y, EF ))1(θ (0, 2))

denote the approximate posterior. The univariate approximate posterior density π(θ|y) can be evaluated pointwise up to a normalising constant, and the corresponding CDF is readily evaluated. We compute the equal-tail, 95% approximate credible set Cy. We ﬁnd Cy = (0.899, 0.913). We used an exchange algorithm (Murray et al., 2006) and Algorithm 1 to estimate the coverage c(yobs) as a check. This Monte Carlo estimate, which we treat as the truth, is c(1)(yobs) = 0.518. Clearly we should be rejection this approximation scheme.

We run Algorithm 2 to estimate the operational coverage. We initialise the AIS sampler with M = 1000 samples {φi, yi}M i=1 with φi π(φ|yobs) and yi p F (yi|φi), the true likelihood. We set the number of AIS iterations NAIS = 60 with cooling schedule βj = 1.05j for j = 1, . . . , NAIS, γj = 0.02j for j = 1, . . . 50 and γj = 1 for j = 51, . . . , NAIS at the jth iteration. We use the K-S distance δ(yi, yobs) = |Gyi Gyobs| , where Gyi the CDF

Number of repitition

Number of iteration

Estimated coverage at observed data

0 10000 20000 30000 40000 Sufficient Statistic S(y)

Estimated coverage

Figure 2. Left: Algorithm 2; the estimated ˆc(yobs) based on the intermediate distribution pj at each iteration j of the AIS sampler. The grey ribbon represents the 2σ error bar for the true value. We see that ˆc(yobs) converges to c(1)(yobs) while the standard error of ˆc(yobs) increases due to decreasing effective sample size. Crosses correspond to ﬁnal results for 15 repeats of the algorithm (with arbitary x-values). Right: Algorithm 3; the estimated c(y) as a function of the natural sufﬁcient statistics s(y). Vertical dotted segment is the 2σ error bar of ˆc(yobs) based on Algorithm 2.

of the approximate posterior π(φ|yi), as the distance metric. Algorithm 2 gives ˆc(yobs) = 0.545 with standard error 0.037 and an effective sample size (ESS) equal 180. In Figure 2 we see the estimated operational coverage at iteration NAIS = 60 is close to the true value, indicating the effectiveness of AIS here. We repeated the whole experiment 15 times with excellent consistency within uncertainty. The importance sampler (Lee et al., 2018) gives a similar estimate ˆc IS(yobs) = 0.529 with the threshold ρ = 0.5 (varying ρ gave no improvement). However, the ESS is just 22 for a similar runtime. This ESS is likely to be an overestimate. See the Supplementary material for details.

In Algorithm 3, we ﬁrst simulate M = 1000 samples {φi, yi} from π(φ)p F (y|φ), the joint prior distribution, for i = 1, . . . , M. Simulation of synthetic data from p F (y|φ) is straightforward using MCMC. Figure 3 shows a BART estimate with covariate S(yi) = f(y, EF ) and response ci = 1φi ˆ Cyi. This gives ˆc(yobs) = 0.565 with 95% credible set (0.420, 0.702), in good agreement with the AIS estimate. This is an easy problem for the regression approaches as there is a scalar sufﬁcient statistic. It is harder for importance sampling, due to the sharply peaked target. In the supplement we show that BART reproduces Figure 2 (right) using the raw N 2 I = 40000 binary data.

5. Dirichlet Process Random Effect Model

Lee et al. (2018) show that their IS approach works for calibration of a completely collapsed MCMC algorithm for

Calibrated Approximate Bayesian Inference

partition structure in a Dirichlet process. However it is easy to ﬁnd problems on which the IS estimator performs poorly. In this section, we use Algorithm 3 to estimate the realised coverage c(yobs) for a dataset and approximation procedure on which the methods of Lee et al. (2018) fail (no sufﬁcient statistic, output ESS close to one, estimated ˆc(y) values close to zero or one). We show that credible sets based on the Laplace approximation are unreliable in this example.

Our dataset has the classical format of a complete design, with ﬁve categorical variables, including Treatment (N = 12 levels), and four block variables, B1 (with q = 3 levels) B2 (r = 2 levels), B3 (two levels) and B4 (seven levels) so that we have n = 1008 observations, y = (y1, . . . , yn). In our example we ﬁt a hierarchical model with known ﬁxed effects B1 B2. Let X be the n p design matrix for the ﬁxed effects (p = 6 here).

We take a Dirichlet process prior for the hierarchy of random effects in our hierarchical model. Our aim here is not to develop new models but to illustrate calibration. The model is similar in structure to the model Malsiner-Walli et al. (2018) ﬁt, differing mainly in the choice of partition model, a Chinese Restaurant Process (CRP) in our case. In related work Pauger et al. (2018) cluster variances in the marginal model. We suppose the scientist wants to cluster the treatments into groups with similar interaction effects. Each treatment has a vector of random effects, so the object here is to estimate a partition of the N = 12 random effect vectors for treatment. The output of the uncalibrated analysis is an approximate HPD credible set for the unknown partition of treatment effects. We calibrate a credible set here, not simply a credible interval, reﬂecting the ease of application of our methods in more general settings.

Let A = {1, . . . , N} give the distinct levels of Treatment and let A be the n 1 covariate vector with Ai A giving the level of Treatment in the i th observation. These are the levels we cluster. Let S = {S1, . . . , SK} be a partition of A and let S be a n 1 unobserved categorical covariate giving the grouping, so that Si = k means Ai Sk. The interaction between B1 and Treatment is a random effect so we have a vector of random effects ηA j Rq, ηA j N(0, ΣA) iid for j = 1, . . . , N for the different levels of A and another offset vector of random effects ηS k Rq, ηS k N(0, ΣS) i.i.d. for k = 1, . . . , K. Let Z be a n q matrix of indicators for the levels of B1. Denote by xi, zi the ith row of X and Z, let β = (β1, . . . , βp) be the vector of ﬁxed effects, and let ϵi N(0, σ2) i.i.d. for i = 1, . . . , n. The observation model is

yi = xiβ + ziηS Si + ziηA Ai + ϵi, i = 1, . . . , n

with likelihood p(y|ψ) for parameter ψ = (β, ηA, ηS, ΣA, ΣS, σ2), ψ ΩS. The parameter space ΩS of ψ depends on the partition S, as the dimension

of ηS is determined by S.

The partition S is an unknown parameter in the posterior with a Chinese Restaurant Process (CRP) prior

π(S) = αKΓ(α) QK k=1 Γ(|Sk|) Γ(α + n)

for S P, where α is a model parameter, |Sk| is the number of elements in the set Sk and S P is the space of partitions. We took α = 1. Our setup is equivalent to a Dirichlet process prior GS DP(α, H), ηS k GS with base distribution H(ηS k ) = N(ηS k ; 0, ΣS) for k = 1, . . . , K for the random effects ηS due to partition S. The joint prior π(ψ, S) is

π(ψ, S) = π(ηS|S, ΣS)π(β, ηA, ΣA, ΣS, σ2)π(S)

with π(ηS|S, ΣS) = QK k=1 N(ηS k ; 0, ΣS) and

π(β, ηA, ΣA, ΣS, σ2) = N(β; 0, σ2 βIp)

j=1 N(ηA j ; 0, ΣA)

IW(ΣA; νA, VA)IW(ΣS; νS, VS)IG(σ2; ασ, βσ).

Here σβ, νA, VA, νS, VS, ασ, βσ are prior hyperparameters. The joint posterior distribution is then

π(ψ, S|y) p(y|ψ)π(ψ, S). (6)

Estimation of S by sampling the joint posterior π(ψ, S|y) using MCMC is a variable dimension problem. It is convenient to work with the marginal posterior π(S|y) p(y|S)π(S), where

p(y|S) = Z p(y|ψ)π(ψ|S)dψ (7)

is the marginal likelihood. However, p(y|S) is computationally intractable. Suppose we approximate it with a Laplace approximation. How much harm does this do? Let b S be the Bayesian Information Criterion (BIC) of the model in Equation 7 if the partition is S so that

p BIC(y|S) = exp( b S/2)

approximates the marginal likelihood p(y|S). This corresponds to a choice of unit information priors on model parameters ψ ((Kass & Raftery, 1995; Raftery, 1999)) and can be seen as part of the approximation we are calibrating. Packages computing the BIC for complex models are available (we use the R-package lme4, see Bates et al. (2014)). The approximate posterior for S is

π(S|y) p BIC(y|S)π(S).

We sample π(S|y) and construct an approximate credible set for S (see Supplementary Material) using standard Metropolis-Hasting MCMC. Can we trust this credible set?

Calibrated Approximate Bayesian Inference

Table 1. Estimate of c(yobs) and the corresponding 95% credible interval based on two Probit BART models.

Model ˆc(yobs) 95% Credible Interval M1 0.262 (0.065,0.549) M2 0.308 (0.124,0.552) M21 0.285 (0.067,0.564) M22 0.322 (0.086,0.618) M23 0.347 (0.095,0.673) M24 0.270 (0.056,0.565)

We apply Algorithm 3 to the problem. For i = 1, . . . , M we sample partitions S(i) π(S), ψ(i) π( |S(i)), ψ ΩS and y(i) p( |ψ(i)), y(i) Rn. Low dimensional summary statistics are not available, so we try two sets of high dimensional summary statistics. Covariates B3 and B4 do not appear in the model, so e average data with Treatment, B1 and B2 ﬁxed. Denote by y(i) jkl the mean of observations {yi : Ai = j, B2i = k, B3i = l} and let

T(y(i)) = { y(i) jkl}, j = 1, . . . , N; k = 1, . . . , r; l = 1, . . . , q

denote these summary statistics, with N = 12, r = 2 and q = 3 and dimension Nrq = 72. Tree-based BART has no difﬁculty with summary statistics of this dimension.

As noted at the end of Section 2, level-labels are exchangeable so we can permute the data vectors y(i), i = 1, . . . , M and map them into a tighter subregion of Rn. Regression is easier on the more densely packed y(i)-values. Let σ PR be the set of relabeling permutations for which c(yσ) = c(y). In our setting with three categorical covariates Treatment, B1 and B2 with N = 12, q = 3 and r = 2 levels respectively, and a complete design, the number of legal permutations of the collapsed data T is |PR| = N!q!r!.

We now deﬁne a second coarser set of summary statistics. Consider the N = 12 treatment levels. For i = 1, . . . , M, let HN(y(i)) = { y(i) j }N j=1, where y(i) j is the sample mean

of {y(i) i : i 1 : N, Ai = j}. Take the permutation σ PR such that HN(σ(y(i))) = { y(i) (1), . . . , y(i) (N)} matches the

order statistics of { y(i) j }N j=1. Let Hr(y(i)) and Hq(y(i)) give the corresponding sorted averages for B2 and B1. Let

H(y(i)) = (HN(σ(y(i))), Hr(y(i)), Hq(y(i)))

denote this collection of the p = 17 order statistics. We take H(y) as a second set of summary statistics.

We simulate M = 810 pairs {c(i), y(i)}M i=1 of training data following Algorithm 3. We ﬁt two probit BART models M1 : c(i) T(y(i)) and M2 : c(i) H(y(i)) i.e. we ﬁt two models using Ti = T(y(i)) and Hi = H(y(i)) as summary statistics. The estimated values ˆc(yobs) at yobs are

given in Table 1. Estimates based on the different summary statistics M1 and M2 agree. In order to further test the robustness of our method, we partition the full synthetic dataset {c(i), y(i)}M i=1 into four equal-size subsets and ﬁt a Probit BART model using formula M2 to each subset. Let M21, M22, M23 and M24 label these models. Fitted values at yobs are in line with ﬁtted values determined on the full training set, with wider credible intervals as training sets are smaller. Our estimate of coverage is robust. The estimated coverage ˆc(yobs) is far lower than the nominal level α = 0.95, so the approximate marginal likelihood p BIC(y|S) is a poor approximation here, and the credible set should not be trusted.

6. Conclusion

In this paper we give a computational framework for estimating the bias in coverage due to approximations made in carrying out Bayesian inference. We provide estimators for the calibration problem deﬁned in Lee et al. (2018). We demonstrate their effectiveness by diagnosing poor approximate coverage in two examples. The quality of the approximate coverage may depend on the data, so an approximation may work well for some data sets and not others.

Our assumptions are similar to those of ABC (we can simulate the prior and observation model). A vanilla application of ABC would give a natural though in general inefﬁcient estimator for b(y) in Equation 1. We have help from the approximate posterior π and so our AIS method for estimating coverage can be seen as a hybrid of the two approximation schemes. Our BART based regression uses the same simulation stage as ABC, but the regression is used to estimate a probability function over data space, in a manner similar to the model selection procedure in Pudlo et al. (2016).

The two coverage estimators we suggest, AIS and BART, have complimentary strengths. AIS uses local information without variable selection. Regression with BART can more easily exploit global structure (such as the symmetry we found in c(y)) and does not require careful speciﬁcation of summary statistics or related distance measures. Finally we note that what we are offering is a consistency check: a good outcome (ie an estimated coverage close to α) is a necessary but not a sufﬁcient condition for us to trust the original estimated credible set.

Banﬁeld, J. D. and Raftery, A. E. Ice ﬂoe identiﬁcation in satellite images using mathematical morphology and clustering about principal curves. Journal of the American Statistical Association, 87(417):7 16, 1992.

Bates, D., Maechler, M., Bolker, B., Walker, S., et al. lme4:

Calibrated Approximate Bayesian Inference

Linear mixed-effects models using Eigen and S4. R package version, 1(7):1 23, 2014.

Besag, J. Statistical analysis of non-lattice data. The Statistician, pp. 179 195, 1975.

Chipman, H. A., George, E. I., Mc Culloch, R. E., et al. BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266 298, 2010.

Cook, S. R., Gelman, A., and Rubin, D. B. Validation of software for Bayesian models using posterior quantiles. Journal of Computational and Graphical Statistics, 15 (3):675 692, 2006.

Geweke, J. Getting it right: Joint distribution tests of posterior simulators. Journal of the American Statistical Association, 99(467):799 804, 2004.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999.

Kapelner, A. and Bleich, J. bart Machine: Machine learning with Bayesian additive regression trees. Journal of Statistical Software, 70(i04), 2016.

Kass, R. E. and Raftery, A. E. Bayes factors. Journal of the American Statistical Association, 90(430):773 795, 1995.

Kaufman, B. Crystal statistics. II. Partition function evaluated by spinor analysis. Physical Review, 76(8):1232, 1949.

Lee, J. E., Nicholls, G. K., and Ryder, R. J. Calibration procedures for approximate Bayesian credible sets. ar Xiv preprint ar Xiv:1810.06433, 2018.

Malsiner-Walli, G., Pauger, D., and Wagner, H. Effect fusion using model-based clustering. Statistical Modelling, 18 (2):175 196, 2018.

Marin, J.-M., Pudlo, P., Estoup, A., and Robert, C. Likelihood-free model choice. Handbook of Approximate Bayesian Computation, pp. 153, 2018.

Menendez, P., Fan, Y., Garthwaite, P., and Sisson, S. Simultaneous adjustment of bias and coverage probabilities for conﬁdence intervals. Computational Statistics & Data Analysis, 70:35 44, 2014.

Monahan, J. F. and Boos, D. D. Proper likelihoods for Bayesian analysis. Biometrika, 79(2):271 278, 1992.

Murray, I., Ghahramani, Z., and Mac Kay, D. J. MCMC for doubly-intractable distributions. In Proceedings of the Twenty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pp. 359 366. AUAI Press, 2006.

Neal, R. M. Annealed importance sampling. Statistics and Computing, 11(2):125 139, 2001.

Pauger, D., Wagner, H., et al. Bayesian effect fusion for categorical predictors. Bayesian Analysis, 2018.

Prangle, D., Blum, M. G., Popovic, G., and Sisson, S. Diagnostic tools for approximate Bayesian computation using the coverage property. Australian & New Zealand Journal of Statistics, 56(4):309 329, 2014.

Pritchard, J. K., T, S. M., Perez-Lezaun, A., and Feldman, M. W. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology and Evolution, 16(12):1791 1798, 1999.

Pudlo, P., Marin, J.-M., Estoup, A., Cornuet, J.-M., Gautier, M., and Robert, C. P. Reliable ABC model choice via random forests. Biometrika, 32(6):859 866, 2016.

Raftery, A. E. Bayes factors and BIC: Comment on A critique of the Bayesian information criterion for model selection. Sociological Methods & Research, 27(3):411 427, 1999.

Raynal, L., Marin, J.-M., Pudlo, P., Ribatet, M., Robert, C., and Estoup, A. ABC random forests for Bayesian parameter inference. Bioinformatics, 2018.

Rodrigues, G., Prangle, D., and Sisson, S. Recalibration: A post-processing method for approximate Bayesian computation. Computational Statistics & Data Analysis, 126: 53 66, 2018.

Talts, S., Betancourt, M., Simpson, D., Vehtari, A., and Gelman, A. Validating Bayesian inference algorithms with simulation-based calibration. ar Xiv preprint ar Xiv:1804.06788, 2018.

Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. Yes, but did it work?: Evaluating variational inference. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of PMLR, pp. 5581 5590, 2018.