# model_evidence_from_nonequilibrium_simulations__56826e81.pdf

Model evidence from nonequilibrium simulations

Michael Habeck Statistical Inverse Problems in Biophysics, Max Planck Institute for Biophysical Chemistry & Institute for Mathematical Stochastics, University of Göttingen, 37077 Göttingen, Germany email mhabeck@gwdg.de

The marginal likelihood, or model evidence, is a key quantity in Bayesian parameter estimation and model comparison. For many probabilistic models, computation of the marginal likelihood is challenging, because it involves a sum or integral over an enormous parameter space. Markov chain Monte Carlo (MCMC) is a powerful approach to compute marginal likelihoods. Various MCMC algorithms and evidence estimators have been proposed in the literature. Here we discuss the use of nonequilibrium techniques for estimating the marginal likelihood. Nonequilibrium estimators build on recent developments in statistical physics and are known as annealed importance sampling (AIS) and reverse AIS in probabilistic machine learning. We introduce estimators for the model evidence that combine forward and backward simulations and show for various challenging models that the evidence estimators outperform forward and reverse AIS.

1 Introduction

The marginal likelihood or model evidence is a central quantity in Bayesian inference [1, 2], but notoriously difﬁcult to compute. If likelihood L(x) p(y|x, M) models data y and prior π(x) p(x|M) expresses our knowledge about the parameters x of the model M, the posterior p(x|y, M) and the model evidence Z are given by:

p(x|y, M) = p(y|x, M) p(x|M)

p(y|M) = L(x) π(x)

Z , Z p(y|M) = Z L(x) π(x) dx . (1)

Parameter estimation proceeds by drawing samples from p(x|y, M), and different ways to model the data are ranked by their evidence. For example, two models M1 and M2 can be compared via their Bayes factor, which is proportional to the ratio of their marginal likelihoods p(y|M1)/p(y|M2) [3].

Often the posterior (and perhaps also the prior) is intractable in the sense that it is not possible to compute the normalizing constant and therefore also the evidence analytically. This makes it difﬁcult to compare different models via their posterior probability and model evidence. Markov chain Monte Carlo (MCMC) algorithms [4] only require unnormalized probability distributions and are among the most powerful and accurate methods to estimate the marginal likelihood, but they are computationally expensive. Therefore, it is important to develop efﬁcient MCMC algorithms that can sample from the posterior and allow us to compute the marginal likelihood.

There is a close analogy between the marginal likelihood and the log-partition function or free energy from statistical physics [5]. Therefore, many concepts and algorithms originating in statistical physics have been applied to problems arising in probabilistic inference. Here we show that nonequilibrium ﬂuctuation theorems (FTs) [6, 7, 8] can be used to estimate the marginal likelihood from forward and reverse simulations.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

2 Marginal likelihood estimation by annealed importance sampling

A common MCMC strategy to sample from the posterior and estimate the evidence is to simulate a sequence of distributions pk that bridge between the prior and the posterior [9]. Samples can either be generated in sequential order as in annealed importance sampling (AIS) [10] or in parallel as in replica-exchange Monte Carlo or parallel tempering [11, 12]. Several methods have been proposed to estimate the marginal likelihood from MCMC samples including thermodynamic integration (TI) [9], annealed importance sampling (AIS) [10], nested sampling (NS) [13] or the density of states (DOS) [14]. Most of these approaches (TI, DOS and NS) assume that we can draw exact samples from the intermediate distributions pk, typically after a sufﬁciently large number of equilibration steps has been simulated. AIS, on the other hand, does not assume that the samples are equilibrated after each annealing step, which makes AIS very attractive for analyzing complex models for which equilibration is hard to achieve.

AIS employs a sequence of K + 1 probability distributions pk and Markov transition operators Tk whose stationary distributions are pk, i.e. R Tk(x|x ) pk(x ) dx = pk(x). In a Bayesian setting, p0 is the prior and p K the posterior. Typically, pk is intractable meaning that we only know an unnormalized version fk, but not the normalizer Zk, i.e. pk(x) = fk(x)/Zk where Zk = R fk(x) dx and the evidence is Z = ZK/Z0. Often, it is convenient to write fk as an energy based model fk(x) = exp{ Ek(x)}. In Bayesian inference, a popular choice is

fk(x) = [L(x)]βk π(x)

with prior π(x) and likelihood L(x); βk is an inverse temperature schedule that starts at β0 = 0 (prior) and ends at βK = 1 (posterior).

AIS samples paths x = [x0, x1, . . . , x K 1] according to the probability

Pf[x] = TK 1(x K 1|x K 2) T1(x1|x0) p0(x0) (2)

where, following Crooks [8], calligraphic symbols and square brackets denote quantities that depend on the entire path. The subscript indicates that the path is generated by a forward simulation, which starts from p0 and propagates the initial state through a sequence of new states by the successive action of the Markov kernels T1, T2, . . . , TK 1.

The importance weight of a path is

fk(xk) = exp

k=0 [ Ek+1(xk) Ek(xk) ] . (3)

The average weight over many paths is a consistent and unbiased estimator of the model evidence Z = ZK/Z0, which follows from [15, 10] (see supplementary material for details):

w f = Z w[x] Pf[x] D[x] = Z (4)

where the average f is an integral over all possible paths generated by the forward sequence of transition kernels (D[x] = dx0 dx K 1). The average weight of M forward paths x(i) is an estimate of the model evidence: Z 1 M P

i w[x(i)]. This estimator is at the core of AIS and its variants [10, 16]. To avoid overﬂow problems, it will be numerically more stable to compute log weights. Logarithmic weights also naturally occur from a physical perspective where log w[x] is identiﬁed as the work required to generate path x.

3 Nonequilibrium ﬂuctuation theorems

Nonequilibrium ﬂuctuations theorems (FTs) quantify the degree of irreversibility of a stochastic process by relating the probability of generating a path by a forward simulation to the probability of generating the exact same path by a time-reversed simulation. If the Markov kernels Tk satisfy detailed balance, time reversal is achieved by applying the same sequence of kernels in reverse order. For Markov kernels not satisfying detailed balance, the deﬁnition is slightly more general [7, 10]. Here we assume that all kernels Tk satisfy detailed balance, which is valid for Markov chains based on the Metropolis algorithm and its variants [4].

Under these assumptions, the probability of generating the path x by a reverse simulation starting in x K 1 is Pr[x] = T1(x0|x1) TK 1(x K 2|x K 1) p K(x K 1) . (5)

Averages over the reverse paths are indicated by r. The detailed ﬂuctuation theorem [6, 8] relates the probabilities of generating x in a forward and reverse simulation (see supplementary material):

Pf[x] Pr[x] = ZK

fk(xk) fk+1(xk) = Z w[x] = exp{W[x] F} (6)

where the physical analogs of the path weight and the marginal likelihood were introduced, namely the work W[x] = log w[x] = P

k[ Ek+1(xk) Ek(xk) ] and the free energy difference F = log Z = log(ZK/Z0). Various demonstrations of relation (6) exist in the physics and machine learning literature [6, 7, 8, 10, 17].

Lower and upper bounds sandwiching the log evidence [17, 18, 16] follow directly from equation (6) and the non-negativity of the relative entropy:

DKL(Pf Pr) = Z Pf[x] log(Pf[x]/Pr[x]) D[x] = W f F 0 .

From DKL(Pr Pf) 0 we obtain an upper bound on log Z, such that overall we have:

log w f = W f log Z W r = log w r. (7)

Grosse et al. use these bounds to assess the convergence of bidirectional Monte Carlo [18].

Thanks to the detailed ﬂuctuation theorem (Eq. 6), we can also relate the marginal distributions of the work resulting from many forward and reverse simulations:

pf(W) = Z δ(W W[x]) Pf[x] D[x] = pr(W) e W F (8)

which is Crooks ﬂuctuation theorem (CFT) [7]. CFT tells us that the work distributions pf and pr cross exactly at W = F. Therefore, by plotting histograms of the work obtained in forward and backward simulations we can read off an estimate for the negative log evidence.

The Jarzynski equality (JE) [15] follows directly from CFT due to the normalization of pr: Z pf(W) e W d W = e W f = e F . (9)

JE restates the AIS estimator w f = Z (Eq. 4) in terms of the physical quantities. Remarkably, JE holds for any stochastic dynamics bridging between the initial and target distribution. This suggests to use fast annealing protocols to drag samples from the prior into the posterior. However, the JE involves an exponential average in which paths requiring the least work contribute most strongly. These paths correspond to work realizations that reside in the left tail of pf. With faster annealing, the chance of generating a minimal work path decreases exponentially and becomes a rare event.

A key feature of CFT and JE is that they do not require exact samples from the stationary distributions pk, which is needed in applications of TI or DOS. For complex probabilistic models, the states generated by the kernels Tk will typically lag behind due to slow mixing, especially near phase transitions. The k-th state of the forward path will follow the intermediate distribution

qk(xk) = Z k Y

l=1 Tl(xl|xl 1) p0(x0) dx0 dxk 1, q0(x) = p0(x) . (10)

Unless the transition kernels Tk mix very rapidly, qk = pk for k > 0.

Consider the common case in Bayesian inference where Ek(x) = βk E(x) and E(x) = log L(x). Then, according to inequalities (7), we have the following lower bound on the marginal likelihood

log w f = W f =

k=0 (βk+1 βk) E qk (11)

0 5 10 15 20 work W

work distribution p(W)

0 5 10 index k

parameter x

0 50 100 150 200 K

W f W r log e W

Figure 1: Nonequilibrium analysis of a Gaussian toy model. (A) Work distributions pf and pr shown in blue and green. The correct free energy difference (minus log evidence) is indicated by a dashed line. (B) Comparison of stationary distribution pk and marginal distributions qk generated by the transition kernels. Shown is a 1σ band about the mean positions for pk (blue) and qk (green). (C) Lower and upper bounds of the log evidence (Eq. 7) and logarithm of the exponential average over the forward work distribution for increasingly slow annealing schedules.

and an analogous expression for the upper bound/reverse direction, in which the average energies along the forward path E qk need to be replaced by the corresponding average energies along the backward path. The difference between the forward and reverse averages is called hysteresis in physics. The larger the hysteresis, the more strongly will the marginal likelihood bounds disagree and the more uncertain will our guess of the model evidence be. The opposite limiting case is slow annealing and full equilibration where the bound (Eq. 11) approaches thermodynamic integration (see supplementary material). So we expect that there is a tradeoff between switching fast in order to save computation time, versus a desire to control the amount of hysteresis, which otherwise makes it difﬁcult to extract accurate evidence estimates from the simulations.

4 Illustration for a tractable model

Let us illustrate the nonequilibrium results for a tractable model where the initial, the target and all intermediate distribution are Gaussians pk(x) = N x; µk, σ2 k) with means µk and standard deviations σk > 0. The transition kernels are also Gaussian:

Tk(x|x ) = N x; (1 τk)µk + τkx , (1 τ 2 k)σ2 k)

with τk [0, 1] controlling the speed of convergence: For τk = 0 convergence is immediate, whereas for τk 1, the chain generated by Tk converges inﬁnitely slowly. Note that the kernels Tk satisfy detailed balance, therefore backward simulations apply the same kernels in reverse order. The energies and exact log partition functions are Ek(x) = 1 2σ2 k (x µk)2 and log Zk = log(2πσ2 k)/2.

We bridge between an initial distribution with mean µ0 = 20 and standard deviation σ0 = 10 and a target with µK = 0, σK = 1 using K = 10 intermediate distributions and compute work distributions resulting from forward/backward simulations. Both distributions indeed cross at W = log Z = log(σ0/σK) = log 10, as predicted by CFT (see Fig. 1A). Figure 1B illustrates the difference between the marginal distribution of the samples after k annealing steps qk (Eq. 10) and the stationary distribution pk. The marginal distributions qk are also Gaussian, but their means and variances diverge from the means and variances of the stationary distributions pk. This divergence results in hysteresis, if the annealing process is forced to progress very rapidly without equilibrating the samples (quenching). Figure 1C conﬁrms the validity of the JE (Eq. 9) and of the lower and upper bounds (Eq. 7). For short annealing protocols, the bounds are very conservative, whereas the Jarzynski equality gives the correct evidence even for fast protocols (small K). In realistic applications, however, we cannot compute the work distribution pf over the entire range of work values. In fast annealing simulations, it will become increasingly difﬁcult to explore the left tail of the work distribution, such that in practice the accuracy of the JE estimator deteriorates for too small K.

Algorithm 1 Bennett s acceptance ratio (BAR)

Require: Work W (i) f , W (i) r from M forward and reverse nonequilibrium simulations, tolerance δ (e.g. δ = 10 4) Z 1 M P

i exp{ W (i) f } (Jarzynski estimator)

i 1 1+Z exp{W (i) f }, RHS P

i 1 1+Z 1 exp{ W (i) r } Z Z LHS

until | log(LHS/RHS)| < δ return Z

5 Using the ﬂuctuation theorem to estimate the evidence

To use the ﬂuctuation theorem for evidence estimation, we run two sets of simulations. As in AIS, forward simulations start from a prior sample which is successively propagated by the transition kernels Tk. For each forward path x(i) the total work W (i) f is recorded. We also run reverse simulations starting from a posterior sample. For complex inference problems it is generally impossible to generate an exact sample from the posterior. However, in some cases the mode of the posterior is known or powerful methods for locating the posterior maximum exist. We can then generate a posterior sample by applying the transition operator TK many times starting from the posterior mode. The reverse simulations could also be started from the ﬁnal states generated by the forward simulations drawn according to their importance weights w(i) f exp{ W (i) f }. Another possibility to generate a posterior sample is to start from the data, if we want to evaluate an intractable generative model such as a restricted Boltzmann machine. The posterior sample is then propagated by the reverse chain of transition operators. Again, we accumulate the total work generated by the reverse simulation W (i) r . The reverse simulation corresponds to reverse AIS proposed by Burda et al. [16].

5.1 Jarzynski and cumulant estimators

There are various options to estimate the evidence from forward and backward simulations. We can apply the Jarzynski equality to W (i) f and W (i) r , which corresponds to the estimators used in AIS [10] and reverse AIS [16]. According to Eq. (7) we can also compute an interval that likely contains the log evidence, but typically this interval will be quite large. Hummer [19] has developed estimators based on the cumulant expansion of pf and pr:

log Z W f + varf(W)/2, log Z W r varr(W)/2 (12)

where varf(W) and varr(W) indicate the sample variances of the work generated during the forward and reverse simulations. The cumulant estimators increase/decrease the lower/upper bound of the log evidence (Eq. 7) by the sample variance of the work. The forward and reverse cumulant estimators can also be combined into a single estimate [19]:

2 W f + W r + 1

12 varf(W) varr(W) . (13)

5.2 Bennett s acceptance ratio

Another powerful method is Bennett s acceptance ratio (BAR) [20, 21], which is based on the observation that according to CFT (Eq. 8) Z h(W; F) pf(W) e W d W = Z h(W; F) pr(W) e F d W

for any function h. Therefore, any choice of h gives an implicit estimator for F. Bennett showed [20, 9] that the minimum mean squared error is achieved for h (pf + pr) 1, leading to the implicit

0 100 200 K

A B C D E F

Figure 2: Performance of evidence estimators on the Gaussian toy model. M = 100 forward and reverse simulations were run for schedules of increasing length K. This experiment was repeated 1000 times to probe the stability of the estimators. Shown is the difference between the log evidence estimate and its true value log 10. The average over all repetitions is shown as red line; the light band indicates one standard deviation over all repetitions. (A) Cumulant estimator (Eq. 12) based on the forward simulation. (B) The combined cumulant estimator (Eq. 13). (C) Forward AIS estimator. (D) Reverse AIS. (E) BAR. (F) Histogram estimator.

1 + Z exp{W (i) f } = X

1 + Z 1 exp{ W (i) r } . (14)

By numerically solving equation (14) for Z, we obtain an estimator of the evidence based on Bennett s acceptance ratio (BAR). A straightforward way to solve the BAR equation is to iterate over the multiplicative update Z(t+1) Z(t)LHS(Z(t))/RHS(Z(t)) where LHS and RHS are the left and right hand side of equation (14) and the superscript (t) indicates an iteration index. Algorithm (1) provides pseudocode to compute the BAR estimator (further details are given in the supplementary material).

5.3 Histogram estimator

Here we introduce yet another way of combining forward/backward simulations and estimating the model evidence. According to CFT, we have:

W (i) f pf(W), W (i) r pr(W) = pf(W) e W /Z .

The idea is to combine all samples W (i) f and W (i) r to estimate pf, from which we can then obtain the evidence by using the JE (Eq. 9). Thanks to the CFT, the samples from the reverse simulation contribute most strongly to the integral in the JE. Therefore, if we can use the reverse paths to estimate the forward work distribution, pf will be quite accurate in the region that is most relevant for evaluating JE.

To estimate pf from W (i) f and W (i) r is mathematically equivalent to estimating the density of states (DOS) (i.e. the marginal distribution of the log likelihood) from equilibrium simulations run at two inverse temperatures β = 0 and β = 1. We can therefore directly apply histogram techniques [14, 22] used to analyze equilibrium simulations to estimate pf from nonequilibrium simulations (details are given in the supplementary material). Histogram techniques result in a non-parametric estimate of the work distribution: pf(W) X

j pj δ(W Wj) (15)

where all sampled work values, W (i) f and W (i) r , were combined into a single set Wj and pj are normalized weights associated with each Wj. Using the JE, we obtain

j pj e Wj (16)

which is best evaluated in log space. The histogram iterations [14] used to determine pj and Z are very similar to the multiplicative that solve the BAR equation (Eq. 14). After running the histogram iterations, we obtain a non-parametric maximum likelihood estimate of pf (Eq. 15). It is also possible

1350 1300 1250 work W

work distribution p(W)

0.4 0.5 0.6 0.7 inverse temperature

average energy E

equilibration steps N

log evidence log Z

Figure 3: Evidence estimation for the 32 32 Ising model. (A) Work distributions obtained for K = 1000 annealing and N = 1000 equilibration steps. (B) Average energy E f and E r at different annealing steps k in comparison to the average energy of the stationary distribution E β. Shown is a zoom into the inverse temperature range from 0.4 to 0.7; the average energies agree quite well outside this interval. (C) Evidence estimates for increasing number of equilibration steps N. Light/dark blue: lower/upper bounds log w f / log w r; light/dark green: forward/reverse AIS estimators log w f / log w r; light red: BAR; dark red: histogram estimator. For N > 1000, BAR and the histogram estimator produce virtually identical evidence estimates.

to carry out a Bayesian analysis, and derive a Gibbs sampler for pf, which does not only provide a point estimate for log Z, but also quantiﬁes its uncertainty (see supplementary material for details).

We studied the performance of the evidence estimators on forward/backward simulations of the Gaussian toy model. The cumulant estimators (Figs. 2A, B) are systematically biased in case of rapid annealing (small K). The combined cumulant estimator (Fig. 2B) is a signiﬁcant improvement over the forward estimator, which does not take the reverse simulation data into account. The forward and reverse AIS estimators are shown in Figs. 2C and 2D. For this system, the evidence estimates derived from the reverse simulation are systematically more accurate than the AIS estimate based on the forward simulation, which is clear given that the work distribution from reverse simulations pr is much more concentrated than the forward work distribution pf (see Fig. 1A). The most accurate, least biased and most stable estimators are BAR (Fig. 2E) and the histogram estimator (Fig. 2F), which both combine forward and backward simulations into a single evidence estimate.

6 Experiments

We studied the performance of the nonequilibrium marginal likelihood estimators on various challenging probabilistic models including Markov random ﬁelds and Gaussian mixture models. A python package implementing the work simulations and evidence estimators can be downloaded from https://github.com/michaelhabeck/paths.

6.1 Ising model

Our ﬁrst test system is a 32 32 Ising model for which the log evidence can be computed exactly: log Z = 1339.27 [23]. A single conﬁguration consists of 1024 spins xi = 1. The energies of the intermediate distributions are Ek(x) = βk P

i j xixj where i j indicates nearest neighbors on a 2D square lattice. We generate M = 1000 forward and reverse paths using a linear inverse temperature schedule that interpolates between β0 = 0 and βK = 1 where K = 1000. Forward simulations start from random spin conﬁgurations. For the reverse simulations, we start in one of the two ground states with all spins either 1 or +1. Tk are Metropolis kernels based on pk: a new spin conﬁguration is proposed by ﬂipping a randomly selected spin and accepted or rejected according to Metropolis rule. The single spin-ﬂip transitions are repeated N times at constant βk, i.e. N is the number of equilibration steps after a perturbation was induced by lowering the temperature. The larger N, the more time we allow the simulation to equilibrate, and the closer will qk be to pk.

Figure 3A shows the work distributions obtained with N = 1000 equilibration steps per temperature perturbation. Even though the forward and reverse work distributions overlap only weakly, the

103 104 105

logw f logw r log w f

log w r BAR Histogram

1 2 3 4 log Z +1.74e3

475 450 log Z

parallel tempering

Figure 4: Evidence estimation for the Potts model and RBM. (A) Estimated log evidence of the Potts model for a ﬁxed computational budget K N = 109 where M = 100 and ten repetitions were computed. The reference value log Z = 1742 (obtained with parallel tempering) is shown as dashed black line. (B) log Z distributions obtained with the Gibbs sampling version of the histogram estimator for K = 1000 and varying number of equilibration steps. (C) Work distributions obtained for a marginal and full RBM (light/dark blue: forward/reverse simulation of the marginal model; light/dark green: forward/reverse simulation of the full model).

evidence estimates obtained with BAR and the histogram estimator are quite accurate with 1338.05 (BAR) and 1338.28 (histogram estimator), which differs only by approx. 1 nat from the true evidence and corresponds to a relative error of 9 10 4 (BAR) and 7 10 4 (histogram estimator). Forward and reverse AIS provide less accurate estimates of the log evidence: 1333.66 (AIS) and 1342.05 (RAISE). The lower and upper bounds are very broad log w f = 1290.5 and log w r = 1352.0, which results from hysteresis effects. Figure 3B zooms into the average energies obtained during the forward and reverse simulations and compares them with the average energy of a fully equilibrated simulation. The average energies differ most strongly at inverse temperatures close to the critical value βcrit 0.44 at which the Ising model undergoes a second-order phase transition. We also tested the performance of the estimators as a function of the number of equilibration steps N. As already observed for the Gaussian toy model, BAR and the histogram estimator outperform the Jarzynski estimators (AIS and RAISE) also in case of the Ising model (see Fig. 3C).

6.2 Ten-state Potts model

Next we performed simulations of the ten-state Potts model deﬁned over a 32 32 lattice where the spins of the Ising model are replaced by integer colors xi {1, . . . , 10} and an interaction energy 2δ(xi, xj). This model is signiﬁcantly more challenging than the Ising model, because it undergoes a ﬁrst-order phase transition and has an astronomically larger number of states (101024 colorings rather than 21024 spin conﬁgurations). We performed forward/backward simulations using a linear inverse temperature schedule with β0 = 0, βK = 1 and a ﬁxed computational budget K N = 109. Figure 4A shows that there seems to be no advantage of increasing the number of intermediate distributions at the cost of reducing the number of equilibration steps. Again, BAR and the histogram estimator perform very similarly. The Gibbs sampling version of the histogram estimator also provides the posterior of log Z (see Fig. 4B). For too few equilibration steps N, this distribution is rather broad or even slightly biased, but for large N the log Z posterior concentrates around the correct log evidence.

6.3 Restricted Boltzmann machine

The restricted Boltzmann machine (RBM) is a common building block of deep learning hierarchies. RBM is an intractable MRF with bipartite interactions: E(v, h) = (a T v + b T h + v T Wh) where a, b are the visible and hidden biases and W are the couplings between the visible and hidden units vi and hj. Here we compare annealing of the full model Ek(v, h) = βk E(v, h) against annealing of the marginal model Ek(h) = βk log P

v exp{ E(v, h)}. The full model can be simulated using a Gibbs sampler, which is straightforward since the conditional distributions are Bernoulli. To sample from the marginal model, we use a Metropolis kernel similar to the one used for the Ising model. To start the reverse simulations, we randomly pick an image from the training set and generate an initial

hidden state by sampling from the conditional distribution p(h|v). We then run 100 steps of Gibbs sampling with TK to obtain a posterior sample.

We ran tests on an RBM with 784 visible and 500 hidden units trained on the MNIST handwritten digits dataset [24] with contrastive divergence using 25 steps [25]. Since the true log evidence is not known, we use a reference value obtained with parallel tempering (PT): log Z 451.42. Figure 4C compares evidence estimates based on annealing simulations of the full against the marginal model. Both annealing approaches provide very similar evidence estimates 451.43 (full model) and 451.48 (marginal model) that are close to the PT result. However, simulation of the marginal model is three times faster compared to the full model. Therefore, it seems beneﬁcial to evaluate and train RBMs based on sampling and annealing of the marginal model p(h) rather than the full model p(v, h).

6.4 Gaussian mixture model

Finally, we consider a sort of data annealing strategy in which independent data points are added one-by-one as in sequential Monte Carlo [10] Ek(x) = P

l<k log p(yl|x, M). We applied thermal and data annealing to a three-component Gaussian mixture model with means -5, 0, 5, standard deviations 1, 3, 0.5 and equal weights. We generated K = 100 data points, and applied both types of annealing to estimate the mixture parameters and marginal likelihood. Parallel tempering produced a reference log evidence of -259.49. A Gibbs sampler utilizing cluster responsibilities as auxiliary variables served as transition kernel. Forward simulations started from prior samples where conjugate priors were used for the component means, widths and weights. The reverse simulations started from a posterior sample obtained by running K-means followed by 100 Gibbs sampling iterations.

Thermal annealing with as many temperatures as data points and 10 Gibbs sampling steps per temperature estimated a log evidence of -259.72 0.60 (M = 100, 10 repetitions). For 100 Gibbs steps, we obtain -259.47 0.36. Data annealing with 10 Gibbs steps per addition of a data point yields 257.52 0.97, which seems to be slightly biased. Increasing the number of Gibbs steps to 100 improves the accuracy of the log evidence estimate: -258.32 1.21. This shows that there might be some potential in a data annealing strategy, especially for larger datasets.

This paper applies nonequilibrium techniques to estimate the marginal likelihood of an intractable probabilistic model. We outline the most important results from nonequilibrium statistical physics that are relevant to marginal likelihood estimation and relate them to machine learning algorithms such as AIS [10], RAISE [16] and bidirectional Monte Carlo [17, 18]. We introduce two estimators, BAR and the histogram estimator, that are currently not used in the context of probabilistic inference. We study the performance of the estimators on a toy system and various challenging probabilistic models including Markov random ﬁelds and Gaussian mixture models. The two evidence estimators perform very similarly and are superior to forward/reverse AIS and the cumulant estimators. Compared to BAR, the histogram estimator has the additional advantage that it also quantiﬁes the uncertainty of the evidence estimate.

Acknowledgments

This work has been funded by the Deutsche Forschungsgemeinschaft (DFG) SFB 860, subproject B09.

[1] E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, Cambridge UK, 2003.

[2] D. J. C. Mac Kay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge UK, 2003.

[3] R. Kass and A. Raftery. Bayes factors. American Statistical Association, 90:773 775, 1995.

[4] J. S. Liu. Monte Carlo strategies in scientiﬁc computing. Springer, 2001.

[5] K. H. Knuth, M. Habeck, N. K. Malakar, A. M. Mubeen, and B. Placek. Bayesian evidence and model selection. Digit. Signal Process., 47(C):50 67, 2015.

[6] G. E. Crooks. Nonequilibrium measurements of free energy differences for microscopically reversible Markovian systems. Journal of Statistical Physics, 90(5-6):1481 1487, 1998.

[7] G. E. Crooks. Entropy production ﬂuctuation theorem and the nonequilibrium work relation for free energy differences. Phys Rev E, 60:2721 2726, 1999.

[8] G. E. Crooks. Excursions in statistical dynamics. Ph D thesis, University of California at Berkeley, 1999.

[9] A. Gelman and X. Meng. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical Science, 13:163 185, 1998.

[10] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125 139, 2001.

[11] R. H. Swendsen and J.-S. Wang. Replica Monte Carlo simulation of spin glasses. Phys Rev Lett, 57:2607 2609, 1986.

[12] C. J. Geyer. Markov chain Monte Carlo maximum likelihood. In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, pages 156 163, 1991.

[13] J. Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1:833 860, 2006.

[14] M. Habeck. Evaluation of marginal likelihoods using the density of states. In Proceedings of the Fifteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 22, pages 486 494. JMLR: W&CP 22, 2012.

[15] C. Jarzynski. Nonequilibrium equality for free energy differences. Phys Rev Lett, 78:2690 2693, 1997.

[16] Y. Burda, R. Grosse, and R. Salakhutdinov. Accurate and conservative estimates of MRF log-likelihood using reverse annealing. In Artiﬁcial Intelligence and Statistics, pages 102 110, 2015.

[17] R. B. Grosse, Z. Ghahramani, and R. P. Adams. Sandwiching the marginal likelihood using bidirectional Monte Carlo. ar Xiv preprint ar Xiv:1511.02543, 2015.

[18] R. B. Grosse, S. Ancha, and D. M. Roy. Measuring the reliability of MCMC inference with bidirectional Monte Carlo. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2451 2459. Curran Associates, Inc., 2016.

[19] G. Hummer. Fast-growth thermodynamic integration: Error and efﬁciency analysis. The Journal of Chemical Physics, 114(17):7330 7337, 2001.

[20] C. H. Bennett. Efﬁcient estimation of free energy differences from Monte Carlo data. J. Comput. Phys., 22:245, 1976.

[21] M. R. Shirts, E. Bair, G. Hooker, and V. S Pande. Equilibrium free energies from nonequilibrium measurements using maximum-likelihood methods. Phys Rev Lett, 91(14):140601, 2003.

[22] M. Habeck. Bayesian estimation of free energies from equilibrium simulations. Phys Rev Lett, 109(10):100601, 2012.

[23] P. D. Beale. Exact Distribution of Energies in the Two-Dimensional Ising Model. Phys Rev Lett, 76:78 81, 1996.

[24] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[25] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771 1800, 2002.