# model_evidence_from_nonequilibrium_simulations__56826e81.pdf Model evidence from nonequilibrium simulations Michael Habeck Statistical Inverse Problems in Biophysics, Max Planck Institute for Biophysical Chemistry & Institute for Mathematical Stochastics, University of Göttingen, 37077 Göttingen, Germany email mhabeck@gwdg.de The marginal likelihood, or model evidence, is a key quantity in Bayesian parameter estimation and model comparison. For many probabilistic models, computation of the marginal likelihood is challenging, because it involves a sum or integral over an enormous parameter space. Markov chain Monte Carlo (MCMC) is a powerful approach to compute marginal likelihoods. Various MCMC algorithms and evidence estimators have been proposed in the literature. Here we discuss the use of nonequilibrium techniques for estimating the marginal likelihood. Nonequilibrium estimators build on recent developments in statistical physics and are known as annealed importance sampling (AIS) and reverse AIS in probabilistic machine learning. We introduce estimators for the model evidence that combine forward and backward simulations and show for various challenging models that the evidence estimators outperform forward and reverse AIS. 1 Introduction The marginal likelihood or model evidence is a central quantity in Bayesian inference [1, 2], but notoriously difficult to compute. If likelihood L(x) p(y|x, M) models data y and prior π(x) p(x|M) expresses our knowledge about the parameters x of the model M, the posterior p(x|y, M) and the model evidence Z are given by: p(x|y, M) = p(y|x, M) p(x|M) p(y|M) = L(x) π(x) Z , Z p(y|M) = Z L(x) π(x) dx . (1) Parameter estimation proceeds by drawing samples from p(x|y, M), and different ways to model the data are ranked by their evidence. For example, two models M1 and M2 can be compared via their Bayes factor, which is proportional to the ratio of their marginal likelihoods p(y|M1)/p(y|M2) [3]. Often the posterior (and perhaps also the prior) is intractable in the sense that it is not possible to compute the normalizing constant and therefore also the evidence analytically. This makes it difficult to compare different models via their posterior probability and model evidence. Markov chain Monte Carlo (MCMC) algorithms [4] only require unnormalized probability distributions and are among the most powerful and accurate methods to estimate the marginal likelihood, but they are computationally expensive. Therefore, it is important to develop efficient MCMC algorithms that can sample from the posterior and allow us to compute the marginal likelihood. There is a close analogy between the marginal likelihood and the log-partition function or free energy from statistical physics [5]. Therefore, many concepts and algorithms originating in statistical physics have been applied to problems arising in probabilistic inference. Here we show that nonequilibrium fluctuation theorems (FTs) [6, 7, 8] can be used to estimate the marginal likelihood from forward and reverse simulations. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. 2 Marginal likelihood estimation by annealed importance sampling A common MCMC strategy to sample from the posterior and estimate the evidence is to simulate a sequence of distributions pk that bridge between the prior and the posterior [9]. Samples can either be generated in sequential order as in annealed importance sampling (AIS) [10] or in parallel as in replica-exchange Monte Carlo or parallel tempering [11, 12]. Several methods have been proposed to estimate the marginal likelihood from MCMC samples including thermodynamic integration (TI) [9], annealed importance sampling (AIS) [10], nested sampling (NS) [13] or the density of states (DOS) [14]. Most of these approaches (TI, DOS and NS) assume that we can draw exact samples from the intermediate distributions pk, typically after a sufficiently large number of equilibration steps has been simulated. AIS, on the other hand, does not assume that the samples are equilibrated after each annealing step, which makes AIS very attractive for analyzing complex models for which equilibration is hard to achieve. AIS employs a sequence of K + 1 probability distributions pk and Markov transition operators Tk whose stationary distributions are pk, i.e. R Tk(x|x ) pk(x ) dx = pk(x). In a Bayesian setting, p0 is the prior and p K the posterior. Typically, pk is intractable meaning that we only know an unnormalized version fk, but not the normalizer Zk, i.e. pk(x) = fk(x)/Zk where Zk = R fk(x) dx and the evidence is Z = ZK/Z0. Often, it is convenient to write fk as an energy based model fk(x) = exp{ Ek(x)}. In Bayesian inference, a popular choice is fk(x) = [L(x)]βk π(x) with prior π(x) and likelihood L(x); βk is an inverse temperature schedule that starts at β0 = 0 (prior) and ends at βK = 1 (posterior). AIS samples paths x = [x0, x1, . . . , x K 1] according to the probability Pf[x] = TK 1(x K 1|x K 2) T1(x1|x0) p0(x0) (2) where, following Crooks [8], calligraphic symbols and square brackets denote quantities that depend on the entire path. The subscript indicates that the path is generated by a forward simulation, which starts from p0 and propagates the initial state through a sequence of new states by the successive action of the Markov kernels T1, T2, . . . , TK 1. The importance weight of a path is fk(xk) = exp k=0 [ Ek+1(xk) Ek(xk) ] . (3) The average weight over many paths is a consistent and unbiased estimator of the model evidence Z = ZK/Z0, which follows from [15, 10] (see supplementary material for details): w f = Z w[x] Pf[x] D[x] = Z (4) where the average f is an integral over all possible paths generated by the forward sequence of transition kernels (D[x] = dx0 dx K 1). The average weight of M forward paths x(i) is an estimate of the model evidence: Z 1 M P i w[x(i)]. This estimator is at the core of AIS and its variants [10, 16]. To avoid overflow problems, it will be numerically more stable to compute log weights. Logarithmic weights also naturally occur from a physical perspective where log w[x] is identified as the work required to generate path x. 3 Nonequilibrium fluctuation theorems Nonequilibrium fluctuations theorems (FTs) quantify the degree of irreversibility of a stochastic process by relating the probability of generating a path by a forward simulation to the probability of generating the exact same path by a time-reversed simulation. If the Markov kernels Tk satisfy detailed balance, time reversal is achieved by applying the same sequence of kernels in reverse order. For Markov kernels not satisfying detailed balance, the definition is slightly more general [7, 10]. Here we assume that all kernels Tk satisfy detailed balance, which is valid for Markov chains based on the Metropolis algorithm and its variants [4]. Under these assumptions, the probability of generating the path x by a reverse simulation starting in x K 1 is Pr[x] = T1(x0|x1) TK 1(x K 2|x K 1) p K(x K 1) . (5) Averages over the reverse paths are indicated by r. The detailed fluctuation theorem [6, 8] relates the probabilities of generating x in a forward and reverse simulation (see supplementary material): Pf[x] Pr[x] = ZK fk(xk) fk+1(xk) = Z w[x] = exp{W[x] F} (6) where the physical analogs of the path weight and the marginal likelihood were introduced, namely the work W[x] = log w[x] = P k[ Ek+1(xk) Ek(xk) ] and the free energy difference F = log Z = log(ZK/Z0). Various demonstrations of relation (6) exist in the physics and machine learning literature [6, 7, 8, 10, 17]. Lower and upper bounds sandwiching the log evidence [17, 18, 16] follow directly from equation (6) and the non-negativity of the relative entropy: DKL(Pf Pr) = Z Pf[x] log(Pf[x]/Pr[x]) D[x] = W f F 0 . From DKL(Pr Pf) 0 we obtain an upper bound on log Z, such that overall we have: log w f = W f log Z W r = log w r. (7) Grosse et al. use these bounds to assess the convergence of bidirectional Monte Carlo [18]. Thanks to the detailed fluctuation theorem (Eq. 6), we can also relate the marginal distributions of the work resulting from many forward and reverse simulations: pf(W) = Z δ(W W[x]) Pf[x] D[x] = pr(W) e W F (8) which is Crooks fluctuation theorem (CFT) [7]. CFT tells us that the work distributions pf and pr cross exactly at W = F. Therefore, by plotting histograms of the work obtained in forward and backward simulations we can read off an estimate for the negative log evidence. The Jarzynski equality (JE) [15] follows directly from CFT due to the normalization of pr: Z pf(W) e W d W = e W f = e F . (9) JE restates the AIS estimator w f = Z (Eq. 4) in terms of the physical quantities. Remarkably, JE holds for any stochastic dynamics bridging between the initial and target distribution. This suggests to use fast annealing protocols to drag samples from the prior into the posterior. However, the JE involves an exponential average in which paths requiring the least work contribute most strongly. These paths correspond to work realizations that reside in the left tail of pf. With faster annealing, the chance of generating a minimal work path decreases exponentially and becomes a rare event. A key feature of CFT and JE is that they do not require exact samples from the stationary distributions pk, which is needed in applications of TI or DOS. For complex probabilistic models, the states generated by the kernels Tk will typically lag behind due to slow mixing, especially near phase transitions. The k-th state of the forward path will follow the intermediate distribution qk(xk) = Z k Y l=1 Tl(xl|xl 1) p0(x0) dx0 dxk 1, q0(x) = p0(x) . (10) Unless the transition kernels Tk mix very rapidly, qk = pk for k > 0. Consider the common case in Bayesian inference where Ek(x) = βk E(x) and E(x) = log L(x). Then, according to inequalities (7), we have the following lower bound on the marginal likelihood log w f = W f = k=0 (βk+1 βk) E qk (11) 0 5 10 15 20 work W work distribution p(W) 0 5 10 index k parameter x 0 50 100 150 200 K W f W r log e W Figure 1: Nonequilibrium analysis of a Gaussian toy model. (A) Work distributions pf and pr shown in blue and green. The correct free energy difference (minus log evidence) is indicated by a dashed line. (B) Comparison of stationary distribution pk and marginal distributions qk generated by the transition kernels. Shown is a 1σ band about the mean positions for pk (blue) and qk (green). (C) Lower and upper bounds of the log evidence (Eq. 7) and logarithm of the exponential average over the forward work distribution for increasingly slow annealing schedules. and an analogous expression for the upper bound/reverse direction, in which the average energies along the forward path E qk need to be replaced by the corresponding average energies along the backward path. The difference between the forward and reverse averages is called hysteresis in physics. The larger the hysteresis, the more strongly will the marginal likelihood bounds disagree and the more uncertain will our guess of the model evidence be. The opposite limiting case is slow annealing and full equilibration where the bound (Eq. 11) approaches thermodynamic integration (see supplementary material). So we expect that there is a tradeoff between switching fast in order to save computation time, versus a desire to control the amount of hysteresis, which otherwise makes it difficult to extract accurate evidence estimates from the simulations. 4 Illustration for a tractable model Let us illustrate the nonequilibrium results for a tractable model where the initial, the target and all intermediate distribution are Gaussians pk(x) = N x; µk, σ2 k) with means µk and standard deviations σk > 0. The transition kernels are also Gaussian: Tk(x|x ) = N x; (1 τk)µk + τkx , (1 τ 2 k)σ2 k) with τk [0, 1] controlling the speed of convergence: For τk = 0 convergence is immediate, whereas for τk 1, the chain generated by Tk converges infinitely slowly. Note that the kernels Tk satisfy detailed balance, therefore backward simulations apply the same kernels in reverse order. The energies and exact log partition functions are Ek(x) = 1 2σ2 k (x µk)2 and log Zk = log(2πσ2 k)/2. We bridge between an initial distribution with mean µ0 = 20 and standard deviation σ0 = 10 and a target with µK = 0, σK = 1 using K = 10 intermediate distributions and compute work distributions resulting from forward/backward simulations. Both distributions indeed cross at W = log Z = log(σ0/σK) = log 10, as predicted by CFT (see Fig. 1A). Figure 1B illustrates the difference between the marginal distribution of the samples after k annealing steps qk (Eq. 10) and the stationary distribution pk. The marginal distributions qk are also Gaussian, but their means and variances diverge from the means and variances of the stationary distributions pk. This divergence results in hysteresis, if the annealing process is forced to progress very rapidly without equilibrating the samples (quenching). Figure 1C confirms the validity of the JE (Eq. 9) and of the lower and upper bounds (Eq. 7). For short annealing protocols, the bounds are very conservative, whereas the Jarzynski equality gives the correct evidence even for fast protocols (small K). In realistic applications, however, we cannot compute the work distribution pf over the entire range of work values. In fast annealing simulations, it will become increasingly difficult to explore the left tail of the work distribution, such that in practice the accuracy of the JE estimator deteriorates for too small K. Algorithm 1 Bennett s acceptance ratio (BAR) Require: Work W (i) f , W (i) r from M forward and reverse nonequilibrium simulations, tolerance δ (e.g. δ = 10 4) Z 1 M P i exp{ W (i) f } (Jarzynski estimator) i 1 1+Z exp{W (i) f }, RHS P i 1 1+Z 1 exp{ W (i) r } Z Z LHS until | log(LHS/RHS)| < δ return Z 5 Using the fluctuation theorem to estimate the evidence To use the fluctuation theorem for evidence estimation, we run two sets of simulations. As in AIS, forward simulations start from a prior sample which is successively propagated by the transition kernels Tk. For each forward path x(i) the total work W (i) f is recorded. We also run reverse simulations starting from a posterior sample. For complex inference problems it is generally impossible to generate an exact sample from the posterior. However, in some cases the mode of the posterior is known or powerful methods for locating the posterior maximum exist. We can then generate a posterior sample by applying the transition operator TK many times starting from the posterior mode. The reverse simulations could also be started from the final states generated by the forward simulations drawn according to their importance weights w(i) f exp{ W (i) f }. Another possibility to generate a posterior sample is to start from the data, if we want to evaluate an intractable generative model such as a restricted Boltzmann machine. The posterior sample is then propagated by the reverse chain of transition operators. Again, we accumulate the total work generated by the reverse simulation W (i) r . The reverse simulation corresponds to reverse AIS proposed by Burda et al. [16]. 5.1 Jarzynski and cumulant estimators There are various options to estimate the evidence from forward and backward simulations. We can apply the Jarzynski equality to W (i) f and W (i) r , which corresponds to the estimators used in AIS [10] and reverse AIS [16]. According to Eq. (7) we can also compute an interval that likely contains the log evidence, but typically this interval will be quite large. Hummer [19] has developed estimators based on the cumulant expansion of pf and pr: log Z W f + varf(W)/2, log Z W r varr(W)/2 (12) where varf(W) and varr(W) indicate the sample variances of the work generated during the forward and reverse simulations. The cumulant estimators increase/decrease the lower/upper bound of the log evidence (Eq. 7) by the sample variance of the work. The forward and reverse cumulant estimators can also be combined into a single estimate [19]: 2 W f + W r + 1 12 varf(W) varr(W) . (13) 5.2 Bennett s acceptance ratio Another powerful method is Bennett s acceptance ratio (BAR) [20, 21], which is based on the observation that according to CFT (Eq. 8) Z h(W; F) pf(W) e W d W = Z h(W; F) pr(W) e F d W for any function h. Therefore, any choice of h gives an implicit estimator for F. Bennett showed [20, 9] that the minimum mean squared error is achieved for h (pf + pr) 1, leading to the implicit 0 100 200 K A B C D E F Figure 2: Performance of evidence estimators on the Gaussian toy model. M = 100 forward and reverse simulations were run for schedules of increasing length K. This experiment was repeated 1000 times to probe the stability of the estimators. Shown is the difference between the log evidence estimate and its true value log 10. The average over all repetitions is shown as red line; the light band indicates one standard deviation over all repetitions. (A) Cumulant estimator (Eq. 12) based on the forward simulation. (B) The combined cumulant estimator (Eq. 13). (C) Forward AIS estimator. (D) Reverse AIS. (E) BAR. (F) Histogram estimator. 1 + Z exp{W (i) f } = X 1 + Z 1 exp{ W (i) r } . (14) By numerically solving equation (14) for Z, we obtain an estimator of the evidence based on Bennett s acceptance ratio (BAR). A straightforward way to solve the BAR equation is to iterate over the multiplicative update Z(t+1) Z(t)LHS(Z(t))/RHS(Z(t)) where LHS and RHS are the left and right hand side of equation (14) and the superscript (t) indicates an iteration index. Algorithm (1) provides pseudocode to compute the BAR estimator (further details are given in the supplementary material). 5.3 Histogram estimator Here we introduce yet another way of combining forward/backward simulations and estimating the model evidence. According to CFT, we have: W (i) f pf(W), W (i) r pr(W) = pf(W) e W /Z . The idea is to combine all samples W (i) f and W (i) r to estimate pf, from which we can then obtain the evidence by using the JE (Eq. 9). Thanks to the CFT, the samples from the reverse simulation contribute most strongly to the integral in the JE. Therefore, if we can use the reverse paths to estimate the forward work distribution, pf will be quite accurate in the region that is most relevant for evaluating JE. To estimate pf from W (i) f and W (i) r is mathematically equivalent to estimating the density of states (DOS) (i.e. the marginal distribution of the log likelihood) from equilibrium simulations run at two inverse temperatures β = 0 and β = 1. We can therefore directly apply histogram techniques [14, 22] used to analyze equilibrium simulations to estimate pf from nonequilibrium simulations (details are given in the supplementary material). Histogram techniques result in a non-parametric estimate of the work distribution: pf(W) X j pj δ(W Wj) (15) where all sampled work values, W (i) f and W (i) r , were combined into a single set Wj and pj are normalized weights associated with each Wj. Using the JE, we obtain j pj e Wj (16) which is best evaluated in log space. The histogram iterations [14] used to determine pj and Z are very similar to the multiplicative that solve the BAR equation (Eq. 14). After running the histogram iterations, we obtain a non-parametric maximum likelihood estimate of pf (Eq. 15). It is also possible 1350 1300 1250 work W work distribution p(W) 0.4 0.5 0.6 0.7 inverse temperature average energy E equilibration steps N log evidence log Z Figure 3: Evidence estimation for the 32 32 Ising model. (A) Work distributions obtained for K = 1000 annealing and N = 1000 equilibration steps. (B) Average energy E f and E r at different annealing steps k in comparison to the average energy of the stationary distribution E β. Shown is a zoom into the inverse temperature range from 0.4 to 0.7; the average energies agree quite well outside this interval. (C) Evidence estimates for increasing number of equilibration steps N. Light/dark blue: lower/upper bounds log w f / log w r; light/dark green: forward/reverse AIS estimators log w f / log w r; light red: BAR; dark red: histogram estimator. For N > 1000, BAR and the histogram estimator produce virtually identical evidence estimates. to carry out a Bayesian analysis, and derive a Gibbs sampler for pf, which does not only provide a point estimate for log Z, but also quantifies its uncertainty (see supplementary material for details). We studied the performance of the evidence estimators on forward/backward simulations of the Gaussian toy model. The cumulant estimators (Figs. 2A, B) are systematically biased in case of rapid annealing (small K). The combined cumulant estimator (Fig. 2B) is a significant improvement over the forward estimator, which does not take the reverse simulation data into account. The forward and reverse AIS estimators are shown in Figs. 2C and 2D. For this system, the evidence estimates derived from the reverse simulation are systematically more accurate than the AIS estimate based on the forward simulation, which is clear given that the work distribution from reverse simulations pr is much more concentrated than the forward work distribution pf (see Fig. 1A). The most accurate, least biased and most stable estimators are BAR (Fig. 2E) and the histogram estimator (Fig. 2F), which both combine forward and backward simulations into a single evidence estimate. 6 Experiments We studied the performance of the nonequilibrium marginal likelihood estimators on various challenging probabilistic models including Markov random fields and Gaussian mixture models. A python package implementing the work simulations and evidence estimators can be downloaded from https://github.com/michaelhabeck/paths. 6.1 Ising model Our first test system is a 32 32 Ising model for which the log evidence can be computed exactly: log Z = 1339.27 [23]. A single configuration consists of 1024 spins xi = 1. The energies of the intermediate distributions are Ek(x) = βk P i j xixj where i j indicates nearest neighbors on a 2D square lattice. We generate M = 1000 forward and reverse paths using a linear inverse temperature schedule that interpolates between β0 = 0 and βK = 1 where K = 1000. Forward simulations start from random spin configurations. For the reverse simulations, we start in one of the two ground states with all spins either 1 or +1. Tk are Metropolis kernels based on pk: a new spin configuration is proposed by flipping a randomly selected spin and accepted or rejected according to Metropolis rule. The single spin-flip transitions are repeated N times at constant βk, i.e. N is the number of equilibration steps after a perturbation was induced by lowering the temperature. The larger N, the more time we allow the simulation to equilibrate, and the closer will qk be to pk. Figure 3A shows the work distributions obtained with N = 1000 equilibration steps per temperature perturbation. Even though the forward and reverse work distributions overlap only weakly, the 103 104 105 logw f logw r log w f log w r BAR Histogram 1 2 3 4 log Z +1.74e3 475 450 log Z parallel tempering Figure 4: Evidence estimation for the Potts model and RBM. (A) Estimated log evidence of the Potts model for a fixed computational budget K N = 109 where M = 100 and ten repetitions were computed. The reference value log Z = 1742 (obtained with parallel tempering) is shown as dashed black line. (B) log Z distributions obtained with the Gibbs sampling version of the histogram estimator for K = 1000 and varying number of equilibration steps. (C) Work distributions obtained for a marginal and full RBM (light/dark blue: forward/reverse simulation of the marginal model; light/dark green: forward/reverse simulation of the full model). evidence estimates obtained with BAR and the histogram estimator are quite accurate with 1338.05 (BAR) and 1338.28 (histogram estimator), which differs only by approx. 1 nat from the true evidence and corresponds to a relative error of 9 10 4 (BAR) and 7 10 4 (histogram estimator). Forward and reverse AIS provide less accurate estimates of the log evidence: 1333.66 (AIS) and 1342.05 (RAISE). The lower and upper bounds are very broad log w f = 1290.5 and log w r = 1352.0, which results from hysteresis effects. Figure 3B zooms into the average energies obtained during the forward and reverse simulations and compares them with the average energy of a fully equilibrated simulation. The average energies differ most strongly at inverse temperatures close to the critical value βcrit 0.44 at which the Ising model undergoes a second-order phase transition. We also tested the performance of the estimators as a function of the number of equilibration steps N. As already observed for the Gaussian toy model, BAR and the histogram estimator outperform the Jarzynski estimators (AIS and RAISE) also in case of the Ising model (see Fig. 3C). 6.2 Ten-state Potts model Next we performed simulations of the ten-state Potts model defined over a 32 32 lattice where the spins of the Ising model are replaced by integer colors xi {1, . . . , 10} and an interaction energy 2δ(xi, xj). This model is significantly more challenging than the Ising model, because it undergoes a first-order phase transition and has an astronomically larger number of states (101024 colorings rather than 21024 spin configurations). We performed forward/backward simulations using a linear inverse temperature schedule with β0 = 0, βK = 1 and a fixed computational budget K N = 109. Figure 4A shows that there seems to be no advantage of increasing the number of intermediate distributions at the cost of reducing the number of equilibration steps. Again, BAR and the histogram estimator perform very similarly. The Gibbs sampling version of the histogram estimator also provides the posterior of log Z (see Fig. 4B). For too few equilibration steps N, this distribution is rather broad or even slightly biased, but for large N the log Z posterior concentrates around the correct log evidence. 6.3 Restricted Boltzmann machine The restricted Boltzmann machine (RBM) is a common building block of deep learning hierarchies. RBM is an intractable MRF with bipartite interactions: E(v, h) = (a T v + b T h + v T Wh) where a, b are the visible and hidden biases and W are the couplings between the visible and hidden units vi and hj. Here we compare annealing of the full model Ek(v, h) = βk E(v, h) against annealing of the marginal model Ek(h) = βk log P v exp{ E(v, h)}. The full model can be simulated using a Gibbs sampler, which is straightforward since the conditional distributions are Bernoulli. To sample from the marginal model, we use a Metropolis kernel similar to the one used for the Ising model. To start the reverse simulations, we randomly pick an image from the training set and generate an initial hidden state by sampling from the conditional distribution p(h|v). We then run 100 steps of Gibbs sampling with TK to obtain a posterior sample. We ran tests on an RBM with 784 visible and 500 hidden units trained on the MNIST handwritten digits dataset [24] with contrastive divergence using 25 steps [25]. Since the true log evidence is not known, we use a reference value obtained with parallel tempering (PT): log Z 451.42. Figure 4C compares evidence estimates based on annealing simulations of the full against the marginal model. Both annealing approaches provide very similar evidence estimates 451.43 (full model) and 451.48 (marginal model) that are close to the PT result. However, simulation of the marginal model is three times faster compared to the full model. Therefore, it seems beneficial to evaluate and train RBMs based on sampling and annealing of the marginal model p(h) rather than the full model p(v, h). 6.4 Gaussian mixture model Finally, we consider a sort of data annealing strategy in which independent data points are added one-by-one as in sequential Monte Carlo [10] Ek(x) = P l