# partition_functions_from_raoblackwellized_tempered_sampling__22ae7d39.pdf Partition Functions from Rao-Blackwellized Tempered Sampling David E. Carlson 1,2 DAVID.EDWIN.CARLSON@GMAIL.COM Patrick Stinson 2 PATRICKSTINSON@GMAIL.COM Ari Pakman 1,2 ARI@STAT.COLUMBIA.EDU Liam Paninski1,2 LIAM@STAT.COLUMBIA.EDU 1 Department of Statistics 2 Grossman Center for the Statistics of Mind Columbia University, New York, NY, 10027 Partition functions of probability distributions are important quantities for model evaluation and comparisons. We present a new method to compute partition functions of complex and multimodal distributions. Such distributions are often sampled using simulated tempering, which augments the target space with an auxiliary inverse temperature variable. Our method exploits the multinomial probability law of the inverse temperatures, and provides estimates of the partition function in terms of a simple quotient of Rao-Blackwellized marginal inverse temperature probability estimates, which are updated while sampling. We show that the method has interesting connections with several alternative popular methods, and offers some significant advantages. In particular, we empirically find that the new method provides more accurate estimates than Annealed Importance Sampling when calculating partition functions of large Restricted Boltzmann Machines (RBM); moreover, the method is sufficiently accurate to track training and validation log-likelihoods during learning of RBMs, at minimal computational cost. 1. Introduction The computation of partition functions (or equivalently, normalizing constants) and marginal likelihoods is an important problem in machine learning, statistics and statisti- *These authors contributed equally to this work. The order of the names was randomized. Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). cal physics, and is necessary in tasks such as evaluating the test likelihood of complex generative models, calculating Bayes factors, or computing differences in free energies. There exists a vast literature exploring methods to perform such computations, and the popularity and usefulness of different methods change across different communities and domain applications. Classic and recent reviews include (Gelman & Meng, 1998; Vyshemirsky & Girolami, 2008; Marin & Robert, 2009; Friel & Wyse, 2012). In this paper we are interested in the particularly challenging case of highly multimodal distributions, such as those common in machine learning applications (Salakhutdinov & Murray, 2008). Our major novel insight is that simulated tempering, a popular approach for sampling from such distributions, also provides an essentially cost-free way to estimate the partition function. Simulated tempering allows sampling of multimodal distributions by augmenting the target space with a random inverse temperature variable and introducing a series of tempered distributions. The idea is that the fast MCMC mixing at low inverse temperatures allows the Markov chain to land in different modes of the low-temperature distribution of interest (Marinari & Parisi, 1992; Geyer & Thompson, 1995). As it turns out, (ratios of) partition functions have a simple expression in terms of ratios of the parameters of the multinomial probability law of the inverse temperatures. These parameters can be estimated efficiently by averaging the conditional probabilities of the inverse temperatures along the Markov chain. This simple method matches state-of-the-art performance with minimal computational and storage overhead. Since our estimator is based on Rao-Blackwellized marginal probability estimates of the inverse temperature variable, we denote it Rao-Blackwellized Tempered Sampling (RTS). In Section 2 we review the simulated tempering technique Partition Functions from Rao-Blackwellized Tempered Sampling and introduce the new RTS estimation method. In Section 3, we compare RTS to Annealed Importance Sampling (AIS) and Reverse Annealed Importance Sampling (RAISE) (Neal, 2001; Burda et al., 2015), two popular methods in the machine learning community. We also show that RTS has a close relationship with Multistate Bennett Acceptance Ratio (MBAR) (Shirts & Chodera, 2008; Liu et al., 2015) and Thermodynamic Integration (TI) (Gelman & Meng, 1998), two methods popular in the chemical physics and statistics communities, respectively. In Section 4, we illustrate our method in a simple Gaussian example and in a Restricted Boltzmann Machine (RBM), where it is shown that RTS clearly dominates over the AIS/RAISE approach. We also show that RTS is sufficiently accurate to track training and validation log-likelihoods of RBMs during learning, at minimal computational cost. We conclude in Section 5. 2. Partition Functions from Tempered Samples In this section, we start by reviewing the tempered sampling approach and then introduce our procedure to estimate partition functions. We note that our approach is useful not only as a stand-alone method for estimating partition functions, but is essentially free in any application using tempered sampling. In this sense it is similar to importance sampling approaches to computing partition functions (such as AIS). 2.1. Simulated Tempering Consider an unnormalized, possibly multimodal distribution proportional to f(x), whose partition function we want to compute. Our method is based on simulated tempering, a well known approach to sampling multimodal distributions (Marinari & Parisi, 1992; Geyer & Thompson, 1995). Simulated tempering begins with a normalized and easy-to-sample distribution p1(x) and augments the target distribution with a set of discrete inverse temperatures {0 = β1 < β2 < ... < βK = 1} to create a series of intermediate distributions between f(x) and p1(x), given by p(x|βk) = fk(x) where fk(x) = f(x)βkp1(x)1 βk , (2) and Zk = fk(x)dx . (3) ZK is the normalizing constant that we want to compute. Note that we assume Z1 = 1 and p(x|β1) = p1(x). However, our method does not depend on this assumption. When performing model comparison through likelihood ratios or Bayes factors, both distributions f(x) and p1(x) can be unnormalized, and one is interested in the ratio of their partition functions. For the sake of simplicity, we consider here only the interpolating family given in (2); other possibilities can be used for particular distributions, such as moment averaging (Grosse et al., 2013) or tempering by subsampling (van de Meent et al., 2014). When β {βk}K k=1 is treated as a random variable, one can introduce a prior distribution r(βk) = rk, and define the joint distribution p(x, βk) = p(x|βk)rk , (4) Unfortunately, Zk is unknown. Instead, suppose we know approximate values ˆZk. Then we can define q(x, βk) fk(x)rk/ ˆZk , (6) which approximates p(x, βk). We note that the distribution q depends explicitly on the parameters ˆZk. A Gibbs sampler is run on this distribution by alternating between samples from x|β and β|x. The latter is given by q(βk|x) = fk(x)rk/ ˆZk K k =1 fk (x)rk / ˆZk . (7) Sampling as such enables the chain to traverse the inverse temperature ladder stochastically, escaping local modes under low β and collecting samples from the target distribution f(x) when β = 1 (Marinari & Parisi, 1992). When K is large, few samples will have β = 1. Instead, an improved strategy to estimate expectations of functions over the target distribution is to Rao-Blackwellize, or importance sample, based on (7) to use all sample information (Geyer & Thompson, 1995). 2.2. Estimating Partition Functions Letting ˆZ1 Z1 = 1, we first note that by integrating out x in (6) and normalizing, the marginal distribution over the βk s is q(βk) = rk Zk/ ˆZk K k =1 rk Zk / ˆZk . (8) Note that if ˆZk is not close to Zk for all k, the marginal probability q(βk) will differ from the prior rk, possibly by orders of magnitude for some k s, and the βk s will not be efficiently sampled. One approach to compute approximate ˆZk values is the Wang-Landau algorithm (Wang & Landau, 2001; Atchade & Liu, 2010). We use an iterative strategy, discussed in Section 2.4. Given samples {x(i), βk(i)} generated from q(x, βk), the marginal probabilities above can simply be estimated by the normalized counts for each bin βk, 1 N N i=1 δk,k(i). But a lower variance estimator can be obtained by the Rao Blackwellized form (Robert & Casella, 2013) N N i=1 q(βk|x(i)) . (9) Partition Functions from Rao-Blackwellized Tempered Sampling k 1 10 20 30 40 50 60 70 80 90 100 120 log ˆZk with Rao-Blackwellization Exact Iteration 8 Iteration 6 Iteration 4 Iteration 2 Iteration 1 k 1 10 20 30 40 50 60 70 80 90 100 -12 log ˆck with Rao-Blackwellization k 1 10 20 30 40 50 60 70 80 90 100 100 Exact Iteration 8 Iteration 6 Iteration 4 Iteration 2 Iteration 1 k 1 10 20 30 40 50 60 70 80 90 100 -12 Figure 1. Comparison of log ˆZk and log ˆck estimates, in some of the first eight iterations of the initialization procedure described in Section 2.4, with and without Rao-Blackwellization, with K = 100. The initial values were ˆZk = 1 for all k, and the prior was uniform, rk = 1/K. The model is a RBM with 784 visible and 10 hidden units, trained on the MNIST dataset. Each iteration consists of 50 Gibbs sweeps, on each of 100 parallel chains. Since in the non-Rao-Blackwellized case, the updates are unstable and sometimes infinite, for demonstration purposes only, we define ˆck 0.1+ N i=1 δk,k(i) and normalize. Note that in the Rao-Blackwellized case, the values of ˆck in the final iteration are very close to those of rk, signaling that the ˆZk s are good enough for a last, long MCMC run to obtain the final ˆZk estimates. The estimates in (9) are unbiased estimators of (8), since q(βk) = q(βk|x)q(x)dx . (10) Our main idea is that the exact partition function can be expressed by ratios of the marginal distribution in (8), Zk = ˆZk r1 rk q(βk) q(β1), k = 2, . . . , K . (11) Plugging our estimates ˆck of q(βk) into (11) immediately gives us the consistent estimator ˆZRTS k = ˆZk r1 rk ˆck ˆc1 , k = 2, . . . , K . (12) The resulting procedure is outlined in Algorithm 1. 2.3. Rao-Blackwellized Likelihood Interpretation We can alternatively derive (12) by optimizing a Rao Blackwellized form of the marginal likelihood. From (8), the log-likelihood of the {βk(i)} samples is log q({βk(i)}N i=1) = N i=1 log(Zk(i)) (13) N log K k=1 rk Zk/ ˆZk + const. Because βk(i) was sampled from q(β|x(i)), we can reduce variance by Rao-Blackwellizing the first sum in (13), resulting in LRB[Z] = N i=1 K k=2 log(Zk)q(βk|x(i)) N log K k=1 rk Zk/ ˆZk + const, = N K k=2 log(Zk)ˆck (14) N log K k=1 rk Zk/ ˆZk + const . The normalizing constants are estimated by maximizing (14) subject to a fixed Z1, which is known. Setting the Algorithm 1 Rao-Blackwellized Tempered Sampling Input: {βk, rk}k=1,...,K, N Initialize log ˆZk, k = 2, ..., K Initialize β {β1, ..., βK} Initialize ˆck = 0, k = 1, ..., K for i = 1 to N do Transition in x leaving q(x|β) invariant. Sample β|x (β|x) Update ˆck ˆck + 1 N q(βk|x) end for Update ˆZRTS k ˆZk r1ˆck rkˆc1 , k = 2, ..., K derivatives of (14) w.r.t. Zk s to zero gives a system of linear equations ˆck 1 Zk = r1 k = 2, . . . , K whose solution is (12). 2.4. Initial Iterations As mentioned above, the chain with initial ˆZk s may mix slowly and provide a poor estimator (i.e. small q(βk) s are rarely sampled). Therefore, when the ˆZk s are far from the Zk s (or equivalently, the rk s are far from the ˆck s), the ˆZk s estimates should be updated. Our estimator in (12) does not directly handle the case where ˆZk is sequentially updated. We note that the likelihood approach of (14) is straightforwardly adapted to this case and is straightforwardly numerically optimized (see Appendix A for details). A simpler, less computationally intensive, and equally effective strategy is as follows: start with ˆZk = 1 for all k (or a better estimate, if known), and iterate between estimating ˆck with few MCMC samples and updating ˆZk with the estimated ˆZRTS k using (12). In our experiments using many parallel Markov chains, this Partition Functions from Rao-Blackwellized Tempered Sampling procedure worked best when the updated Markov chains started from the previous last x s, and fresh, uniformly random sampled βk s. Once the ˆZk s estimates are close enough to the Zk s to facilitate mixing, a long MCMC chain can be run to provide samples for the estimator. Because ˆck estimates q(βk), and q(βk) rk when ˆZk Zk, a simple stopping criterion for the initial iterations is to check the similarity between ˆck and rk. For example, if we use a uniform prior rk = 1/K, a practical rule is to iterate the few-samples chains until maxk |rk ˆck| < 0.1/K. Figure 1 shows the values taken by ˆZk and ˆck in these initial iterations in a simple example. The figure also illustrates the importance of using the Rao-Blackwellized form (9) for ˆck, which dramatically reduces the noise in the estimator 1 N N i=1 δk,k(i) for q(βk). 2.5. Bias and Variance In Appendix B, we show that the bias and variance of log ˆZk using Eqn. (12) can be approximated by E log ˆZRTS k log Zk 1 2 σ2 1 ˆc2 1 σ2 k ˆc2 k and Var[log ˆZRTS k ] σ2 1 ˆc2 1 + σ2 k ˆc2 k 2σ1k ˆckˆc1 . (16) where σ2 1 = Var[ˆc1], σ2 k = Var[ˆck], and σ1k = Cov[ˆc1, ˆck]. This shows that the bias of log ˆZk has no definite sign. This is in contrast to many popular methods, such as AIS, which underestimates log Zk (Neal, 2001), and RAISE, which overestimates log Zk (Burda et al., 2015). 3. Related Work In this section, we briefly review some popular estimators and explore their relationship to the proposed RTS estimator (12). All the estimators below use a family of tempered distributions, as appropriate for multimodal distributions. In some cases the temperatures are fixed parameters, while in others they are random variables. Note that RTS belongs to the latter group, and relies heavily on the random nature of the temperatures. 3.1. Wang-Landau A well-known approach to obtain approximate values of the Zk s is the Wang-Landau algorithm (Wang & Landau, 2001; Atchade & Liu, 2010). The setting is similar to ours, but the algorithm constantly modifies the ˆZk s along the Markov chain as different βk s are sampled. The factors that change the ˆZk s asymptotically converge to 1. The resulting ˆZk estimates are usually good enough to allow mixing in the (x, β) space (Salakhutdinov, 2010), but are too noisy for purposes such as likelihood estimation (Tan, 3.2. AIS/RAISE Annealed Importance Sampling (AIS) (Neal, 2001) is perhaps the most popular method in the machine learning literature to estimate log ZK. Here, one starts from a sample x1 from p1(x), and samples a point x2, using a transition function K2(x2|x1) that leaves f2(x) invariant. The process is repeated until one has sampled x K using a transition function that leaves f(x) invariant. The vector (x1, x2, ..., x K) is interpreted as a sample from an importance distribution on an extended space, while the original distribution p(x K) can be similarly augmented into an extended space. The resulting importance weight can be computed in terms of quotients of the fk s, and provides an unbiased estimator for ZK/Z1, whose variance decreases linearly with K. Note that the inverse temperatures in this approach are not random variables. The variance of the AIS estimator can be reduced by averaging over several runs, but the resulting value of log( ˆZK) has a negative bias due to Jensen s inequality. This in turn results in a positive bias when estimating data loglikelihoods. Recently, a related method, called Reverse Annealed Importance Sampling (RAISE) was proposed to estimate the data log-likelihood in models with latent variables, giving negatively biased estimates (Burda et al., 2015; Grosse et al., 2015). The method performs a similar sampling as AIS, but starts from a sample of the latent variables at βK = 1 and proceeds then to lower inverse temperatures. In certain cases, such as in the RBM examples we consider in Section 4.2, one can obtain from these estimates of the data log-likelihood an estimate of the partition function, which will have a positive bias. The combination of the expectations of the AIS and RAISE estimators thus sandwiches the exact value (Burda et al., 2015; Grosse et al., 2015). 3.3. BAR/MBAR Bennett s acceptance ratio (BAR) (Bennett, 1976), also called bridge sampling (Meng & Wong, 1996), is based on the identity Zk Z1 = Ep(x|β1)[α(x)fk(x)] Ep(x|βk)[α(x)f1(x)] , (17) where α(x) is an arbitrary function such that 0 < f1(x)fk(x)α(x)dx < , which can be chosen to minimize the asymptotic variance. BAR has been generalized to estimate partition functions when sampling from multiple distributions, a method termed the multistate BAR (MBAR) (Shirts & Chodera, 2008). Partition Functions from Rao-Blackwellized Tempered Sampling Assuming that there are nk i.i.d. samples for each inverse temperature βk (N samples {xi}i=1,...,N in total), and Δx = log f(x) log p1(x), the MBAR partition function estimates can be obtained by maximizing the log-likelihood function (Tan et al., 2012): N N i=1 log K k=1 nk N exp( log Zk + βkΔxi) N log Zr. (18) This method was recently rediscovered and shown to compare favorably against AIS/RAISE in (Liu et al., 2015). MBAR has many different names in different literatures, e.g. unbinned weighted histogram analysis method (UWHAM) (Tan et al., 2012) and reverse logistic regression (Geyer, 1994). Unlike RTS, MBAR does not use the form of q(β) when estimating the partition function. As a price associated with this increased generality, MBAR requires the storage of all collected samples, and the estimator is calculated by finding the maximum of (18). This likelihood function does not have an analytic solution, and Newton-Raphson was proposed to iteratively solve this problem, which requires O(NK2 + K3) per iteration. While RTS is less general than MBAR, RTS has an analytic solution and only requires the storage of the ˆck statistics. We note that this objective function is very similar to the one discussed in Appendix A for combining different ˆZk s. Recent work has proposed a stochastic learning algorithm based on MBAR/UWHAM (Tan et al., 2016), with updates based on the sufficient statistics ˆck given by log ˆZ(t+1) k = log ˆZ(t) k + γt The step size is recommended to be set to γt = t 1. Note the similarity with our estimator from (12) in log space, with log ˆck rk as the update. We empirically found that when the ˆZk s are far away from the truth, our update (12) dominates over (19). Because the first order Taylor series approximation to our estimator is the same as the term in (19), when ˆck rk the updates will essentially only differ by the step size γt. We also note that there is a particularly interesting relationship between the the cost function for MBAR and the cost function for RTS. Note that Eq[ nk N ] is equal to q(βk) for tempered sampling. If the values of nk N in (18) are replaced by their expectation, the maximizer of (18) is equal to the RTS estimator given in (12). We detail this equivalency in Appendix D. Hence, the similarity of MBAR and RTS will depend on how far the empirical counts vary from their expectation. In our experiments, this form of extra information empirically helps to improve estimator accuracy. 3.4. Thermodynamic Integration Thermodynamic Integration (TI) (Gelman & Meng, 1998) is derived from basic calculus identities. Let us first assume that β is a continuous variable in [0, 1]. We again define Δx = log f(x) log p1(x), and fβ(x) = f(x)βp1(x)1 β. We note that d dβ log Z(β) = 1 Z(β) d dβ fβ(x)dx = Ex|β[Δx], (20) which yields 0 Ex|β[Δx]dβ = Ep(x|β)p(β) This equation holds for any p(β) that is positive over the range [0, 1], and provides an unbiased estimator for log Zk if unbiased samples from p(x|β) are available. This is in contrast to AIS, which is unbiased on Zk, and biased on log Zk. Given samples {x(i), β(i)}i=1,...,N, the estimator for log ZK is log ZK = log Z1 + 1 N N i=1 Δx(i) p(β(i)). There are two distinct approaches for generating samples and performing this calculation in TI. First, β can be sampled from a prior p(β), and samples are generated from fβ(x) to estimate the gradient at the current point in β space. A second approach is to use samples generated from simulated tempering, which can facilitate mixing. However, the effective marginal distribution q(β) must be estimated in this case. When β consists of a discrete set of inverse temperatures, the integral can be approximated by the trapezoidal or Simpson s rule. In essence, this uses the formulation in (20), and uses standard numerical integration techniques. Recently, higher order moments were used to improve this integration, which can help in some cases (Friel et al., 2014). As noted by (Calderhead & Girolami, 2009), this discretization error can be expressed as a sum of KL-divergences between neighboring intermediate distributions. If the KL-divergences are known, an optimal discretization strategy can be used. However, this is unknown in general. While the point of this paper is not to improve the TI approach, we note that the Rao-Blackwellization technique we propose also applies to TI when using tempered samples. This gives that the Monte Carlo approximation of the gradient (20) is d dβ log Z(β) β=βk N i=1 q(βk|xi)Δxi N j=1 q(βk|xj) . (21) Partition Functions from Rao-Blackwellized Tempered Sampling This reduces the noise on the gradient estimates, and improves performance when the number of bins is relatively high compared to the number of collected samples. We refer to this technique as TI-Rao-Blackwell (TI-RB). TI-RB is further interesting in the context of RTS, because of a surprising relationship: in the continuous β limit, RTS and TI-RB are equivalent estimators. However, when using discrete inverse temperatures, RTS does not suffer from the discretization error that TI and TI-RB do. We show the derivation of this relationship in Appendix C, but we give a quick description here. First, let the inverse temperature β take continuous values. Replacing the index k by β in (12), we note that the estimator for RTS can be written as: log ˆcβ log rβ + log ˆZβ dβ, i q(β|xi)Δxi j q(β|xj) dβ . (22) Note that the integrand of (22) is exactly identical to the TIRB gradient estimate from the samples given in (21). After integration, the estimators will be identical. We stress that while the continuous formulation of RTS and TI-RB are equivalent in the continuous limit, in the discrete case RTS does not suffer from discretization error. RTS is also limited to the case when samples are generated by the joint tempered distribution q(x, β); however, because it does not suffer from discretization error, we empirically demonstate that RTS is much less sensitive to the number of temperatures compared to TI (see Section 4.3). Parallels between other methods and Thermodynamic Integration can be drawn as well. As noted in (Neal, 2005), the log importance weight for AIS can be written as log w = K k=2(βk βk 1)Δxk (23) and thus can be thought of as a Riemann sum approximation to the numerical integral under a particular sampling approach. 4. Examples In this section, we study the ability of RTS to estimate partition functions in a Gaussian mixture model and in Restricted Boltzmann Machines and compare to estimates from popular existing methods. We also study the dependence of several methods on the number K of inverse temperatures, and show that RTS can provide estimates of trainand validation-set likelihoods during RBM training at minimal cost. The MBAR estimates used for comparison 1000 1500 2000 2500 3000 3500 4000 Number of samples RTS MBAR TI Riemann TI trap TI trap corrected TI RB Figure 2. Comparison of log Z estimation performance on a toy Gaussian Mixture Model using an RMSE from 10 repeats. TI Riemann approximates the discrete integral as a right Riemann sum, TI trap uses the trapezoidal method, TI trap corrected uses a variance correction technique developed in (Friel et al., 2014), TI RB uses a Rao-Blackwellized version of TI discussed in Appendix C. in this section were calculated with the pymbar package1. 4.1. Gaussian Mixture Example and Comparisons Figure 2 compares the performance of RTS to several methods, including MBAR and TI and its variants, in a mixture of two 10-dimensional Gaussians (see Appendix E.1 for specific details). The sampling for all methods was performed using a novel adaptive Hamiltonian Monte Carlo method for tempered distributions of continuous variables, introduced in Appendix E. In this case the exact partition function can be numerically estimated to high precision. Note that all estimators give nearly identical performance; however, our method is the simplest to implement and use for tempered samples, with minimal memory and computation requirements. 4.2. Partition Functions of RBMs The Restricted Boltzmann Machine (RBM) is a bipartite Markov Random Field model popular in the machine learning community (Smolensky, 1986). For the binary case, this is a generative model over visible observations v {0, 1}M and latent features h {0, 1}J defined by log f(v, h) = v T c + v T Wh + h T b, for parameters c RM, b RJ, and W RM J. A fundamental performance measure of this model is the log-likelihood of a test set, which requires the estimation of the log partition function. Both AIS (Salakhutdinov & Murray, 2008) and RAISE (Burda et al., 2015) were proposed to address this issue. We will evaluate performance on the bias and the root mean squared error (RMSE) of the estimator. To estimate truth, we estimate the true mean as the average of estimates from AIS and RTS with 106 samples from 100 1Code available from https://github.com/ choderalab/pymbar Partition Functions from Rao-Blackwellized Tempered Sampling 103 104 105 Gibbs Sweeps Estimator Mean RTS AIS RAISE True 103 104 105 Gibbs Sweeps Estimator RMSE RTS AIS RAISE 103 104 105 106 Gibbs Sweeps Estimator Mean RTS AIS RAISE True 103 104 105 106 Gibbs Sweeps Estimator RMSE RTS AIS RAISE Figure 3. Mean and root mean squared error (RMSE) of competing estimators of log ZK evaluated on RBMs with 784 visible units trained on the MNIST dataset. The numbers of hidden units were 500 (Left and Middle Left) and 100 (Middle Right and Right). In both cases, the bias from RTS decreases quicker than that of AIS and RAISE, and the RMSE of AIS does not approach that of RTS at 1000 Gibbs sweeps until over an order of magnitude later. Each method is run on 100 parallel Gibbs chains, but the Gibbs sweeps in the horizontal axis corresponds to each individual chain. parallel chains. We note the variance of these estimates was very low ( 0.006). Figure 3 shows a comparison of RTS versus AIS/RAISE on two RBMs trained on the binarized MNIST dataset (M=784, N=60000), with 500 and 100 hidden units. The former was taken from (Salakhutdinov & Murray, 2008),2 while the latter was trained with the method of (Carlson et al., 2015b). In all the cases we used for p1 a product of Bernoulli distributions over the v variables which matches the marginal statistics of the training dataset, following (Salakhutdinov & Murray, 2008). We run each method (RTS, AIS, RAISE) with 100 parallel Gibbs chains. In RTS, the number of inverse temperatures was fixed at K=100, and we performed 10 initial iterations of 50 Gibbs sweeps each, following Section 2.4. In AIS/RAISE, the number of inverse temperatures K was set to match in each case the total number of Gibbs sweeps in RTS, so the comparisons in Figure 3 correspond to matched computational costs. We note that the performance of RAISE is similar to the plots shown in (Burda et al., 2015) for these parameters. We also experimented with the case where p1 was the uniform prior, and these results are included in Appendix F. 4.3. Number of Temperatures An advantage of the Rao-Blackwellization of temperature information is that there is no need to pick a precise number of inverse temperatures, as long as K is big enough to allow for good mixing of the Markov chain. As shown in Figure 4, RTS s performance is not greatly affected by adding more temperatures once there are enough temperatures to give good mixing. Also note that as the number of temperatures increases RTS and the Rao-Blackwellized version of TI (TI-RB) become increasingly similar. We show explicitly in Appendix C 2Code and parameters available from: http://www.cs. toronto.edu/ rsalakhu/rbm_ais.html 101 102 103 Number of Temperatures Estimator RMSE RTS TS MBAR TI TI-RB Figure 4. RMSE as a function of the number of inverse temperatures K for various estimators. The model is the same RBM with 500 hidden units studied in Figure 3. Each point was obtained by averaging over 200 estimates (20 for MBAR due to computational costs) made from 10,000 bootstrapped samples from a long MCMC run of 3 million samples. that they are equivalent in the infinite limit of the number of temperatures. Due to computational costs, running MBAR on a large number of temperatures is computationally prohibitive. An issue when estimates are non-Rao Blackwellized is that the estimates eventually become unstable as we do not have positive counts for each bin. This is addressed heuristically in the non-Rao-Blackwellized version of RTS (TS) by adding a constant of .1 to each bin. For TI, empty bins are imputed by linear interpolation. 4.4. Tracking Partition Functions While Training There are many approaches to training RBMs, including recent methods that do not require sampling (Sohl-Dickstein et al., 2010; Im et al., 2015; Gabri e et al., 2015). However, most learning algorithms are based on Monte Carlo Integration with persistent Contrastive Divergence (Tieleman & Hinton, 2009). This includes proposals based on tempered sampling (Salakhutdinov, 2009; Desjardins et al., 2010). Because RTS requires a relatively low number of Partition Functions from Rao-Blackwellized Tempered Sampling samples and the parameters are slowly changing, we are able to track the value of a trainand validation-set likelihoods during RBM training at minimal additional cost. This allows us to avoid overfitting by early stopping of the training. We note that there are previous more involved efforts to track RBM partition functions, which involve additional computational and implementation efforts (Desjardins et al., 2011). This idea is illustrated in Figure 5, which shows estimates of the mean of training and validation log-likelihoods on the dna dataset3, with 180 observed binary features, trained on a RBM with 500 hidden units. We first pretrain the RBM with CD-1 to get initial values for the RBM parameters. We then run initial RTS iterations with K = 100, as in Section 2.4, in order to get starting log ˆZk estimates. For the main training effort we used the RMSspectral stochastic gradient method, with stepsize of 1e-5 and parameter λ = .99 (see (Carlson et al., 2015b) for details). We considered a tempered space with K = 100 and sampled 25 Gibbs sweeps on 2000 parallel chains between gradient updates. The latter is a large number compared to older learning approaches (Salakhutdinov & Murray, 2008), but is similar to that used both in (Carlson et al., 2015b) and (Grosse & Salakhudinov, 2015) that provide state-of-the-art learning techniques. We used a prior on the inverse temperatures rk exp(2βk), which reduces variance on the gradient estimate by encouraging more of the samples to contribute to the gradient estimation. With the samples collected after each 25 Gibbs sweeps, we can estimate the ˆck s to compute the running partition function. To smooth the noise from such a small number of samples, we consider partial updates of ˆZK given by ˆZ(t+1) K = ˆZ(t) K r1 r K ˆc(t) K ˆc(t) 1 with α = 0.2, and t an index on the gradient update. Similar results were obtained with .05 < α < .5. This smoothing is also justified by the slowly changing nature of the parameters. Figure 5 also shows the corresponding value from AIS with 100 parallel samples and 10,000 inverse temperatures. Such AIS runs have been shown to give accurate estimates of the partition function for RBMs with even more hidden units (Salakhutdinov & Murray, 2008), but involve a major computational cost that our method avoids. Using the settings from (Salakhutdinov & Murray, 2008) adds a cost of 106 additional samples. 3Available from: https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/multiclass.html 0 2000 4000 6000 8000 Iteration Train-RTS Validation-RTS Train-AIS Validation-AIS Figure 5. A demonstration of the ability to track with minimal cost the mean train and validation log-likelihood during the training of a RBM on the dna 180-dimensional binary dataset, with 500 latent features. 5. Discussion In this paper, we have developed a new partition function estimation method that we called Rao-Blackwellized Tempered Sampling (RTS). Our experiments show RTS has equal or superior performance to existing methods popular in the machine learning and physical chemistry communities, while only requiring sufficient statistics collected during simulated tempering. An important free parameter is the prior over inverse temperatures, rk, and its optimal selection is a natural question. We explored several parametrized proposals for rk, but in our experiments no alternative prior distribution consistently outperformed the uniform prior on estimator RMSE. (In Section 4.4, a non-uniform prior was used, but this was to reduce gradient estimate uncertainty at the expense of a less accurate log Z estimate.) We also explored a continuous β formulation, but the resulting estimates were less accurate. Additionally, we tried subtracting off estimates of the bias, but this did not improve the results. Finally, we tried incorporating a variety of control variates, such as those in (Dellaportas & Kontoyiannis, 2012), but did not find them to reduce the variance of our estimates in the examples we considered. Other control variates methods, such as those in (Oates et al., 2015), could potentially be combined with RTS in continuous distributions. We also briefly considered estimating p(βk) via the stationary distribution of a Markov process, which we discuss in Appendix G. This approach did not consistently yield performance improvements. Future improvements could be obtained through improving the temperature path as in (Grosse et al., 2013; van de Meent et al., 2014) or incorporating generalized ensembles (Frellsen et al., 2016). Partition Functions from Rao-Blackwellized Tempered Sampling Acknowledgements We thank Ryan Adams for helpful conversations. Funding for this research was provided by DARPA N66001-15-C4032 (SIMPLEX), a Google Faculty Research award, and ONR N00014-14-1-0243; in addition, this work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (Do I/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, Do I/IBC, or the U.S. Government. Atchade, Y. and Liu, J. The Wang-Landau algorithm in general state spaces: Applications and convergence analysis. Stat. Sinica, 2010. Bennett, C. Efficient estimation of free energy differences from Monte Carlo data. J. Comp. Physics, 1976. Beskos, A., Pillai, N., Roberts, G., Sanz-Serna, J.-M., and Stuart, A. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 2013. Burda, Y., Grosse, R., and R, S. Accurate and conservative estimates of MRF log-likelihood using reverse annealing. AISTATS, 2015. Calderhead, B. and Girolami, M. Estimating Bayes factors via thermodynamic integration and population MCMC. Computational Statistics & Data Analysis, 53(12):4028 4045, 2009. Carlson, D., Cevher, V., and Carin, L. Stochastic spectral descent for Restricted Boltzmann Machines. AISTATS, 2015a. Carlson, D., Hsieh, Y.-P., Collins, E., Carin, L., and Cevher, V. Stochastic spectral descent for discrete graphical models. IEEE J. Selected Topics Signal Processing, 2016. Carlson, D. E., Collins, E., Hsieh, Y.-P., Carin, L., and Cevher, V. Preconditioned spectral descent for deep learning. In NIPS, 2015b. Dellaportas, P. and Kontoyiannis, I. Control variates for estimation based on reversible Markov Chain Monte Carlo samplers. J. Royal Statistical Society: Series B (Statistical Methodology), 74(1):133 161, 2012. Desjardins, G., Courville, A. C., Bengio, Y., Vincent, P., and Delalleau, O. Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines. In AISTATS, 2010. Desjardins, G., Bengio, Y., and Courville, A. C. On tracking the partition function. In NIPS, 2011. Frellsen, J., Winther, O., Ghahramani, Z., and Ferkinghoff Borg, J. Bayesian generalised ensemble Markov chain Monte Carlo. In AISTATS, 2016. Friel, N., Hurn, M., and Wyse, J. Improving power posterior estimation of statistical evidence. Statistics and Computing, 2014. Friel, N. and Wyse, J. Estimating the evidence a review. Statistica Neerlandica, 66(3):288 308, 2012. Gabri e, M., Tramel, E. W., and Krzakala, F. Training Restricted Boltzmann Machines via the Thouless Anderson-Palmer free energy. In NIPS, 2015. Gelman, A. and Meng, X.-L. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Stat. science, 1998. Geyer, C. J. Estimating normalizing constants and reweighting mixtures. UM Technical Report 568, 1994. Geyer, C. J. and Thompson, E. A. Annealing Markov chain Monte Carlo with applications to ancestral inference. JASA, 90(431):909 920, 1995. Grosse, R. and Salakhudinov, R. Scaling up natural gradient by sparsely factorizing the inverse Fisher matrix. In ICML, 2015. Grosse, R. B., Maddison, C. J., and Salakhutdinov, R. R. Annealing between distributions by averaging moments. In NIPS, 2013. Grosse, R. B., Ghahramani, Z., and Adams, R. P. Sandwiching the marginal likelihood using bidirectional Monte Carlo. ar Xiv:1511.02543, 2015. Im, D. J., Buchman, E., and Taylor, G. Understanding minimum probability flow for RBMs under various kinds of dynamics. ICLR Workshop Track, 2015. Li, Y., Protopopescu, V., and Gorin, A. Accelerated simulated tempering. Physics Letters A, 2004. Liu, Q., Peng, J., Ihler, A., and III, J. F. Estimating the partition function by discriminance sampling. UAI, 2015. Marin, J.-M. and Robert, C. P. Importance sampling methods for Bayesian discrimination between embedded models. ar Xiv:0910.2325, 2009. Partition Functions from Rao-Blackwellized Tempered Sampling Marinari, E. and Parisi, G. Simulated tempering: a new monte carlo scheme. EPL (Europhysics Letters), 19(6): 451, 1992. Meng, X.-L. and Wong, W. H. Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica, 6(4):831 860, 1996. Neal, R. M. Estimating ratios of normalizing constants using linked importance sampling. ar Xiv preprint math/0511216, 2005. Neal, R. Annealed importance sampling. Statistics and Computing, 2001. Neal, R. Handbook of Markov Chain Monte Carlo. Chapman & Hall / CRC Press, 2011. Oates, C. J., Papamarkou, T., and Girolami, M. The controlled thermodynamic integral for Bayesian model evidence evaluation. J. American Statistical Association, 2015. Robert, C. and Casella, G. Monte Carlo statistical methods. Springer Science & Business Media, 2013. Salakhutdinov, R. Learning deep Boltzmann machines using adaptive MCMC. In ICML, 2010. Salakhutdinov, R. and Murray, I. On the quantitative analysis of Deep Belief Networks. In ICML, 2008. Salakhutdinov, R. R. Learning in markov random fields using tempered transitions. In NIPS, 2009. Shirts, M. R. and Chodera, J. D. Statistically optimal analysis of samples from multiple equilibrium states. J. Chem. Physics, 129(12):124105, 2008. Smolensky, P. Information processing in dynamical systems: Foundations of harmony theory. Technical report, DTIC Document, 1986. Sohl-Dickstein, J., Battaglino, P., and De Weese, M. R. Minimum probability flow learning. ICML, 2010. Tan, Z. Optimally adjusted mixture sampling and locally weighted histogram analysis. Journal of Computational and Graphical Statistics, (just-accepted), 2015. Tan, Z., Gallicchio, E., Lapelosa, M., and Levy, R. M. Theory of binless multi-state free energy estimation with applications to protein-ligand binding. J. Chem. Physics, 136(14):144102, 2012. Tan, Z., Xia, J., Zhang, B. W., and Levy, R. M. Locally weighted histogram analysis and stochastic solution for large-scale multi-state free energy estimation. J. Chemical Physics, 144(3):034107, 2016. Tieleman, T. and Hinton, G. Using fast weights to improve persistent contrastive divergence. In ICML. ACM, 2009. van de Meent, J.-W., Paige, B., and Wood, F. Tempering by subsampling. ar Xiv:1401.7145, 2014. Vyshemirsky, V. and Girolami, M. A. Bayesian ranking of biochemical system models. Bioinformatics, 24(6): 833 839, 2008. Wang, F. and Landau, D. P. Efficient multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Let. E, 2001.