# parallel_tempering_on_optimized_paths__cecf499e.pdf

Parallel tempering on optimized paths

Saifuddin Syed * 1 Vittorio Romaniello * 1 Trevor Campbell 1 Alexandre Bouchard-Cˆot e 1

Parallel tempering (PT) is a class of Markov chain Monte Carlo algorithms that constructs a path of distributions annealing between a tractable reference and an intractable target, and then interchanges states along the path to improve mixing in the target. The performance of PT depends on how quickly a sample from the reference distribution makes its way to the target, which in turn depends on the particular path of annealing distributions. However, past work on PT has used only simple paths constructed from convex combinations of the reference and target log-densities. This paper begins by demonstrating that this path performs poorly in the setting where the reference and target are nearly mutually singular. To address this issue, we expand the framework of PT to general families of paths, formulate the choice of path as an optimization problem that admits tractable gradient estimates, and propose a ﬂexible new family of spline interpolation paths for use in practice. Theoretical and empirical results both demonstrate that our proposed methodology breaks previously-established upper performance limits for traditional paths.

1. Introduction

Markov Chain Monte Carlo (MCMC) methods are widely used to approximate intractable expectations with respect to un-normalized probability distributions over general state spaces. For hard problems, MCMC can suffer from poor mixing. For example, faced with well-separated modes, MCMC methods often get trapped exploring local regions of high probability. Parallel tempering (PT) is a widely applicable methodology (Geyer, 1991) to tackle poor mixing of MCMC algorithms.

*Equal contribution 1Department of Statistics, University of British Columbia, Vancouver, Canada. Correspondence to: Saifuddin Syed <saif.syed@stat.ubc.ca>, Vittorio Romaniello <vittorio.romaniello@stat.ubc.ca>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

Suppose we seek to approximate an expectation with respect to an intractable target density π1. Denote by π0 a reference density deﬁned on the same space, which is assumed to be tractable in the sense of the availability of an efﬁcient sampler. This work is motivated by the case where π0 and π1 are nearly mutually singular. A typical case is where the target is a Bayesian posterior distribution, the reference is the prior for which i.i.d. sampling is typically possible and the prior is misspeciﬁed.

PT methods are based on a speciﬁc continuum of densities πt π1 t 0 πt 1, t [0, 1], bridging π0 and π1. This path of intermediate distributions is known as the power posterior path in the literature, but in our framework it will be more natural to think of these continua as a linear paths, as they linearly interpolate between log-densities. PT algorithms discretize the path at some 0 = t0 < < t N = 1 to obtain a sequence of densities πt0, πt1, . . . , πt N . See Figure 1 (top) for an example of a linear path for two nearly mutually singular Gaussian distributions.

Given the path discretization, PT involves running N + 1 MCMC chains that together target the product distribution πt0πt1 πt N . Based on the assumption that the chain π0 can be sampled efﬁciently, PT uses swap-based interactions between neighbouring chains to propagate the exploration done in π0 into improved exploration in the chain of interest π1. By designing these swaps as Metropolis Hastings moves, PT guarantees that the marginal distribution of the N th chain converges to π1; and in practice, the rate of convergence is often much faster compared to running a single chain (Woodard et al., 2009). PT algorithms are extensively used in hard sampling problems arising in statistics, physics, computational chemistry, phylogenetics, and machine learning (Desjardins et al., 2014; Ballnus et al., 2017; Kamberaj, 2020; M uller & Bouckaert, 2020).

Notwithstanding empirical and theoretical successes, existing PT algorithms also have well-understood theoretical limitations. Earlier work focusing on the theoretical analysis of reversible variants of PT has shown that adding too many intermediate chains can actually deteriorate performance (Lingenheil et al., 2009; Atchad e et al., 2011). Recent work (Syed et al., 2019) has shown that a nonreversible variant of PT (Okabe et al., 2001) is guaranteed to dominate its classical reversible counterpart, and moreover that in the

Parallel tempering on optimized paths

nonreversible regime adding more chains does not lead to performance collapse. However, even with these more efﬁcient non reversible PT algorithms, Syed et al. (2019) established that the improvement brought by higher parallelism will asymptote to a fundamental limit known as the global communication barrier.

In this work, we show that by generalizing the class of paths interpolating between π0 and π1 from linear to nonlinear, the global communication barrier can be broken, leading to substantial performance improvements. Importantly, the nonlinear path used to demonstrate this breakage is computed using a practical algorithm that can be used in any situation where PT is applicable. An example of a path optimized using our algorithm is shown in Figure 1 (bottom).

We also present a detailed theoretical analysis of parallel tempering algorithms based on nonlinear paths. Using this analysis we prove that the performance gains obtained by going from linear to nonlinear path PT algorithms can be arbitrarily large. Our theoretical analysis also motivates a principled objective function used to optimize over a parametric family of paths.

Literature review Beyond parallel tempering, several methods to approximate intractable integrals rely on a path of distributions from a reference to a target distribution, and there is a rich literature on the construction and optimization of nonlinear paths for annealed importance sampling type algorithms (Gelman & Meng, 1998; Rischard et al., 2018; Grosse et al., 2013; Brekelmans et al., 2020). These algorithms are highly parallel; however, for challenging problems, even when combined with adaptive step size procedures (Zhou et al., 2016) they typically suffer from particle degeneracy (Syed et al., 2019, Sec. 7.4). Moreover, these methods use different path optimization criteria which are not well motivated in the context of parallel tempering.

Some special cases of non-linear paths have been used in the PT literature (Whitﬁeld et al., 2002; Tawn et al., 2020). Whitﬁeld et al. (2002) construct a non-linear path inspired by the concept of Tsallis entropy, a generalization of Boltzmann-Gibbs entropy, but do not provide algorithms to optimize over this path family. The work of Tawn et al. (2020), also considers a speciﬁc example of a nonlinear path distinct from the ones explored in this paper. However, the construction of the nonlinear path in Tawn et al. (2020) requires knowledge of the location of the modes of π1 and hence makes their algorithm less broadly applicable than standard PT.

2. Background

In this section, we provide a brief overview of parallel tempering (PT) (Geyer, 1991), as well as recent results on nonreversible communication (Okabe et al., 2001; Sakai

& Hukushima, 2016; Syed et al., 2019). Deﬁne a reference unnormalized density function π0 for which sampling is tractable, and an unnormalized target density function π1 for which sampling is intractable; the goal is to obtain samples from π1.1 Deﬁne a path of distributions πt π1 t 0 πt

for t [0, 1] from the reference to the target. Finally, deﬁne the annealing schedule TN to be a monotone sequence in [0, 1], satisfying

TN = (tn)N n=0, 0 = t0 t1 t N = 1

TN = max n {0,...,N 1} tn+1 tn.

The core idea of parallel tempering is to construct a Markov chain (X0 m, . . . , XN m), m = 1, 2, . . . that (1) has invariant distribution πt0 πt1 πt N such that we can treat the marginal chain XN n as samples from the target π1 and (2) swaps components of the state vector such that independent samples from component 0 (i.e., the reference π0) traverse along the annealing path and aid mixing in component N (i.e., the target π1). This is possible to achieve by iteratively performing a local exploration move followed by a communication move as shown in Algorithm 1.

Local Exploration Given (X0 m 1, . . . , XN m 1), we obtain an intermediate state ( X0 m, . . . , XN m) by updating the nth component using any MCMC move targeting πtn, for n = 0, . . . , N. This move can be performed in parallel across components since each is updated independently.

Communication Given the intermediate state ( X0 m, . . . , XN m), we apply pairwise swaps of components n and n + 1, n Sm {0, . . . , N 1} for swapped index set Sm. Formally, a swap is a move from (x0, . . . , x N) to (x0, . . . , xn+1, xn, . . . , x N), which is accepted with probability

αn = 1 πtn(xn+1)πtn+1(xn)

πtn(xn)πtn+1(xn+1). (1)

Since each swap only depends on components n, n + 1, one can perform all of the swaps in Sm in parallel, as long as n Sm implies (n + 1) / Sm. The largest collection of such non-interfering swaps is Sm {Seven, Sodd}, where Seven, Sodd are the even and odd subsets of {0, . . . , N 1} respectively. In non-reversible PT (NRPT) (Okabe et al., 2001), the swap set Sm at each step m is set to

Sm = Seven if m is even Sodd if m is odd.

Round trips The performance of PT is sensitive to both the local exploration and communication moves. The quantity commonly used to evaluate the performance of MCMC

1We assume all distributions share a common state space X throughout, and will often suppress the arguments of (log-)density functions i.e., π1 instead of π1(x) for notational brevity.

Parallel tempering on optimized paths

algorithms is the effective sample size (ESS); however, ESS measures the combined performance of local exploration and communication, and is not able to distinguish between the two. Since the major difference between PT and standard MCMC is the presence of a communication step, we require a way to measure communication performance in isolation such that we can compare PT methods without dependence on the details of the local exploration move. The round trip rate is a performance measure from the PT literature (Katzgraber et al., 2006; Lingenheil et al., 2009) that is designed to assess communication efﬁciency alone. We say a round trip has occurred when a new sample from the reference π0 travels to the target π1 and then back to π0; the round trip rate τ(TN) is the frequency at which round trips occur. Based on simplifying assumptions on the local exploration moves, the round trip rate τ(TN) may be expressed as (Syed et al., 2019, Section 3.5)

r(tn, tn+1) 1 r(tn, tn+1)

where r(t, t ) is the expected probability of rejection between chains t, t [0, 1],

r(t, t ) = E 1 πt(X )πt (X)

πt(X)πt (X )

and X, X have distributions X πt, X πt . Further, if TN is reﬁned so that TN 0 as N , we ﬁnd that the asymptotic (in N) round trip rate is

τ = lim N τ(TN) = (2 + 2Λ) 1 , (3)

where Λ 0 is a constant associated with the pair π0, π1 called the global communication barrier (Syed et al., 2019). Note that Λ does not depend on the number of chains N or discretization schedule TN.

3. General annealing paths

In the following, we use the terminology annealing path to describe a continuum of distributions interpolating between π0 and π1; this deﬁnition will be formalized shortly. The previous work reviewed in the last section assumes that the annealing path has the form πt π1 t 0 πt 1, i.e., that the annealing path linearly interpolates between the log densities. A natural question is whether using other paths could lead to an improved round trip rate.

In this work we show that the answer to this question is positive. The following proposition demonstrates that the traditional path πt π1 t 0 πt 1 suffers from an arbitrarily suboptimal global communication barrier even in simple examples with Gaussian reference and target distributions.

Algorithm 1 NRPT Require: state x0, path πt, schedule TN, # iterations M

rn 0 for all n {0, . . . , N 1} for m = 1 to M do

xm Local Exploration(xm 1) Sm Seven if m is even, otherwise Sm Sodd for n = 0 to N 1 do

αn 1 πtn(xn+1)πtn+1(xn) πtn(xn)πtn+1(xn+1) rn rn + (1 αn) Un Unif(0, 1) if n Sm and Un αn then

( xn m, xn+1 m ) ( xn+1 m , xn m) end if xm xm end for end for rn rn/M for n {0, . . . , N 1} Return: {xm}M m=1, {rn}N 1 n=0

Figure 1. Two annealing paths between a π0 = N( 2, 0.22) (light blue) and π1 = N(2, 0.22) (dark blue) : the traditional linear path (top) and an optimized nonlinear path (bottom). While the distributions in the linear path are nearly mutually singular, those in the optimized path overlap substantially, leading to faster round trips.

Proposition 1. Suppose the reference and target distributions are π0 = N(µ0, σ2) and π1 = N(µ1, σ2), and deﬁne z = |µ1 µ0|/σ. Then as z ,

1. the path πt π1 t 0 πt 1 has τ = Θ(1/z), and

2. there exists a path of Gaussians distributions with τ = Ω(1/ log z).

Therefore, upon decreasing the variance of the reference and target while holding their means ﬁxed, the traditional linear annealing path obtains an exponentially smaller asymptotic

Parallel tempering on optimized paths

round trip rate than the optimal path of Gaussian distributions. Figure 1 provides an intuitive explanation. The standard path (top) corresponds to a set of Gaussian distributions with mean interpolated between the reference and target. If one reduces the variance of the reference and target, so does the variance of the distributions along the path. For any ﬁxed N, these distributions become nearly mutually singular, leading to arbitrarily low round trip rates. The solution to this issue (bottom) is to allow the distributions along the path to have increased variances, thereby maintaining mutual overlap and the ability to swap components with a reasonable probability. This motivates the need to design more general annealing paths. In the following, we introduce the precise general deﬁnition of an annealing path, an analysis of path communication efﬁciency in parallel tempering, and a rigorous formulation of and solution to the problem of tuning path parameters to maximize communication efﬁciency.

3.1. Assumptions

Let P(X) be the set of probability densities with full support on a state space X. For any collection of densities πt P(X) with index t [0, 1], associate to each a log-density function Wt such that

Zt exp (Wt(x)) , x X, (4)

where Zt = R

X exp (Wt(x)) dx is the normalizing constant. Deﬁnition 1 provides the conditions necessary to form a path of distributions from a reference π0 to a target π1 that are well-behaved for use in parallel tempering. Deﬁnition 1. An annealing path is a map π( ) : [0, 1] P(X), denoted t 7 πt, such that for all x X, πt(x) is continuous in t.

There are many ways to move beyond the standard linear path πt π1 t 0 πt 1. For example, consider a nonlinear path πt πη0(t) 0 πη1(t) 1 where ηi : [0, 1] R are continuous functions such that η0(0) = η1(1) = 1 and η0(1) = η1(0) = 0. As long as for all t [0, 1], πt is a normalizable density this is a valid annealing path between π0 and π1. Further, note that the path parameter does not necessarily have to appear as an exponent: consider for example the mixture path πt (1 t)π0 + tπ. Section 4 provides a more detailed example based on linear splines.

3.2. Communication efﬁciency analysis

Given a particular annealing path satisfying Deﬁnition 1, we require a method to characterize the round trip rate performance of parallel tempering based on that path. The results presented in this section form the basis of the objective function used to optimize over paths, as well as the foundation for the proof of Proposition 1.

We start with some notation for the rejection rates involved when Algorithm 1 is used with nonlinear paths (Equation 4). For t, t [0, 1], x X, deﬁne the rejection rate function r : [0, 1]2 [0, 1] to be

r(t, t ) = 1 E [exp (min{0, At,t (X, X )})]

At,t (x, x ) = (Wt (x) Wt(x)) (Wt (x ) Wt(x )),

where X πt and X πt are independent. Assuming that all chains have reached stationarity, and all chains undergo efﬁcient local exploration i.e., Wt(X), Wt( X) are independent when X πt and X is generated by local exploration from X then the round trip rate for a particular schedule TN has the form given earlier in Equation (2). This statement follows from Syed et al. (2019, Cor. 1) without modiﬁcation, because the proof of that result does not depend on the form of the acceptance ratio.

Our next objective is to characterize the asymptotic communication efﬁciency of a nonlinear path in the regime where N and TN 0 which establishes its fundamental ability to take advantage of parallel computation. In other words, we require a generalization of the asymptotic result in Equation (3). In this case, previous theory relies on the particular form of the acceptance ratio for linear paths (Syed et al., 2019); in the remainder of this section, we provide a generalization of the asymptotic result for nonlinear paths by piecewise approximation by a linear spline.

For t, t [0, 1], deﬁne Λ(t, t ) to be the global communication barrier for the linear secant connecting πt and π t,

Λ(t, t ) = 1

0 E[|At,t (Xs, X s)|]ds, (5)

Xs, X s i.i.d. 1 Zs(t, t ) exp ((1 s)Wt + s Wt ) .

Lemma 1 shows that the global communication barrier Λ(t, t ) along the secant of the path connecting t to t is a good approximation of the true rejection rate r(t, t ) with O(|t t |3) error as t t .

Lemma 1. Suppose that for all x X, Wt(x) is piecewise continuously differentiable in t, and that there exists V1 : X [0, ) such that

x X, sup t [0,1]

dt (x) V1(x), (6)

and sup t [0,1] Eπt[V 3 1 ] < . (7)

Then there exists a constant C < independent of t, t

such that for all t, t [0, 1],

|r(t, t ) Λ(t, t )| C|t t |3. (8)

Parallel tempering on optimized paths

A direct consequence of Lemma 1 is that for any ﬁxed schedule TN,

n=0 r(tn, tn+1) Λ(TN)

C TN 2, (9)

where Λ(TN) = PN 1 n=0 Λ(tn, tn+1). Intuitively, in the TN 0 regime where rejection rates are low,

r(tn, tn+1) 1 r(tn, tn+1) r(tn, tn+1),

and we have that τ(TN) (2 + 2Λ(TN)) 1. Therefore, Λ(TN) characterizes the communication efﬁciency of the path in a way that naturally extends the global communication barrier from the linear path case. Theorem 2 provides the precise statement: the convergence is uniform in TN and depends only on TN , and Λ(TN) itself converges to a constant Λ in the asymptotic regime. We refer to Λ, deﬁned below in Equation (13) as the global communication barrier for the general annealing path.

Theorem 2. Suppose that for all x X, Wt(x) is piecewise twice continuously differentiable in t, that there exists V1 : X [0, ) satisfying (6) and (7), and that there exists V2 : X [0, ), ϵ > 0 satisfying

x X, sup t [0,1]

dt2 (x) V2(x), (10)

and sup t [0,1] Eπt [(1 + V1) exp(ϵV2)] < . (11)

lim δ 0 sup TN: TN δ

(2 + 2Λ(TN)) 1 τ(TN) = 0 (12)

and lim δ 0 sup TN: TN δ |Λ(TN) Λ| = 0, (13)

where Λ = R 1 0 λ(t)dt for an instantaneous rejection rate function λ : [0, 1] [0, ) given by

λ(t) = lim t 0 r(t + t, t)

dt (Xt) d Wt

, Xt, X t i.i.d. πt.

The integrability condition is required to control the tail behaviour of distributions formed by linearized approximations to the path πt. This condition is satisﬁed by a wide range of annealing paths, e.g., the linear spline paths proposed in this work since in that case V2 = 0.

3.3. Annealing path families and optimization

It is often the case that there are a set of candidate annealing paths in consideration for a particular target π. For example, if a path has tunable parameters φ Φ that govern its shape, we can generate a collection of annealing paths that all target π by varying the parameter φ. We call such collections an annealing path family. Deﬁnition 3. An annealing path family for target π1 is a collection of annealing paths {πφ t }φ Φ such that for all parameters φ Φ, πφ 1 = π1.

There are many ways to construct useful annealing path families. For example, if one is provided a parametric family of variational distributions {qφ : φ Φ} for some parameter space Φ, one can construct the annealing path family of linear paths πφ t = q1 t φ πt 1 from a variational reference qφ to the target π1. More generally, given ηi(t) satisfying the constraints in Section 3.1, πφ t = qη0(t) φ πη1(t) 1 deﬁnes a nonlinear annealing path family. Another example of an annealing path family used in the context of PT are q-paths {πq t }q [0,1] (Whitﬁeld et al., 2002). Given a ﬁxed reference and target π0, π1, the path πq t interpolates between the mixture path (q = 0) and the linear path (q = 1) (Brekelmans et al., 2020). In Section 4, we provide a new ﬂexible class of nonlinear paths based on splines that is designed speciﬁcally to enhance the performance of parallel tempering.

Since every path in an annealing path family has the desired target distribution π1, we are free to optimize the path over the tuning parameter space φ Φ in addition to optimizing the schedule TN.2 Motivated by the analysis of Section 3.2, a natural objective function for this optimization to consider is the non-asymptotic round trip rate

φ , T N = arg max φ Φ,TN τ φ(TN) (14)

= arg min φ Φ,TN

rφ(tn, tn+1) 1 rφ(tn, tn+1), (15)

where now the round trip rate and rejection rates depend both on the schedule and path parameter, denoted by superscript φ. We solve this optimization using an approximate coordinate-descent procedure, iterating between an update of the schedule TN for a ﬁxed path parameter φ Φ, followed by a gradient step in φ based on a surrogate objective function and a ﬁxed schedule. This is summarized in Algorithm 2. We outline the details of schedule and path tuning procedure in the following.

Tuning the schedule Fix the value of φ, which ﬁxes the path. We adapt a schedule tuning algorithm from past work

2We assume that the optimization over φ ends after a ﬁnite number of iterations to sidestep the potential pitfalls of adaptive MCMC methods (Andrieu & Moulines, 2006).

Parallel tempering on optimized paths

Algorithm 2 Path Opt NRPT

Require: state x, path family πφ t , parameter φ, # chains N, # PT iterations M, # tuning steps S, learning rate γ TN (0, 1/N, 2/N, . . . , 1) for s = 1 to S do

{xm}M m=1, (rn)N n=0 NRPT(x, πφ t , TN, M) λφ Communication Barrier(TN, {rn}) TN Update Schedule(λφ, N) φ φ γ φ PN 1 n=0 SKL(πφ tn, πφ tn+1) x x M end for Return: φ, TN

to update the schedule TN = (tn)N n=0 (Syed et al., 2019, Section 5.1). Based on the same argument as this previous work, we obtain that when TN 0, the non-asymptotic round trip rate is maximized when the rejection rates are all equal. The schedule that approximately achieves this satisﬁes

n {1, . . . , N 1}, 1 Λφ

0 λφ(s)ds = n

Following Syed et al. (2019), we use Monte Carlo estimates of the rejection rates rφ(tn, tn+1) to approximate t 7 R t 0 λφ(s)ds, s [0, 1] via a monotone cubic spline, and then use bisection search to solve for each tn according to Equation (16).

Optimizing the path Fix the schedule TN; we now want to improve the path itself by modifying φ. However, in challenging problems this is not as simple as taking a gradient step for the objective in Equation (14). In particular, in early iterations when the path is near its oft-poor initialization the rejection rates satisfy rφ(tn, tn+1) 1. As demonstrated empirically in Appendix F, gradient estimates in this regime exhibit a low signal-to-noise ratio that precludes their use for optimization.

We propose a surrogate, the symmetric KL divergence, motivated as follows. Consider ﬁrst the global communication barrier Λφ(TN) for the linear spline approximation to the path; Theorem 2 guarantees that as long as TN is small enough, one can optimize Λφ(TN) in place of the round trip rate τ φ(TN). By Jensen s inequality,

1 N 2 Λφ(TN)2 1

n=0 Λφ(tn, tn+1)2.

Next, we apply Jensen s inequality again to the deﬁnition of Λφ(tn, tn+1) from (5), which shows that

Λφ(tn, tn+1))2 1

0 E[Atn,tn+1(Xs, X s)2]ds,

where Xs are deﬁned in (5) are drawn from the linear path between πφ tn and πφ tn+1. Finally, we note that the inner expectation is the path integral of the Fisher information metric along the linear path and evaluates to the symmetric KL divergence (Dabak & Johnson, 2002, Result 4), Z 1

0 E (Atn,tn+1(Xs, X s))2 ds = 2SKL(πφ tn, πφ tn+1).

Therefore we have

n=0 SKL(πφ tn, πφ tn+1). (17)

The slack in the inequality in Equation (17) could potentially depend on φ even in the large N regime. Therefore, during optimization, we recommend monitoring the value of the original objective function (Equation (14)) to ensure that the optimization of the surrogate SKL objective indeed improves it, and hence the round trip rate performance of PT via Equation (2). In the experiments we display the values of both objective functions.

4. Spline annealing path family

In this section, we develop a family of annealing paths the spline annealing path family that offers a practical and ﬂexible improvement upon the traditional linear paths considered in past work. We ﬁrst deﬁne a general family of annealing paths based on the exponential family, and then provide the speciﬁc details of the spline family with a discussion of its properties. Empirical results in Section 5 demonstrate that the spline annealing path family resolves the problematic Gaussian annealing example in Figure 1.

4.1. Exponential annealing path family

We begin with the practical desiderata for an annealing path family given a ﬁxed reference π0 and target π1 distribution.3

First, the traditional linear path πt π1 t 0 πt 1 should be a member of the family, so that one can achieve at least the round trip rate provided by that path. Second, the family should be broadly applicable and not depend on particular details of either π0 or π1. Finally, using the Gaussian example from Figure 1 and Proposition 1 as insight, the family should enable the path to smoothly vary from π0 to π1 while inﬂating / deﬂating the variance as necessary.

These desiderata motivate the design of the exponential annealing path family, in which each annealing path takes the form

πt πη0(t) 0 πη1(t) 1 = exp(η(t)T W(x)),

3A natural extension of this discussion would include parametrized variational reference distribution families. For simplicity we restrict to a ﬁxed reference.

Parallel tempering on optimized paths

for some function η(t) = (η0(t), η1(t)) and reference/target log densities W(x) = (W0(x), W1(x)). Intuitively, η0(t) and η1(t) represent the level of annealing for the reference and target respectively along the path. Proposition 2 shows that a broad collection of functions η indeed construct a valid annealing path family including the linear path. Proposition 2. Let Ω R2 be the set

Ω= ξ R2 : Z exp(ξT W(x))dx < .

Suppose (0, 1) Ωand A is a set of piecewise twice continuously differentiable functions η : [0, 1] Ωsuch that η(1) = (0, 1). Then πt(x) exp(η(t)T W(x)) : η A

is an annealing path family for target distribution π1. If additionally (1, 0) Ω, then the linear path η(t) = (1 t, t) may be included in A. Finally, if for every η A there exists M > 0 such that supt max{ η (t) 2, η (t) 2} M and Equation (11) holds with V1 = V2 = M W 2, then each path in the family satisﬁes the conditions of Theorem 2.

4.2. Spline annealing path family

Proposition 2 reduces the problem of designing a general family of paths of probability distributions to the much simpler task of designing paths in R2. We argue that a good choice can be constructed using linear spline paths connecting K knots φ = (φ0, . . . , φK) (R2)K+1, i.e., for all k {1, . . . , K} and t [ k 1

ηφ(t) 7 (k Kt)φk 1 + (Kt k + 1)φk.

Let Ωbe deﬁned as in Proposition 2. The K-knot spline annealing path family is deﬁned as the set of K-knot linear spline paths such that

φ0 = (1, 0), φK = (0, 1), and k, φk Ω.

Validity: Since Ωis a convex set per the proof of Proposition 2, we are guaranteed that ηφ([0, 1]) Ω, and so this collection of annealing paths is a subset of the exponential annealing family and hence forms a valid annealing path family targeting π1.

Convexity: Furthermore, convexity of Ωimplies that tuning the knots φ ΩK+1 involves optimization within a convex constraint set. In practice, we enforce also that the knots are monotone in each component i.e., the ﬁrst component monotonically decreases, 1 = φ0,0 φ1,0 φK,0 = 0 and the second increases, 0 = φ0,1 φ1,1 φK,1 = 1 such that the path of distributions always moves from the reference to the target. Because monotonicity constraint sets are linear and hence convex, the overall monotonicity-constrained optimization problem has a convex domain.

Figure 2. The spline path for K = 1, 2, 3, 4, 5, 10 knots for the family generated by π0 = N( 1, 0.5) and π1 = N(1, 0.5).

Flexibility: Assuming the family is nonempty, it trivially contains the linear path. Further, given a large enough number of knots K, the spline annealing family wellapproximates subsets of the exponential annealing family for ﬁxed M > 0. In particular,

sup η AM inf φ ΩK+1 ηφ η M

Figure 2 provides an illustration of the behaviour of optimized spline paths for a Gaussian reference and target. The path takes a convex curved shape; starting at the bottom right point of the ﬁgure (reference), this path corresponds to increasing the variance of the reference, shifting the mean from reference to target, and ﬁnally decreasing the variance to match the target. With more knots, this process happens more smoothly.

Appendix D provides an explicit derivation of the stochastic gradient estimates we use to optimize the knots of the spline annealing path family. It is also worth making two practical notes. First, to enforce the monotonicity constraint, we developed an alternative to the usual projection approach, as projecting into the set of monotone splines can cause several knots to become superposed. Instead, we maintain monotonicity as follows: after each stochastic gradient step, we identify a monotone subsequence of knots containing the endpoints, remove the nonmonotone jumps, and linearly interpolate between the knot subsequence with an even spacing. Second, we take gradient steps in a log-transformed space so that knot components are always strictly positive.

5. Experiments

In this section, we study the empirical performance of non-reversible PT based on the spline annealing path family (K {2, 3, 4, 5, 10}) from Section 4, with knots and schedule optimized using the tuning method from Section 3. We compare this method to two PT methods

Parallel tempering on optimized paths

Figure 3. Top: Cumulative round trips averaged over 10 runs for the spline path with K = 2, 3, 4, 5, 10 (solid blue), NRPT using a linear path (dashed green), and reversible PT with linear path (dash/dot red). The slope of the lines represent the round trip rate. We observe large gains going from linear to non-linear paths (K > 1). For all values of K > 1, the optimized spline path substantially improves on the theoretical upper bound on round trip rate possible using linear path (dotted black). Bottom: Non-asymptotic communication barrier from Equation 15 (solid blue) and Symmetric KL (dash orange) as a function of iteration for one run of Path Opt NRPT + Spline (K = 4 knots).

based on standard linear paths: non-reversible PT with adaptive schedule ( NRPT+Linear ) (Syed et al., 2019), and reversible PT ( Reversible+Linear ) (Atchad e et al., 2011). Code for the experiments is available at https: //github.com/vittrom/PT-pathoptim.

We use the terminology scan to denote one iteration of the for loop in Algorithm 2. The computational cost of a scan is comparable for all the methods, since the bottleneck is the local exploration step shared by all methods. These experiments demonstrate two primary conclusions: (1) tuned nonlinear paths provide a substantially higher round trip rate compared to linear paths for the examples considered; and (2) the symmetric KL sum objective (Equation 17) is a good proxy for the round trip rate as a tuning objective.

We run the following benchmark problems; see the supplement for details. Gaussian: a synthetic setup in which the reference distribution is π0 = N( 1, 0.012) and the target is π1 = N(1, 0.012). For this example we used N = 50 parallel chains and ﬁxed the computational budget to 45000 samples. For Algorithm 2, the computational budget was divided equally over 150 scans, meaning 300 samples were used for every gradient update. Reversible+Linear performed 45000 local exploration steps with a communication step after every iteration while for NRPT+Linear the computational budget was used to adapt the schedule. The gradient updates were performed using Adagrad (Duchi et al., 2011) with learn-

ing rate equal to 0.2. Beta-binomial model: a conjugate Bayesian model with prior π0(p) = Beta(180, 840) and likelihood L(x|p) = Binomial(x|n, p). We simulated data x1, . . . , x2000 Binomial(100, 0.7) resulting in a posterior π1(p) = Beta(140180, 60840). The prior and posterior are heavily concentrated at 0.2 and 0.7 respectively. We used the same settings as for the Gaussian example. Galaxy data: A Bayesian Gaussian mixture model applied to the galaxy dataset of (Roeder, 1990). We used six mixture components with mixture proportions w0, . . . , w5, mixture component densities N(µi, 1) for mean parameters µ0, . . . , µ5, and a cluster label categorical variable for each data point. We placed a Dir(16) prior on the proportions, where 16 = (1, 1, 1, 1, 1, 1) and a N(150, 1) prior on each of the mean parameters. We did not marginalize the cluster indicators, creating a multi-modal posterior inference problem over 94 latent variables. In this experiment we used N = 35 chains and ﬁxed the computational budget to 50000 samples, divided into 500 scans using 100 samples each. We optimized the path using Adagrad with a learning rate of 0.3. Exploring the full posterior distribution is challenging in this context due to a combination of misspeciﬁcation of the prior and label switching. Label switching refers to the invariance of the likelihood under relabelling of the cluster labels. In Bayesian problems label switching can lead to increased difﬁculty of the sampling problem as it generates symmetric multi-modal posterior distributions. High dimensional Gaussian: a similar setup to the

Parallel tempering on optimized paths

Figure 4. Round trips rate averaged over 5 runs for the spline path with K = 4 (solid blue), NRPT using a linear path (dashed green) and theoretical upper bound on round trip rate possible using linear path (dotted black) as a function of the dimensionality of the target distribution.

one-dimensional Gaussian experiment where the number of dimensions ranges from d = 1 to d = 256. The reference distribution is π0 = N( 1d, (0.12)Id) and the target is π1 = N(1d, (0.12)Id) where the subscript d indicates the dimensionality of the problem. The number of chains N is set to increase with dimension at the rate N = 15

d . We ﬁxed the number of spline knots K to 4 and set the computational budget to 50000 samples divided into 500 scans with 100 samples per gradient update. The gradient updates were performed using Adagrad with learning rate equal to 0.2. For all the experiments we performed one local exploration step before each communication step.

The results of these experiments are shown in Figures 3 and 4. Examining the top row of Figure 3 which shows the number of round trips as a function of the number of scans one can see that PT using the spline annealing family outperforms PT using the linear annealing path across all numbers of knots tested. Moreover, the slope of these curves demonstrates that PT with the spline annealing family exceeds the theoretical upper bound of round trip rate for the linear annealing path (from Equation (12)). The largest gain is obtained from going from K = 1 (linear) to K = 2. For all the examples, increasing the number of knots to more than K > 2 leads to marginal improvements. In the case of the Gaussian example, note that since the global communication barrier Λ for the linear path is much larger than N, algorithms based on linear paths incurred rejection rates of nearly one for most chains, resulting in no round trips.

The bottom row of Figure 3 shows the value of the surrogate SKL objective and non-asymptotic communication barrier from Equation (15). In particular, these ﬁgures demonstrate that the SKL provides a surrogate objective that is a reason-

able proxy for the non-asymptotic communication barrier, but does not exhibit as large estimation variance in early iterations when there are pairs of chains with rejection rates close to one.

Figure 4 shows the round trip rate as a function of the dimensionality of the problem for Gaussian target distributions. As the dimensionality increases, the sampling problem becomes fundamentally more difﬁcult, explaining the decay in performance of both NRPT with a linear path and an optimized path. Both sampling methods are provided with a ﬁxed computational budget in all runs. Due to this ﬁxed budget, NRPT is unable for d 64 to approach the optimal schedule in the alloted time. This leads to an increasing gap between the round trip rates of NRPT with linear path and the spline path for d 64.

6. Discussion

In this work, we identiﬁed the use of linear paths of distributions as a major bottleneck in the performance of parallel tempering algorithms. To address this limitation, we have provided a theory of parallel tempering on nonlinear paths, a methodology to tune parametrized paths, and ﬁnally a practical, ﬂexible family of paths based on linear splines. Future work in this line of research includes extensions to estimate normalization constants, as well as the development of techniques and theory surrounding the use of variational reference distributions.

Andrieu, C. and Moulines, E. On the ergodicity properties of some adaptive MCMC algorithms. Annals of Applied Probability, 16(3):1462 1505, 2006.

Atchad e, Y. F., Roberts, G. O., and Rosenthal, J. S. Towards optimal scaling of Metropolis-coupled Markov chain Monte Carlo. Statistics and Computing, 21(4): 555 568, 2011.

Ballnus, B., Hug, S., Hatz, K., G orlitz, L., Hasenauer, J., and Theis, F. J. Comprehensive benchmarking of Markov chain Monte Carlo methods for dynamical systems. BMC Systems Biology, 11(1):63, 2017.

Brekelmans, R., Masrani, V., Bui, T., Wood, F., Galstyan, A., Steeg, G. V., and Nielsen, F. Annealed importance sampling with q-paths. ar Xiv:2012.07823, 2020.

Costa, S. I., Santos, S. A., and Strapasson, J. E. Fisher information distance: A geometrical reading. Discrete Applied Mathematics, 197:59 69, 2015.

Dabak, A. G. and Johnson, D. H. Relations between Kullback-Leibler distance and Fisher information. Journal of The Iranian Statistical Society, 5:25 37, 2002.

Parallel tempering on optimized paths

Desjardins, G., Luo, H., Courville, A., and Bengio, Y. Deep tempering. ar Xiv:1410.0123, 2014.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Gelman, A. and Meng, X.-L. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical science, pp. 163 185, 1998.

Geyer, C. J. Markov chain Monte Carlo maximum likelihood. Interface Proceedings, 1991.

Grosse, R. B., Maddison, C. J., and Salakhutdinov, R. R. Annealing between distributions by averaging moments. In Advances in Neural Information Processing Systems, 2013.

Kamberaj, H. Molecular Dynamics Simulations in Statistical Physics: Theory and Applications. Springer, 2020.

Katzgraber, H. G., Trebst, S., Huse, D. A., and Troyer, M. Feedback-optimized parallel tempering Monte Carlo. Journal of Statistical Mechanics: Theory and Experiment, 2006(03):P03018, 2006.

Lingenheil, M., Denschlag, R., Mathias, G., and Tavan, P. Efﬁciency of exchange schemes in replica exchange. Chemical Physics Letters, 478(1-3):80 84, 2009.

M uller, N. F. and Bouckaert, R. R. Adaptive parallel tempering for BEAST 2. bio Rxiv, 2020. doi: 10.1101/603514.

Okabe, T., Kawata, M., Okamoto, Y., and Mikami, M. Replica-exchange Monte Carlo method for the isobaric isothermal ensemble. Chemical Physics Letters, 335(5-6): 435 439, 2001.

Predescu, C., Predescu, M., and Ciobanu, C. V. The incomplete beta function law for parallel tempering sampling of classical canonical systems. The Journal of Chemical Physics, 120(9):4119 4128, 2004.

Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. Tighter variational bounds are not necessarily better. In International Conference on Machine Learning, 2018.

Rischard, M., Jacob, P. E., and Pillai, N. Unbiased estimation of log normalizing constants with applications to Bayesian cross-validation. ar Xiv:1810.01382, 2018.

Roeder, K. Density estimation with conﬁdence sets exempliﬁed by superclusters and voids in the galaxies. Journal of the American Statistical Association, 85(411):617 624, 1990.

Sakai, Y. and Hukushima, K. Irreversible simulated tempering. Journal of the Physical Society of Japan, 85(10): 104002, 2016.

Syed, S., Bouchard-Cˆot e, A., Deligiannidis, G., and Doucet, A. Non-reversible parallel tempering: an embarassingly parallel MCMC scheme. ar Xiv:1905.02939, 2019.

Tawn, N. G., Roberts, G. O., and Rosenthal, J. S. Weightpreserving simulated tempering. Statistics and Computing, 30(1):27 41, 2020.

Whitﬁeld, T., Bu, L., and Straub, J. Generalized parallel sampling. Physica A: Statistical Mechanics and its Applications, 305(1-2):157 171, 2002.

Woodard, D. B., Schmidler, S. C., Huber, M., et al. Conditions for rapid mixing of parallel and simulated tempering on multimodal distributions. The Annals of Applied Probability, 19(2):617 640, 2009.

Zhou, Y., Johansen, A. M., and Aston, J. A. Toward automatic model comparison: An adaptive sequential Monte Carlo approach. Journal of Computational and Graphical Statistics, 25(3):701 726, 2016.