# replica_exchange_for_nonconvex_optimization__077fa645.pdf

Journal of Machine Learning Research 22 (2021) 1-59 Submitted 7/20; Revised 6/21; Published 6/21

Replica Exchange for Non-Convex Optimization

Jing Dong jing.dong@gsb.columbia.edu Graduate School of Business Columbia University 3022 Broadway New York, NY 10027, USA

Xin T. Tong mattxin@nus.edu.sg Department of Mathematics National University of Singapore Block S17, 10 Lower Kent Ridge Road Singapore 119077

Editor: Qiang Liu

Gradient descent (GD) is known to converge quickly for convex objective functions, but it can be trapped at local minima. On the other hand, Langevin dynamics (LD) can explore the state space and ﬁnd global minima, but in order to give accurate estimates, LD needs to run with a small discretization step size and weak stochastic force, which in general slow down its convergence. This paper shows that these two algorithms can collaborate through a simple exchange mechanism, in which they swap their current positions if LD yields a lower objective function. This idea can be seen as the singular limit of the replica-exchange technique from the sampling literature. We show that this new algorithm converges to the global minimum linearly with high probability, assuming the objective function is strongly convex in a neighborhood of the unique global minimum. By replacing gradients with stochastic gradients, and adding a proper threshold to the exchange mechanism, our algorithm can also be used in online settings. We also study non-swapping variants of the algorithm, which achieve similar performance. We further verify our theoretical results through some numerical experiments, and observe superior performance of the proposed algorithm over running GD or LD alone.

Keywords: Non-convex optimization, Langevin dynamics, gradient descent, stochastic gradient Langevin dynamics, stochastic gradient descent

1. Introduction

Division of labor is the secret of any eﬃcient enterprises. By collaborating with individuals with diﬀerent skillsets, we can focus on tasks within our own expertise and produce better outcomes than working independently. This paper asks whether the same principle can be applied when designing an algorithm. Given a general smooth non-convex objective function F, we consider the unconstrained optimization problem

min x Rd F(x). (1)

c 2021 Jing Dong and Xin Tong.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-697.html.

Dong and Tong

It is well-known that deterministic optimization algorithms, such as gradient descent (GD), can converge to a local minimum quickly (Nesterov, 2013). However, this local minimum may not be the global minimum, and GD will be trapped there afterwards. On the other hand, sampling-based algorithms, such as Langevin dynamics (LD) can escape local minima by their stochasticity, but the additional stochastic noise contaminates the optimization results and slows down the convergence when the iterate is near the global minimum. More generally, deterministic algorithms are mostly designed to ﬁnd local minima quickly, but they can be terrible at exploration. Sampling-based algorithms are better suited for exploring the state space, but they are ineﬃcient when pinpointing the local minima. This paper investigates how they can collaborate to get the best of both worlds . The collaboration mechanism we introduce stems from the idea of replica exchange in the sampling literature. Its implementation is very simple: we run a copy of GD, denoted by Xn; and a copy of discretized LD, denoted by Yn, in parallel. At each iteration, if F(Xn) > F(Yn), we swap their positions. At the ﬁnal iteration, we output XN. The proposed algorithm, denoted by GDx LD and formalized in Algorithm 1 below, enjoys the expertise of both GD and LD. In particular, we establish that if F is convex in a neighborhood of the unique global minimum, then, for any ϵ > 0, there exists N(ϵ) = O(ϵ 1), such that for N N(ϵ), |F(XN) F(x )| < ϵ with high probability, where x is the global minimum. If F is strongly convex in the same neighborhood, we can further obtain linear convergence; that is, N(ϵ) can be reduced to O(log 1

ϵ). As we will demonstrate with more details in Section 2.2, GDx LD can be seen as the singular limit of a popular sampling algorithm, known as replica exchange or parallel tempering. The exchange mechanism is in place to make the sampling algorithm a reversible MCMC. However, for the purpose of optimization, exchanging the locations of GD and LD is not the only option. In fact, both GD and LD can obtain the location of Yn if F(Xn) > F(Yn). This leads to a slightly diﬀerent version of the algorithm, which we will denote as n GDx LD where n stands for non-swapping . Since the LD part will not be swapped to the location of the GD, it is a bona-ﬁde Langevin diﬀusion algorithm. In terms of performance, we establish the same complexity estimates for both GDx LD and n GDx LD. Our numerical experiments also verify that GDx LD and n GDx LD have similar performance. To simply the notation, we write (n)GDx LD when we refer to both versions of the algorithm. Notably, the complexity bounds we establish here are the same as the complexity bounds for standard GD when F is globally convex (or strongly convex) (Nesterov, 2013), but we only need F to be convex ( or strongly convex) near x , which is signiﬁcantly weaker. It is not diﬃcult to see intuitively why (n)GDx LD works well in such non-convex settings. The LD explores the state space and visits the neighborhood of the global minimum. Since this neighborhood is of a constant size, it can be found by the LD in constant time. Moreover, this neighborhood gives lower values of F than anywhere else, so the GD will be swapped there, if it is not there already. Finally, GD can pinpoint the global minimum as it now starts in the right neighborhood. Figure 1 provides a visualization of the mechanism. For many modern data-driven applications, F is an empirical loss function, so its precise evaluation and its gradient evaluation can be computationally expensive. For these scenarios, we also consider an online modiﬁcation of our algorithm. The natural modiﬁcation of GD is stochastic gradient decent (SGD), and the modiﬁcation of LD is stochastic gradient

Replica Exchange for Non-Convex Optimization

Figure 1: (Left) GDx LD in action: when LD enters a neighborhood of the global minimum (yellow region), exchange happens and helps GD to escape the local minimum that traps it. After exchange, GD can converge to the global minimum, and LD keeps exploring. (Right) n GDx LD is similar to GDx LD; the only diﬀerence is that LD will stay at its original location, instead of swapping with GD.

Langevin dynamics (SGLD). This algorithm, denoted by SGDx SGLD and formalized in Algorithm 2 below, achieves a similar complexity bound as GDx LD if F is strongly convex in the neighborhood of the unique global minimum x . For the theory to apply, we also need the noise of the stochastic gradient to be sub-Gaussian and is of order O( ϵ). This assumption can often be met by using a mini-batch of size O(ϵ 1), which in principle should be factored in for complexity. In this case, the overall complexity of SGDx SGLD is O(ϵ 1 log 1

ϵ). Similar to the oﬄine scenario, the non-swapping variation, n SGDx SGLD, can be obtained by keeping the SGLD part not swapped to the location of the SGD.

1.1 Related work

Non-convex optimization problems arise in numerous advanced machine learning, statistical learning, and structural estimation settings (Geyer and Thompson, 1995; Bottou et al., 2018). How to design eﬃcient algorithms with convergence guarantees has been an active area of research due to their practical importance. In what follows, we discuss some existing results related to our work. As this is a fast growing and expanding ﬁeld, we focus mostly on algorithms related to GD, LD, or their stochastic gradient versions. One main approach to study nonlinear optimization is to ask when an algorithm can ﬁnd local minima eﬃciently. The motivation behind ﬁnding local minima, rather than global minima, is that in some machine learning problems, such as matrix completion and wide neural networks, local minima already have good statistical properties or prediction power (Ge et al., 2017; Park et al., 2017; Bhojanapalli et al., 2016; Du et al., 2018; Ge et al., 2018; Mei et al., 2017). Moreover, the capability to ﬁnd local minima or second-order stationary points (SOSP) is nontrivial, since GD can be trapped at saddle points. When full Hessian information is available, this can be done through algorithms such as cubic-regularization or trust region (Nesterov and Polyak, 2006; Curtis et al., 2014). If only gradient is available,

Dong and Tong

one general idea is to add stochastic noise so that the algorithms can escape saddle points (Jin et al., 2017; Allen-Zhu, 2018; Jin et al., 2018; Yu et al., 2018; Daneshmand et al., 2018; Jin et al., 2019). But the size of the noise and the step size need to be tuned based on the accuracy requirement. This, on the other hand, reduces the learning rate and the speed of escaping saddle points. For example, to ﬁnd an ϵ-accurate SOSP in the oﬄine setting, the perturbed gradient descent method requires O(ϵ 2) iterations (Jin et al., 2017), and in the online setting, it requires O(ϵ 4) iterations (Jin et al., 2019). These convergence rates are slower than the ones of our proposed algorithms.

For problems in which local minima are not good enough, we often need to use samplingbased algorithms to ﬁnd global minima. This often involves simulating an ergodic stochastic process for which the invariant measure is proportional to exp( 1

γ F(x)), where γ is referred to as the temperature , which often controls the strength of stochasticity. Because the process is ergodic, it can explore the state space. However, for the invariant measure to concentrate around the global minimum, γ needs to be small. Then for these sampling based-algorithms to ﬁnd accurate approximation of global minima, they need to use weaker stochastic noise or smaller step sizes, which in general slow down the convergence. For LD in the oﬄine setting and SGLD in the online setting, the complexity is ﬁrst studied in Raginsky et al. (2017), later improved by Xu et al. (2018), and generalized by Chen et al. (2019a) to settings with decreasing step sizes. In Xu et al. (2018), it is shown that LD can ﬁnd an ϵ-accurate global minimum in O(ϵ 1) iterations, and SGLD can do so in O(ϵ 5) iterations. These algorithms have higher complexity than the ones we proposed in this paper. Note that in Xu et al. (2018), it keeps the temperature at a constant order, in which case the step size needs to scale with ϵ. As we will discuss in more details in Section 2, the algorithm we propose keeps both the temperature and the step size at constant values.

We also comment that sampling-based algorithms may lead to better dependence on the dimension of the problem for some MCMC problems as discussed in Ma et al. (2019). However, if the goal is to ﬁnd global minima of a Lipschitz-smooth non-convex optimization problem, the complexity in general has exponential dependence on the dimension. This has to do with the spectral gap of the sampling process and has been extensively discussed in Ma et al. (2019); Raginsky et al. (2017); Xu et al. (2018). Due to this, our developments in this paper focus on settings where the dimension of the problem is ﬁxed, and we characterize how the complexity scales with the precision level ϵ.

Aside from optimization, LD and related Markov Chain Monte Carlo (MCMC) algorithms are also one of the main workhorses for Bayesian statistics. Our work is closely related to the growing literature on convergence rate analysis for LD. Asymptotic convergence of discretized LD with decreasing temperature (diﬀusion coeﬃcient) and step sizes is extensively studied in the literature (see, e.g., Gidas (1985); Pelletier (1998)). Nonasymptotic performance bounds for discrete-time Langevin updates in the case of convex objective functions are studied in Durmus and Moulines (2017); Dalalyan (2017); Cheng et al. (2018); Dalalyan and Karagulyan (2019). In MCMC, the goal is to sample from the stationary distribution of LD, and thus the performance is often measured by the Wasserstein distance or the Kullback-Leibler divergence between the ﬁnite-time distribution and the target invariant measure (Ma et al., 2015; Bubeck et al., 2018). In this paper, we use the performance metric P(|F(XN) F(x )| > ϵ), which is more suitable for the goal of optimizing a non-

Replica Exchange for Non-Convex Optimization

convex function. This also leads to a very diﬀerent framework of analysis compared to the existing literature.

1.2 Main contribution

The main message of this paper is that, by combining a sampling algorithm with an optimization algorithm, we can create a better tool for some non-convex optimization problems. From the practical point of view, the new algorithms have the following advantages:

When the exact gradient F is accessible, we propose the GDx LD algorithm and its non-swapping variant, n GDx LD. Their implementation does not require the step size h or temperature γ to change with the precision level ϵ. Such independence is important in practice, since tuning hyper-parameters is in general diﬃcult. As we will demonstrate in Section 3.1.2, our algorithm is quite robust to a wide range of hyper-parameters.

When the dimension is ﬁxed, for a given precision level ϵ, (n)GDx LD beats existing algorithms in computational cost. In particular, we show (n)GDx LD can reach approximate global minimum with a high probability in O(log 1

ϵ) iterations. Comparing to the iteration estimate of LD in Xu et al. (2018), which is O(ϵ 1 log 1

ϵ), our algorithm is much more eﬃcient, but we require the extra assumption that F has a unique global minimum x , and F is strongly convex near x .

When only stochastic approximation of F is available, we propose the SGDx SGLD algorithm and its non-swapping variant, n SGDx SGLD. Like (n)GDx LD, their implementation does not require the temperature or the step size to change with the precision level. (n)SGDx SGLD is also more eﬃcient when comparing with other online optimization methods using stochastic gradients. In particular, we show (n)SGDx SGLD can reach approximate global minimum with high probability with an O(ϵ 1 log 1

ϵ) complexity. This is better than the complexity estimate of VR-SGLD in Xu et al. (2018), which is O(ϵ 4(log 1

ϵ)4). The additional assumptions we require is that the function evaluation noise is sub-Gaussian, and F is strongly convex near the unique global minimum x .

In term of algorithm design, a novel aspect we introduce is the exchange mechanism. The idea comes from replica-exchange sampling, which has been designed to overcome some of the diﬃculties associated with rare transitions to escape local minima (Earl and Deem, 2005; Woodard et al., 2009). Dupuis et al. (2012) uses the large deviation theory to deﬁne a rate of convergence for the empirical measure of replica-exchange Langevin diﬀusion. It shows that the large deviation rate function is increasing with the exchange intensity, which leads to the development of inﬁnite swapping dynamics. We comment that inﬁnite swapping is not feasible in practice as the actual swapping rate depends on the discretization stepsize. The algorithm that we propose attempts a swap at every iteration, which essentially maximizes the practical swapping intensity. Chen et al. (2019b) extend the idea to solve non-convex optimization problems, but they only discuss the exchange between two LD processes. Our work further extends the idea by combining GD with LD and provides ﬁnite-time performance bounds that are tailored to the optimization setting.

Dong and Tong

In online settings, the function and gradient evaluations are contaminated with noises. Designing the exchange mechanism in this scenario is more challenging, since we need to avoid and mitigate the possible erroneous exchanges. We demonstrate such a design is feasible by choosing a proper mini-batch size as long as the function evaluation noise is sub Gaussian. The analysis of SGDx SGLD is hence much more challenging and nonstandard, when comparing with that of GDx LD. While it is a natural idea to let two algorithms specializing in diﬀerent tasks to collaborate, coming up with a good collaboration mechanism is not straightforward. For example, one may propose to let the LD explore the state-space ﬁrst, after it ﬁnds the region of the global minimum, one would turn-oﬀthe temperature, i.e., setting γ = 0, and run the GD. However, in practice, it is hard to come up with a systematic way to check whether the process is in the neighborhood of the global minimum or not. One may also propose to turn down the temperature of the LD gradually. This is the idea of simulated annealing (Kirkpatrick et al., 1983; Mangoubi and Vishnoi, 2018). The challenge there is to come up with a good cooling schedule . To ensure convergence to global minima theoretically, the temperature needs to be turned down very slowly, which could jeopardize the eﬃciency of the algorithm (Gelfand and S.K., 1991). For example, Granville et al. (1994) shows that for the simulated annealing to converge, the temperature at the n-th iteration needs to be of order 1/ log n. Readers who are familiar with optimization algorithms might naturally think of doing replica exchange between other deterministic algorithms and other sampling-based algorithms. Standard deterministic algorithms include GD with line search, Newton s method, and heavy-ball methods such as Nestrov acceleration. Sampling-based algorithms include random search, perturbed gradient descent, and particle swarm optimization. We investigate GDx LD instead of other exchange combinations, not because GDx LD is superior in convergence speed, but because of the following two reasons:

1. GDx LD can be seen as a natural singular limit of replica-exchange LD, which is a mathematical subject with both elegant theoretical properties and useful sampling implications. We will explain the connection between GDx LD and replica-exchange LD in Section 2.2.

2. GDx LD is very simple to implement. It can be easily adapted to the online setting. We will explain how to do so in Section 2.5.

On the other hand, it would be interesting to see whether exchange between other algorithms can provide faster rate of convergence in theory or in practice. We leave this as a future research direction, and think the analysis framework and proving techniques we developed here can be extended to more general settings. In particular, due to the division of labor, LD is only used to explore the state-space. Therefore, instead of establishing its convergence as in Raginsky et al. (2017); Xu et al. (2018), we only need to prove that it is suitably ergodic, such that there is a positive probability of visiting the neighborhood of the global minimum. GD is only used for exploitation and the complexity only depends on its behavior in the neighborhood of the global minimum. Lastly, one of the key element in the complexity of (n)GDx LD is how long it takes LD to ﬁnd the neighborhood of the global minimum. High dimensional and multi-modal

Replica Exchange for Non-Convex Optimization

structures can be diﬃcult to sample using just LD. In the literature, sampling techniques such as dimension reduction, Laplace approximation, Gibbs-type-partition can be applied to improve the sampling eﬃciency for high dimensional problems (Hairer et al., 2014; Cui et al., 2014; Morzfeld et al., 2019; Tong et al., 2020). As for multi-modality, it is well known that LD may have slow convergence when sampling from mixture of log-concave densities if each component density is close to singular (Menz and Schlichting, 2014). The replica exchange method, discussed in Section 2.2 below, considers running multiple LDs with diﬀerent temperatures and exchanging among them (Swendsen and Wang, 1986). It has been shown to perform well for multi-modal densities (Woodard et al., 2009; Dong and Tong, 2020). Other than replica exchange, simulated tempering, which considers dynamically changing the temperature of the LD can also handle such multimodality in general (Marinari and Parisi, 1992; Woodard et al., 2009; Lee et al., 2018). Replacing LD with these more advanced sampler may facilitate more eﬃcient exploration when facing more challenging energy landscapes.

2. Main results

In this section, we present the main algorithms: GDx LD and n GDx LD (Algorithm 1), SGDx SGLD and n SGDx SGLD (Algorithm 2). We also provide the corresponding complexity analysis. We start by developing (n)GDx LD in the oﬄine setting (Section 2.1) and discuss the connection between GDx LD and replica-exchange LD (Section 2.2). The rigorous complexity estimate is given by Theorem 2 in Section 2.3. We also study how the complexity depends on the exploration speed of LD in Section 2.4 (see Theorem 3). We then discuss how to adapt the algorithm to the online setting and develop (n)SGDx SGLD in Section 2.5. Section 2.6 provides the corresponding complexity analysis Theorems 4 and 5. To highlight the main idea and make the discussion concise, we defer the proof of Theorems 2 and 3 to Appendix A, and the proof of Theorems 4 and 5 to Appendix B.

2.1 Replica exchange in oﬄine setting

We begin with a simple oﬄine optimization scenario, where the objective function F and its gradient F are directly accessible. GD is one of the simplest deterministic iterative optimization algorithms. It involves iterating the formula

Xn+1 = Xn F(Xn)h. (2)

The hyper-parameter h is known as the step size, which can often be set as a small constant. If F is convex and x is a global minimum, the GD iterates can converge to a global minimum linearly fast; that is, F(Xn) F(x ) ϵ if n = O(1

ϵ). If F is strongly convex with convexity parameter m, then the convergence will be exponentially fast; that is, F(Xn) F(x ) ϵ if n = O((log 1

ϵ)/m). In the general non-convex optimization setting, F can still be strongly convex near a global minimum x . In this case, if GD starts near x , the iterates can converge to x

very fast. However, this is hard to implement in practice, since it requires a lot of prior knowledge of F. If we start GD at an arbitrary point, the iterates are likely to be trapped at a local minimum. To resolve this issue, one method is to add stochastic noise to GD and

Dong and Tong

generate iterates according to

Yn+1 = Yn F(Yn)h + p

2γh Zn, (3)

where Zn s are i.i.d. samples from N(0, Id). For a small enough step size or learning rate h, the iterates (3) can be viewed as a temporal discretization of the LD

d Yt = F(Yt)dt + p

2γd Bt, (4)

where Bt is a d-dimensional standard Brownian motion. The diﬀusion process (4) is often called the overdamped Langevin dynamic or unadjusted Langevin dynamic (Durmus and Moulines, 2017). Here for notational simplicity, we refer to algorithm (3) as LD as well. γ is referred to as the temperature of the LD. It is known that under certain regularity conditions, (4) has an invariant measure πγ(x) exp( 1

γ F(x)). Note that the stationary measure is concentrated around the global minimum for small γ. Therefore, it is reasonable to hypothesize that by iterating according to (3) enough times, the iterates will be close to x . Adding the stochastic noise 2γh Zn in (3) partially resolves the non-convexity issue, since the iterates can now escape local minima. On the other hand, in order for (3) to be a good approximation of (4), and consequently to be a good sampler of πγ, it is necessary to use a small step size h and hence a lot of iterations. In particular, Xu et al. (2018) shows in Corollary 3.3 that in order for E[F(Yn)] F(x ) = O(ϵ), h needs to scale linearly as ϵ, and the computational complexity is O(ϵ 1 log 1

ϵ). This is slower than GD if F is strongly convex. In summary, when optimizing a non-convex F, one has a dilemma in choosing the types of algorithms. Using stochastic algorithms like LD will eventually ﬁnd a global minimum, but they are ineﬃcient for accurate estimation. Deterministic algorithms like GD is more eﬃcient if initialized properly, but there is the danger that they can get trapped in local minima when initialized arbitrarily. To resolve this dilemma, idealistically, we can use a stochastic algorithm ﬁrst to explore the state space. Once we detect that the iterate is close to x , we switch to a deterministic algorithm. However, it is in general diﬃcult to write down a detection criterion a-priori. Our idea is to devise a simple on-the-ﬂy criterion. It involves running a copy of GD, Xn in (2), and a copy of LD, Yn in (3), simultaneously. Since a smaller F-value implies the iterate is closer to x in general, we apply GD in the next iteration for the one with a smaller F-value, and LD to the one with a larger F-value. In other words, we want to exchange the locations of Xn and Yn if F(Xn) > F(Yn) + t0, where t0 0 is a properly chosen threshold. Finally, to implement the non-swapping variation, that is n GDx LD, it suﬃces to let LD stay at its original location when the exchange takes place. A more detailed description of the algorithm is given in Algorithm 1.

2.2 GDx LD as a singular limit of replica-exchange Langevin diﬀusion

In this section, we review the idea of replica-exchange LD, and show its connection with GDx LD. Consider an LD

d Xt = F(Xt)dt +

Replica Exchange for Non-Convex Optimization

Algorithm 1: GDx LD and n GDx LD: oﬄine optimization

Input: Temperature γ, step size h, number of steps N, exchange threshold t0 [0, r0/8), and initial X0, Y0. for n = 0 to N 1 do

X n+1 = Xn F(Xn)h; Y n+1 = Yn F(Yn)h + 2γh Zn, where Zn N(0, Id).; if F(Y n+1) < F(X n+1) t0 then

(Xn+1, Yn+1) =

( (Y n+1, X n+1), if GDx LD is applied (Y n+1, Y n+1), if n GDx LD is applied ;

(Xn+1, Yn+1) = (X n+1, Y n+1). end end Output: XN as an optimizer for F.

where Wt is a Brownian motion independent of Bt. Under certain regularity conditions on F, {Xt}t 0 has a unique stationary measure πν satisfying

πν(x) exp 1

Note that the stationary measure is concentrated around the unique global minimum x , and the smaller the value of ν, the higher the concentration is. In particular, if ν 0, then πν is a Dirac measure at x . However, from the algorithmic point of view, the smaller the value of ν, the slower the stochastic process converges to its stationary distribution. In practice, we can only sample the LD for a ﬁnite amount of time, which gives rise to the tradeoﬀbetween the concentration around the global minimum and the convergence rate to stationarity. One idea to overcome this tradeoﬀis to run two Langevin diﬀusions with two diﬀerent temperatures, high and low, in parallel, and combining the two in an appropriate way so that we can enjoy the beneﬁt of both high and low temperatures. This idea is known as replica-exchange LD. Consider

d Xa t = F(Xa t )dt +

d Y a t = F(Y a t )dt + p

The way we would connect them is to allow exchange between Xt and Yt at random times. In particular, we swap the positions of Xt and Yt according to a state-dependent rate

s(x, y; a) := a exp 0 n ( 1

γ )(F(x) F(y) o .

We refer to a as the exchange intensity. The joint process (Xa t , Y a t ) has a unique stationary measure πν,γ(x, y) exp 1

Based on πν,γ, one would want to set ν small so that Xt can exploit local minima, and set γ large so that Yt can explore the state space to ﬁnd better minima. We exchange the

Dong and Tong

positions of the two with high probability if Yt ﬁnds a better local minimum neighborhood to exploit than the one where Xt is currently at. For optimization purposes, we can send ν to zero. In this case

d Xa t = F(Xa t )dt, s(x, y; a) = a1{F(x)<F(y)},

and π0,γ(x, y) δx (x) exp 1

We would also like to send a to inﬁnity to allow exchange as soon as Y a t ﬁnd a better region to explore. However, the processes (Xa, Y a) do not converge in natural senses when a , because the number of swap-caused discontinuities, which are of size O(1), will grow without bound in any time interval of positive length. Dupuis et al. (2012) uses a temperature swapping idea to accommodate this. Consider a temperature exchange process

d Xa t = F( Xa t )dt + q

2γ1{ Za t =1}d Wt

d Y a t = F( Y a t )dt + q

2γ1{ Za t =0}d Bt,

where { Za t }t 0 is a Markov jump process that switches from 0 to 1 at rate a1{F(x)>F(y)} and 1 to 0 at rate a1{F(x)<F(y)}. Now, sending a to inﬁnity, we have

d X t = F( X t )dt + q

2γ1{F( X t )>F( Y t )}d Wt

d Y t = F( Y t )dt + q

2γ1{F( X t )<F( Y t )}d Bt.

In actual implementations, we will not be able to exactly sample the continuous-time processes. We thus implement a discretized version of it as described in Algorithm 1.

2.3 Complexity bound for GDx LD and n GDx LD

We next present the performance bound for GDx LD and n GDx LD. We start by introducing some assumptions on F. First we need the function to be smooth in the following sense:

Assumption 1 F is Lipschitz continuous with Lipschitz constant L (0, ), i.e.,

F(x) F(y) L x y .

Second, we need some conditions under which the iterates or their function values will not diverge to inﬁnity. The following assumption ensures the gradient will push the iterates back once they get too large:

Assumption 2 The utility function is coercive. In particular, there exist constants λc, Mc (0, ), such that F(x) 2 λc F(x) Mc,

and F(x) when x .

Replica Exchange for Non-Convex Optimization

Note that another more commonly used deﬁnition of coerciveness (or dissipation) (Raginsky et al., 2017; Xu et al., 2018) is

F(x), x λ0 x 2 + M0. (5)

The condition (5) is stronger than Assumption 2. In general, (5) can be enforced by adding proper regularizations. These are explained by Lemma 1. Its proof can be found in Appendix A.

Lemma 1 Under Assumption 1,

1. For any λ > 0, (5) holds for Fλ(x) = F(x) + λ x 2.

2. Suppose (5) holds, then Assumption 2 holds.

The last condition we need is that F is twice diﬀerentiable and strongly convex near the global minimum x . This allows GD to ﬁnd x eﬃciently when it is close to x .

Assumption 3 x is the unique global minimum for F. There exists r0 > 0, such that the sub-level set B0 = {x : F(x) F(x ) + r0} is star-convex with x being the center, i.e., a line segment connecting x and any x B0 is in B0. There exists m, M, r > 0 such that the Hessian 2F exists for B0 {x : x x < r}, and for any vector v

m v 2 v T 2F(x)v M v 2, x : x x < r.

Note that when Assumption 1 holds, m M L.

Assumption 3 is a mild one in practice, since checking the Hessian matrix is positive deﬁnite is often the most direct way to determine whether a point is a local minimum. Figuratively speaking, Assumption 3 essentially requires that F has only one global minimum x at the bottom of a valley of F. It is important to emphasize that we do not assume knowing where this valley is, otherwise it suﬃces to run GD within this valley . Moreover, when the LD process enters a valley , there is no mechanism required to detect whether this valley contains the global minimum. In fact, designing such detection mechanism can be quite complicated, since it is similar to ﬁnding the optimal solution in a multi-arm bandit problem. Instead, for GDx LD, we only need to implement the simple exchange mechanism. Requiring B0 to be star-convex rules out the scenario where there are multiple local minima taking values only ϵ-apart from the global optimal value where ϵ can be arbitrarily small. In this scenario, GDx LD will have a hard time to identify the true global local minima. Indeed, GD iterates may stuck at one of the local minima, and the exchange will take place only when LD iterate is o(ϵ) away from x , which will happen rarely due to the noisy movement of LD. On the other hand, ﬁnding solutions whose functional values are only ϵ away from the global optimal is often considered good enough in practice. When 2F is not available, or F is only convex near x , we can also use the following more general version of Assumption 3. Admittedly the complexity estimate under it will be be worse than the one under Assumption 3.

Assumption 4 x is the unique global minimum for F. There exists r0 > 0, such that the sub-level set B0 = {x : F(x) F(x ) + r0} is star-convex with x being the center, i.e., a line segment connecting x and any x B0 is in B0. F is convex in B0. In addition, there exist 0 < rl < ru < , such that { x x rl} B0 { x x ru}.

Dong and Tong

With all assumptions stated, the complexity of GDx LD and n GDx LD is given in the following theorem.

Theorem 2 Consider the iterates following Algorithm 1 ((n)GDx LD). Under Assumptions 1, 2, 3, and ﬁxed γ = O(1), 0 < h 1 2L, 0 t0 r0/8, for any ϵ > 0 and δ > 0, there exists N(ϵ, δ) = O( log ϵ) + O( log δ) such that for any n N(ϵ, δ),

P(F(Xn) F(x ) ϵ) 1 δ. (6)

Alternatively, under Assumptions 1, 2, 4, and ﬁxed γ = O(1), 0 < h min{ 1

2L, r2 u r0 }, (6) holds with N(ϵ, δ) = O(ϵ 1) + O(log(1/δ)).

In particular, if we hold δ ﬁxed, to achieve an ϵ accuracy, the complexity is O(log(1/ϵ)) when F is strongly convex in B0 and O(ϵ 1) when F is convex in B0. These complexity estimates are of the same order as GD in the convex setting. However, F does not need to be convex globally for our results to hold. The fast convergence rate comes partly from the fact that GDx LD does not require the hyper-parameters h and γ to change with the error tolerance ϵ. This is quite unique when compared with other single-copy stochastic optimization algorithms. This feature is of great advantage in both practical implementations and theoretical analysis. Finally, as pointed out in the introduction, our analysis and subsequent results focus on a ﬁxed dimension setting. The big O terms in the deﬁnition of N(ϵ, δ) hide constants that are independent of ϵ and δ. Such constants can scale exponentially with d and 1 γh. In particular, the O(log(1/δ)) term is associated with how long it takes the LD to visit B0, which is further determined by how fast LD converges to stationarity. With a ﬁxed value of γ, the convergence speed in general scales inversely with the exponential of d (see the proof of Theorem 2 in Appendix A for more details). Curse of dimensionality is a common issue for sampling-based optimization algorithms. In (Raginsky et al., 2017; Xu et al., 2018), this is mentioned as a problem of the spectral gap. A more detailed discussion can also be found in Ma et al. (2019).

2.4 Exploration speed of LD

Intuitively, the eﬃciency of (n)GDx LD depends largely on how often LD visits the neighborhood of x , B0. A standard way to address this question is to study the convergence rate of LD to stationarity. Let πγ exp( F(x)/γ),

which denotes the target invariant distribution, i.e., it is the invariant distribution of the Langevin diﬀusion deﬁned in (4). We also write Uγ as the normalizing constant of πγ. In the literature, the convergence of a stochastic processes is usually obtained using either a small set argument or a functional inequality argument. Theorem 2 uses the small set argument. In particular, in Lemma 9, we show that each typical LD iterate can reach B0 with a probability lower bounded by α > 0. While such α is usually not diﬃcult to obtain, it is often pessimistically small and does not reveal how the convergence depends on F. In contrast, the functional inequality approach usually provides more details about how the convergence depends on F. In particular, using functional inequalities like log Sobolev

Replica Exchange for Non-Convex Optimization

inequality (LSI) or Poincar e inequality, one can show that the convergence speed of LD to πγ depends on the pre-constant of the functional inequality. This pre-constant usually only depends on the target density πγ, i.e., F, and can be seen as an intrinsic measure of how diﬃcult it is to sample πγ. In addition, the convergence speed derived from this approach is in general better. For example, Proposition 2.8 in Hairer et al. (2014) indicates that the convergence speed estimate from the small set argument is in general smaller than the one derived from Poincar e inequality. Therefore, it is of theoretical importance to show how the computational complexity estimates of (n)GDx LD depend on the pre-constant of the functional inequality. In this paper, we use LSI as it controls the convergence speed of LD in Kullback-Leibler (KL) divergence. Note that in general, KL divergence between two distributions has a linear dependence on the dimension, whereas the χ2 divergence, which is associated with Poincar e inequality, often scales exponentially in the dimension. Let µt denote the distribution of Yt, i.e., the Langevin diﬀusion (4) at time t. Let ˆµn denote the distribution of Yn, i.e., the discrete LD update (3) at step n. For two measures, µ and π, with µ π, the KL divergence of µ with respect to π is deﬁned as

KL(µ π) = Z log µ(x)

We impose the following assumption on πγ.

Assumption 5 πγ satisﬁes a log Sobolev inequality with pre-constant β > 0: Z πγ(x)f(x)2 log f(x)2 2

It is well known that with LSI, the Langevin diﬀusion (4) satisﬁes

KL(µt|πγ) exp( 2γβt)KL(µ0|π).

In addition, the LD (3) satisﬁes a similar convergence rate (Vempala and Wibisono, 2019):

KL(ˆµn|π) exp( γβnh)KL(µ0|π) + 8hd L2

Adapting these results, we can establish the following convergence results for (n)GDx LD.

Theorem 3 Consider the iterates following Algorithm 1 ((n)GDx LD). Suppose Assumptions 1, 2, 3, and 5 hold. In addition, for GDx LD, t0 > 0. Fix

where RV = 8(Mc/4 + 4γLd)/λc. For any ϵ > 0 and δ > 0, set

K = log(δ) log(1 πγ(B0)/2) and n0 = 1

βγ 1 h log 1

There exists

N(β, ϵ, δ) = O(Kn0) + O(log(1/ϵ)) = O(log(1/δ)/β) + O(log(1/ϵ))

such that for n N(β, ϵ, δ) P(|Xn x | ϵ) > 1 δ.

Dong and Tong

The big O terms in the deﬁnition of N(β, ϵ, δ) in Theorem 3 hide constants that have polynomial dependence on d, 1/πγ(B0), and log Uγ (see the proof of Theorem 3 in Appendix A.4 for more details). The complexity bound in Theorem 3 can be viewed as a reﬁnement of the bound established in Theorem 2, where the hidden constant scales exponentially with d and 1 γh. This is due to the fact that the LSI constant β characterizes how diﬃcult it is to sample πγ. To interpret this reﬁned complexity bound, we note that n0 represents the time it takes the LD to sample the approximate stationary distribution. If we check whether the LD is in B0 every n0 steps, K is the number of trials we need to run to ensure a larger than 1 δ probability of visiting B0. Then, the time it takes the LD to visit B0 with high probability is upper bounded by Kn0. Note that K scales with 1/πγ(B0). In order to have a smaller K, we want to have a larger πγ(B0). This can be achieved by using a smaller γ, which gives rise to a πγ that is more concentrated on B0. On the other hand, to achieve a smaller n0, we want to have a larger γ so the LD can escape local minima and converge to stationarity faster. Thus, there is a non-trivial tradeoﬀto be made here to optimize the constant term hidden in O(log(1/δ)/β) in the quantiﬁcation of N(β, ϵ, δ). However, as will be demonstrated in our numerical experiments in Section 3.1.2, (n)GDx LD achieves good and robust performance for a reasonable range of γ (not too small or too big). This is because in (n)GDx LD, γ does not need to depend on ϵ. We also note that the analysis of n GDx LD is conceptually simpler than GDx LD, since the LD part is a bona-ﬁde unadjusted Langevin algorithm, so the result of (Vempala and Wibisono, 2019) can be implemented rather straightforwardly. In contrast, the LD part in GDx LD can be swapped with the GD part. Thus, the analysis of GDx LD is more challenging. In particular, we need to impose the assumption that t0 > 0 in Algorithm 1. This technical assumption limits the number of swaps between GD and LD.

2.5 Online optimization with stochastic gradient

In a lot of data science applications, we deﬁne a loss function for a given parameter x and data sample s as f(x, s), and the loss function we wish to minimize is the average of f(x, s) over a distribution S. Let F(x) = ES[f(x, S)].

Since the distribution of S can be complicated or unknown, the precise evaluation of F and the gradient F may be computationally too expensive or practically infeasible. However, we often have access to a large number of samples of S in applications. So given an iterate Xn, we can draw two independent batches of independent and identically distributed (iid) samples, sn,1, . . . , sn,Θ and s n,1, . . . , s n,Θ, and use

ˆFn(Xn) = 1

i=1 f(Xn, sn,i), ˆFn(Xn) = 1

i=1 xf(Xn, s n,i)

to approximate F and F. Here we require {sn,i, 1 i Θ} and {s n,i, 1 i Θ} to be two independent batches, so that the corresponding approximation errors are uncorrelated. When we replace F with ˆFn in GD and LD, the resulting algorithms are called SGD and SGLD. They are useful when the data samples are accessible only online: to run the algorithms, we only need to get access to and operate on a batch of data. This is very important when computation or storage capacities are smaller than the size of the data.

Replica Exchange for Non-Convex Optimization

To implement the replica exchange idea in the online setting, it is natural to replace GD and LD with their online versions. In addition, we need to pay special attention to the exchange criterion. Since we only have access to ˆFn, not F, ˆFn(X n+1) > ˆFn(Yn+1) may not guarantee that F(X n+1) > F(Yn+1). Incorrect exchanges may lead to bad performance, and thus we need to be cautious to avoid that. One way to avoid/reduce incorrect exchanges is to introduce a threshold t0 > 0 when comparing ˆFn s. In particular, if t0 is chosen to be larger than the typical size of approximation errors of ˆFn, then, ˆFn(X n+1) > ˆFn(Yn+1) + t0 indicates that ˆF(X n+1) is very likely to be larger than ˆF(Yn+1). Lastly, since the approximation error of ˆFn(x) in theory can be very large when x is large, we avoid exchanges if the iterates are very large, i.e., when min{ Xn , Yn } > ˆ Mv for some large ˆ Mv (0, ). Putting these ideas together, the SGDx SGLD algorithm is given in Algorithm 2:

Algorithm 2: SGDx SGLD and n SGDx SGLD: online optimization

Input: Temperature γ, step size h, number of steps N, initial X0, Y0, estimation error parameter Θ (when using batch means, Θ is the batch size, it controls the accuracy of ˆFn and ˆFn), threshold t0 (0, r0/8], and exchange boundary ˆ Mv. for n = 0 to N 1 do

X n+1 = Xn h ˆFn(Xn); Y n+1 = Yn h ˆFn(Yn) + 2γh Zn, where Zn N(0, Id); if ˆFn(Y n+1) < ˆFn(X n+1) t0, X n+1 ˆ MV , and Y n+1 ˆ MV then

(Xn+1, Yn+1) =

( (Y n+1, X n+1), if SGDx SGLD is applied (Y n+1, Y n+1), if n SGDx SGLD is applied ;

(Xn+1, Yn+1) = (X n+1, Y n+1). end end Output: XN as an optimizer for F.

2.6 Complexity bound for SGDx SGLD

To implement SGDx SGLD, we require three new hyper-parameters, Θ, t0 and ˆ Mv. We discuss how they can be chosen next. First of all, the batch-size Θ controls the accuracy of the stochastic approximations of F and F. In particular, we deﬁne

ζn(x) := ˆFn(x) F(x) and ξn(x) = ˆFn(x) F(x),

where ζn(x) s and ξn(x) s are independent random noise with E[ζn(x)] = E[ξn(x)] = 0. By controlling the number of samples we generate at each iteration, we can control the accuracy of the estimation, as the variances of the estimation errors are of order 1/Θ. We will see in Theorem 4 and the discussions following it that 1/Θ should be of the same order as the error tolerance ϵ. For the simplicity of exposition, we introduce a new parameter

Dong and Tong

θ = O(ϵ) to describe the scale of ζn and ξn. In addition, we assume that the errors have sub-Gaussian tails:

Assumption 6 There exists a constant θ > 0, such that for any 0 < b < 1

E exp a T ξn(x) + b ξn(x) 2 1 (1 2bθ)d/2 exp a 2θ 2(1 2bθ)

and for x , y ˆ MV , we have for any z > 0,

P(ζn(x) ζn(y) z) exp z2

Note that Assumption 6 implies

En[ ξn(x) 2] dθ.

We also remark that Assumption 6 holds if ξn(x) N(0, θId) and ζn(x) ζn(y) N(0, 1

2θ). In practice, Assumption 6 can be veriﬁed using Hoeﬀding inequality or other concentration inequalities if the stochastic gradients are bounded. Second, the threshold t0 is related to the shape of the valley around x . To keep the exposition concise, we set t0 r0/8 where r0 is deﬁned in Assumption 4. Heuristically, it can be chosen as a generic small constant such as 10 2. Lastly, ˆ MV is introduced to facilitate theoretical veriﬁcation of Assumption 6. In other words, if Assumption 6 holds for ˆ MV = , then we can set ˆ MV = . More generally, under Assumptions 1 and 2, we set

4 + (8γLd + 4Lθd), ˆRV = 8λ 1 c ˆCV and ˆ MV = sup{x : F(x) ˆRV }.

In practice, one can set ˆ MV as a generic large number. We are now ready to present the complexity of (n)SGDx SGLD:

Theorem 4 Consider the iterates following Algorithm 2 ((n)SGDx SGLD). Under Assumptions 1, 2, 3, 6, and ﬁxed γ = O(1), 0 < h min{ 1

2L, r2 u r0 }, and 0 < t0 r0/8, for any ϵ > 0 and δ > 0, there exists N(ϵ, δ) = O( log(δ) log(ϵ)), such that for any ﬁxed N > N(ϵ, δ), there exists θ(N, ϵ, δ) = O min{N 1, ϵδ} , and for θ θ(N, ϵ, δ), we have

P(F(XN) F(x ) ϵ) 1 δ.

In particular, if we hold δ ﬁxed, then to achieve an ϵ accuracy, we need to set the number of iterations N = O( log ϵ) and the batch size Θ = O(θ 1) = O(ϵ 1). In this case, the total complexity (including data sample) is O(NΘ) = O(ϵ 1 log 1

ϵ). To see where the O(ϵ 1) batch size comes from, we can look at a simple example where F(x) = 1

2x2. As this function is strongly convex, we can focus on the SGD part. The iterates in this case takes the form

Xn+1 = Xn h Xn + hξn(Xn).

Replica Exchange for Non-Convex Optimization

For simplicity, we assume ξn(Xn) s are iid N(0, θId). Then, Xn is a linear auto-regress sequence with the invariant measure N(0, hθ

2 h Id). Now, for E[F(XN)] = hθd 2(2 h) = O(ϵ) when h is a constant, θ = O(ϵ). Similar to the oﬄine case, we can also characterize how the computational cost depends on the LSI pre-constant. Let

ˆB0 = n x : F(x) F(x ) + r0

Theorem 5 Consider the iterates following Algorithm 2 ((n)SGDx SGLD). Suppose Assumptions 1, 2, 3, 5, and 6 hold. Fix.

For any ϵ > 0 and δ > 0, let

log(1 πγ( ˆB0)/2) and n0 = 4

βγ h 1 log(1/h) + 1.

There exists

N(β, ϵ, δ) = O(Kn0) + O(log(1/ϵ)) = O(log(1/δ)/β + log(1/ϵ)),

such that for N > N(β, ϵ, δ), there exists θ(N, ϵ, δ) = O(min{N 1, ϵδ}), and for θ < θ(N, ϵ, δ), P(F(XN) F(x ) < ϵ) > 1 δ.

Similar to the oﬄine versions, the big O term in the deﬁnition of N(β, ϵ) in Theorem 4 hides constants that scale exponentially with d and 1 γh, while the big O terms in the deﬁnition of N(β, ϵ, δ) in Theorem 5 hide constants that have polynomial dependence on d (see the proofs of Theorems 4 and 5 in Appendices B.3 and B.4 for more details). Lastly, it is worth noting that the step size h and temperature γ in (n)SGDx SGLD is independent of the error tolerance ϵ. This is one of the reason why it can beat existing sampling-based algorithms on the convergence speed.

3. Numerical experiments

In this section, we provide some numerical experiments to illustrate the performance of (n)GDx LD and (n)SGDx SGLD. Our main focus is to demonstrate that by doing exchange between the two algorithms, (S)GD and (SG)LD, the performance of the combined algorithm can be substantially better than running isolated copies of the individual algorithms. We also demonstrate the robustness of our algorithm to diﬀerent choices of the temperature γ and step size h.

3.1 Two-dimensional Problems

We start by looking at two-dimensional examples, which are easier to visualize.

Dong and Tong

-1 0 1 2 3 4 5

5 4 3 2 1 0 -1

(a) Heat map of F

(b) 3-d plot of F

Figure 2: The objective function F in (7)

3.1.1 Offline setting

First, we consider how to ﬁnd the mode of a two-dimensional Gaussian-mixture density. The loss function is given by

det(2πΣi) exp 1

2(x mi)T Σ 1 i (x mi) + L(x). (7)

For simplicity, we choose M = 5 5 = 25, each mi is a point in the meshgrid {0, 1, 2, 3, 4}2, and Σi =diag(0.1) so the valleys are distinctive. The weights are generated randomly. As the Gaussian-mixture density and its gradient are almost zero for x far away from mi s, we add a quadratic regularization term

(x(i) + 1)21x(i) 1 + (x(i) 5)21x(i) 5 , (8)

where x(i) is the i-th element of x. Figure 2 shows the heat map and the 3-d plot of one possible realization of F. We can see that it is highly non-convex with 25 local minima. In this particular realization of F, the global minimum is at (3, 2) and F(3, 2) = 0.168. We implement GDx LD for the objective function plotted in Figure 2 with h = 0.1, γ = 1, X0 = (0, 0), and Y0 = (1, 1). We plot F(Xn) and Xn at diﬀerent iterations n in Figure 3. We do not plot Yn, the sample path of the LD, since it is used for exploration, not optimization. We observe that the convergence happens really fast despite the underlying non-convexity. In particular, we ﬁnd the global minimum with less than 300 iterations. We run 100 independent copies of GDx LD and are able to ﬁnd the global minimum within 1000 iterations in all cases. In Figure 4, we plot F(Xn) and Xn at diﬀerent iterations for a typical n GDx LD implementation. We again observe that the convergence happens really fast, i.e., with less than 400 iterations.

Replica Exchange for Non-Convex Optimization

Figure 3: Convergence of the iterates under GDx LD

Figure 4: Convergence of the iterates under n GDx LD

Dong and Tong

For comparison, we also implement GD and LD with the same F. For GD, the iteration takes the form Xn+1 = Xn h F(Xn)

with h = 0.1. In this case, Xn gets stuck at diﬀerent local minima depending on where we start, i.e., the value of X0. For example, Figure 5 plots the trajectories of Xn under GD with X0 = (0, 0), which is the same as the X0 we used in GDx LD. As for LD, Figure 6 plots Xn following Xn+1 = Xn h F(Xn) + p

We set h = 0.1 and test two diﬀerent values of γ: γ = 1, which is the γ used in GDx LD, and γ = 0.01. When γ = 1 (Figure 6 (a)), we do not see convergence to the global minimum at all. The process is doing random exploration in the state-space. When γ = 0.01 (Figure 6 (b)), we do observe convergence of Xn to the neighborhood of the global minimum. However, compared with GDx LD, the convergence is much slower under LD, since the exploration is slowed down by the small γ. In particular, we ﬁnd the approximate global minimum with around 1.2 105 iterations.

Figure 5: Convergence of the iterates under GD

3.1.2 Sensitivity analysis of the hyper-parameters

One attractive feature of GDx LD is that we do not require the temperature γ and the step size h to change with the precision level ϵ. In the theoretical analysis, we ﬁx them as constants. From a practical point of view, we want γ to be large enough so that it is easy for LD to escape the local minima. On the other hand, we do not want γ to be too large as we want it to focus on exploring the relevant region so that there is a good chance of visiting the neighborhood of the global minimum. As for h, we want it to be small enough so that the GD converges once it is in the right neighborhood of the global minimum. On the other hand, we do not want h to be too small, as the convergence rate of GD, when it is in the right neighborhood, increases with h. In this section, we conduct some sensitivity analysis for diﬀerent values of γ and h. We use the same objective function as the one plotted in Figure 2.

Replica Exchange for Non-Convex Optimization

(b) γ = 0.01

Figure 6: Convergence of the iterates under LD

In Figure 7, we keep h = 0.1 ﬁxed and vary the value of γ from 0.1 to 100. The left plot shows E[F(Xn)] estimated based on 100 independent replications of GDx LD. The right plot shows P( Xn x 10 3), which is again estimated based on 100 independent replications. We observe that as long as γ is not too small or too large, i.e., 0.5 γ 10, GDx LD achieves very fast convergence. For 0.5 γ 10, the convergence speed is slightly diﬀerent for diﬀerent values of γ, with γ = 2.5, 5 among the fastest.

Figure 7: Convergence of the iterates under GDx LD with diﬀerent values of γ. h = 0.1. Averages are estimated based on 100 independent replications.

In Figure 8, we keep γ = 1 ﬁxed and vary the value of h from 0.1 to 1.5. The left plot shows E[F(Xn)] and the right plot shows P( Xn x 10 3). We observe that as long as h is not too big or too small, i.e., 0.05 γ 1, GDx LD achieves very fast convergence. Taking a closer look at the sample path of GDx LD when h = 1.5, we note that Xn is swapped to the region around the global minimum fairly quickly but it keeps oscillating around the global minimum due to the large step size. For 0.05 h 1, the convergence

Dong and Tong

speed is slightly diﬀerent for diﬀerent values of h with h = 1 being the fastest and h = 0.05 being the slowest.

Figure 8: Convergence of the iterates under GDx LD with diﬀerent values of h. γ = 1. Averages are estimated based on 100 independent replications.

Above all, our numerical experiments suggest that while diﬀerent hyper-parameters may lead to diﬀerent performances of GDx LD, the diﬀerences are fairly small as long as γ and h are within a reasonable range. This suggests the robustness of GDx LD to the hyperparameters. In Figure 9, we conduct the same analysis for n GDx LD. In particular, we plot E[F(Xn)] at diﬀerent iterations for diﬀerent values of the temperature γ and step size h in n GDx LD. The performances are very similar as those in GDx LD. In all our subsequent numerical experiments, we implement both GDx LD and n GDx LD. Since their performances are very similar, we only show the results for GDx LD in the ﬁgures.

3.1.3 Online setting

In this section, we consider an online version of the test problem from Section 3.1.1. In particular, we consider the setting of kernel density estimation (KDE)

ˆp N(x) = 1

i=1 κσ(x, si).

κσ is known as a kernel function with tuning parameter σ. It measures the similarity between x and the sample data si s. There are many choices of kernel functions, and here we use the Gaussian kernel

κσ(x, si) = 1

(2πσ) d 2 exp( 1

2σ(x si)2).

Then, ˆp N(x) can be seen as a sample average version of

p(x) = Esκσ(x, s).

Replica Exchange for Non-Convex Optimization

(a) h = 0.1

Figure 9: Convergence of the iterates under n GDx LD with diﬀerent values of γ and h. Averages are estimated based on 100 independent replications.

Notably, p is the density function of X = S + σZ, where S follows the distribution of the sample data and Z N(0, 1). In the following example, we assume S follows a mixture of Gaussian distribution with density

det(2πΣi) exp 1

2(x mi)T Σ 1 i (x mi) . (9)

As in Section 3.1.1, we set M = 5 5 = 25, each mi is a point in the meshgrid {0, 1, 2, 3, 4}2, Σi =diag(0.1), and the weights are randomly generated. Our goal is to ﬁnd the mode of p. In this case, we write

F(x) = p(x) + L(x) = Esκσ(x, s) + L(x),

where L is the quadratic regularization function deﬁned in (8). Then, we can run SGDx SGLD with the mini-batch average approximations of F and F:

ˆFn(Xn) = 1

i=1 κσ(Xn, sn,i) + L(x), and ˆFn(Xn) = 1

i=1 xκσ(Xn, s n,i) + L(x),

where the data-speciﬁc gradient takes the form

xκσ(x, si) = 1

(2πσ) d 2 σ exp( 1

2σ|x si|2)(x si).

In Figure 10, we plot the heat map and 3-plot of one possible realization of ˆFn with σ = 0.12

and n = 104. Note that in this particular realization, the global minimum is achieved at (3, 2). In Figure 11, we plot Xn for diﬀerent values of n under SGDx SGLD with the objective function plotted in Figure 10. We set h = 0.1, γ = 1, Θ = 103, t0 = 0.05, ˆ Mv = 5,

Dong and Tong

-1 0 1 2 3 4 5

5 4 3 2 1 0 -1

(a) Heat map of ˆFn

(b) 3-d plot of ˆFn

Figure 10: The estimated objective function

Figure 11: Convergence of the iterates under SGDx SGLD

Replica Exchange for Non-Convex Optimization

Figure 12: Convergence of the iterates under SGD and SGLD

X0 = (0, 0), and Y0 = (1, 1). We observe that SGDx SGLD converges to the approximate global minimum very fast, within 103 iterations. For comparison, in Figure 12, we plot the sample path of Xn under SGD and SGLD with the objective function plotted in Figure 10. For SGD, the iteration takes the form

Xn+1 = Xn h ˆF(Xn).

For SGLD, the iteration takes the form

Xn+1 = Xn h ˆF(Xn) + p

We set h = 0.1, γ = 0.01, and Θ = 103. Note that γ = 0.01 is tuned to ensure convergence. We observe that SGD still gets stuck in local minima. For example, in Figure 12 (a), when X0 = (0, 0), Xn gets stuck at (0, 1). SGLD is able to attain the global minimum, but at a much slower rate than SGDx SGLD. In particular, SGLD takes more than 2 104 iterations to converge to the approximate global minimum in Figure 12 (b).

3.2 Non-Convex Optimization Problems with d 2

In this section, we demonstrate the performance of (n)GDx LD on some classical non-convex optimization test problems. In particular, we consider the Rastrigin function

F(x) = 10n +

x2 i 10 cos(2πxi)

restricted to x [ 5, 5]d. We also consider the Griewank function

Dong and Tong

(a) Rastrigin Function

(b) Griewank Function

Figure 13: 3-d plot of F when d = 2

restricted to x [ 5, 5]d. For both functions, the global minimum is at the origin, i.e., x = (0, . . . , 0), with F(x ) = 0. We select these two functions because they are classic simple test problems for non-convex, multimodal optimization algorithms. Existing tools for optimizing these functions often involve metaheurstics, which lack rigorous complexity analysis (G amperle et al., 2002; Esquivel and Coello, 2003; Cheng and Jin, 2014). Figure 13 provides an illustration of the two test functions when d = 2. We note that the Rastrigin function is especially challenging to optimize as it has many local minima, while the Griewank function restricted to [ 5, 5]d has a relatively smooth landscape with only a few local minima. For Rastrigin functions of diﬀerent dimensions, we plot F(Xn) under at diﬀerent iterations under GDx LD in Figure 15. We observe that as d increases, it takes longer to ﬁnd the global minimum. For example, when d = 2, it takes around 5 102 iterations to ﬁnd the global minim, while when d = 5, it takes around 7 104 iterations to ﬁnd the the global minimum. When d 10, we are not able to ﬁnd the global minimum within 105 iterations, but the function value reduces substantially. n GDx LD achieves very similar performances as GDx LD. To avoid repetition, we do not include the corresponding plots here. For Griewank functions of diﬀerent dimensions, we plot F(Xn) at diﬀerent iterations under GDx LD in Figure 15. We observe that for d as large as 50, the algorithm is able to ﬁnd the global minimum within 104 iterations.

4. Conclusion

GD is known to converge quickly for convex objective functions, but it is not designed for exploration and it can be trapped at local minima. LD is better at exploring the state space. But in order for the stationary distribution of the LD to concentrate around the global minimum, it needs to run with a weak stochastic force, which in general slows down its convergence. This paper considers a novel exchange mechanism to exploit the expertise of both GD and LD. The proposed algorithm, (n)GDx LD, can converge to the

Replica Exchange for Non-Convex Optimization

Figure 14: GDx LD for Rastrigin functions on [ 5, 5]d. For (a), (b), and (c), h = 0.005 and γ = 5. For (d), (e), and (f), h = 0.001 and γ = 5. The initial values are generated uniformly on [ 5, 5]d

Figure 15: GDx LD for Griewank functions on [ 5, 5]d. h = 0.1 and γ = 5. For (d), (e), and (f), h = 0.001 and γ = 5. The initial values are generated uniformly on [ 5, 5]d

Dong and Tong

global minimum linearly with high probability for non-convex objective functions, under the assumption that the objective function is strongly convex in a neighborhood of the unique global minimum. Our algorithms can be generalized to online settings. To do so, we replace the exact gradient and function evaluation with their corresponding batch-average versions, and introduce an appropriate threshold for exchange. Lastly, we demonstrate the strength of our algorithms through numerical experiments.

Acknowledgments

We thank anonymous reviewers for their valuable comments and suggestions. The research of Jing Dong is supported by National Science Foundation Grant DMS-1720433. The research of Xin T. Tong is supported by the Singapore Ministry of Education Academic Research Funds Tier 1 grant R-146-000-292-114.

Appendix A. Detailed complexity analysis of (n)GDx LD

In this section, we provide the detailed proof of Theorem 2. The proof uses a constructive stochastic control argument, under which we can drive the iterates into the desired neighborhood. We start by providing an overview of the construction, which can be of interests to the analysis of other sampling-based numerical methods. We ﬁrst note once Xn B0, by convexity, Xn will converge to x with properly chosen step size h (see details in Lemma 11 and 12). It thus remains to show that Xn will be in B0 with high probability for n large enough. This task involves two key steps.

Step 1. We construct a proper exponential-type Lyapunov function V with corresponding parameters C > 0 and 0 < ρ < 1 (Lemma 7). In particular, if Yn x > C and Xn x > C,

E[V (Y n+1)|Yn] ρV (Yn) and E[V (X n+1)|Xn] ρV (Xn)

Utilizing this Lyapunov function, we can show for Yn, the k-th, k 1, visit time to the set {x : x x C} has a ﬁnite moment generating function in a neighborhood of the origin (Lemma 8). This implies that Yn visits the set {x : x x C} quite often (i.e., the inter-visit time has a sub-exponential distribution).

Step 2. We then show that during each visit to {x : x x C}, there is positive probability that Yn will also visit B0 (Lemma 9). This essentially creates a sequence of geometric trials whenever Yn {x : x x C}. Note that once Yn B0, Xk B0 for any k n due to the exchange mechanism.

Remark 6 The positive probability of visiting B0 in Step 2 can decay exponentially with the dimension d. Therefore, the complexity estimates in Theorem 2 and likewise Theorem 4 can grow exponentially in dimension as well. This is not due to the techniques we are using, as the estimates in Raginsky et al. (2017) and Xu et al. (2018) depend on a quantity called spectral gap , which can scale exponentially with the dimension as well.

Replica Exchange for Non-Convex Optimization

To facilitate subsequent discussion, we introduce a few more notations. We will use the ﬁltration Fn = σ{Xk, Yk, Zk 1, k n},

to denote the information up to iteration n. We use Pn to denote the conditional probability, conditioned on Fn, and En to denote the corresponding conditional expectation. Note that these notations generalize to stopping times. To keep our derivation concise, we assume x = 0 and F(x ) = 0. This does not sacriﬁce any generality, since we can always shift the function and consider

Fc(x) = F(x x ) F(x ).

It is easy to check if F satisfy the assumptions introduced in Section 2, Fc also satisfy the assumptions with slightly diﬀerent constants that depend on x .

A.1 Recurrence of the small set

In this section, we provide details about Step 1 and 2 in the proof outline. We start from checking Lemma 1. Proof [Proof of Lemma 1] For Claim 1), note that

Fλ(x), x = 2λ x 2 F(x), x .

By Assumption 1,

| F(x), x | F(x) x by Cauchy-Schwarz inequality

( F(x) F(0) + F(0) ) x

(L x + F(0) ) x by Assumption 1

= L x 2 + F(0) x .

Then, applying Young s inequality, we have

Fλ(x), x 2λ x 2 + L x 2 + F(0) x

2λ x 2 + 2L x 2 + F(0) 2

which is of form (5). For Claim 2), note that

λ0 x 2 + M0 + F(x) x by (5)

λ0 x 2 + M0 + L x x x by Assumption 1

λ0 x x 2 + λ0 x 2 + (2λ0 + L) x x x + M0

2λ0 x x 2 + λ0 x 2 + 2λ0 + L

2λ0 x 2 + M0 by Young s inequality.

Dong and Tong

Setting M1 = λ0 x 2 + 2λ0+L

2λ0 x 2 + M0, then,

F(x), x x 1

2λ0 x x 2 + M1.

Using Young s inequality again, we have,

1 λ0 F(x) 2 + 1

4λ0 x x 2 F(x), x x 1

2λ0 x x 2 M1.

4λ2 0 x x 2 λ0M1.

Our ﬁrst result provides a proper construction of the Lyapunov function V . It also establishes that F(Xn) is monotonically decreasing.

Lemma 7 For (n)GDx LD, under Assumption 1, if Lh 1/2,

1. The value of F(Xn) keeps decreasing:

F(Xn+1) F(X n+1) F(Xn) 1

2 F(Xn) 2h F(Xn).

2. Assume also Assumption 2, for η (8γ) 1, V (x) := exp(ηF(x)) satisﬁes the following:

En[F(Y n+1)] F(Yn) 1

2 F(Yn) 2h + 2γLhd

2λch)F(Yn) + 2CV h.

En[V (Y n+1)] exp 1

4ηhλc F(Yn) + ηh CV

En[V (Xn+1)] En[V (X n+1)] exp 1

4ηhλc F(Xn) + ηh CV

where CV = Mc/4 + 4γLd.

Proof For claim 1), note that by Rolle s theorem, there exits xn on the line segment between Xn and X n+1, such that

F(X n+1) =F(Xn) F(xn)T F(Xn)h

=F(Xn) F(Xn)T F(Xn)h ( F(xn) F(Xn))T F(Xn)h

F(Xn) F(Xn) 2h + h F(xn) F(Xn) F(Xn)

by Cauchy-Schwarz inequality

F(Xn) F(Xn) 2h + Lh2 F(Xn) 2 by Assumption 1

2 F(Xn) 2h as Lh < 1

Replica Exchange for Non-Convex Optimization

Claim 1) then follows, as F(X n+1) F(Xn+1). Next, we turn to claim 2). We start by establishing a bound for F(Y n+1). Let Yn = Y n+1 Yn = F(Yn)h + 2γh Zn. Note that again by Rolle s theorem, there exits yn on the line segment between Yn and Y n+1, such that

F(Y n+1) =F(Yn) + F(yn)T Yn

=F(Yn) + F(Yn)T Yn ( F(yn) + F(Yn))T Yn

F(Yn) + F(Yn)T Yn + L Yn 2

by Cauchy-Schwarz inequality and Assumption 1

F(Yn) F(Yn)T F(Yn)h + p

2γh F(Yn)T Zn

+ L F(Yn)T F(Yn)h2 2 p

2γhh L F(Yn)T Zn + 2Lγh ZT n Zn

2 F(Yn) 2h + β p

2γh F(Yn)T Zn + 2γLh Zn 2 as Lh < 1

with β = 1 2Lh (0, 1). Taking conditional expectation and using Assumption 2, we have our ﬁrst estimate. Recall that V (y) = exp(ηF(y)). Then,

En[V (Y n+1)] V (Yn) exp ηh

2 F(Yn) 2 En h exp βη p

2γh F(Yn)T Zn + 2ηγLh Zn 2 i

=V (Yn) exp ηh

2 F(Yn) 2 (1 4ηγLh) d/2 exp β2η2γh 1 4ηγLh F(Yn) 2

as Zn N(0, Id) and 4ηγLh < 1/4

V (Yn) exp ηh

2 F(Yn) 2 exp ηh

4 F(Yn) 2 + 8ηγLhd

as ηγ < 1/8 and β < 1

V (Yn) exp ηh

4 F(Yn) 2 + 4ηγLhd

V (Yn) exp ηh

4 λc F(Yn) + ηh

4 Mc + 4ηγLhd by Assumption 2.

Similarly, from the derivation of claim 1), we have

F(X n+1) F(Xn) 1

2 F(Xn) 2h.

En[V (X n+1)] V (Xn) exp ηh

V (Xn) exp ηh

4 λc F(Xn) + ηh

Dong and Tong

In the following, we set RV = 8λ 1 c CV and deﬁne a sequence of stopping times:

τ0 = min{n 0 : F(Xn) RV },

and for k = 1, 2, . . . , τk = min{n > τk 1, F(Yn) RV }.

Utilizing the Lyapunov function V , our second result establishes bounds for the moment generating function of τk s, k 0.

Lemma 8 For (n)GDx LD, under Assumptions 1 and 2, if Lh 1/2 and η (8γ) 1, then for any K 0, the stopping time τK satisﬁes

E[exp(hηCV τK)] exp(2KhηCV + KηRV )(V (X0) + V (Y0)).

Proof Note that for n < τ0, F(Yn) F(Xn) > RV = 8λ 1 c CV . Then, by Lemma 7,

En[V (Xn+1) + V (Yn+1)] exp( hηCV )(V (Xn) + V (Yn)).

This implies (V (Xτ0 n) + V (Yτ0 n)) exp(hηCV (τ0 n))

is a supermartingale. As V (x) 1, we have, by sending n ,

E[exp(hηCV τ0)] V (X0) + V (Y0).

By Lemma 7, F(Xn+1) F(X n+1) < RV for n τ0. Therefore, for k 0, if τk+1 > τk + 1, F(Y n) > RV and there is no jump for Xn at step n for τk < n < τk+1. Given Fτk (starting from τk), for n τk+1 1, we have F(Yn) > RV , and by Lemma 7,

En[V (Yn+1)] exp( hηCV )V (Yn).

This implies V (Yτk+1 n) exp(hηCV (τk+1 n)) is a supermartingale. Then, because V (x) 1, by sending n , we have

Eτk+1[exp(hηCV (τk+1 τk 1))] V (Yτk+1).

Eτk[exp(hηCV (τk+1 τk 1))] Eτk[V (Yτk+1)]

4hηF(Yτk) + hηCV

V (Yτk) by Lemma 7

exp(hηCV + ηRV ).

Now because E[exp(hηCV τ0)] V (X0) + V (Y0),

Replica Exchange for Non-Convex Optimization

E[exp(hηCV (τK K))] = E

k=0 exp(hηCV (τk+1 τk 1))

k=0 exp(hηCV + ηRV )

(V (X0) + V (Y0)).

Then, E[exp(hηCV τK)] = exp(2hηKCV + KηRV )(V (X0) + V (Y0)).

Let D = max{ x h F(x) : F(x) RV }.

The last result of this subsection shows that if F(Yn) RV , there is a positive probability that Y n+1 r for any r > 0. In particular, this includes r = rl.

Lemma 9 For (n)GDx LD, if F(Yn) RV , for any r > 0, there exist an α(r, D) > 0, such that Pn( Y n+1 r) > α(r, D).

In particular, a lower bound for α(r, D) is given by

α(r, D) Sdrd

(4γhπ) d 2 exp 1

2γh(D2 + r2) > 0,

where Sd is the volume of a d-dimensional unit-ball.

Pn( Y n+1 r) = Pn( Yn h F(Yn) + p

Zn Qn r 2γh

where Qn := (Yn h F(Yn))/ 2γh. Note that as F(Yn) RV , Qn D/ 2γh. Thus,

Zn Qn r 2γh

(2π) d 2 exp 1

2 z + Qn 2 dz

(2π) d 2 exp 1

2γh(D2 + r2) dz

(4γhπ) d 2 exp 1

2γh(D2 + r2) .

Dong and Tong

A.2 Convergence to global minimum

In this subsection, we analyze the speed of convergence for {Xn+k : k 0} to x = 0 when Xn B0. Most of these results are classical. In particular, if we assume B0 = Rd, then these rate-of-convergence results can be found in Nesterov (2005). For self-completeness, we provide the detailed arguments here adapted to our settings. First of all, we show Assumption 3 leads to Assumption 4.

Lemma 10 Under Assumption 3, Assumption 4 holds with

ru = r, r0 = 1

2a, rl = r a

Moreover for any x, y B0,

F(y) F(x) F(x), y x + m

2 y x 2. (10)

Proof Without loss of generality, we assume x = 0 and F(0) = 0. Let a = minx: x r F(x). Then a > 0 by our assumption. We choose r0 = 1

2a, ru = r, then B0 = {x : F(x) F(x ) + r0} { x r}. Next by Taylor expansion, we know F is convex within B0 and {|x| r}. This also leads to B0 being convex since it is a sublevel set. Moreover for any x so that x r, for some x on the line between x and 0,

F(x) = F(0) + x T F(0) + 1

2x T 2F(x )x = 1

2x T 2F(x )x.

2M x 2. So if we let rl = p a

M , { x x rl} B0. Finally, using Taylor s expansion leads to (10).

Lemma 11 For (n)GDx LD, under Assumptions 1 and 4, and assuming h min{1/(2L), r2 u/r0}, if Xn B0,

F(Xn+k) 1 1/r0 + kh/r2u for all k 0.

Proof From Lemma 7, we have, if F(Xn) r0, i.e, Xn B0, F(Xn+1) r0. We ﬁrst note for any k 0,

F(Xn+k) F(Xn+k), Xn+k ru F(Xn+k) , (11)

where the ﬁrst inequality follows by convexity (Assumption 4) and the second inequality follows by H older s inequality. Next, by Lemma 7,

F(Xn+k) F(Xn+k 1) 1

2 F(Xn+k 1) 2h F(Xn+k 1) h

2r2u F(Xn+k 1)2,

Replica Exchange for Non-Convex Optimization

where the last inequality follows from (11). This implies

1 F(Xn+k) 1 F(Xn+k 1) h 2r2u F(Xn+k 1)2 = 1 F(Xn+k 1) + 1 2r2u/h F(Xn+k 1).

Because F(Xn+k) r0 < r2 u/h by assumption,

1 F(Xn+k) 1 F(Xn+k 1) + h

Then, by induction, we have

1 F(Xn+k) 1 F(Xn) + kh

Lemma 12 For (n)GDx LD, under Assumptions 1 and 3, if F(Xn) r0, Lh 1/2,

F(Xn+k) (1 mh)k F(Xn)

for all k 0.

Proof We ﬁrst note if F(x) is strongly convex in B0, for x B0,

F(x ) F(x) F(x), x x + m

2 ||x x ||2.

By rearranging the inequality, we have

F(x) F(x ) F(x), x x m

2 ||x x ||2 1 2m F(x) 2, (12)

where the last inequality follows from Young s inequality. Next, from Lemma 7, we have

F(Xn+1) F(Xn) 1

2 F(Xn) 2h (1 mh)F(Xn),

where the second inequality follows from (12). Note that by Lemma 7, F(Xn+k) r0 for k 0. Thus, by induction, we have

F(Xn+k) (1 mh)k F(Xn).

Remark 13 The proof of Lemma 12 deals with F(Xn) and F(Xn) directly. It is thus easily generalizable to the online setting as the noise is additive (see Lemma 23). In contrast, the proof for Lemma 11 requires investigating (F(Xn)) 1. Its generalization to the online setting can be much more complicated, as the stochastic noise can make the inverse singular.

Dong and Tong

A.3 Proof of Theorem 2

We are now ready to prove the main theorem. Note from Lemma 11 that if Xn B0, for any

F(Xn+k) ϵ. We set k(ϵ) = (1/ϵ 1/r0)r2 u/h = O(ϵ 1).

Next, we study how long it takes for Xn to reach the set B0. From Lemma 9, every time Yn {x : F(x) RV },

Pn( Xn+1 rl) Pn( Y n+1 rl) α(rl, D) > 0.

P(F(XτK+1) > r0) = E

k=1 Pτk(F(Xτk+1) > r0)

(1 α(rl, D))K.

K > log(δ/2) log(1 α(rl, D)),

P(F(XτK+1) > r0) < δ/2. We set

K(δ) = log(δ/2)/ log(1 α(rl, D)) = O( log δ).

Lastly, we establish a bound for τK. From Lemma 8, we have, by Markov inequality,

P(τK > T) E[exp(hηCV τK)]

exp(hηCV T) exp(2KhηCV + KηRV )(V (X0) + V (Y0))

exp(hηCV T) .

T > log(δ/2)

hηCV + 2K + KRV

h CV + log(V (X0) + V (Y0))

hηCV P(τK > T) < δ/2. We set

T(δ) = log(δ/2)

hηCV + 2K(δ) + K(δ)RV

h CV + log(V (X0) + V (Y0))

hηCV = O( log δ).

Above all, we have, for any N T(δ) + k(ϵ),

P(F(XN) > ϵ) = P(F(XN) > ϵ, τK > T(δ)) + P(F(XN) > ϵ, τK < T(δ))

P(τK > T(δ)) + P(F(XτK+1) > r0) δ.

When Assumption 3 holds, F is strongly convex in B0, from Lemma 12, if Xn B0, for any

k log(ϵ) log(r0)

log(1 mh) ,

F(Xn+k) ϵ. In this case, we can set

k(ϵ) = (log(ϵ) log(r0)) / log(1 mh) = O( log ϵ). (13)

Note that T scales with K and 1/α(rl, D). Lemma 9 shows that 1/α(rl, D) depends exponentially on d and 1 γh.

Replica Exchange for Non-Convex Optimization

A.4 Proof of Theorem 3

The proof of Theorem 3 follows similar lines of arguments as the proof of Theorem 2. In particular, it relies on a geometric trial argument. The key diﬀerence is that the success probability of the geometric trial is bounded using the pre-constant of LSI rather than the small set argument in Lemma 9. Let Yn+1 = Yn F( Yn)h + p

γ F( Yn)γh + p

This is to be diﬀerentiated with Yn in Algorithm 1, which can swap position with Xn. Note that Yn is also known as the unadjusted Langevin algorithm (ULA). We denote µn as the distribution of Yn. Recall that πγ(x) = 1 Uγ exp( 1

γ F(x)). We ﬁrst prove an auxiliary bound for V (x) = exp(ηF(x)) that will be useful in our subsequent analysis.

Lemma 14 Under Assumptions 1 and 2, for η (8γ) 1, h 1/(2L),

En[V ( Yn+m)] V ( Yn) + 4 exp(ηRV ),

where RV = 8CV /λc.

Proof From the second claim in Lemma 7, we have

En[V ( Yn+1)] exp 1

4ηhλc F( Yn) + ηh CV

exp(ηF( Yn)).

When F( Yn) < RV ,

En[V ( Yn+1)] exp (ηh CV ) exp(ηF( Yn))

exp(ηF( Yn)) + exp (ηh CV ) 1 + 1

exp(ηF( Yn)) + 2ηh CV exp(ηRV )

When F( Yn) RV , we have

En[V ( Yn+1)] exp ( ηh CV ) exp(ηF( Yn)) 1 1

exp(ηF( Yn))

In both cases,

En[V ( Yn+1)] 1 1

V ( Yn) + 2ηh CV exp(ηRV ),

which implies that En[V ( Yn+m)] V ( Yn) + 4 exp(ηRV ).

Dong and Tong

Lemma 15 Under Assumptions 1 and 5,

KL( µn πγ) exp( β(n 1)hγ) 1

γ F( Y0) + Cd

where Md := 2Lhd d

2(log(4πh) + 1) + log(Uγ).

Proof Given Y0,

Y1 N(m( Y0), 2γh Id), where m( Y0) = Y0 F( Y0)h.

Then, given Y0,

KL( µ1 πγ) = Z log µ1(y)

2 log(4πγh) y m( Y0) 2

4γh + log Uγ + F(y)

2 log(4πγh) d

2 + log Uγ + 1

γ E[F( Y1)| Y0]

2 log(4πγh) d

2 + log Uγ + 1

γ (F( Y0) + 2γLhd) by Lemma 7 claim 2.

2(log(4πγh) + 1) + log Uγ + 2Lhd,

we have KL( µ1 πγ) 1

γ F( Y0) + Md

Next, from Theorem 2 in (Vempala and Wibisono, 2019) with L = L/γ and ϵ = γh, we have

KL( µn πγ) exp( β(n 1)hγ)KL( µ1 πγ) + 8hd L2

exp( β(n 1)hγ) 1

γ F( Y0) + Md

Based on Lemma 15, let

βγ h 1 log(1/h) + 1.

Note that for n n0 and h small enough,

KL( µn πγ) h2 1

γ F( Y0) + Md

h F( Y0) + 9hd L2

Replica Exchange for Non-Convex Optimization

We next draw connection between the bound (14) and the hitting time of Yn to B0. For n GDx LD, let φ = min n {Yn B0}.

With a slight abuse of notation, for GDx LD, let

φ = min n {Xn = Y n or Yn B0}.

Lemma 16 For n GDx LD or GDx LD, under Assumptions 1, 2, and 5,

P(φ n0) πγ(B0)

Proof For n GDx LD, Yn = Yn. For GDx LD, before replica exchange, we also have Yn = Yn. Therefore, P(φ > n) P(Zn / B0), which further implies that

P(φ n) P(Zn B0).

By Pinsker s inequality, the KL divergence provides an upper bound for the total variation distance. Let dtv(µ, π) denote the total variation distance between µ and π. Then, for n n0,

dtv( µn, πγ)

1 2KL( µn πγ)

1 2h F( Y0) + 1

h F( Y0) + 3

where the second inequality follows from (14). Thus,

P( Yn0 B0) πγ(B0) dtv( µn, πγ) πγ(B0)

Lemma 17 For (n)GDx LD, ﬁx any K and T (n0 + 1)K,

P0(φ > T) exp(2(n0 + 1)KhηCV + (n0 + 1)KηRV hηCV T)(V (X0) + V (Y0))

Proof Recall the sequence of stopping times τj = inf{n > τj 1 : F(Yn) RV }. From Lemma 8, we have

E[exp(hηCV τk)] exp(2khηCV + kηRV )(V (X0) + V (Y0)).

We next deﬁne a new sequence of stopping times:

ψ0 = inf {n : F(Yn) RV } , ψ 0 = ψ0 + n0,

Dong and Tong

and for k = 1, . . . ,

ψk = inf n ψ k 1 + 1, F(Yn) RV , ψ k = ψk + n0 + 1.

Note that ψi always coincide with one of τj s, and as τj τj 1 1,

ψk τ(n0+1)k.

P0(φ T) P0 τ(n0+1)K T + P0 Yψ k / B0, k K

exp(2(n0 + 1)KhηCV + (n0 + 1)KηRV hηCV T)(V (X0) + V (Y0))

!K by Lemma 16.

We are now ready to prove Theorem 3. Set

hd L βγ < 1 πγ(B0)

For n GDx LD, based on Lemma 17, let

K(δ) = log(δ/2) log(1 πγ(B0)/2).

For K K(δ),

In addition, let

T(β, δ) = log(δ/2)

hηCV + 2(n0 + 1)K(δ) + (n0 + 1)K(δ)RV

h CV + log(V (X0) + V (Y0))

Note that T(β, δ) = O(K(δ)n0) = O(log(1/δ)/β). For T T(δ), P(τK > T) δ/2. Above all, P(φ > T(δ)) δ. Lastly, note that once Yn is in B0, Xn will be moved there if it is not in B0 already. After Xn is in B0, it takes at most k(ϵ) = (log(ϵ) log(r0)) / log(1 mh) iterates to achieve the desired accuracy. Therefore, we can set

N(β, ϵ, δ) = T(β, δ) + k(ϵ).

Replica Exchange for Non-Convex Optimization

For GDx LD, let J = Rv/t0 . We also deﬁne

φ0 = min{n 0 : F(Xn) RV }

and for k = 1, 2, . . . ,

φk = min{n 0 : Xφk 1+n = Y φk 1+n or Yφk 1+n B0}.

We next make a few observations. First, for n φ0, F(Xn) RV . Second, F(Yφk) RV for k 1. Lastly, the value of F(Xn) will decrease by t0 every time swapping takes place. Thus, after φ0, there can be at most J swapping events. The last observation implies that before time φJ, Yn must visit B0 at least once. Let

Ξ = min{n 0 : Yn B0 or F(Xn) ϵ}.

P0(Ξ > (J + 1)T)

k=0 Pφk 1(φk > T)

where φ 1 0

(J + 1) exp(2(n0 + 1)KhηCV + (n0 + 1)KηRV hηCV T)(V (X0) + V (Y0) + 2RV )

by Lemma 17 and the fact that V (Xφk) RV , V (Yφk) RV .

Let K(δ) = log(δ) log(4(J + 1))

log(1 πγ(B0)/2)

T(β, δ) = log(δ) log(4(J + 1))

hηCV + 2(n0 + 1)K(δ) + (n0 + 1)K(δ)RV

+ log(V (X0) + V (Y0) + 2RV )

For T > T(β, δ), P(Ξ > (J + 1)T) δ/2. In this case, we can set

N(β, ϵ, δ) = (J + 1)T(β, δ) + k(ϵ).

Appendix B. Detailed complexity analysis of (n)SGDx SGLD

In this section, we provide the proof of Theorem 4. The proof follows a similar construction as the proof of Theorem 2. However, the stochasticity of ˆF(x) and ˆF(x) substantially complicates the analysis.

Dong and Tong

To facilitate subsequent discussions, we start by introducing some additional notations. We denote ζX n = ˆFn(Xn) F(Xn) and ζY n = ˆFn(Yn) F(Yn).

Similarly, we denote

ξX n = ˆFn(Xn) F(Xn) and ξY n = ˆFn(Yn) F(Yn).

We also deﬁne Fn = σ{Xk, Yk, Zk 1, k n},

and Gn = Fn σ{X n+1, Y n+1}.

We use Pn to denote the conditional probability, conditioned on Gn, and En to denote the corresponding conditional expectation. Following the proof of Theorem 2, the proof is divided into two steps. We ﬁrst establish the positive recurrence of some small sets centered around the global minimum. We then establish convergence to the global minimum conditional on being in the properly deﬁned small set. Without loss of generality, we again assume x = 0 and F(x ) = 0.

B.1 Recurrence of the small set

Our ﬁrst result establishes some bounds for the decay rate of F(Xn).

Lemma 18 For (n)SGDx SGLD, under Assumptions 1 and 6, if Lh 1/2,

1. The value of F(Xn) keeps decreasing on average:

En[F(X n+1)] F(Xn) 1

2 F(Xn) 2h + d Lh2θ.

If θ t2 0 log(2Lh2t0),

En[F(Xn+1)] F(Xn) 1

2 F(Xn) 2h + 2d Lh2θ.

2. Assume also Assumption 2, for ˆη min{(16γ) 1, (8hθ) 1}, ˆV (x) := exp(ˆηF(x)) satisﬁes the following:

En[F(Yn+1)] (1 1

2λch)F(Yn) 1

2 F(Yn) 2h + 2CV h.

En[ ˆV (Y n+1)] exp 1

4 ˆηhλc F(Yn) + ˆηh ˆCV

En[ ˆV (X n+1)] exp 1

4 ˆηhλc F(Xn) + ˆηh ˆCV

where ˆCV = Mc/4 + (8γLd + 4Lhθd).

Replica Exchange for Non-Convex Optimization

3. If ˆη < min (8hθ) 1, 2t0/θ , for θ t2 0 log(exp(ˆηhθd) 1),

En[ ˆV (Xn+1)] ˆV (Xn) exp 1

Proof For Claim 1), by Rolle s theorem, there exits xn on the line segment between Xn and X n+1, such that

En[F(X n+1)] =F(Xn) En F(xn)T ( F(Xn) + ξX n ) h

=F(Xn) En F(Xn)T ( F(Xn) + ξX n ) h

+ En ( F(xn) F(Xn))T ( F(Xn) + ξX n ) h

F(Xn) F(Xn) 2h

+ h En F(xn) F(Xn) 2 1/2 En F(Xn) + ξX n 2 1/2

by H older inequality

F(Xn) F(Xn) 2h + Lh2En F(Xn) + ξX n 2 by Assumption 1

2 F(Xn) 2h + d Lh2θ

2 and En[ ξX n 2] dθ by Assumption 6.

For Xn+1, we ﬁrst note that when X n+1 > ˆ MV or Y n+1 > ˆ MV , F(Xn+1) = F(X n+1). When X n+1 ˆ MV and Y n+1 ˆ MV , we note that if F(Y n+1) F(X n+1), then F(Xn+1) F(X n+1). If F(Y n+1) > F(X n+1), we may accidentally move Xn+1 to Y n+1 due to the estimation errors. In particular,

En[F(Xn+1)] =F(X n+1) Pn(Fn(X n+1) Fn(Y n+1) + t0)

+ F(Y n+1) Pn(Fn(X n+1) > Fn(Y n+1) + t0).

This implies

En[F(Xn+1)] F(X n+1)

=(F(Y n+1) F(X n+1)) Pn( ˆFn(X n+1) > ˆFn(Y n+1) + t0)

=(F(Y n+1) F(X n+1)) Pn(ζX n+1 ζY n+1 > F(Y n+1) F(X n+1) + t0)

(F(Y n+1) F(X n+1)) exp( (F(Y n+1) F(X n+1) + t0)2/θ) by Assumption 6

2t0 exp t2 0 θ

as F(Y n+1) > F(X n+1) and e x 1/x for x > 0

Lh2θ as θ t2 0 log(2Lh2t0).

Thus, En[F(Xn+1)] F(Xn) 1

2 F(Xn) 2h + 2d Lh2θ. For Claim 2), we start by establishing a bound for F(Y n+1). Let

2γZn and Yn = Y n+1 Yn = F(Yn)h +

Dong and Tong

By Rolle s theorem, there exits yn on the line segment between Yn and Y n+1, such that

F(Y n+1) =F(Yn) + F(yn)T Yn

=F(Yn) + F(Yn)T Yn + ( F(yn) F(Yn))T Yn

F(Yn) + F(Yn)T Yn + L Yn 2

by Cauchy-Schwarz inequality and Assumption 1

F(Yn) F(Yn)T F(Yn)h +

h F(Yn)T Wn

+ L F(Yn)T F(Yn)h2 2

hh L F(Yn)T Wn + Lh W T n Wn

2 F(Yn) 2h + β

h F(Yn)T Wn + Lh Wn 2 as Lh < 1

where β = 1 2h L (0, 1). Taking conditional expectation yields the ﬁrst estimate. Next, note that for any 0 < b < min{1/(8γ), 1/(4hθ)}, we have

E exp(a T Wn + b Wn 2)

ha T ξY n + p

2γa T Zn + 4γb Zn 2 + 2bh ξY n 2 i

by Young s inequality Wn 2 4γ Zn 2 + 2h ξY n 2

(1 8γb) d/2 exp γ a 2

(1 4bhθ) d/2 exp h a 2θ 2(1 4bhθ)

by Assumption 6 and the fact that Zn N(0, I).

En[ ˆV (Y n+1)]

ˆV (Yn) exp ˆηh

2 F(Yn) 2 En h exp ˆηβ

h F(Yn)T Wn + ˆηLh Wn 2 i by (15)

= ˆV (Yn) exp ˆηh

2 F(Yn) 2 (1 8γˆηLh) d/2 exp γˆη2β2h F(Yn) 2

(1 4ˆηLh2θ) d/2 exp hθˆη2β2h F(Yn) 2

2(1 4ˆηLh2θ)

by (16) as γˆη < 1/16, ˆηhθ < 1/8 and Lh < 1/2

ˆV (Yn) exp ˆηh

2 F(Yn) 2 exp ˆηh

4 F(Yn) 2 + 16ˆηγLhd

2 + 8ˆηLh2θd

as 8γˆηLh < 1/4, 4ˆηLh2θ < 1/4 and β < 1

ˆV (Yn) exp ˆηh

4 F(Yn) 2 + ˆηh(8γLd + 4Lhθd)

ˆV (Yn) exp ˆηh

4 λc F(Yn) + ˆηh

4 Mc + ˆηh(8γLd + 4Lhθd) by Assumption 2.

The upper bound for En[ ˆV (X n+1)] can be obtained in a similar way. Lastly, for Claim 3), we ﬁrst note following the same argument as (15), we have

F(X n+1) F(Xn) 1

2 F(Xn) 2h βh F(Xn)T ξX n + Lh2 ξX n 2.

Replica Exchange for Non-Convex Optimization

En[ ˆV (X n+1)] ˆV (Xn) exp 1

2 F(Xn) 2ˆηh

(1 2ˆηLh2θ) d/2 exp ˆη2β2h2θ F(Xn) 2

2(1 2ˆηLh2θ)

ˆV (Xn) exp 4ˆηLh2θ d

2 as ˆηhθ < 1/8 and h L < 1/2

ˆV (Xn) exp(ˆηhθd) as h L < 1/2.

Next, note that when X n+1 > ˆ MV or Y n+1 > ˆ MV , ˆV (Xn+1) ˆV (X n+1). When max{ X n+1 , Y n+1 } ˆ MV , if F(Y n+1) F(X n+1), ˆV (Xn+1) ˆV (X n+1). If F(Y n+1) > F(X n+1), we have

En[ ˆV (Xn+1)]

= ˆV (X n+1)P ˆFn(X n+1) ˆFn(Y n+1) + t0 + ˆV (Y n+1)P ˆFn(X n+1) > ˆFn(Y n+1) + t0

ˆV (X n+1) 1 + exp(ˆη(F(Y n+1) F(X n+1))) exp 1

θ F(Y n+1) F(X n+1) + t0 2

by Assumption 6

ˆV (X n+1) 1 + exp t2 0 θ as ˆη < 2t0/θ

ˆV (X n+1) exp(ˆηhθd) as 1 + exp( t2 0/θ) < exp(ˆηhθd).

Thus, En[ ˆV (Xn+1)] En[ ˆV (X n+1) exp(ˆηhθd)] ˆV (Xn) exp(2ˆηhθd).

Recall that ˆRV = 8λ 1 c ˆCV . We deﬁne a sequence of stopping times:

ˆτ0 = min n n 0 : F(Yn) ˆRV o ,

and for k = 1, 2, . . . ,

ˆτk = min n n > ˆτk 1 : F(Yn) ˆRV o .

Utilizing the Lyapunov function ˆV , our second result establishes bounds for the moment generating function of the stopping times.

Lemma 19 For (n)SGDx SGLD, under Assumptions 1, 2 and 6, if Lh 1/2 and ˆη min{(16γ) 1, (8hθ) 1}, for any K 0, the stopping time ˆτK satisﬁes

E[exp(hˆη ˆCV ˆτK)] exp(2Khˆη ˆCV + Kˆη ˆRV ) ˆV (Y0).

Dong and Tong

The proof of Lemma 19 follows exactly the same lines of arguments as the proof of Lemma 8. We thus omit it here. Let ˆD = max{ x h F(x) : F(x) ˆRV }.

Following the similar lines of arguments as Lemma 9, we have the following result.

Lemma 20 For (n)SGDx SGLD, under Assumption 6, if F(Yn) ˆRV , then, for any r > 0, there exists ˆα(r, ˆD) > 0, such that

Pn( Y n+1 r) > ˆα(r, ˆD).

In particular,

ˆα(r, ˆD) Sdrd

2τh( ˆD2 + r2) > 0.

Pn( Y n+1 r) = Pn( Yn h F(Yn) hξY n + p

Pn h ξY n r

where Qn = (Yn h F(Yn))/

2τh. Note that as F(Yn) ˆRV , Qn ˆD/

(2π) d 2 exp 1

2 z + Qn 2 dz

(2π) d 2 exp 1

2τh( ˆD2 + r2) dz

(8τhπ) d 2 exp 1

2τh( ˆD2 + r2) .

Lastly, by Markov inequality,

Pn h ξY n r

1 E[4h2 ξY n 2] r2 1 4h2θd

Our next result shows that if F(Y n+1) 1 4r0, there is a positive probability that F(Xn+1) 1

Lemma 21 For (n)SGDx SGLD, under Assumption 6, if t0 1

8r0, θ r2 0 64 log 2, F(Y n+1)

1 4r0, and X n+1 ˆ MV , then

Replica Exchange for Non-Convex Optimization

Proof Note that if F(X n+1) 1 2r0, F(Xn+1) is guaranteed to be less than 1 2r0. If F(X n+1) > 1

2r0, the probability of exchange is

Pn ˆFn(X n+1) ˆFn(Y n+1) + t0

= Pn F(X n+1) F(Y n+1) + ξX n ξY n t0

Pn ξX n ξY n 1

8r0 as F(X n+1) F(Y n+1) > 1

4r0 and t0 1

1 exp r2 0 64θ

under Assumption 6

2 as θ r2 0 64 log 2.

If exchange takes place, Xn+1 = Y n+1 and F(Xn+1) 1

B.2 Convergence to global minimum

In this subsection, we analyze the speed of convergence {Xn+k : k 0} to x when F(Xn) 1

2r0. Let κn = inf{k > 0 : F(Xn+k) > r0}.

Lemma 22 For (n)SGDx SGLD, under Assumption 1 and 6, and assuming Lh 1/2 and ˆη < (8hθ) 1, if F(Xn) 1

2r0, then for any ﬁxed k > 0,

Pn(κn > k) 1 exp r0

Proof From Lemma 18, the following is a supermartingale

ˆV (Xn+k) exp 1

In particular,

ˆV (Xn+k+1)] exp 1

4d(k + 1) 1κn k+1

ˆV (Xn+k) exp 1

En[ ˆV (Xn+(κn k)) exp( 1

4d(κn k))] ˆV (Xn) exp 1

We also note

ˆV (Xn+(κn k))] exp 1

4d(κn k) En

exp(ˆηr0) exp 1

Dong and Tong

Pn(κn k) exp 1

since 8ˆηhθ < 1.

Lemma 23 For (n)SGDx SGLD, under Assumptions 1 and 6, and assuming F is strongly convex in B0 and h min{1/(2L), 1/m}, if F(Xn) 1

En[F(Xn+k)1κn>k] (1 mh)k F(Xn) + dθ

for all k 0.

Proof We ﬁrst note from Lemma 18, if F(Xn) r0,

En+j[F(Xn+j+1)] F(Xn+j) 1

2 F(Xn+j) 2h + 2d Lh2θ

(1 mh)F(Xn+j) + 2d Lh2θ,

where the second inequality follows from (12) as F(x) is strongly convex in B0. Next, we note

En[F(Xn+(κn k))] En

(1 mh)(κn k) 1F(Xn) +

j=1 (1 mh)(κn k) jd Lh2θ

F(Xn) + 2d Lhθ

m F(Xn) + dθ

m as Lh < 1/2.

Because En[F(Xn+(κn k))] > En[F(Xn+k)1κn>k],

En[F(Xn+k)1κn>k] (1 mh)k F(Xn) + dθ

B.3 Proof of Theorem 4

For any ﬁxed accuracy ϵ > 0 and δ, we set

K(δ) = log(δ/3)

log(1 ˆα(r0/4, ˆD)) = O( log(δ)),

k(ϵ, δ) = log(2δϵ/(9r0))

log(1 mh) = O( log(δ) log(ϵ)),

N(ϵ, δ) = k(ϵ, δ)+ 2K(δ)hˆηv CV + K(δ)ˆη ˆRV + log ˆV (Y0) log(δ/3)

hˆη ˆCV = O( log(δ) log(ϵ)).

Replica Exchange for Non-Convex Optimization

For any ﬁxed N > N(ϵ, δ), we set

9d , r0 16h(d N/4 log(δ/9)), t2 0 log(2Lh2t0), t2 0 log(exp(d/8) 1), r2 0 64 log 2

=O(min{N 1, ϵδ}).

Now for ﬁxed N > N(ϵ, δ) and θ θ(N, ϵ, δ), we ﬁrst note if F(Xn) 1

2r0 for n N k(ϵ, δ),

Pn(F(XN) > ϵ) =Pn(F(XN) > ϵ, κn > N n) + Pn(F(Xn+k) > ϵ, κn N n)

Pn(F(XN)1κn>N n > ϵ) + Pn(κn N n)

(1 mh)N n F(Xn) + dθ

by Markov inequality, Lemma 23, and Lemma 22

(1 mh)k(ϵ,δ) r0

3δ by our choice of θ(N, ϵ, δ) and k(ϵ, δ).

Next, we study how long it takes Xn to visit the set {x : F(x) r0/2}. In particular, we denote T = inf{n : F(Xn) r0/2}. From Lemma 20 and 21, every time Yn {x : F(x) ˆRV },

2 ˆα(r0/4, ˆD) > 0.

P(F(Xˆτk+1) for k = 1, . . . , K(δ)) = E

F(Xˆτk+1) > 1

2 ˆα(r0/4, ˆD) K(δ) < δ

From Lemma 19, by Markov inequality,

P(ˆτK(δ) > N k(ϵ, δ))

E[exp(hˆη ˆCV ˆτK(δ))]

exp(hˆη ˆCV (N k(ϵ, δ)))

exp(2K(δ)hˆηv CV + K(δ)ˆη ˆRV )V (Y0)

exp(hˆη ˆCV (N k(ϵ, δ)))

3 since N k(ϵ, δ) > 2K(δ)hˆηv CV + K(δ)ˆη ˆRV + log V (Y0) log δ + log 3

Dong and Tong

P(T N k(ϵ, δ)) P ˆτK(δ) N k(ϵ, δ) and F( ˆXˆτk+1) for some k = 1, . . . , K(δ)

P(F(XN) ϵ) P(F(XN) ϵ, T N k(ϵ, δ))

E[PT (F(XN) ϵ)|T N k(ϵ, δ)]P(T N k(ϵ, δ))

B.4 Proof of Theorem 5

Let ˆYn+1 = ˆYn F( ˆYn)h ξY n h + p

This is to be diﬀerentiated with Yn in Algorithm 2, which can swap position with Xn. We denote ˆµn as the distribution of ˆYn.

Lemma 24 Under Assumptions 1, 5, and 6, for h < β 4

KL(ˆµn|πγ) exp( 1

2βγh(n 1)) 1

γ F( ˆY0) + ˆ Md

+ 6h L2d + θd/γ

where ˆ Md = 2Lhd d

2(log(4πh) + 1) + log(Uγ).

Proof For a given realization of ξY 0 and ˆY0,

ˆY1 N(m( ˆY0), 2γh Id), where m( ˆY0) = ˆY0 F( ˆY0)h ξY 0 h.

Then, given Y0,

KL(ˆµ1 πγ) = Z log ˆµ1(y)

2 log(4πγh) y m( ˆY0) 2

4γh + log Uγ + F(y)

2(log(4πγh) + 1) + log Uγ + 1

γ E[F( ˆY1)| ˆY0]

2(log(4πγh) + 1) + log Uγ + 1

γ (F( ˆY0) + 2Lh ξY 0 2 + 2γLhd)

γ F( ˆY0) + ˆ Md + 2Lh

where the last inequality follows from (15).

Replica Exchange for Non-Convex Optimization

Next, following the proof of Lemma 3 in (Vempala and Wibisono, 2019), for a given ξY 1 , with slight abuse of notation, consider the modiﬁed diﬀusion process:

d Yt = F( ˆY1)dt ξY 1 dt + p

2γd Bt, Y1 = ˆYt

for t 1. Note that ˆY2 follows the same law as Y1+h. Let µt denote the distribution of Yt. Then

d dt KL(µt πγ)

= γ Z log µt(x)

2 µt(x)dx + E F(Yt) ξY 1 F( ˆY1), log µt(Yt)

Z log µt(x)

2 µt(x)dx + 1

γ E F(Yt) ξY 1 F( ˆY1) 2

Z log µt(x)

2 µt(x)dx + 2

γ L2E[ Yt ˆY1 2] + 2

Because Yt d= ˆY1 ( F( ˆY1)+ξY 1 )(t 1)+ p

2γ(t 1)Z1, where Z1 is a standard d-dimensional Gaussian random vector,

E[ Yt ˆY1 2] = 2(t 1)2E[ F( ˆY1) 2] + 2(t 1)2 ξY 1 2 + 2γ(t 1)d

β KL(ˆµ1 πγ) + 4(t 1)2d L + 2(t 1)2 ξY 1 2 + 2γ(t 1)d.

In addition, because,

Z log µt(x)

2 µt(x)dx 2βKL(µt(x) πγ),

d dt KL(µt πγ)

2 KL(µt πγ)

β KL(ˆµ1 πγ) + 4(t 1)2d L + 2(t 1)2 ξY 1 2 + 2γ(t 1)d + 2

2 KL(µt πγ) + 16(t 1)2L4

βγ KL(µ1 πγ) + 1

γ (4(t 1)2L2 + 2) ξY 1 2

+ 8(t 1)2d L3

γ + 4(t 1)d L2,

which further implies that

KL(µ1+h πγ) e 3

2 βγh KL(µ1 πγ)+16h3L4

βγ KL(µ1 πγ)+4h3L2 + 2h

γ ξY 1 2+8h3d L3

Dong and Tong

For h < βγ 4

2L2 , we have

KL(µ1+h πγ) (1 βhγ) KL(µ1 πγ) + 3h

γ ξY 1 2 + 6h2L2d.

The above analysis implies that given ξY k s,

KL(ˆµn πγ) (1 βhγ) KL(ˆµn 1 πγ) + 3h

γ ξY 1 2 + 6h2L2d

2βh(n 1)γ)KL(ˆµ1 πγ) + 6h L2d

k=1 (1 hβγ)n k ξY k 2.

Plug in the bound for KL(ˆµ1 πγ) in (17) and taking the expectation with respect to ξY k s, we have

KL(ˆµn πγ) exp( 1

2βh(n 1)γ) 1

γ F( ˆY0) + ˆ Md

k=0 (1 hβγ)n kθd

2βh(n 1)γ) 1

γ F( ˆY0) + ˆ Md

+ 6h L2d + θd/γ

Based on Lemma 24, let

βγ h 1 log(1/h) + 1.

For n n0 and h small enough,

KL(ˆµn πγ) h2 1

γ F( ˆY0) + ˆ Md

+ 6h L2d + θd/γ

h F( ˆY0) + 8h L2d + θd/γ

Recall that ˆB0 = {x : F(x) r0/4}.

We next draw connection between the bound (18) and the hitting time of Yn to ˆB0. For n SGDx SGLD, let ˆφ = minn{Yn ˆB0}. With a slight abuse of notation, for SGDx SGLD, let ˆφ = minn{Xn = Y n or Yn ˆB0}.

Lemma 25 For (n)SGDx SGLD, under Assumptions 1, 2, 5, and 6,

P(ˆφ n0) πγ( ˆB0)

h F( ˆY0) 2

Proof For both n SGDx SGLD and SGDx SGLD,

P(ˆφ n) P( ˆYn ˆB0).

Replica Exchange for Non-Convex Optimization

By Pinsker s inequality,

dtv(ˆµn, πγ)

1 2KL(ˆµn πγ)

h F( ˆY0) + 2

where the last inequality follows from (18). Then

P( ˆYn0 ˆB0) πγ( ˆB0)

h F( ˆY0) 2

Lemma 26 For (n)GDx LD, ﬁx any K and T (n0 + 1)K,

P0(ˆφ > T) exp(2(n0 + 1)Khˆη ˆCV + (n0 + 1)Kˆη ˆRV hˆη ˆCV T) ˆV (Y0)

1 πγ( ˆB0) +

Proof Recall the sequence of stoping times ˆτj = inf{n > τj 1 : F(Yn) ˆRV }. Applying Lemma 19, we have

E[exp(hˆη ˆCV ˆτk)] exp(2khˆη ˆCV + kˆη ˆRV ) ˆV (Y0).

We next deﬁne a new sequence of stopping times:

ˆψ0 = inf n n : F(Yn) ˆRV o , ˆψ 0 = ˆψ0 + ˆn0,

and for k = 1, . . . ,

ˆψk = inf n n ˆψ k 1 + 1, F(Yn) ˆRV o , ˆψ k = ˆψk + ˆn0 + 1.

Note that ˆψi always coincide with one of ˆτj s, and as ˆτj ˆτj 1 1,

ˆψk ˆτ(n0+1)k.

P0(ˆφ T) P0 ˆτ(n0+1)K T + P0 Yψ k / B0, k K

exp(2(n0 + 1)Khˆη ˆCV + (n0 + 1)Kˆη ˆRV hˆη ˆCV T) ˆV (Y0)

1 πγ( ˆB0) +

!K by Lemma 25.

Dong and Tong

We are now ready to prove Theorem 5. For a ﬁxed N, let AN denote the event ξX n ξY n > 1

2t0 for some n N

n=1 P ζX n ζY n > 1

N exp t2 0 θ

!2 and θ < πγ( ˆB0)2βγ2

1 πγ( ˆB0) +

θd βγ 1 πγ( ˆB0)/2

Following the proof of Theorem 4, we also set

k(ϵ, δ) = log(2δϵ/(9r0))

θ(N, δ) = min

( 1 t2 0 log δ

9d , r0 16h(d N/4 log(δ/9)), πγ( ˆB0)2βγ2

In what follows we assume θ θ(N, δ). For n SGDx SGLD, deﬁne J = 1,

ˆφ0 = min{n 0 : F(Yn) ˆB0},

and for k 1, ˆφk = min{n > 0 : F(Yφk+n) ˆB0}.

For SGDx SGLD, with a slight abuse of notation, deﬁne J = ˆRv/t0,

ˆφ0 = min{n 0 : F(Xn) ˆRV },

and for k 1, ˆφk = min{n : Xφk 1+n = Y φk 1+n or Yφk 1+n ˆB0}.

We next note that

P(F(XN) > ϵ) P(AN) + P(F(XN) > ϵ, φJ < N, Ac N)

+ P(φJ > N, Ac N) (19)

We shall look at the three terms on the left hand side of (19) one by one. First, with our choice of θ, P(AN) δ/3.

Replica Exchange for Non-Convex Optimization

Second, suppose there exist n < N k(ϵ, δ) such that F(Xn) < r0/2. Let κn = inf{k > 0 : F(Xn) > r0}. Then,

Pn(F(XN) > ϵ) =Pn(F(XN) > ϵ, κn > N n) + Pn(F(XN) > ϵ, κn N n)

Pn(F(XN)1κn>N n > ϵ) + Pn(κn N n)

(1 mh)N n F(Xn) + dθ

by Markov inequality, Lemma 23, and Lemma 22

(1 mh)k(ϵ,δ) r0

3δ by our choice of θ and k(ϵ, δ).

In addition, from Lemma 21,

2r0 Y n ˆB0

Thus every time Y n ˆB0, there is a chance larger than or equal to 1

2 for Xn to visit {x : F(x) r0/2}. Lastly, we study how long it takes Yn to visit ˆB0. Conditional on Ac N and φJ < N, there exists at least one φk, k J, such that Yφk ˆB0. In addition, at all φk, 1 k J, F(Yφk) ˆRV + t0. Let φ 1 0. Then, for N > (J + 1)T

P0(φJ (J + 1)T|Ac N)

k=0 Pφk 1 (φk+1 > T|Ac N)

(J + 1) exp 2(n0 + 1)Khˆη ˆCV + (n0 + 1)Kˆη ˆRV hˆη ˆCV T ( ˆV (Y0) + ˆRV + t0)

1 πγ( ˆB0) +

!K by Lemma 26

(J + 1) exp 2(n0 + 1)Khˆη ˆCV + (n0 + 1)Kˆη ˆRV hˆη ˆCV T ( ˆV (Y0) + ˆRV + t0)

+ (J + 1) 1 πγ( ˆB0)/2 K by our choice of h and θ.

K(δ) = log(δ) log(6(J + 1))

log(1 πγ( ˆB0)/2)

T(β, δ) =2(n0 + 1)K(δ)hˆη ˆCV + (n0 + 1)K(δ)ˆη ˆRV

+ log ˆV (Y0) + ˆRV + t0 log(δ) + log(6(J + 1))

Dong and Tong

We have for K > K(δ) and T > T(δ, β), P(φJ T, Ac N) < δ/3. Above all, if we set

N(β, ϵ, δ) = T(β, δ) + k(δ, ϵ) = O(log(1/δ)/β) + O(1/ϵ),

we have for N N(β, ϵ, δ), P(F(XN) > ϵ) δ.

Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Advances in Neural Information Processing Systems, 2018.

Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, 2016.

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018.

S. Bubeck, R. Eldan, and J. Lehec. Sampling from a log-concave distribution with projected langevin monte carlo. Discrete and Computational Geometry, 59(4):757 783, 2018.

X. Chen, S. Du, and X. T. Tong. On stationary-point hitting time and ergodicity of stochastic gradient Langevin dynamics. ar Xiv:1904.13016, 2019a.

Y. Chen, J. Chen, J. Dong, J. Peng, and Z. Wang. Accelerating nonconvex learning via replica exchange Langevin diﬀusion. In International Conference on Learning Representations, 2019b.

Ran Cheng and Yaochu Jin. A competitive swarm optimizer for large scale optimization. IEEE transactions on cybernetics, 45(2):191 204, 2014.

X. Cheng, N.S. Chatterji, P.L. Bartlett, and M.I. Jordan. Underdamped langevin MCMC: A non-asymptotic analysis. In Machine Learning Research, volume 75, pages 1 24, 2018.

Tiangang Cui, James Martin, Youssef M Marzouk, Antti Solonen, and Alessio Spantini. Likelihood-informed dimension reduction for nonlinear inverse problems. Inverse Problems, 30(11):114015, 2014.

Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of O(ϵ 3/2) for nonconvex optimization. Mathematical Programming, pages 1 32, 2014.

A.S. Dalalyan. Theoretical guarantees for approximate sampling from a smooth and logconcave density. J. R. Stat. Soc. B, 79:651 676, 2017.

A.S. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129:5278 5311, 2019.

Replica Exchange for Non-Convex Optimization

Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In Proceedings of the International Conference on Machine Learning, 2018.

Jing Dong and Xin T Tong. Spectral gap of replica exchange langevin diﬀusion on mixture distributions. ar Xiv preprint ar Xiv:2006.16193, 2020.

Simon Du, Jason Lee, Yuandong Tian, Aarti Singh, and Barnabas Poczos. Gradient descent learns one-hidden-layer CNN: Don t be afraid of spurious local minima. In Proceedings of the International Conference on Machine Learning, 2018.

P. Dupuis, Y. Liu, N. Plattner, and J.D. Doll. On the inﬁnite swapping limit for parallel tempering. Multiscale Model Simulation, 10(3):986 1022, 2012.

A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Annals of Applied Probability, 27(3):1551 1587, 2017.

D. J. Earl and M. W. Deem. Parallel tempering: Theory, applications, and new perspectives. Physical Chemistry Chemical Physics, 7:3910 3916, 2005.

Susana C Esquivel and CA Coello Coello. On the use of particle swarm optimization with multimodal functions. In The 2003 Congress on Evolutionary Computation, 2003. CEC 03., volume 2, pages 1130 1136. IEEE, 2003.

Roger G amperle, Sibylle D M uller, and Petros Koumoutsakos. A parameter study for diﬀerential evolution. Advances in intelligent systems, fuzzy systems, evolutionary computation, 10(10):293 298, 2002.

Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A uniﬁed geometric analysis. In Proceedings of the International Conference on Machine Learning, 2017.

Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. In Proceedings of the International Conference on Learning Representations, 2018.

S.B. Gelfand and S.K. Recursive stochastic algorithms for global optimization in Rd. SIAM Journal on Control and Optimization, 29(5):999 1018, 1991.

Charles J. Geyer and Elizabeth A. Thompson. Annealing Markov Chain Monte Carlo with applications to ancestral inference. Journal of the American Statistical Association, 90 (431):909 920, 1995.

B. Gidas. Nonstationary markov chains and convergence of the annealing algorithm. Journal of Statistical Physics, 39(1/2), 1985.

Vincent Granville, Mirko Kriv anek, and J-P Rasson. Simulated annealing: A proof of convergence. IEEE transactions on pattern analysis and machine intelligence, 16(6): 652 656, 1994.

Dong and Tong

M. Hairer, A.M. Stuart, and S.J. Vollmer. Spectral gaps for a Metropolis Hastings algorithm in inﬁnite dimensions. Ann. Appl. Probab., 24(6):2455 2490, 2014.

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points eﬃciently. In Proceedings of the International Conference on Machine Learning, 2017.

Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Proceedings of the Conference on Learning Theory, 2018.

Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. 2019.

Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671 680, 1983.

Holden Lee, Andrej Risteski, and Rong Ge. Beyond log-concavity: Provable guarantees for sampling multi-modal distributions using simulated tempering langevin monte carlo. In Neur IPS, 2018.

Y. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan. Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42):20881 20885, 2019.

Yi-An Ma, Tianqi Chen, and Emily B. Fox. A complete recipe for stochastic gradient MCMC. In International Conference on Neural Information Processing Systems, volume 2, pages 2917 2925, 2015.

Oren Mangoubi and Nisheeth K Vishnoi. Convex optimization with unbounded nonconvex oracles using simulated annealing. In Conference On Learning Theory, pages 1086 1124. PMLR, 2018.

Enzo Marinari and Giorgio Parisi. Simulated tempering: a new Monte Carlo scheme. EPL (Europhysics Letters), 19(6):451, 1992.

Song Mei, Theodor Misiakiewicz, Andrea Montanari, and Roberto I Oliveira. Solving SDPs for synchronization and maxcut problems via the Grothendieck inequality. In Proceedings of the Conference on Learning Theory, 2017.

Georg Menz and Andr e Schlichting. Poincar e and logarithmic sobolev inequalities by decomposition of the energy landscape. The Annals of Probability, 42(5):1809 1884, 2014.

Matthias Morzfeld, Xin T Tong, and Youssef M Marzouk. Localization for mcmc: sampling high-dimensional posterior distributions with local structure. Journal of Computational Physics, 380:1 28, 2019.

Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., Ser. A, 103: 127 152, 2005.

Replica Exchange for Non-Convex Optimization

Y. Nesterov. Introductory lectures on convex optimization: A basic course., volume 87. Springer Science+Business Media, 2013.

Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Nonsquare matrix sensing without spurious local minima via the Burer-Monteiro approach. In Artiﬁcial Intelligence and Statistics, 2017.

M. Pelletier. Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Annals of Applied Probability, pages 10 44, 1998.

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. In Proceedings of the Conference on Learning Theory, 2017.

Robert H Swendsen and Jian-Sheng Wang. Replica Monte Carlo simulation of spin-glasses. Physical review letters, 57(21):2607, 1986.

Xin T. Tong, Mathias Morzfeld, and Youssef M. Marzouk. Mala-within-gibbs samplers for high-dimensional distributions with sparse conditional structure. SIAM Journal on Scientiﬁc Computing, 42(3):A1765 A1788, 2020.

Santosh S Vempala and Andre Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suﬃces. ar Xiv preprint ar Xiv:1903.08568, 2019.

D. B. Woodard, S. C. Schmidler, and M. Huber. Conditions for rapid mixing of parallel and simulated termpering on multimodal distribution. Annals of Applied Probability, 19 (2):617 640, 2009.

Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. In Advances in Neural Information Processing Systems, 2018.

Yaodong Yu, Pan Xu, and Quanquan Gu. Third-order smoothness helps: Faster stochastic optimization algorithms for ﬁnding local minima. In Advances in Neural Information Processing Systems, 2018.