# global_nonconvex_optimization_with_discretized_diffusions__10dec0b4.pdf Global Non-convex Optimization with Discretized Diffusions Murat A. Erdogdu 1,2 erdogdu@cs.toronto.edu Lester Mackey 3 lmackey@ microsoft.com Ohad Shamir 4 ohad.shamir@weizmann.ac.il 1University of Toronto 2Vector Institute 3Microsoft Research 4Weizmann Institute of Science An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems. We show that this property holds for any suitably smooth diffusion and that different diffusions are suitable for optimizing different classes of convex and non-convex functions. This allows us to design diffusions suitable for globally optimizing convex and non-convex functions not covered by the existing Langevin theory. Our non-asymptotic analysis delivers computable optimization and integration error bounds based on easily accessed properties of the objective and chosen diffusion. Central to our approach are new explicit Stein factor bounds on the solutions of Poisson equations. We complement these results with improved optimization guarantees for targets other than the standard Gibbs measure. 1 Introduction Consider the unconstrained and possibly non-convex optimization problem Recent studies have shown that the Langevin algorithm in which an appropriately scaled isotropic Gaussian vector is added to a gradient descent update globally optimizes f whenever the objective is dissipative (hrf(x), xi kxk2 2 β for > 0) with a Lipschitz gradient [14, 25, 29]. Remarkably, these globally optimized objectives need not be convex and can even be multimodal. The intuition behind the success of the Langevin algorithm is that the stochastic optimization method approximately tracks the continuous-time Langevin diffusion which admits the Gibbs measure a distribution defined by pγ(x) / exp( γf(x)) as its invariant distribution. Here, γ > 0 is an inverse temperature parameter, and when γ is large, the Gibbs measure concentrates around its modes. As a result, for large values of γ, a rapidly mixing Langevin algorithm will be close to a global minimum of f. In this case, rapid mixing is ensured by the Lipschitz gradient and dissipativity. Due to its simplicity, efficiency, and well-understood theoretical properties, the Langevin algorithm and its derivatives have found numerous applications in machine learning [see, e.g., 28, 6]. In this paper, we prove an analogous global optimization property for the Euler discretization of any smooth and dissipative diffusion and show that different diffusions are suitable for solving different classes of convex and non-convex problems. Our non-asymptotic analysis, based on a multidimensional version of Stein s method, establishes explicit bounds on both integration and optimization error. Our contributions can be summarized as follows: For any function f, we provide explicit O bounds on the numerical integration error of discretized dissipative diffusions. Our bounds depend only on simple properties of the 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. diffusion s coefficients and Stein factors, i.e., bounds on the derivatives of the associated Poisson equation solution. For pseudo-Lipschitz f, we derive explicit first through fourth-order Stein factor bounds for every fast-coupling diffusion with smooth coefficients. Since our bounds depend on Wasserstein coupling rates, we provide user-friendly, broadly applicable tools for computing these rates. The resulting computable integration error bounds recover the known Markov chain Monte Carlo convergence rates of the Langevin algorithm in both convex and non-convex settings but apply more broadly. We introduce new explicit bounds on the expected suboptimality of sampling from a diffusion. Together with our integration error bounds, these yield computable and convergent bounds on global optimization error. We demonstrate that improved optimization guarantees can be obtained by targeting distributions other than the standard Gibbs measure. We show that different diffusions are appropriate for different objectives f and detail concrete examples of global non-convex optimization enabled by our framework but not covered by the existing Langevin theory. For example, while the Langevin diffusion is particularly appropriate for dissipative and hence quadratic growth f [25, 29], we show alternative diffusions are appropriate for heavy-tailed f with subquadratic or sublinear growth. We emphasize that, while past work has assumed the existence of finite Stein factors [4, 29], focused on deriving convergence rates with inexplicit constants [23, 26, 29], or concentrated singularly on the Langevin diffusion [6, 9, 29, 25], the goals of this work are to provide the reader with tools to (a) check the appropriateness of a given diffusion for optimizing a given objective and (b) compute explicit optimization and integration error bounds based on easily accessed properties of the objective and chosen diffusion. The rest of the paper is organized as follows. Section 1.1 surveys related work. Section 2 provides an introduction to diffusions and their use in optimization and reviews our notation. Section 3 provides explicit bounds on integration error in terms of Stein factors and on Stein factors in terms of simple properties of f and the diffusion. In Section 4, we provide explicit bounds on optimization error by targeting Gibbs and non-Gibbs invariant measures and discuss how to obtain better optimization error using non-Gibbs invariant measures. We give concrete examples of applying these tools to non-convex optimization problems in Section 5 and conclude in Section 6. 1.1 Related work The Euler discretization of the Langevin diffusion is commonly termed the Langevin algorithm and has been studied extensively in the context of sampling from a log concave distribution. Nonasymptotic integration error bounds for the Langevin algorithm are studied in [8, 7, 9, 10]. A representative bound follows from combining the ergodicity of the diffusion with a discretization error analysis and yields error in O( 1 2 poly(log( 1 ))) steps for the strongly log concave case and O( 1 4 poly(log( 1 ))) steps for the general log concave case [7, 9]. Our work is motivated by a line of research that uses the Langevin algorithm to globally optimize non-convex functions. Gelfand and Mitter [14] established the global convergence of an appropriate variant of the algorithm, and Raginsky et al. [25] subsequently used optimal transport theory to prove optimization and integration error bounds. For example, [25] provides an integration error bound of after O 4 poly(log( 1 steps under the quadratic-growth assumptions of dissipativity and a Lipschitz gradient; the estimate involves the inverse spectral gap parameter λ 1 , a quantity that is often unknown and sometimes exponential in both inverse temperature and dimension. Gao et al. [13] obtained similar guarantees for stochastic Hamiltonian Monte Carlo algorithms for empirical and population risk minimization under a dissipativity assumption with rate estimates. In this work, we accommodate heavy-tailed objectives that grow subquadratically and trade the often unknown and hence inexplicit spectral gap parameter of [25] for the more user-friendly distant dissipativity condition (Prop. 3.4) which provides a straightforward and explicit certification of fast coupling and hence the fast mixing of a diffusion. For distantly dissipative diffusions, the size of our error bounds is driven primarily by a computable distance parameter; in the Langevin setting, an analogous quantity is studied in place of the spectral gap in the contemporaneous work of [5]. Cheng et al. [5] provide integration error bounds for sampling with the overdamped Langevin algorithm under a distant strong convexity assumption (a special case of distant dissipativity). The authors build on the results of [9, 11] and establish error in O( 1 )) steps. We consider general distantly dissipative diffusions and establish an integration error of in O( 1 2 ) steps under mild assumptions on the objective function f and smoothness of the diffusion. Vollmer et al. [26] used the solution of the Poisson equation in their analysis of stochastic Langevin gradient descent, invoking the bounds of Pardoux and Veretennikov [24, Thms. 1 and 2] to obtain Stein factors. However, Thms. 1 and 2 of [24] yield only inexplicit constants and require bounded diffusion coefficients, a strong assumption violated by the examples treated in Section 5. Chen et al. [4] considered a broader range of diffusions but assumed, without verification, that Stein factor and Markov chain moment were universally bounded by constants independent of all problem parameters. One of our principal contributions is a careful enumeration of the dependencies of these Stein factors and Markov chain moments on the objective f and the candidate diffusion. Our convergence analysis builds on the arguments of [23, 15], and our Stein factor bounds rely on distant and uniform dissipativity conditions for L1-Wasserstein rate decay [11, 15] and the smoothing effect of the Markov semigroup [3, 15]. Our Stein factor results significantly generalize the existing bounds of [15] by accommodating pseudo-Lipschitz objectives f and quadratic growth in the covariance coefficient and deriving the first four Stein factors explicitly. 2 Optimization with Discretized Diffusions: Preliminaries Consider a target objective function f : Rd ! R. Our goal is to carry out unconstrained minimization of f with the aid of a candidate diffusion defined by the stochastic differential equation (SDE) t )dt + σ(Zz t )d Bt with Zz 0 = z. (2.1) Here, (Bt)t 0 is an l-dimensional Wiener process, and b : Rd ! Rd and σ : Rd ! Rd l represent the drift and the diffusion coefficients, respectively. The diffusion Zz t starts at a point z 2 Rd and, under the conditions of Section 3, admits a limiting invariant distribution P with (Lebesgue) density p. To encourage sampling near the minima of f, we would like to choose p so that the maximizers of p correspond to minimizers of f. Fortunately, under mild conditions, one can construct a diffusion with target invariant distribution P (see, e.g., [20, 15, Thm. 2]), by selecting the drift coefficient b(x) = 1 2p(x)hr, p(x)(a(x) + c(x))i, (2.2) where a(x) , σ(x)σ(x)> is the covariance coefficient, c(x) = c(x)> 2 Rd d is the skewsymmetric stream coefficient, and hr, m(x)i = P @xk denotes the divergence operator with {ej}j as the standard basis of Rd. As an illustration, consider the (overdamped) Langevin diffusion for the Gibbs measure with inverse temperature γ > 0 and density pγ(x) / exp( γf(x)) (2.3) associated with our objective f. Inserting σ(x) = 2/γ I and c(x) = 0 into the formula (2.2) we obtain bj(x) = 1 2pγ(x)hr, pγ(x)(a(x) + c(x))ij = 1 γpγ(x) @xk = 1 γpγ(x) @xj = @jf(x) which reduces to b = rf. We emphasize that the choice of the Gibbs measure is arbitrary, and we will consider other measures that yield superior guarantees for certain minimization problems. In practice, the diffusion (2.1) cannot be simulated in continuous time and is instead approximated by a discrete-time numerical integrator. We will show that a particular discretization, the Euler method, can be used as a global optimization algorithm for various families of convex and non-convex f. The Euler method is the most commonly used discretization technique due to its explicit form and simplicity; however, our analysis can be generalized to other numerical integrators as well. For m = 0, 1, ..., the Euler discretization of the SDE (2.1) corresponds to the Markov chain updates Xm+1 = Xm + b(Xm) + p σ(Xm)Wm, where is the step size, and Wm Nd(0, I) is an isotropic Gaussian vector that is independent from Xm. This update rule defines a Markov chain which typically has an invariant measure that is different from the invariant measure of the continuous time diffusion. However, when the step size is sufficiently small, the difference between two invariant measures becomes small and can be quantitatively characterized [see, e.g., 22]. Our optimization algorithm is simply to evaluate the function f at each Markov chain iterate Xm and report the point with the smallest function value. Denoting by p(f) the expectation of f under the density p i.e., p(f)=EZ p[f(Z)] we decompose the optimization error after M steps of our Markov chain into two components, min m=1,..,ME[f(Xm)] minx f(x) 1 M m=1 E[f(Xm) p(f)] | {z } integration error + p(f) minx f(x) | {z } expected suboptimality and bound each term on the right-hand side separately. The integration error which captures both the short-term non-stationarity of the chain and the long-term bias due to discretization is the subject of Section 3; we develop explicit bounds using techniques that build upon [23, 15]. The expected suboptimality quantifies how well exact samples from p minimize f on average. In Section 4, we extend the Gibbs measure Langevin diffusion bound of Raginsky et al. [25] to more general invariant measures and associated diffusions and demonstrate the benefits of targeting non-Gibbs measures. Notation. We say a function g is pseudo-Lipschitz continuous of order n if it satisfies |g(x) g(y)| µ1,n(g)(1 + kxkn 2)kx yk2, for all x, y 2 Rd, (2.5) where k k2 denotes the Euclidean norm, and µ1,n(g) is the smallest constant satisfying (2.5). This assumption, which relaxes the more stringent Lipschitz assumption, allows g to exhibit polynomial growth of order n. For example, g(x) = x2 is not Lipschitz but satisfies (2.5) with µ1,1(g) 1. In all of our examples of interest, n 1. For operator and Frobenius norms k kop and k k F, we use φ1(g) = supx,y2Rd,x6=y kg(x) g(y)k F kx yk2 , µ0(g) = supx2Rd kg(x)kop, and µi(g) = supx,y2Rd,x6=y kri 1g(x) ri 1g(y)kop kx yk2 for the i-th order Lipschitz coefficients of a sufficiently differentiable function g. We denote the degree n polynomial coefficient of the i-th derivative of g by i,n(g) , supx2Rd krig(x)kop 3 Explicit Bounds on Integration Error We develop our explicit bounds on integration error in three steps. In Theorem 3.1, we bound integration error in terms of the polynomial growth and dissipativity of diffusion coefficients (Conditions 1 and 2) and Stein factors bounds on the derivatives of solutions to the diffusion s Poisson equation (Condition 3). Condition 3 is a common assumption in the literature but is typically not verified. To address this shortcoming, Theorem 3.2 shows that any smooth, fast-coupling diffusion admits finite Stein factors expressed in terms of diffusion coupling rates (Condition 4). Finally, in Section 3.1, we provide user-friendly tools for explicitly bounding those diffusion coupling rates. We begin with our conditions. Condition 1 (Polynomial growth of coefficients). For some r 2 {1, 2} and 8x 2 Rd, the drift and the diffusion coefficients of the diffusion (2.1) satisfy the growth condition 4 (1 + kxk2), kσ(x)k F λσ 4 (1 + kxk2), and kσσ>(x)kop λa 4 (1 + kxkr The existence and uniqueness of the solution to the diffusion SDE (2.1) is guaranteed under Condition 1 [19, Thm 3.5]. The cases r = 1 and r = 2 correspond to linear and quadratic growth of kσσ>(x)kop, and we will explore examples of both r settings in Section 5. As we will see in each result to follow, the quadratic growth case is far more delicate. Condition 2 (Dissipativity). For , β > 0, the diffusion (2.1) satisfies the dissipativity condition 2 + β for Ag(x) , hb(x), rg(x)i + 1 2hσ(x)σ(x)>, r2g(x)i. (3.1) A is the generator of the diffusion with coefficients b and σ, and Akxk2 2 = 2hb(x), xi + kσ(x)k2 Dissipativity is a standard assumption that ensures that the diffusion does not diverge but rather travels inward when far from the origin [22]. Notably, a linear growth bound on kσ(x)k F and a quadratic growth bound on kσσ>(x)kop follow directly from the linear growth of kb(x)k and Condition 2. However, in many examples, tighter growth constants can be obtained by inspection. Our final condition concerns the solution of the Poisson equation (also known as the Stein equation in the Stein s method literature) associated with our candidate diffusion. Condition 3 (Finite Stein factors). The function uf solves the Poisson equation with generator (3.1) f p(f) = Auf, (3.2) is pseudo-Lipschitz of order n with constant 1, and has i-th order derivative with degree-n polynomial growth for i = 2, 3, 4, i.e., kriuf(x)kop i(1 + kxkn 2) for i 2 {2, 3, 4} and all x 2 Rd. In other words, µ1,n(uf) = 1, and i,n(uf) = i for i = 2, 3, 4 with maxi i < 1. The coefficients i govern the regularity of the Poisson equation solution uf and are termed Stein factors in the Stein s method literature. Although variants of Condition 3 have been assumed in previous work [4, 26], we emphasize that this assumption is not easily verified, and frequently only empirical evidence is provided as justification for the assumption [4]. We will ultimately derive explicit expressions for the Stein factors i for a wide variety of diffusions and functions f, but first we will use the Stein factors to bound the integration error of our discretized diffusion. Theorem 3.1 (Integration error of discretized diffusions). Let Conditions 1 to 3 hold for some r 2 {1, 2}. For any even integer1 ne n+4 and a step size satisfying < 1 2(ne 1)!!(1+λb/2+λσ/2)ne , )))) E[f(Xm)] p(f) c1 1 M + c2 + c3 1+|1 n/2| ! r(ne) + E[k X0kne c1 = 6 1, c2 = 1 16 σ + 4(1 + 3n 1)λ4 b(1 + 3n 1) + 4 4 1.5n r(n) = 2 + 2β , with 1 = , 2 = [ neλa/4]+. This integration error bound, proved in Appendix A, is O since the higher order term c3 1+|1 n/2| can be combined with the dominant term c2 yielding (c2 + c3) as < 1. We observe that one needs O steps to reach a tolerance of . Theorem 3.1 seemingly makes no assumptions on the objective function f, but in fact the dependence on f is present in the growth parameters, the Stein factors, and the polynomial degree of the Poisson equation solution. For example, we will show in Theorem 3.2 that the polynomial degree is upper bounded by that of the objective function f. To characterize the function classes covered by Theorem 3.1, we next turn to dissecting the Stein factors. While verifying Conditions 1 and 2 for a given diffusion is often straightforward, it is not immediately clear how one might verify Condition 3. As our second principal contribution, we derive explicit values for the Stein factors i for any smooth and dissipative diffusion exhibiting fast L1-Wasserstein decay: Condition 4 (Wasserstein rate). The diffusion Zx t has Lp-Wasserstein rate %p : R 0 ! R if infcouplings (Zx 1/p %p(t)kx yk2 for all x, y 2 Rd and t 0, where infimum is taken over all couplings between Zx t . We further define the relative rates %1(t) = log(%2(t)/%1(t)) and %2(t) = log(%1(t)/[%1(0)%2(t)])/ log(%1(t)/%1(0)). Theorem 3.2 (Finite Stein factors from Wasserstein decay). Assume that Conditions 1, 2 and 4 hold and that f is pseudo-Lipschitz continuous of order n with, for i = 2, 3, 4, at most degree-n polynomial growth of its i-th order derivatives. Then, Condition 3 is satisfied with Stein factors 0 %1(t)!r(t + i 2)dt for i = 1, 2, 3, 4, where !r(t) = 1 + 4%1(t)1 1/r%1(0)1/2! 1 + 2 n r {[1 _ %r(t)]2λan + 3rβ}n" with 1 = , 2 = inft 0[ nλa(1 _ %2(t))]+, and 1 = 0 & i = µ1,n(f) 2:i,n(f) 1:i(b) 1:i(σ) r(6n) for i = 2, 3, 4, 1 = µ1,n(f) & i = µ1,n(f) 1:i(b) 1:i(σ) 0:i 2(σ 1)%1(0)!r(1) r(6n)i 1 for i = 2, 3, 4, where r(n) is as in Theorem 3.1, a:b,n(f) = maxi=a,..,b i,n(f), and a:b(g) is a constant, given explicitly in the proof, depending only on the order a through b derivatives of g. 1In a typical example where f is bounded by a quadratic polynomial, we have n = 1 and ne = 6. We also remind the reader that the double factorial (ne 1)!! = 1 3 5 (ne 1) is of order pne!. The proof of Theorem 3.2 is given in Section B and relies on the explicit transition semigroup derivative bounds of [12]. We emphasize that, to provide finite Stein factors, Theorem 3.2 only requires L1-Wasserstein decay and allows the L2-Wasserstein rate to grow. An integrable Wasserstein rate is an indication that a diffusion mixes quickly to its stationary distribution. Hence, Theorem 3.2 suggests that, for a given f, one should select a diffusion that mixes quickly to a stationary measure that, like the Gibbs measure (2.3), has modes at the minimizers of f. We explore user-friendly conditions implying fast Wasserstein decay in Section 3.1 and detailed examples deploying these tools in Section 5. Crucially for the heavy-tailed examples given in Section 5, Theorem 3.2 allows for an unbounded diffusion coefficient σ, unlike the classic results of [24]. 3.1 Sufficient conditions for Wasserstein decay A simple condition that leads to exponential L1 and L2-Wasserstein decay is uniform dissipativity (3.3). The next result from [27] (see also [2, Sec. 1], [15, Thm. 10]) makes the relationship precise. Proposition 3.3 (Wasserstein decay from uniform dissipativity [27, Thm. 2.5]). A diffusion with drift and diffusion coefficients b and σ has Wasserstein rate %p(t) = e kt/2 if, for all x, y 2 Rd, 2hb(x) b(y), x yi + kσ(x) σ(y)k2 F + (p 2)kσ(x) σ(y)k2 In the Gibbs measure Langevin case, where b = rf and σ 2/γI, uniform dissipativity is equivalent to the strong convexity of f. As we will see in Section 5, the extra degree of freedom in the diffusion coefficient σ will allow us to treat non-convex and non-strongly convex functions f. A more general condition leading to exponential L1-Wasserstein decay is the distant dissipativity condition (3.4). The following result of [15] builds upon the pioneering analyses of Eberle [11, Cor. 2] and Wang [27, Thm. 2.6] to provide explicit Wasserstein decay. Proposition 3.4 (Wasserstein decay from distant dissipativity [15, Cor. 4.2]). A diffusion with drift and diffusion coefficients b and σ satisfying σ(x) , (σ(x)σ(x)> s2I)1/2 and hb(x) b(y),x yi 2/2 + k σ(x) σ(y)k2 2 k( σ(x) σ(y))>(x y)k2 K if kx yk2 > R L if kx yk2 R (3.4) for R, L 0, K > 0, and s 2 (0, 1/µ0(σ 1)) has Wasserstein rate %1(t) = 2e LR2/8e kt/2 for 8K 1 R + 4K 1 if LR2 8 8 2 R 1L 1/2(L 1 + K 1) exp( LR2 8 ) + 32R 2K 2 if LR2 > 8. Conveniently, both uniform and distant dissipativity imply our dissipativity condition, Condition 2. The Prop. 3.4 rates feature the distance-dependent parameter e LR2/8. In the pre-conditioned Langevin Gibbs setting (b = 1 2arf and σ constant) when f is the negative log likelihood of a multimodal Gaussian mixture, R in (3.4) represents the maximum distance between modes [15]. When R is relatively small, the convergence of the diffusion towards its stationary distribution is rapid, and the non-uniformity parameter is small; when R is relatively large, the parameter grows exponentially in R2, as would be expected due to infrequent diffusion transitions between modes. Our next result, proved in Appendix D, provides a user-friendly set of sufficient conditions for verifying distant dissipativity and hence exponential Wasserstein decay in practice. Proposition 3.5 (User-friendly Wasserstein decay). Fix any diffusion and skew-symmetric stream coefficients σ and c satisfying L , F1( σ)2 + supx λmax(rhr, m(x)i) < 1 for m(x) , σ(x)σ(x)> + c(x), σ(x) , (σ(x)σ(x)> s2 0I)1/2, and s0 2 (0, 1/µ0(σ 1)). If hm(x)rf(x) m(y)rf(y), x yi Km if kx yk2 > Rm Lm if kx yk2 Rm, (3.5) holds for Rm, Lm 0, Km > 0, then, for any inverse temperature γ > L /Km, the diffusion with drift and diffusion coefficients bγ = 1 2mrf + 1 2γ hr, mi and σγ = 1 pγ σ has stationary density pγ(x) / e γf(x) and satisfies (3.4) with s = s0 pγ , K = γKm L 0 , L = γLm+L 0 , and R = Rm. 4 Explicit Bounds on Optimization Error To convert our integration error bounds into bounds on optimization error, we now turn our attention to bounding the expected suboptimality term of (2.4). To characterize the expected suboptimality of sampling from a measure with modes matching the minima of f, we generalize a result due to Raginsky et al. [25]. The original result [25, Prop. 3.4] was designed to analyze the Gibbs measure (2.3) and demanded that log pγ be smooth, in the sense that µ2(log pγ) < 1. Our next proposition, proved in Appendix C, is designed for more general measures p and importantly relaxes the smoothness requirements on log p. Proposition 4.1 (Expected suboptimality: Sampling yields near-optima). Suppose p is the stationary density of an ( , β)-dissipative diffusion (Condition 2) with global maximizer x . If p takes the generalized Gibbs form pγ, (x) / exp( γ(f(x) f(x )) ) for γ > 0 and rf(x ) = 0, we have pγ, (f) f(x ) q d ) + log( eβµ2(f) 2 )). (4.1) More generally, if log p(x ) log p(x) Ckx x k2 2 for some C > 0 and 2 (0, 1] and all x, then p(log p) + log p(x ) d 2 log( 2C When = 1, pγ, is the Gibbs measure, and the bound (4.1) exactly recovers [25, Prop. 3.4]. The generalized Gibbs measures with < 1 allow for improved dependence on the inverse temperature when γ d/(2 ). Note however that, for < 1, the distributions pγ, also require knowledge of the optimal value f(x ). In certain practical settings, such as neural network optimization, it is common to have f(x ) = 0. When f(x ) is unknown, a similar analysis can be carried out by replacing f(x ) with an estimate, and the bound (4.1) still holds up to a controllable error factor. By combining Prop. 4.1 with Theorem 3.1, we obtain a complete bound controlling the global optimization error of the best Markov chain iterate. Corollary 4.2 (Optimization error of discretized diffusions). Instantiate the assumptions and notation of Theorem 3.1 and Prop. 4.1. If the diffusion has the generalized Gibbs stationary density pγ, (x) / exp( γ(f(x) f(x )) ), then min m=1,..,ME[f(Xm)] f(x ) c1 1 M + (c2+c3) r(ne) + E[k X0kne d ) + log( eβµ2(f) Finally, we demonstrate that, for quadratic functions, the generalized Gibbs expected suboptimality bound (4.1) can be further refined to remove the log(γ/d)1/ dependence. Proposition 4.3 (Expected suboptimality: Quadratic f). Let f(x) = hx b, A(x b)i for a positive semidefinite A 2 Rd d and b 2 Rd. Then for pγ, (x) / exp( γ(f(x) f(x )) ) with > 0, and for each positive integer k, we have pγ,1/k(f) f(x ) The bound (4.4) applies to any f with level set (i.e., {x : f(x) = }) volume proportional to d 1. 5 Applications to Non-convex Optimization We next provide detailed examples of verifying that a given diffusion is appropriate for optimizing a given objective, using either uniform dissipativity (Prop. 3.3) or our user-friendly distant dissipativity conditions (Prop. 3.5). When the Gibbs measure Langevin diffusion is used, our results yield global optimization when f is strongly convex (condition (3.3) with b = rf and σ 2/γI) or has strongly convex tails (condition (3.5) with m I). To highlight the value of non-constant diffusion coefficients, we will focus on heavy-tailed examples that are not covered by the Langevin theory. 100 0 100 150 Gradient Descent (first 7000 iters) Gradient Descent (next 3000 iters) Langevin Algorithm (300 iters) Designed Diffusion (15 iters) 0 25 50 75 100 Iterations Function Value Designed Diffusion Langevin Algorithm Gradient Descent Figure 1: The left plot shows the landscape of the non-convex, sublinear growth function f(x) = c log(1 + 1 2). The middle and right plots compare the optimization error of gradient descent, the Langevin algorithm, and the discretized diffusion designed in Section 5.1. 5.1 A simple example with sublinear growth We begin with a pedagogical example of selecting an appropriate diffusion and verifying our global optimization conditions. Fix c > d+3 2 and consider f(x) = c log(1 + 1 2), a simple non-convex objective which exhibits sublinear growth in kxk2 and hence does not satisfy dissipativity (Condition 2) when paired with the Gibbs measure Langevin diffusion (b = rf, σ = 2/γI). To target the Gibbs measure (2.3) with inverse temperature γ 1, we choose the diffusion with coef- ficients bγ(x) = 1 2a(x)rf(x) + 1 2γ hr, a(x)i and σγ(x) = 1 pγ σ(x) for σ(x) , and a(x) = σ(x)σ(x)>. This choice satisfies Condition 1 with λb = O(1), λσ = O , and λa = O with respect to γ and Condition 2 with = c d+3 2γ and β = d/γ. In fact, this diffusion satisfies uniform dissipativity, 2hbγ(x) bγ(y), x yi + kσγ(x) σγ(y)k2 yielding L1 and L2-Wasserstein rates %1(t) = %2(t) = e t /2 by Prop. 3.3 and the relative rate %2(t) = 0. Hence, the i-th Stein factor in Theorem 3.2 satisfies i = O . This implies that the coefficients ci in Corollary 4.2 scale with O i=1 iγi/2 + 1 and the final optimization error bound (4.3) can be made of order by choosing the inverse temperature γ = O , the step size = O , and the number of iterations M = O Figure 1 illustrates the value of this designed diffusion over gradient descent and the standard Langevin algorithm. Here, d = 2, c = 10, the inverse temperature γ = 1, the step size = 0.1, and each algorithm is run from the initial point (90, 110). We observe that the Langevin algorithm diverges, and gradient descent requires thousands of iterations to converge while the designed diffusion converges to the region of interest after 15 iterations. 5.2 Non-convex learning with linear growth Next consider the canonical learning problem of regularized loss minimization with f(x) = L(x) + R(x) for L(x) , 1 L l=1 l(hx, vli), l a datapoint-specific loss function, vl 2 Rd the l-th datapoint covariate vector, and R(x) = ( 1 2) a regularizer with concave satisfying δ3 0(z) p max(0, 0(0)z 000(z)) and 4g0 z δ1 for gs(z) , 0(0) 0( 1 2 z) s, some δ1, δ2, δ3 > 0, and all z, s 2 R. Our aim is to select diffusion and stream coefficients that satisfy the Wasserstein decay preconditions of Prop. 3.5. To achieve this, we set c 0 and choose σ with µ0(σ 1) < 1 so that the regularization component of the drift is one-sided Lipschitz, i.e., ha(x)r R(x) a(y)r R(y), x yi Kakx yk2 2 for some Ka > 0. (5.1) We then show that L from Prop. 3.5 is bounded and that, for suitable loss choices, a(x)r L(x) is bounded and Lipschitz so that (3.5) holds with Km = Ka 2 and Lm, Rm sufficiently large. Fix any x, let r = kxk2, and define σ(s)(x) = p1 s(I xx> gs(r2) for all s 2 [0, 1]. We choose σ = σ(0) so that a(x)r R(x) = 0(0)x and (5.1) holds with Ka = 0(0). Our constraints on ensure that a(x) = I + xx> r2 g1(r2) is positive definite, that µ0(σ 1) 1, and that σ and a have at most linear and quadratic growth respectively, in satisfaction of Condition 1. Moreover, rhr, a(x)i = I( (d 1)g1(r2) 1(r2)) + 2 xx> r2 ((d 1)(g0 1(r2) g1(r2) r2 ) + 2r2g00 1(r2)), and λmax(rhr, a(x)i) = max( (d 1)g1(r2) 1(r2), (d 1)g1(r2) 1(r2) + 4r2g00 so that λmax(rhr, a(x)i) max((d 1)δ1 + pδ1δ2, dpδ1δ2 + 2δ3). For any s0 2 (0, 1), we have r σ(s0)(x)[v] = (I hx,vi gs0(r2) p1 s0 for each v 2 Rd, so, as | gs0(r2) p1 s0| g1(r2), φ1( σ) dpδ1 + pδ2 for σ = σ(s0). Finally, to satisfy (3.5), it suffices to verify that a(x)r L(x) is bounded and Lipschitz. For example, in the case of a ridge regularizer, R(x) = λ 2 for λ > 0, the coefficient a(x) = I, and it suffices to check that L is Lipschitz with Lipschitz gradient. This strongly convex regularizer satisfies our assumptions, but strong convexity is by no means necessary. Consider instead the pseudo-Huber function, R(x) = λ( 2 1), popularized in computer vision [17]. This convex but non- strongly convex regularizer satisfies all of our criteria and yields a diffusion with a(x) = I+ xx> λ . Moreover, since r L(x) = 1 l(hx, vli) and r2L(x) = 1 l (hx, vli), a(x)r L(x) is bounded and Lipschitz whenever | 0 l(r)| δ4 1+r and | 00 l (r)| δ5 1+r for some δ4, δ5 > 0. Hence, Prop. 3.5 guarantees exponential Wasserstein decay for a variety of non-convex L based on datapoint outcomes yl, including the sigmoid ( (r) = tanh((r yl)2) for yl 2 R or (r) = 1 tanh(ylr) for yl 2 { 1}) [1], the Student s t negative log likelihood ( l(r) = log(1 + (r yl)2)), and the Blake-Zisserman ( (r) = log(e (r yl)2 + ), > 0) [17]. The reader can verify that all of these examples also satisfy the remaining global optimization pre-conditions of Corollary 4.2 and Theorem 3.2. In contrast, these linear-growth examples do not satisfy dissipativity (Condition 2) when paired with the Gibbs measure Langevin diffusion. 6 Conclusion In this paper, we showed that the Euler discretization of any smooth and dissipative diffusion can be used for global non-convex optimization. We established non-asymptotic bounds on global optimization error and integration error with convergence governed by Stein factors obtained from the solution of the Poisson equation. We further provided explicit bounds on Stein factors for large classes of convex and non-convex objective functions, based on computable properties of the objective and the diffusion. Using this flexibility, we designed suitable diffusions for optimizing non-convex functions not covered by the existing Langevin theory. We also demonstrated that targeting distributions other than the Gibbs measure can give rise to improved optimization guarantees. [1] P. L. Bartlett, M. I. Jordan, and J. Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138 156, 2006. [2] P. Cattiaux and A. Guillin. Semi log-concave Markov diffusions. In Séminaire de Probabilités XLVI, volume 2123 of Lecture Notes in Math., pages 231 292. Springer, Cham, 2014. doi: 10.1007/978-3-319-11970-0_ 9. [3] S. Cerrai. Second order PDE s in finite and infinite dimension: a probabilistic approach, volume 1762. Springer Science & Business Media, 2001. [4] C. Chen, N. Ding, and L. Carin. On the convergence of stochastic gradient mcmc algorithms with high-order integrators. In Advances in Neural Information Processing Systems, pages 2278 2286, 2015. [5] X. Cheng, N. S. Chatterji, Y. Abbasi-Yadkori, P. L. Bartlett, and M. I. Jordan. Sharp convergence rates for langevin dynamics in the nonconvex setting. ar Xiv preprint ar Xiv:1805.01648, 2018. [6] A. S. Dalalyan. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. ar Xiv preprint ar Xiv:1704.04752, 2017. [7] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651 676, 2017. [8] A. S. Dalalyan and A. Tsybakov. Sparse regression learning by aggregation and langevin monte-carlo. Journal of Computer and System Sciences, 78:1423 1443, 2012. [9] A. Durmus, E. Moulines, et al. Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability, 27(3):1551 1587, 2017. [10] R. Dwivedi, Y. Chen, M. J. Wainwright, and B. Yu. Log-concave sampling: Metropolis-hastings algorithms are fast! ar Xiv preprint ar Xiv:1801.02309, 2018. [11] A. Eberle. Reflection couplings and contraction rates for diffusions. Probability theory and related fields, 166(3-4):851 886, 2016. [12] M. A. Erdogdu, L. Mackey, and O. Shamir. Multivariate Stein Factors from Wasserstein Decay. In preparation, 2019. [13] X. Gao, M. Gürbüzbalaban, and L. Zhu. Global convergence of stochastic gradient hamiltonian monte carlo for non-convex stochastic optimization: Non-asymptotic performance bounds and momentum-based acceleration. ar Xiv preprint ar Xiv:1809.04618, 2018. [14] S. B. Gelfand and S. K. Mitter. Recursive stochastic algorithms for global optimization in rˆd. SIAM Journal on Control and Optimization, 29(5):999 1018, 1991. [15] J. Gorham, A. B. Duncan, S. J. Vollmer, and L. Mackey. Measuring sample quality with diffusions. ar Xiv preprint ar Xiv:1611.06972, 2016. [16] I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Academic press, 2014. [17] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. [18] G. J. O. Jameson. Inequalities for gamma function ratios. The American Mathematical Monthly, 120(10): 936 940, 2013. ISSN 00029890, 19300972. URL http://www.jstor.org/stable/10.4169/amer. math.monthly.120.10.936. [19] R. Khasminskii. Stochastic stability of differential equations, volume 66. Springer Science & Business Media, 2011. [20] Y.-A. Ma, T. Chen, and E. Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917 2925, 2015. [21] A. Mathai and S. Provost. Quadratic forms in random variables: Theory and applications. 1992. [22] J. C. Mattingly, A. M. Stuart, and D. J. Higham. Ergodicity for sdes and approximations: locally lipschitz vector fields and degenerate noise. Stochastic processes and their applications, 101(2):185 232, 2002. [23] J. C. Mattingly, A. M. Stuart, and M. V. Tretyakov. Convergence of numerical time-averaging and stationary measures via poisson equations. SIAM Journal on Numerical Analysis, 48(2):552 577, 2010. [24] E. Pardoux and A. Veretennikov. On the Poisson equation and diffusion approximation. i. Ann. Probab., pages 1061 1085, 2001. [25] M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. ar Xiv preprint ar Xiv:1702.03849, 2017. [26] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. Exploration of the (non-) asymptotic bias and variance of stochastic gradient langevin dynamics. Journal of Machine Learning Research 17, pages 1 48, 2016. [27] F. Wang. Exponential Contraction in Wasserstein Distances for Diffusion Semigroups with Negative Curvature. ar Xiv:1608.04471, Mar. 2016. [28] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681 688, 2011. [29] P. Xu, J. Chen, D. Zou, and Q. Gu. Global convergence of langevin dynamics based algorithms for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 3126 3137, 2018.