# global_optimization_algorithm_through_highresolution_sampling__d16976b1.pdf

Published in Transactions on Machine Learning Research (09/2025)

Global Optimization Algorithm through High-Resolution Sampling

Daniel Cortild d.cortild@rug.nl University of Groningen, Netherlands Laboratoire de Finance des Marchés de l Energie, Dauphine, CREST, EDF R&D, France

Claire Delplancke claire.delplancke@edf.fr EDF R&D Palaiseau, France

Nadia Oudjane nadia.oudjane@edf.fr Laboratoire de Finance des Marchés de l Energie, Dauphine, CREST, EDF R&D, France

Juan Peypouquet j.g.peypouquet@rug.nl University of Groningen, Netherlands

Reviewed on Open Review: https: // openreview. net/ forum? id= r3VEA1AWY5

We present an optimization algorithm that can identify a global minimum of a potentially nonconvex smooth function with high probability, assuming the Gibbs measure of the potential satisfies a logarithmic Sobolev inequality. Our contribution is twofold: on the one hand we propose said global optimization method, which is built on an oracle sampling algorithm producing arbitrarily accurate samples from a given Gibbs measure. On the other hand, we propose a new sampling algorithm, drawing inspiration from both overdamped and underdamped Langevin dynamics, as well as from the high-resolution differential equation known for its acceleration in deterministic settings. While the focus of the paper is primarily theoretical, we demonstrate the effectiveness of our algorithms on the Rastrigin function, where it outperforms recent approaches.

1 Introduction

Smooth nonconvex optimization remains a challenge with broad applications in machine learning and statistical inference. Despite significant advances, convex optimization techniques often lead to suboptimal or computationally infeasible solutions in many inherently nonconvex real-world problems.

In this paper, we focus on the unconstrained global minimization problem: given a smooth nonconvex potential U : Rd R, we search for a point x such that

x argminx Rd U(x),

assuming such a point exists.

A recent trend in optimization consists of studying continuous-time versions of algorithms to obtain better estimates for their discrete-time counterparts. Notable progress has been made in accelerating convergence of first-order optimization methods by analyzing second-order dynamical systems in their continuous-time formulation. In specific, we shall interest ourselves in the high-resolution differential equation, given by

x(t) + α x(t) + β 2U(x(t)) x(t) + γ U(x(t)) = 0, (1)

where α, β, γ > 0 could in principle depend on time, but are supposed constants as indicated by the notation. The equation was originally introduced in Alvarez et al. (2002) and has since been further explored in Attouch et al. (2016), to mitigate the oscillations. The algorithmic consequences of this, including the

Published in Transactions on Machine Learning Research (09/2025)

connection with Nesterov s method, were investigated independently in Attouch et al. (2022) and Shi et al. (2022). The theoretical convergence rates obtained for the resulting algorithmic framework were comparable to those established for Nesterov s method in the convex case. However, in the strongly convex case, the theoretical guarantees were more conservative. Nevertheless, this motivated intense research within the convex optimization community, and the high-resolution differential equation seems to be displacing the classical differential equation studied in Su et al. (2016) and Attouch et al. (2018), which corresponds to the overdamped Langevin system, as the preferred continuous-time model for Nesterov s acceleration. Amongst other reasons, the high-resolution model extends better to the nonsmooth setting, and it captures the linear convergence rates of FISTA and the Optimized Gradient Method in the strongly convex case, while the overdamped one does not. These characteristics, coupled with the more stable trajectories it enables and the potential for integrating other techniques, such as restart schemes (Su et al., 2016), underscore the high-resolution differential equation s significance as a promising area of interest in convex optimization. To the best of our knowledge, only the convergence to critical points has been shown for Equation (1) in nonconvex landscapes (Alvarez et al., 2002).

By setting y(t) = x(t)+β U(x(t)) and renaming the parameter γ, we can rewrite Equation (1) as a first-order system ( x(t) = β U(x(t)) + y(t) y(t) = γ U(x(t)) αy(t). (2)

This reformulation has the benefit of not requiring the Hessian of U, making it more user-friendly, whilst still preserving the favourable convergence results.

However, deterministic models like System (2) may struggle when the potential U is nonconvex, as they can become trapped in local minima and fail to identify the global minimizer. To overcome this limitation, it has been proposed to add stochasticity to the dynamics to enable them to escape local minima. This stochasticity can come in the form of random perturbations, encouraging the dynamics to navigate through more complex, potentially nonconvex, landscapes. This perspective naturally leads us to consider stochastic differential equations, specifically the Langevin dynamics (Langevin, 1908), which combine the advantages of gradient flows with stochastic elements. Rather than the iterates of these dynamics, we study their law, and hope for it to concentrate around the global minimizers of the potential. This pushes us towards sampling problems, where we aim to produce samples from a given distribution µ exp( U).

Many problems in statistics require sampling from probability distributions. Sampling through the Langevin dynamics is a well-studied approach when the target distribution is strongly log-concave (or equivalently, when the potential is strongly convex) (see Durmus and Moulines (2016); Dalalyan and Karagulyan (2019); Cheng and Bartlett (2018); Dalalyan and Riou-Durand (2020) and the references therein), and has recently also been studied in the non log-concave setting, when it, for instance, verifies a log-Sobolev inequality (Vempala and Wibisono, 2019; Ma et al., 2021), a Poincaré inequality (Chewi et al., 2022), a weak Poincaré inequality (Mousavi-Hosseini et al., 2023), or even in the fully nonconvex setting (Balasubramanian et al., 2022).

The simplest variant of the Langevin dynamics is the overdamped Langevin dynamics, governed by

d Xt = γ U(Xt)dt + p

where (Bt) is standard Brownian motion, γ > 0 is a free parameter and U is the negative logarithm of the distribution we wish to sample from. Under weak assumptions, the invariant distribution of the dynamics is exactly µ exp( U). Convergence of the Euler discretization of the overdamped Langevin dynamics in Wasserstein-2 distance under strong convexity of U was shown in Durmus and Moulines (2016), and convergence in Kullback-Leibler divergence under a log-Sobolev assumption on µ in Vempala and Wibisono (2019). Accelerated rates may be obtained under different discretizations. For instance, see Dalalyan and Karagulyan (2019) for strongly log-concave targets.

Published in Transactions on Machine Learning Research (09/2025)

A different approach to faster convergence involves the underdamped Langevin dynamics, governed by the stochastic differential equation ( d Xt = Vtdt d Vt = ( γ U(Xt) Vt)dt + 2γd Bt,

where Vt represents the velocity. Under weak assumptions, the invariant measure is µ(x, v) exp( U(x) v 2/(2γ)). Accelerated convergence with respect to the overdamped dynamics was shown in Wasserstein-2 distance in the strongly log-concave case by Cheng and Bartlett (2018) and in Kullback-Leibler divergence under a log-Sobolev inequality by Ma et al. (2021). We again note that different discretizations are possible, we cite for instance Dalalyan and Riou-Durand (2020) and Schuh and Whalley (2024).

Sampling algorithms for optimization were first introduced for strongly convex potentials in Dalalyan (2017), and further studied in Raginsky et al. (2017); Xu et al. (2018); Zhang et al. (2017); Tzen et al. (2018), in the nonconvex setting. All of these focused on discretizations of the overdamped Langevin dynamics. The underdamped Langevin dynamics have also been studied, we refer to Gao et al. (2020) and Borysenko and Byshkin (2021) for more details. Discretizations of other underlying processes were investigated in Chen et al. (2018) Zhang (2024). Langevin dynamics have also been studied for global optimization outside the context of sampling. We refer interested readers to Chen et al. (2024) and the references therein.

1.1 Contribution

The contributions of the paper are the following:

1. We design a global optimization algorithm capable of minimizing nonconvex functions and obtaining arbitrarily accurate solutions with high probability. This optimization algorithm relies on an oracle sampling algorithm.

2. We propose a new variant of the classical Langevin dynamics, both for continuous-time and discretetime. The dynamics is inspired by the first-order high-resolution System (2), which is known to exhibit accelerated convergence rates in the deterministic convex setting. We anticipate that this study will encourage further exploration of this system, with the potential to deliver a comparable breakthrough in the field of nonconvex sampling as it has achieved in convex optimization. This sampling algorithm will then serve as oracle algorithm in our global optimization algorithm.

1.2 Structure

Section 2 introduces notation and preliminaries. Section 3 is dedicated to the design of our global optimization algorithm. In Section 4 we study a novel continuousand discrete-time dynamics, and formulate the sampling results. Finally, in Section 5, we illustrate our results numerically on the Rastrigin function, where we improve on current methods. Technical results and proofs are provided in the appendix.

2 Notation and Assumptions

The following standing assumptions on the potential U : Rd R will be valid throughout the paper:

U is twice differentiable, L-smooth and has Lipschitz continuous and bounded Hessian.

There exists an a0 > 0 such that exp( a0U) is integrable.

U has a nonzero finite number of global minimizers, and admits no global minimizers at infinity. We define its minimal value to be U .

Let denote the Euclidean norm on Rd and P(Rd) denote the space of probability measures on Rd. With abuse of notation, we shall denote interchangeably by µ P(Rd) the probability distribution and its density function with respect to the Lebesgue measure, in the case where it exists. For a given potential U : Rd R and reals a a0 and b > 0, we define µa P(Rd) and µa,b P(R2d) to be the probability distributions whose densities satisfy

µa(x) exp( a U(x)) and µa,b(x, y) exp a U(x) b y 2

Published in Transactions on Machine Learning Research (09/2025)

which are well-defined by assumption.

Given a distribution µ P(Rd), we denote by

EX µ[f(X)] = Z

Rd f(x)µ(dx)

the expected value of f(X) where X µ. Whenever the random variable or its distribution is clear from context, we shall abbreviate this to Eµ[f(X)] or E[f(X)].

Let µ, ν P(Rd) both have a density with respect to the Lebesgue measure and have full support. We define the total variation distance between µ and ν as

µ ν TV = sup {|Eµ[f] Eν[f]| : f 1} .

We define the Kullback-Leibler divergence of µ with respect to ν as

KL(µ ν) = EX µ

Moreover, we define their relative Fisher information as

Fi(µ ν) = EX µ

Finally, we define the Wasserstein-2 distance between µ and ν as

W2(µ, ν) = inf ζ Γ(µ,ν) E(X,Y ) ζ h X Y 2 2 i 1/2 ,

where Γ(µ, ν) is the set of couplings of µ and ν, namely the set of distributions ζ P(R2d) such that ζ(A Rd) = µ(A) and ζ(Rd A) = ν(A) for all A B(Rd). The infimum is always attained, and we call the minimizers optimal couplings (Villani, 2009).

A standard assumption in the literature when dealing with nonconvex sampling is a logarithmic Sobolev inequality (Vempala and Wibisono, 2019; Ma et al., 2019; 2021), which may be viewed as a Polyak-Łojasiewicz inequality on the space of probability measures (Liu et al., 2023; Chewi and Stromme, 2024). To the best of our knowledge, this assumption is not standard in the field of global optimization. A log-Sobolev inequality on µ with coefficient ρ states that, for all ν P(Rd),

2ρ Fi(ν µ). (3)

Remark 2.1. The notion of a logarithmic Sobolev inequality as presented in Inequality (3) was originally introduced in Feissner (1972) for Gaussian measures, and extended in Gross (1975) for general measures. All strongly log-concave probability distributions satisfy the inequality (Bakry and Émery, 1985), which is stable under bounded perturbations (Holley and Stroock, 1987), tensorization, convolution and mixture (Ledoux, 2006; Bakry et al., 2014), and Lipschitz perturbations (Brigati and Pedrotti, 2024), although this might come at the expense of the constant. In the case of mixtures of Gaussians of equal variance, the log-Sobolev constant has an exponential dependency in the problem dimension (Menz and Schlichting, 2014; Schlichting, 2019). Moreover, measures with potentials which are strongly convex outside a ball satisfy a log-Sobolev inequality (Ma et al., 2019), although the constant may again have an exponential dependency in the dimension.

Assumption 2.2. The Gibbs measures µa of U : Rd R satisfy a logarithmic Sobolev inequality with coefficients ρa > 0, for all a a0.

Published in Transactions on Machine Learning Research (09/2025)

Remark 2.3. Under Assumption 2.2, the measures µa,b also satisfy a log-Sobolev inequality with constant ρa,b = min(ρa, ρb) by tensorization, where we know that ρb = 1/b since the marginal in y of µa,b is strongly log-concave with parameter b. Verifying the logarithmic Sobolev inequality for each a a0 may be challenging in general, however it is readily verified under some structural assumptions on the potential. For instance, it is satisfied if U = V + F, where V is strongly convex and F is bounded, as strongly convex potentials induce a Gibbs measure satisfying the log-Sobolev property, which is stable under bounded perturbations. To the best of our knowledge, the tightest known lower bound for ρa is given by exp( a Osc(F)) (aµ) 1, where µ is the strong convexity constant of V . As Osc(F) typically scales with the dimension d, this highlights the possible exponential dependency in the dimension and in the parameter a of ρa, and hence also of ρa,b.

We now recall Pinsker s Inequality (Pinsker, 1964), which relates the KL divergence and the total variation distance.

Theorem 2.4 (Pinsker s Inequality). For any two µ, ν P(Rd), it holds that

We finish this section by recalling Mc Diarmid s Inequality (Mc Diarmid, 1989) in the single-variable case.

Lemma 2.5. Let f : Rd R satisfy Osc(f) := sup(f) inf(f) < + . For any random variable X, it holds that

P(E[f(X)] f(X) ε) exp 2ε2

3 Optimization

The idea behind the global optimization algorithm is to sample from a distribution that produces samples close to global minimizers. Specifically, we define µ P(Rd) to be a mixture of Dirac measures concentrated around the global minimizers of U with weights as defined in (Hasenpflug et al., 2024, Equation 18). As such, by Hasenpflug et al. (2024), under the technical Assumption B.1, there exists a constant C > 0, depending on U only, satisfying W2(µa, µ ) C a 1/4. (4)

As such, an approximate sample from µa will yield samples close to global minimizers. Concretely, we suppose we have access to a distribution µ satisfying

KL( µ µa) ε2/18, (5)

for some ε > 0, and that we can draw samples from µ.

We are now ready to introduce our global optimization algorithm, dependent on a yet unspecified oracle sub-algorithm, corresponding to a sample from µ.

Algorithm 1 Global Optimization Algorithm Require: Oracle algorithm.

1: Generate N random i.i.d. samples X(i) according to oracle algorithm where i = 1, . . . , N. 2: Set X = X(I) for I = argmini=1...,N U( X(i)).

With these results at hand, we may prove convergence in probability of Algorithm 1.

Theorem 3.1. Let U : Rd R satisfy Assumptions 2.2 and B.1. Fix ε (0, 1/2), δ (0, 1), and suppose

a max a0, 9C4L2

and N 18 ln(1/δ)

Suppose we have access to an oracle algorithm that can produce a sample from µ, where µ satisfies (5). Then, if X is simulated according to Algorithm 1 with the same oracle algorithm, it holds that

P(U( X) U ε) δ.

Published in Transactions on Machine Learning Research (09/2025)

Remark 3.2. A natural choice for the oracle algorithm producing samples X(i) satisfying (5) is an iterative sampling algorithm, producing the sample in K iterations. The total complexity to produce the sample X is then of K N, with the possibility of parallelization when N > 1. For a fixed computational budget, there is a trade-off between the number of iterations K and the number N of samples, as indicated by Equation (8) below, where the number of iterations K will control how small the probability P(U( X(1)) ε) is. Therefore, the number of iterations K should be big enough if one wants to ensure that increasing the number of samples N leads to improved accuracy.

Proof. Without loss of generality, set U = 0.

Let X(i) for i = 1, . . . , N be N i.i.d. copies generated according to the oracle algorithm, such that X = X(I)

where I = argmini=1,...,N U( X(i)).

Note that, for any i = 1, . . . , N,

1 e U( X(i)) = 1 e E[U(Xa)] (7a)

+ e E[U(Xa)] E[e U(Xa)] (7b)

+ E[e U(Xa)] E[e U( X(i))] (7c)

+ E[e U( X(i))] e U( X(i)). (7d)

Term (7a) is bounded by L-smoothness as

E[U(Xa)] = E[U(Xa) U(X )]

E U(X )(Xa X ) + L

2 W 2 2 (µa, µ ) LC2

where X µ such that U(X ) = 0 almost surely, and (Xa, X ) ζ for ζ an optimal coupling between µa and µ . As such, using that 1 e x x, we obtain that (7a) is bounded by ε/6. Moreover, as x 7 exp( x) is convex, (7b) is nonpositive, by Jensen s inequality. Since x 7 exp( U(x)) is bounded by 1, (7c) is bounded as

E h e U(Xa)i E h e U( X(0))i µ µa T V

where the second step uses Pinsker s Inequality 2.4. Finally, by considering all the bounds on (7), we have

P 1 e U( X(i)) ε/2 P E[e U( X(i))] e U( X(i)) ε/6 exp( ε2/18) δ1/N,

where the third inequality follows by Lemma 2.5, as x 7 e U(x) takes values in [0, 1]. For x [0, 1/2], it holds that 1 e x x/2, and hence

P(U( X(i)) ε) P 1 e U( X(i)) ε/2 δ1/N.

As X = X(I) where I = argmini=1,...,N U( X(i)), and ( X(i))i=1,...,N are i.i.d., it holds that

P(U( X) ε) = P(U( X(1)) ε)N δ. (8)

Remark 3.3. In Theorem 3.1, the bound of ε < 1/2 is artificial to the proof. By scaling U one can obtain similar results, up to a constant factor, for any value of ε > 0.

Algorithm 1 depends on a yet unspecified oracle algorithm. A new such oracle algorithm is discussed in Section 4. However, the modular form of Algorithm 1 provides a simple framework to employ other sampling algorithms, which potentially may lead to faster provable convergence, or be applicable in different contexts.

Published in Transactions on Machine Learning Research (09/2025)

4 High-Resolution Langevin

4.1 Continuous-Time Study

As previously highlighted, the high-resolution differential equation introduced in System (2) has played a pivotal role in advancing the study of accelerated convex optimization algorithms. Aiming to build an algorithm with accelerated convergence rate for nonconvex sampling, we introduce the High-Resolution Langevin Dynamics, inspired by these dynamics. Let (Ω, F, P) be a filtered probability space and consider the following stochastic differential equation:

d Xt = ( β U(Xt) + Yt)dt + p

d Yt = ( γ U(Xt) αYt)dt + q

2σ2yd By t , (9)

where (Bx, By) is a standard 2d-dimensional Brownian motion, (X0, Y0) µ0 for some initial distribution µ0, and α, β, γ, σ2 x, σ2 y > 0. As the drift coefficient is Lipschitz continuous, System (9) has a unique solution (Friedman, 1975). We denote the joint law of (Xt, Yt) by µt, which has a twice continuously differentiable density with respect to the Lebesgue measure, as the drift coefficient has Lipschitz continuous and bounded gradient (Menozzi et al., 2021).

A slight variation of System (9) has been studied previously in Li et al. (2022), with similar motivations. The authors obtain convergence rates in the W2 metric, in the strongly log-concave setting. As our study addresses the nonconvex setting, a comparative analysis falls outside the scope of this work.

Remark 4.1. System (9) is equivalent to

d Zt = β/a 1/b γ/a α/b

H(Zt)dt + 2 σ2 x 0 0 σ2 y

where Zt = (Xt, Yt) and H((x, y)) = a U(x) + b y 2

2 . As such, System (9) may be viewed as a preconditioned Langevin dynamics in a larger space on the function H.

Moreover, we see the difference between System (9) and the underdamped Langevin dynamics, through the presence of additional noise.

Finally, as σ2 x, σ2 y 0, we recover the deterministic System (2), which exhibits accelerated convergence.

We now study the convergence in KL divergence of System (9), similarly to what was done in Ma et al. (2021). A byproduct of the proof (see Appendix C.1) is the uniqueness of the invariant measure.

Theorem 4.2. Let U : Rd R, and let a a0 and b, α, β, γ, σx, σy > 0 satisfy

σ2x , b = α

σ2y and a b = γ. (10)

1. System (9) admits a weak solution (Xt, Yt) which has as invariant law µa,b.

2. If µa,b satisfies a log-Sobolev inequality with ρ > 0,

KL(µt µa,b) KL(µ0 µa,b) e 2ρ min(σ2 x,σ2 y)t.

In specific, provided σ2 x, σ2 y > 0, we obtain KL(µt µa,b) 0 at exponential rate as t , and the invariant law µa,b is unique.

4.2 Discrete-Time Study

Consider the following discretization of System (9):

d Xt = ( β U( Xkh) + Yt)dt + p

d Yt = ( γ U( Xkh) α Yt)dt + q

2σ2yd By t , (11)

Published in Transactions on Machine Learning Research (09/2025)

for t [kh, (k + 1)h], where h > 0 is the step-size and µ0 is the initial distribution, such that ( X0, Y0) µ0. We define µt = L(( Xt, Yt)).

Conditionally on ( Xkh, Ykh), System (11) describes an Ornstein-Uhlenbeck process for t [kh, (k + 1)h]. We may thus simulate µ(k+1)h by sampling an appropriate Gaussian random variable. The explicit computations leading to Algorithm 2 are presented in Appendix C.2.

Algorithm 2 High-Resolution Sampling Algorithm Require: An initial distribution µ0 P(R2d).

1: Simulate ( X0, Y0) µ0. 2: for k = 0, . . . , K 1 do 3: Generate ( X(k+1)h, Y(k+1)h) N(m, Σ) conditionally on ( Xkh, Ykh), where

m X = Xkh βh U( Xkh) + 1 e αh

m Y = e αh Ykh γ

α(1 e αh) U( Xkh)

ΣXX = σ2 y α3 2αh e 2αh + 4e αh 3 Id + 2σ2 xh Id

ΣY Y = σ2 y(1 e 2αh)

ΣXY = ΣY X = σ2 y(1 e αh)2

4: end for 5: return ( XKh, YKh).

Our result, as well as our analysis, is comparable to Vempala and Wibisono (2019), who studied the highly overdamped Langevin Dynamics. The more precise result and its derivation are given in Appendix C.3.

Theorem 4.3. Let ε > 0, a a0, b > 0 and assume (10) holds. If a log-Sobolev inequality on µa,b with parameter ρ > 0 holds, and h O(ρ), then there exist

A = O(ρ), B = O(a2d/ρ) and ˆB = O(a2d),

such that KL( µh µa,b) e Ah KL( µ0 µa,b) + ˆBh2,

and, for all K 1,

KL( µKh µa,b) e AKh KL( µ0 µa,b) + Bh. (12)

As an immediate corollary we obtain sufficient conditions to obtain an ε-accurate sample.

Corollary 4.4. Let ε > 0, a a0, b > 0 and assume (10) holds. If a log-Sobolev inequality on µa,b with parameter ρ > 0 holds, then KL( µKh µa,b) ε for

and K O da2

Proof. Bound each term in Equation (12) by ε/2.

We are now ready to complement Theorem 3.1, by replacing the oracle algorithm by Algorithm 2.

Published in Transactions on Machine Learning Research (09/2025)

Corollary 4.5. Let U : Rd R satisfy Assumptions 2.2 and B.1, and fix ε (0, 1/2) and δ (0, 1). Suppose (6) holds, and that X is simulated according to Algorithm 1 with K iterations of Algorithm 2 being used for the oracle sub-procedure. It holds that

P(U( X) U ε) δ, for h O ρa,bε2

Remark 4.6. As in Remark 3.3, the results in Corollary 4.5 immediately extend to any ε > 0.

Remark 4.7. It may be observed that the rates presented in Corollary 4.5 do not improve upon the best-known rates for the overdamped Langevin dynamics (Vempala and Wibisono, 2019) or the underdamped Langevin dynamics (Ma et al., 2021). However, this does not imply the absence of acceleration; rather, it reflects that our current analytical framework is not sufficiently tight to capture it. This is comparable to the initial studies on high-resolution differential equations which similarly did not demonstrate provable acceleration, yet laid the groundwork for the remarkable advancements in acceleration techniques that we benefit from today.

5 Numerical Results

All experiments have been performed in Python 3.8. The code is available on the author s Git Hub page.1

Our theoretical study does not answer the question of the optimality of the parameter choice, which we leave for future work. Given a > 0, we fix α = 1, β = 1, b = 10, γ = a/10, σ2 x = 1/a and σ2 y = 0.1. The remaining parameters, namely the number of samples N, the number of iterations K and the step-size h, will vary with the experiments. The number of runs over which we compute empirical probabilities is denoted by M.

We illustrate the convergence of our algorithm on the Rastrigin function, a classical example of a highly multimodal function with regularly distributed local minima. Let U : Rd R be given by

U(x) = d + x 2

i=1 cos(2πxi),

which is minimized in x = (0, . . . , 0) Rd, with objective value U = U(x ) = 0. The Rastrigin function for d = 1 and d = 2 is plotted in Figure 6 of Appendix A for illustrative purposes. The Gibbs measure of the Rastrigin function satisfies a log-Sobolev inequality by Remark 2.3, and it is easy to see it also satisfies Assumption B.1. We select d = 10 for all the experiments, unless otherwise specified.

In Figure 1, we show the empirical probabilities computed over M = 100 runs that U( Xk) U ε, for various thresholds ε. In each run, a step-size h = 0.01, a sample number N = 10 and a maximal number of iterations K = 14000 have been chosen. The initial distribution is set to µ0 = N(3 1d, 10 Id d). We observe that for smaller values of a, µa is not representative enough of µ to guarantee the wanted threshold, even after numerous iterations. For larger a, the probability converges, with a rate that decreases as a increases. This is expected, as µa approaches µ as a increases, but the number of iterations to reach a good estimate of µa also increases as a increases. These observations qualitatively confirm Corollary 4.5, as well as the dependency in a of ρa,b as outlined in Remark 2.3.

Table 1 shows similar results in a tabular form, for various values of the step-size h. We report the average, median and standard deviation of M = 100 runs after K = 14000 iterations. The small standard deviation showcases the robustness of our method. These results motivate our selection of step-size for the subsequent experiments.

As previously noted, Algorithm 1 achieves convergence for any sampling algorithm that satisfies the conditions outlined in Theorem 3.1. In particular, discretizations of both overdamped and underdamped Langevin dynamics are anticipated to be viable candidates. In Figure 2, we compare the high-resolution Langevin algorithm (HRLA) proposed in Algorithm 2 with the overdamped Langevin algorithm (OLA) from Vempala and Wibisono (2019) and the underdamped Langevin algorithm (ULA) from Ma et al. (2021), using comparable parameter settings. The values of a are selected empirically to optimize the convergence rate for a given value

1https://github.com/Daniel Cortild/Global Optimization

Published in Transactions on Machine Learning Research (09/2025)

Figure 1: Empirical Probabilities of Hitting Tolerance ε for 14000 Iterations.

Table 1: Average, Median and Standard Deviation of Objective Function Value for Various Step-Sizes.

h a = 1 a = 2 a = 3 a = 4 0.001 Avg 2.974 0.920 1.363 4.216 0.001 Med 3.034 0.895 1.427 4.291 0.001 SD 0.595 0.334 0.617 0.906 0.01 Avg 3.422 0.949 0.425 0.318 0.01 Med 3.488 0.960 0.422 0.329 0.01 SD 0.599 0.257 0.101 0.076 0.1 Avg 7.840 6.970 6.780 6.570 0.1 Med 7.977 7.168 6.894 6.712 0.1 SD 1.031 0.959 0.842 0.766

of ε, with convergence assessed by the speed at which the empirical probability reaches 1, which reflects the accuracy of the invariant measure and does not depend on the sampling algorithm. Hence the chosen pairs (a, ε) are common to all three algorithms.

Figure 2: Comparison of Different Sampling Methods.

Extending the results of Figure 1 to a larger number of iterations (K = 100000) with larger values of a allows us to reach better accuracies. This is shown in Figure 3, in which we observe the same trends as in Figure 1.

Published in Transactions on Machine Learning Research (09/2025)

Figure 3: Empirical Probabilities of Hitting Tolerance ε for 100000 Iterations.

The impact of the problem s dimensionality can also be analyzed. In Figure 4, we compare the cases of d = 10 with ε = 2 and d = 20 with ε = 4. The observed trends are remarkably similar, which aligns with theoretical expectations: doubling the dimension proportionally the expected initial objective value, and consequently, doubling the tolerance maintains the relative error at a consistent level.

Figure 4: Empirical Probabilities of Hitting Tolerance ε for d = 10 and d = 20.

In all preceding figures, we observe a faster convergence to the target µa for smaller a, however the target µa is further from the true target µ , such that the probability converges to a value far from 1. In order to leverage this, one can update the value of a over the iterations, in the same spirit as simulated annealing (Gidas, 1985). For a given a and a, we let ak be the value of a at iteration k, where we assume (ak) evolves linearly between a and a in k. Specifically, we set

ak = (K k) a k a

where K is the total number of iterations. We select a = 0.1, and vary the final value a. Figure 5 plots the empirical probabilities for various values of a, now for smaller tolerances. We observe a faster convergence to a higher accuracy, as expected. For large values of a, the increase from a to a is abrupt, causing the algorithm to become trapped in local minima. This behavior may originate from the suboptimality of the cooling scheme described in Equation (13). Further acceleration may be obtained either by optimizing the

Published in Transactions on Machine Learning Research (09/2025)

Table 2: Average Objective Function Value for Fixed Effort N K = 140000, with ak given by Equation (13).

N a = 4 a = 12 a = 20 a = 40 1 0.140 0.062 0.078 0.141 10 0.143 0.048 0.054 0.176 100 0.197 0.229 1.311 4.432 1000 3.653 8.781 11.167 12.558 10000 16.206 16.014 16.023 16.160

Table 3: Comparison to Results in Guilmeau et al. (2021), with ak given by Equation (13).

K SA FSA SMC-SA CSA a = 1 a = 2 a = 3 a = 4 a = 5 a = 6 50 Avg 3.29 3.36 3.26 3.23 15.76 15.30 14.04 13.61 13.40 13.40 50 SD 0.425 0.453 0.521 0.484 2.539 2.262 2.563 2.068 2.306 2.065 500 Avg 2.52 2.64 2.62 2.47 2.56 0.74 0.38 0.32 0.31 0.61 500 SD 0.320 0.304 0.413 0.502 0.549 0.244 0.101 0.095 0.223 0.433

cooling scheme, or by employing alternatives to simulated annealing. Although we do not delve further into this, we refer the interested reader to Marinari and Parisi (1992).

Figure 5: Empirical Probabilities of Hitting Tolerance ε with ak given by Equation (13), for 14000 Iterations.

For a fixed computational effort N K, one can choose to put a heavier emphasis on the number of samples N or the number of iterations K. Table 2 shows a comparison between multiple pairs of (N, K) having a constant product and the average running best value over all iterates obtained. We observe that it is beneficial to select N small, although N = 1 is not always optimal. Moreover, N = 1 does not allow for parallelization, contrary to N > 1. This does align with the observations in Remark 3.2.

In Table 3 we compare our results to well-studied variants of simulated annealing. Our algorithm is compared to Simulated Annealing (SA) as presented in Haario and Saksman (1991), Fast Simulated Annealing (FSA) as presented in Rubenthaler et al. (2009), Sequential Monte Carlo Simulated Annealing (SMC-SA) as presented in Zhou and Chen (2013), and Curious Simulated Annealing as presented in Guilmeau et al. (2021). The numerical values for the four aforementioned methods are extracted from Guilmeau et al. (2021). For a fair comparison we use the same parameters, namely we perform M = 50 runs with N = 250 samples and K = 500 number of iterations, and an initial distribution µ0 = δ{x0}, where x0 = (1, . . . , 1)T R10. Table 3 transcribes the average running best function value and its corresponding standard deviation over the runs. Although at 50 iterations the accuracy of our algorithm is worse than for the other presented algorithms, the comparison reverses at 500 iterations, where our method outperforms the state-of-the-art methods by an order of magnitude of 10. In the state-of-the-art simulated annealing methods, we observe very little improvement between 50 and 500 iterations compared to our algorithm. This reflects the well-known fact that simulated annealing methods tend to get stuck in local minimizers if the cooling scheme is not tailored to the problem, issue which our method does not seem to encounter.

Published in Transactions on Machine Learning Research (09/2025)

6 Conclusions

The main focus of the paper is on a global optimization algorithm, which produces arbitrarily accurate solution with high probability. The main assumptions on the potential to be minimized were some regularity assumptions, and a log-Sobolev inequality on the Gibbs measure. This global optimization algorithm relies on an oracle sub-algorithm, which produces samples from a given Gibbs distribution.

For this oracle algorithm, we introduced a new variant of Langevin dynamics, given in System (9). These dynamics are inspired by the deterministic high-resolution differential equation presented in System (2) to tackle global optimization in nonconvex settings. Our continuous-time and discrete-time dynamics complement the classical overdamped and underdamped Langevin methods by adding another noise term. We established convergence properties of the continuous-time dynamics, showing that their invariant measure concentrates around the global minimizers of the potential U, and developed a practical sampling algorithm through discretization, which provably converges.

Our theoretical analysis and numerical experiments on the Rastrigin function demonstrate that the proposed method effectively navigates nonconvex landscapes and can obtain accurate solutions with high probability.

6.1 Open Questions

Several open questions remain for future work:

1. The parameters of the algorithm were presented in a general form, and the optimal selection of these parameters is unclear. This is also the case for the ideal combination of (N, K), as suggested by Table 2.

2. The constant C in Inequality (4) is not explicitly defined in Hasenpflug et al. (2024). An explicit value would enable an explicit bound in Theorem 3.1.

3. The technical Assumption B.1 might be restrictive in certain contexts. Under more general smoothness assumptions, Algorithm 1 can make use of sampling algorithms from non-smooth potentials, such as the proximal algorithm (Lee et al., 2021; Chen et al., 2022; Liang and Chen, 2023). Whence the interest in establishing a bound similar to Inequality (5), under less restrictive assumptions.

4. The cooling scheme outlined in Equation (13) is suboptimal, as evidenced by Figure 5. Further research is needed to provide a theoretical analysis of the simulated annealing variant of our algorithm. Additionally, the exploration of an adaptive scheme could be considered.

6.2 Acknowledgements

The authors thank the anonymous reviewers for their careful reading and insightful feedback on our manuscript.

They moreover thank Guillaume Garrigos, Huiyuan Guo, Lucas Ketels, Filip Voronine and Zepeng Wang, for our numerous meetings and their interesting and helpful insights and questions, and Mareike Hasenpflug for her help regarding her paper (Hasenpflug et al., 2024).

This research benefited from the support of the FMJH Program Gaspard Monge for optimization and operations research and their interactions with data science. We thank the Center for Information Technology of the University of Groningen for their support and for providing access to the Hábrók high performance computing cluster.

Alvarez, F., Attouch, H., Bolte, J., and Redont, P. (2002). A second-order gradient-like dissipative dynamical system with Hessian-driven damping.: Application to optimization and mechanics. Journal de Mathématiques Pures et Appliquées, 81(8):747 779.

Ambrosio, L., Gigli, N., and Savaré, G. (2005). Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media.

Attouch, H., Chbani, Z., Fadili, J., and Riahi, H. (2022). First-order optimization algorithms via inertial systems with Hessian driven damping. Mathematical Programming, pages 1 43. Publisher: Springer.

Published in Transactions on Machine Learning Research (09/2025)

Attouch, H., Chbani, Z., Peypouquet, J., and Redont, P. (2018). Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168:123 175. Publisher: Springer.

Attouch, H., Peypouquet, J., and Redont, P. (2016). Fast convex optimization via inertial dynamics with Hessian driven damping. Journal of Differential Equations, 261(10):5734 5783.

Bakry, D., Gentil, I., and Ledoux, M. (2014). Analysis and Geometry of Markov Diffusion Operators, volume 348 of Grundlehren der mathematischen Wissenschaften. Springer International Publishing, Cham.

Bakry, D. and Émery, M. (1985). Diffusions hypercontractives. Séminaire de probabilités de Strasbourg, 19:177 206. Publisher: Springer - Lecture Notes in Mathematics.

Balasubramanian, K., Chewi, S., Erdogdu, M. A., Salim, A., and Zhang, S. (2022). Towards a theory of non-log-concave sampling: first-order stationarity guarantees for Langevin Monte Carlo. In Conference on Learning Theory, pages 2896 2923. PMLR.

Borysenko, O. and Byshkin, M. (2021). Cool Momentum: A method for stochastic optimization by Langevin dynamics with simulated annealing. Scientific Reports, 11(1):10705.

Brigati, G. and Pedrotti, F. (2024). Heat flow, log-concavity, and Lipschitz transport maps. ar Xiv preprint ar Xiv:2404.15205.

Chen, A. Y., Sekhari, A., and Sridharan, K. (2024). Langevin Dynamics: A Unified Perspective on Optimization via Lyapunov Potentials. In OPT 2024: 16th Annual Workshop on Optimization for Machine Learning.

Chen, Y., Chen, J., Dong, J., Peng, J., and Wang, Z. (2018). Accelerating Nonconvex Learning via Replica Exchange Langevin Diffusion. In International Conference on Learning Representations.

Chen, Y., Chewi, S., Salim, A., and Wibisono, A. (2022). Improved analysis for a proximal algorithm for sampling. In Proceedings of Thirty Fifth Conference on Learning Theory, pages 2984 3014. PMLR.

Cheng, X. and Bartlett, P. (2018). Convergence of Langevin MCMC in KL-divergence. In Algorithmic Learning Theory, pages 186 211. PMLR.

Chewi, S., Erdogdu, M. A., Li, M. B., Shen, R., and Zhang, M. (2022). Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178, pages 1 2. PMLR.

Chewi, S. and Stromme, A. J. (2024). The ballistic limit of the log-Sobolev constant equals the PolyakŁojasiewicz constant. ar Xiv preprint ar Xiv:2411.11415.

Dalalyan, A. (2017). Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 678 689. PMLR.

Dalalyan, A. S. and Karagulyan, A. (2019). User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278 5311.

Dalalyan, A. S. and Riou-Durand, L. (2020). On sampling from a log-concave density using kinetic Langevin diffusions. ar Xiv preprint ar Xiv:1807.09382.

Durmus, A. and Moulines, E. (2016). Sampling from a strongly log-concave distribution with the Unadjusted Langevin Algorithm. ar Xiv preprint ar Xiv:1605.01559.

Feissner, G. F. (1972). A Gaussian measure analogue to Sobolev s inequality. Cornell University.

Friedman, A. (1975). Stochastic Differential Equations and Applications. In Friedman, A., editor, Stochastic Differential Equations, pages 75 148. Springer.

Published in Transactions on Machine Learning Research (09/2025)

Gao, X., Gurbuzbalaban, M., and Zhu, L. (2020). Breaking Reversibility Accelerates Langevin Dynamics for Non-Convex Optimization. In Advances in Neural Information Processing Systems, volume 33, pages 17850 17862. Curran Associates, Inc.

Gidas, B. (1985). Global optimization via the Langevin equation. In 24th IEEE Conference on Decision and Control, pages 774 778.

Gross, L. (1975). Logarithmic Sobolev Inequalities. American Journal of Mathematics, 97(4):1061 1083.

Guilmeau, T., Chouzenoux, E., and Elvira, V. (2021). Simulated Annealing: a Review and a New Scheme. In IEEE Statistical Signal Processing Workshop, pages 101 105.

Haario, H. and Saksman, E. (1991). Simulated Annealing Process in General State Space. Advances in Applied Probability, 23(4):866 893.

Hasenpflug, M., Rudolf, D., and Sprungk, B. (2024). Wasserstein convergence rates of increasingly concentrating probability measures. The Annals of Applied Probability, 34(3):3320 3347.

Holley, R. and Stroock, D. W. (1987). Logarithmic Sobolev inequalities and stochastic Ising models. Journal of Statistical Physics, 46(5):1159 1194.

Langevin, P. (1908). On the theory of brownian motion. Comptes Rendus de l Académie des Sciences (Paris), 146:530.

Ledoux, M. (2006). Concentration of measure and logarithmic Sobolev inequalities. In Seminaire de probabilites XXXIII, pages 120 216. Springer.

Lee, Y. T., Shen, R., and Tian, K. (2021). Structured Logconcave Sampling with a Restricted Gaussian Oracle. In Proceedings of Thirty Fourth Conference on Learning Theory, pages 2993 3050. PMLR.

Li, R., Zha, H., and Tao, M. (2022). Hessian-Free High-Resolution Nesterov Acceleration For Sampling. In Proceedings of the 39th International Conference on Machine Learning, pages 13125 13162. PMLR.

Liang, J. and Chen, Y. (2023). A Proximal Algorithm for Sampling. Transactions on Machine Learning Research.

Liu, L., Majka, M. B., and Szpruch, L. (2023). Polyak Łojasiewicz inequality on the space of measures and convergence of mean-field birth-death processes. Applied Mathematics & Optimization, 87(3):48.

Ma, Y.-A., Chatterji, N. S., Cheng, X., Flammarion, N., Bartlett, P. L., and Jordan, M. I. (2021). Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli, 27(3):1942 1992.

Ma, Y.-A., Chen, Y., Jin, C., Flammarion, N., and Jordan, M. I. (2019). Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42):20881 20885.

Marinari, E. and Parisi, G. (1992). Simulated Tempering: A New Monte Carlo Scheme. Europhysics Letters, 19(6):451.

Mc Diarmid, C. (1989). On the method of bounded differences. Surveys in Combinatorics, 141(1):148 188.

Menozzi, S., Pesce, A., and Zhang, X. (2021). Density and gradient estimates for non degenerate Brownian SDEs with unbounded measurable drift. Journal of Differential Equations, 272:330 369.

Menz, G. and Schlichting, A. (2014). Poincaré and logarithmic Sobolev inequalities by decomposition of the energy landscape. The Annals of Probability, 42(5):1809 1884.

Mousavi-Hosseini, A., Farghly, T. K., He, Y., Balasubramanian, K., and Erdogdu, M. A. (2023). Towards a Complete Analysis of Langevin Monte Carlo: Beyond Poincare Inequality. In The Thirty Sixth Annual Conference on Learning Theory, pages 1 35. PMLR.

Pinsker, M. S. (1964). Information and information stability of random variables and processes. Holden-Day.

Published in Transactions on Machine Learning Research (09/2025)

Raginsky, M., Rakhlin, A., and Telgarsky, M. (2017). Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on Learning Theory, volume 65, pages 1674 1703. PMLR.

Rubenthaler, S., Rydén, T., and Wiktorsson, M. (2009). Fast simulated annealing in $\mathbb{R} d$ with an application to maximum likelihood estimation in state-space models. Stochastic Processes and their Applications, 119(6):1912 1931.

Schlichting, A. (2019). Poincaré and log Sobolev inequalities for mixtures. Entropy, 21(1):89.

Schuh, K. and Whalley, P. A. (2024). Convergence of kinetic Langevin samplers for non-convex potentials. ar Xiv preprint ar Xiv:2405.09992.

Shi, B., Du, S. S., Jordan, M. I., and Su, W. J. (2022). Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, 195(1):79 148.

Su, W., Boyd, S., and Candes, E. J. (2016). A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1 43.

Talagrand, M. (1996). Transportation cost for Gaussian and other product measures. Geometric & Functional Analysis, 6(3):587 600.

Tzen, B., Liang, T., and Raginsky, M. (2018). Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability. In Proceedings of the 31st Conference On Learning Theory, volume 75, pages 857 875. PMLR.

Vempala, S. and Wibisono, A. (2019). Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.

Villani, C. (2009). Optimal Transport: Old and New, volume 338. Springer.

Xu, P., Chen, J., Zou, D., and Gu, Q. (2018). Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.

Zhang, X. (2024). New algorithms for sampling and diffusion models. ar Xiv preprint ar Xiv:2406.09665.

Zhang, Y., Liang, P., and Charikar, M. (2017). A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1980 2022. PMLR.

Zhou, E. and Chen, X. (2013). Sequential Monte Carlo simulated annealing. Journal of Global Optimization, 55:101 124.

Published in Transactions on Machine Learning Research (09/2025)

A Deferred Figures

Figure 6 depicts the Rastrigin function in dimensions d = 1 and d = 2, illustrating its nonconvex nature.

Figure 6: Rastrigin Function for d = 1 and d = 2.

B Technical Assumptions

Equation (4) holds under the following technical assumptions:

Assumption B.1. We assume

1. There exists a global minimizer ˆx of U satisfying Z

Rd x ˆx 2 exp( U(x))dx < + .

2. U is 6-times continuously differentiable on a set F Rd such that all global minimizers are contained in the interior of F.

3. The Hessian of U is positive definite at each global minimizer.

C Deferred Proofs

For a probability measure µ P(Rd), we use the notation µ(f) := E[f(Z)] where Z µ and f is an integrable test function defined on Rd. Given a process Zt µt, the notation µ0|t(f) denotes the function defined on Rd such that µ0|t(f)(z) := E[f(Z0)|Zt = z] for any z Rd.

We start by introducing a lemma that will be used throughout the proofs.

Lemma C.1. Let µt : R 0 P(Rd) be a curve. Suppose that vt : R 0 Rd Rd is a sequence of vector fields satisfying tµt + (µt vt) = 0. Then it holds that, for all µ P(Rd),

d dt KL(µt µ ) = Eµt

Proof. See (Ambrosio et al., 2005, Equation 10.1.16).

C.1 Proof of Theorem 4.2

1. Observe that xµa,b = a U(x)µa,b, yµa,b = byµa,b,

xxµa,b = a 2U(x)µa,b + a2 U(x) 2µa,b, yyµa,b = ( b + b2 y 2)µa,b.

Published in Transactions on Machine Learning Research (09/2025)

As such, under the given parameters, one easily checks that

0 = ( aβ + σ2 xa2) U(x) 2 + (a γb) U(x) y + ( αb + σ2 yb2) y 2 + (β σ2 xa) tr( 2U(x)) + (α σ2 yb)

= β tr( 2)U(x) + a( β U(x) + y) U(x) + α + b( γ U(x) αy)y

+ σ2 x( a tr( 2U(x)) + a2 U(x) 2) + σ2 y( b + b2 y 2)

x (( β U(x) + y)µa,b(x, y)) y (( γ U(x) αy)µa,b(x, y)) + σ2 x tr( xxµa,b) + σ2 y tr( yyµa,b) .

It thus holds that µa,b satisfies the Fokker-Planck Equation, and must hence be an invariant distribution.

2. The Fokker-Planck Equation associated with System (9) reads

tµt = x (( β U(x) + y)µt(x, y)) y (( γ U(x) αy)µt(x, y)) + σ2 x xxµt + σ2 y yyµt

= σx 1/b 1/b σy

+ σ2 x 1/b 1/b σ2 y

= σ2 x 1/b 1/b σ2 y

log µa,b + log µt µt

= σ2 x 1/b 1/b σ2 y

As such, by Lemma C.1, we know that

d dt KL(µt µa,b) = Eµt

, σ2 x 1/b 1/b σ2 y

, σ2 x 0 0 σ2 y

min(σ2 x, σ2 y)Eµt

In specific, if we assume a log-Sobolev inequality with coefficient ρ, we get

d dt KL(µt µa,b) 2ρ min(σ2 x, σ2 y) KL(µt µa,b),

which, by Grönwall s Inequality, implies

KL(µt µa,b) exp( 2ρ min(σ2 x, σ2 y)t) KL(µ0 µa,b),

which means we get convergence as long as σ2 x, σ2 y > 0.

C.2 Explicit Computations for Algorithm 2

We first note that we can rewrite System (11) in integral form as, for t [kh, (k + 1)h),

Xt = Xkh β(t kh) U( Xkh) + Z t

kh Ysds + p

Yt = e α(t kh) Ykh γ

1 e α(t kh) U( Xkh) + q

kh e α(t s)d By s . (15)

Conditionally on the initial condition ( Xkh, Ykh), the process ( Xt, Yt)kh t (k+1)h is an Ornstein-Uhlenbeck process on [kh, (k + 1)h].

From now on, in this subsection, we always work implicitly conditionally to ( Xkh, Ykh).

In particular, L(( Xt, Yt)) is Gaussian and it remains to compute the associated expectation and covariance matrix to fully characterize it.

Published in Transactions on Machine Learning Research (09/2025)

We thus compute

E[ Yt] = e α(t kh) Ykh γ

1 e α(t kh) U( Xkh),

from which we get that

E[ Xt] = Xkh β(t kh) U( Xkh) + 1 e α(t kh)

t 1 e α(t kh)

Now note that the Brownian motion term for Yt is q

2σ2y R t kh e α(t s)d By s , whereas the Brownian motion

term for Xt is

kh d Bx s + q

kh e α(r s)d By s dr = p

kh d Bx s + q

s e α(r s)drd By s

kh d Bx s + q

Cov Yt, Yt = E h Yt E[ Yt] Yt E[ Yt] T i

kh e α(t s)d By s

kh e α(t s)d By s

= 2σ2 y Z t

kh e 2α(t s)ds Id

= σ2 y 1 e 2α(t kh)

Cov Xt, Yt = E h Xt E[ Xt] Yt E[ Yt] T i

kh d Bx s + q

kh e α(t s)d By s

kh e α(t s)d By s

+ 2σ2 y α E

kh 1 e α(t s)d By s

kh e α(t s)d By s

kh (1 e α(t s))e α(t s)ds Id

= σ2 y(1 e α(t kh))2

Published in Transactions on Machine Learning Research (09/2025)

And, finally,

Cov Xt, Xt = E h Xt E[ Xt] Xt E[ Xt] T i

kh d Bx s + q

kh d Bx s + q

kh ds Id + 2σ2 y α2

kh (1 e α(t s))2ds Id

2σ2 x + σ2 y α3

h 2α(t kh) + 1 e 2α(t kh) 4(1 e α(t kh)) i!

Selecting t = (k + 1)h yields the wanted result.

C.3 Proof of Theorem 4.3

For completion, we introduce the following technical lemma:

Lemma C.2. Consider the Rd-valued random process (Zt) defined through

d Zt = b(Z0)dt + σd Wt,

with initial condition Z0 ν0 for some ν0 P(Rd). Then νt = L(Zt) is a weak solution of

tνt = L t νt,

i=1 i(ν0|t(b)η) + 1

i,j=1 i,j (σσT )i,jη .

Proof. Let us consider a smooth real-valued function f. Then Ito s lemma together with the tower property of conditional expectation yields

E[f(Zt)] = E[f(Z0)] + Z t

i=1 E[bi(Z0)|Zs] if(Zs) + 1

i,j=1 (σσT )i,j i,jf(Zs)

which reads

(νt ν0)(f) = Z t

i=1 ν0|s(bi) if + 1

i,j=1 (σσT )i,j i,jf

Let L be the differential operator such that

i=1 ν0|t(bi) if + 1

i,j=1 (σσT )i,j i,jf

(νt ν0)(f) = Z t

0 νs(Ltf)ds = Z t

0 L t νs(f)ds,

and differentiating yields the wanted result.

Published in Transactions on Machine Learning Research (09/2025)

Given the parameters α, β, γ, σ2 x, σ2 y, a, b, ρ, L, we define

θ := ρ min(σ2 x, σ2 y),

τ := a2L2(σ4 x + b 2) 2 min(σ2x, σ2y) ,

A := 12 + 4β2L2 + 4γ2L2,

B := 2σ2 x + 12

a + 3σ2 y + 4γ2L

ˆB := 2τB d.

We now introduce the more precise statement of Theorem 4.3.

Theorem C.3. Assume µa,b satisfies a log-Sobolev inequality with constant ρ. Assume ( Xt, Yt) follows System (11), with initial distribution ( X0, Y0) µ0. Moreover, assume that

Then it holds that

KL( µh µa,b) e θh/2 KL( µ0 µa,b) + ˆBh2, (18)

and for all K 1,

KL( µKh µa,b) exp( θKh/2) KL( µ0 µa,b) + 3 ˆBh

Proof. Choose h according to Equation (17), and fix some t h, such that System (11) reduces to

Xt = X0 βt U( X0) + Z t

Yt = e αt Y0 γ

α 1 e αt U( X0) + q

0 e α(t s)d By s . (20)

Denote by µt the joint law of ( Xt, Yt). Note that in the above process, ( X0, Y0) is itself a random variable, with joint law µ0. Denote by µ0,t the joint law of ( X0, Y0) and ( Xt, Yt), and by µ0|t the conditional law L(( X0, Y0)|( Xt, Yt)).

Applying Lemma C.2 to the process Z = ( X, Y ) implies that µt satisfies

t µt = σ2 x 1/b 1/b σ2 y

a µ0|t U (x, y) by

= σ2 x 1/b 1/b σ2 y

+ log ( µt) + aµ0|t U (x, y) U(x) 0

= σ2 x 1/b 1/b σ2 y

µa,b + aµ0|t U (x, y) U(x) 0

As such, the vector field given by

vt = σ2 x 1/b 1/b σ2 y

µa,b + a µ0|t U (x, y) U(x) 0

Published in Transactions on Machine Learning Research (09/2025)

is tangent to ( µt), and hence, by Lemma C.1, it holds that

d dt KL( µt µa,b) = E µt

µa,b , σ2 x 1/b 1/b σ2 y

µa,b + a µ0|t U ( Xt, Yt) U( Xt) 0

µa,b , σ2 x 1/b 1/b σ2 y

µa,b , σ2 x 1/b 1/b σ2 y

a µ0|t U ( Xt, Yt) U( Xt) 0

µa,b , σ2 x 0 0 σ2 y

µa,b , σ2 x 1/b 1/b σ2 y

a[ U( X0) U( Xt)] 0

min(σ2 x, σ2 y) 2 Fi( µt µa,b)

+ 1 2 min(σ2x, σ2y)E µ0,t

σ2 x 1/b 1/b σ2 y

a[ U( X0) U( Xt)] 0

where the final inequality follows by Young s Inequality with coefficient min(σ2 x, σ2 y). Now note that

σ2 x 1/b 1/b σ2 y

a[ U( X0) U( Xt)] 0

= a2(σ4 x + b 2)E µ0,t[ U( X0) U( Xt) 2]

a2L2(σ4 x + b 2)E µ0,t[ X0 Xt 2].

As such, Equation (22) reads

d dt KL( µt µa,b) min(σ2 x, σ2 y) 2 Fi( µt µa,b) + τE µ0,t[ X0 Xt 2]. (23)

Now notice that, by the integral Equations (20) and Jensen s Inequality,

E µ0,t[ X0 Xt 2] = E µ0,t

" βt U( X0) + Z t

2β2t2E µ0,t[ U( X0) 2] + 2t Z t

0 E µ0,t Ys 2 ds + 2σ2 xd t. (24)

Now observe that, for ζ an optimal coupling between µ0 and µa,b and for (Xa,b, Y a,b) µa,b,

E µ0[ U( X0) 2] 2Eζ[ U( X0) U(Xa,b) 2] + 2Eµa,b[ U(Xa,b) 2]

2L2Eζ[ X0 Xa,b 2] + 2Eµa,b[ U(Xa,b) 2]

= 2L2W0 + 2Eµa,b[ U(Xa,b) 2], (25)

Published in Transactions on Machine Learning Research (09/2025)

where W0 = W 2 2 ( µ0, µa,b). Moreover, by denoting Za,b the normalization constant of µa,b,

Eµa,b[ U(x) 2] = ZZ

Rd Rd U(x) U(x)dµa,b(x, y)

Rd e b y 2/2 Z

Rd U(x) U(x)e a U(x)dxdy

Rd e b y 2/2 Z

Rd U(x) e a U(x)dxdy

Rd e b y 2/2 Z

Rd U(x)e a U(x)dxdy

a Eµa,b[ U(Xa,b)]

where the third-to-last equality follows from integration by parts, and the final inequality follows from L-smoothness. Plugging this into (25) then yields

E µ0[ U( X0) 2] 2L2W0 + 2d L

which, replacing in Equation (24), gives

E µ0,t h X0 Xt 2i 2β2t2 2L2W0 + 2d L

0 E µ0,t h Ys 2i ds + 2σ2 xd t. (28)

By Equations (20)

E µ(0,s) h Yt 2i 3E µ0 h Y0 2i + 3γ2E µ0

+ 6σ2 y E µ(0,s)

0 e α(s r)d By r

6W0 + 6Eµa,b h Y a,b 2i + 3γ2s2E µ0 h U( X0) 2i + 6σ2 yd s

6W0 + 6Eµa,b h Y a,b 2i + 3γ2s2 2L2W0 + 2d L

+ 6σ2 yd s. (29)

Now we realize that analogous computations to (26) give that

Eµa,b h Y a,b 2i d

which, plugged into (29), yields, after rearranging the terms,

E µt Yt 2 6W0 1 + γ2L2t2 + 6d

b + 6d t γ2Lt

Returning to Equation (28), we obtain

E µ0,t h X0 Xt 2i 2β2t2 2L2W0 + 2d L

6W0 1 + γ2L2s2 + 6d

b + 6ds 2γ2Ls

= 2β2t2 2L2W0 + 2d L

b + 2d t3 2γ2L

a + 3d t2σ2 y

= W0 12t2 + 4β2L2t2 + 4γ2L2t3 + d 2σ2 xt + 12t2

a t2 + 3σ2 y t3 + 4γ2L

Published in Transactions on Machine Learning Research (09/2025)

Using that t h < 1 and recalling the notation defined in Equation (16), we obtain

E µ0,t h X0 Xt 2i W0 A t2 + d B t.

Combining the above with (23), we obtain

d dt KL( µt µa,b) min(σ2 x, σ2 y) 2 Fi( µt µa,b) + τ W0 A t2 + τ d B t.

Applying a log-Sobolev inequality, we obtain

d dt KL( µt µa,b) θ KL( µt µa,b) + τA W0 t2 + τB d t.

Rearranging the terms yields

d dteθt KL( µt µa,b) eθtτA W0 t2 + eθtτB d t.

As t h, d dteθt KL( µt µa,b) eθtτA W0 h2 + eθtτB d h,

which we integrate from 0 to h in t, yielding

eθh KL( µ(h) µa,b) KL( µ0 µa,b) eθh 1

θ τA W0 h2 + eθh 1

2τA W0 h3 + 2τB d h2,

where we used ec 1 + 2c for 0 < c < 1 for c = θh. Using Talagrand s Inequality (Talagrand, 1996) and denoting KL0 = KL( µ0 µa,b) gives us that

eθh KL( µ(h) µa,b) KL0 4

ρτA KL0 h3 + 2τB d h2,

This implies that

KL( µ(h) µa,b) e θh KL0 1 + 4

ρτA h3 + 2e θhτB d h2,

which implies (18), using (17) and the fact that ln(x) x 1 for x > 0.

Iteratively applying the result of (18), we obtain, where KLk := KL( µkh µa,b) for all k,

KLK e θh/2 KLK 1 + ˆBh2

e 2θh/2 KLK 2 +(e θh/2 + 1) ˆBh2

e Kθh/2 KL0 +

k=0 e kθh/2 ! ˆBh2,

which, by bounding the finite sum by an infinite sum and using e c 1 2

3c for 0 < c < 1, yields (19).