# fast_equilibrium_of_sgd_in_generic_situations__2ac3566b.pdf

Published as a conference paper at ICLR 2024

FAST EQUILIBRIUM OF SGD IN GENERIC SITUATIONS

Zhiyuan Li * Toyota Technological Institute at Chicago zhiyuanli@ttic.edu

Yi Wang * Johns Hopkins University ywang261@jhu.edu

Zhiren Wang Pennsylvania State University zhirenw@psu.edu

Normalization layers are ubiquitous in deep learning, greatly accelerating optimization. However, they also introduce many unexpected phenomena during training, for example, the Fast Equilibrium conjecture proposed by (Li et al., 2020), which states that the scale-invariant normalized network, when trained by SGD with η learning rate and λ weight decay, mixes to an equilibrium in O( 1

steps, as opposed to classical e O((ηλ) 1) mixing time. Recent works by Wang & Wang (2022); Li et al. (2022c) proved this conjecture under different sets of assumptions. This paper aims to answer the fast equilibrium conjecture in full generality by removing the non-generic assumptions of Wang & Wang (2022); Li et al. (2022c) that the minima are isolated, that the region near minima forms a unique basin, and that the set of minima is an analytic set. Our main technical contribution is to show that with probability close to 1, in exponential time trajectories will not escape the attracting basin containing their initial position.

1 INTRODUCTION

Normalization layers are ubiquitous and play a fundamental role in modern deep learning, e.g., Batch Normalization (Ioffe & Szegedy, 2015), Group Normalization (Wu & He, 2018), Layer Normalization (Ba et al., 2016), and Weight Normalization (Salimans & Kingma, 2016). Normalization layers not only greatly facilitate optimization and improve trainability, it also brings intriguing new optimization behaviors to neural networks. For example, Li & Arora (2020) showed that normalized networks can be trained by SGD with exponentially increasing learning rates, because training with exponentially increasing learning rates turns out to be equivalent to training with constant learning rates but with weight decay turned on, as shown in (2). Here xk Rd is the parameter of a neural network after the k-th step and is updated by

xk+1 (1 λ)xk η LBk(xk), (1)

where λ and η are respectively the weight decay parameter and the learning rate, and LBk is the loss function evaluated using a randomly chosen mini-batch Bk.

The result of Li & Arora (2020) holds not only for normalized networks but more broadly for all scale invariant training losses, which is a popular abstraction of normalized networks in optimization analysis. Mathematically, scale invariance refers to the following property of the loss: LB(cx) = LB(x), c > 0, x Rd and every batch B.

Later, Li et al. (2020); Wan et al. (2021) discovered that it is the intrinsic learning rate ηλ that controls the long-term convergence behavior for SGD on scale invariant loss with weight decay, (1). The approach that Li et al. (2020) takes to study (1) is to approximate by the stochastic differential equation (SDE) model Li et al. (2017; 2019), which is quite common in literature.

d Xt = ( η L(Xt) ηλXt)dt ησ(Xt)d BK t . (2)

The authors are listed alphabetically.

Published as a conference paper at ICLR 2024

Here L is the average 1

K PK k=1 LBk over all random batches, σ = 1

K LBk L K k=1 is a d K matrix, and BK t is the K-dimensional Wiener process. Scale invariance of LB implies that L is scale-invariant and that σ is ( 1)-homogeneous, i.e.

L(cx) = L(x), σ(cx) = c 1σ(x), c > 0, x Rd. (3)

Li et al. (2020) further proposed the following Fast Equilibrium Conjecture for the SDE approximation of SGD. Conjecture 1.1. [Fast Equilibrium Conjecture] (Li et al., 2020) If F(X, input) denotes the output of a neural network NN with parameters X, and Xt denotes the value of SDE (2) at time t, starting from initial parameter X0. Suppose NN has normalization steps so that the F(X, input) is scaleinvariant in X, i.e. F(X, input) = F(c X, input) for all c > 0. Then for all input values input, the probability distribution of F(Xt, input) stabilizes to an equilibrium in O( 1

ηλ) steps of SGD updates.

Experiments where the empirically observed rates of convergence are polynomial were contained in the original paper Li et al. (2020) where the Fast Equilibrium Conjecture was first asked. The rate O( 1

ηλ) is considered to be fast because according to Langevin dynamics, the time it takes to converge

to the Gibbs equilibrium is of exponential order e O((ηλ) 1

2 ). This can be done by following a similar analysis to those in (Bovier et al., 2004; Shi et al., 2020). The works by (Bovier et al., 2004) and (Shi et al., 2020) dealt with models without normalization, and the convergence times there are of order e O((ηλ) 1). Using the similar method as in (Bovier et al., 2004) and (Shi et al., 2020), when

normalization is used the convergence time can be shown to be of order e O((ηλ) 1

2 ). This is because Li et al. (2020) proved that the intrinsic learning rate ηλ is replaced by an effective learning rate (γ 1

2 t in Li et al. (2020)) for the renormalized parameter vector, which is of order O((ηλ) 1 2 ).

The recent paper (Li et al., 2022c), using a mathematical framework from (Li et al., 2022b), established the fast equilibrium conjecture for ηλ 0 under a mixed set of generic and non-generic assumptions. See (Damian et al., 2021), (Gu et al., 2022) for more work on analyzing the dynamics of SGD near the manifold of minimizers. The goal of the current paper is to remove the non-generic ones and thus provide a general proof in the aforementioned range of parameters.

1.1 NOTATIONS AND ASSUMPTIONS

To introduce assumptions from previous authors as well as our results, we need to set up a few notations first. Let Γ Rd\{0} be the set of local minima of L. Notice that by (3), Γ is a cone, i.e. x Γ if and only if cx Γ for all c > 0. For all r > 0, write Γr = {x Γ : |x| = r}. In particular Γ1 is a subset of the unit sphere Sd 1 = {|x| = 1}.

In general, Γ may have multiple connected components. Decompose Γ = F Γi where each Γi is a connected cone. We then write Γi r = Γi {|x| = r}. Then Γi 1 are the connected components of Γ1. In particular, there are only finitely many Γi s and we index them by i = 1, , m.

In addition to the scaling properties (3) guaranteed by the use of normalization, (Li et al., 2022c) made certain assumptions, which we will need in the following. Assumption 1.2. The functions L and σ satisfy:

(i). (Scale invariance) The scaling rules in (3) hold.

(ii). (Regular critical locus) Each loss function LBk is C4 on Rd\{0}, the critical points of L form a C2 submanifold Ω. 1 For all x Ω, 2L(x) is of rank d dim TxΩ.

(iii). (Controllability) For all x Γ1, span{ Φ(x)σk(x)}K k=1 = TxΓ1. Here and below, Φ(x) = limt Xt, with Xt being the solution to the deterministic gradient descent Xt = L(Xt) with initial value x.

By (Arora et al., 2022, Lemma B.15), under Assumption 1.2.(i) & (ii), Φ is well defined and C2differentiable on a neighborhood of Γ as long as L is C4 differentiable. We also note that in general

1Different connected components of Ωare not required to have the same dimension.

Published as a conference paper at ICLR 2024

the noise structure does affect the convergence rate. But as long as Assumption 1.2 (iii) is satisfied, the noise structure won t affect the asymptotic order of the convergence rate.

On the other hand, we will not need the following assumptions.

Assumption 1.3. Γ satisfies:

(i). (Unique basin) Γ1 is compact and connected;

(ii). (Analyticity) Γ is a real analytic manifold and Tr Σ is a real analytic function on Rd\{0} where Σ = σσ .

Restricting to an attracting basin U of Γ and assuming both Assumptions 1.2 and 1.3, Li et al. (2022c) proved Conjecture 1.1 when λη 0 in the natural range of η O(λ) O(1) and the parameter X0 is initialized within U. Note that since U is an attracting basin, Ωand Γ coincide in U and thus Assumption 1.2 is equivalent to (Li et al., 2022c, Assumption 2.1) for the purpose of that paper. Note that Γ is always a submanifold of Ω.

Remark 1.4. All three assumptions in Assumption 1.2 are very natural for the following reasons:

As remarked earlier, the scale-invariance (3) is a consequence of the use of normalization steps inside neural networks.

It is a widely used assumption, at least in the case of local minimizers, that the locus is a manifold for overparametrized neural networks, for example in (Fehrman et al., 2020; Arora et al., 2022; Li et al., 2022b). For the locus of global minimizers, this assumption was proved by Cooper (2021). As remarked in (Li et al., 2022b; Cooper, 2021), a main reason for the local minimizers to form manifolds is the overparametrization of modern neural networks. S ims ek et al. (2021) further identifies the reason as symmetries arising from overparametrization. In fact, they studied loci of critical points that are not necessarily minima and proved that symmetry-induced critical points form a manifold that satisfies Assumption 1.2.(ii).

The philosophy behind Assumption 1.2.(iii) is that the generation of random batches in training is independent of the aforementioned symmetries in the setup of the neural network, and thus generically should not live in subspaces that are invariant under such symmetries. In particular, the same symmetries are generically capable of move the noises from the random batches to span a tangent space of the same dimension as that of the local manifold of critical points.

Remark 1.5. On the other hand, both conditions in Assumption 1.3 are non-generic:

Certain evidences from (Draxler et al., 2018) suggests that all local minima appearing in realistic training generically come from connected relatively flat region of small variation in height, so empirically Assumption 1.3.(i) could be a reasonable approximate assumption. However, the experiments in (Draxler et al., 2018, Fig. 5) shows at the same time that in many settings, this region is not completely flat and contains non-trivial saddle points. In particular, there could be multiple disconnected basins. In light of these, it is more reasonable work in the absence of assumption Assumption 1.3.(i).

The analyticity of the Γ and Tr Σ depends on that of the activation functions chosen in the neural network. While many popular activation functions are real analytic, one may always choose to use functions that are differentiable but not analytic, in which case Assumption 1.3.(ii) is in general not guaranteed.

In this paper, we will give a general proof of Conjecture 1.1 in the same natural range as in (Li et al., 2022c), assuming only the generic conditions from Assumption 1.2. In particular, we will introduce two arguments that respectively remove both hypothesis (Unique basin and Analyticity) in Assumption 1.3.

Here we would like to provide comments on why Assumption 1.3 are restrictive. Assuming analyticity is restrictive because the regularity of the loss function is decided by that of the activation function. Even though popular activation functions such as Sigmoid are analytic, a priori one could use smooth but not analytic functions. The one basin assumption is restrictive as we do not see empirical evidence of proof that L only has one basin. In fact, the experiments at the end of the paper suggests that there are multiple basins.

Published as a conference paper at ICLR 2024

We would also like to give remarks on why the three assumptions in Assumption 1.2 are essential. (i) is essential because without this assumption, the SDE would not be equivalent to a SDE on the sphere Sd 1, which is crucial to our analysis. Without this assumption, similar analysis can probably be formulated on Rd instead of the Sd 1 coordinate but there will be new technical obstacles to overcome. Since the original fast equilibrium conjecture was asked for normalized neural nets, we restrict our study to the current setting. (ii) is important because if not a trajectory may stay near a critical point (for example a saddle point) for a very long period of time, it would not be able to converge within a polynomial time. Finally, the reason why we need (iii) is that if the span is not the whole tangent space, but instead a subspace of the tangent space, then the diffusion will be restrained to this subspace, which a priori may be very fractal and existing mathematical theory is not sufficient to guarantee a unique equilibrium in limit.

1.2 MULTIPLE EQUILIBRIA WHEN BASIN IS NOT UNIQUE

It is worthy to explain in more details what happens when the basin is not unique, i.e Γ has multiple connected components and Assumption 1.3.(i) fails. In this situation, our analysis generalizes the work of Wang & Wang (2022) and reveals a three-stage equilibrium phenomenon. The most important property of this phenomenon is the mismatching between practical training and theoretical bounds: the equilibrium distribution of network parameters observed in the time window under a realistic budget is both local in space and temporary in time. It is concentrated near the bottom of the same attracting basin containing the initial parameter, and differs from the eventual global Gibbs equilibrium that the distribution of parameters will eventually converge to in exponentially long time. This phenomenon interprets the gap between the empirically based Conjecture 1.1 and the previous theoretical estimate from e.g. (Bovier et al., 2004; Shi et al., 2020). See (Frankle et al., 2020), (Gupta et al., 2019) for more work about the iterates stay in the same basin for a significant amount of time when starting from the same initialization.

The major short-come of (Wang & Wang, 2022) is that, while not relying on the uniqueness of the basin, the arguments therein are subject to other non-generic assumptions, namely: (1) all basins are isolated points; (2) the noise σ is a standard isotropic Gaussian noise.

Our methods allow to remove these assumption simultaneously together with Assumption 1.3. This is made possible by avoiding using the semi-classical analysis of spectra of differential operators, which was developed by Simon (Simon, 1983) and used in an essential way by previous authors in (Bovier et al., 2004; Shi et al., 2020; Wang & Wang, 2022). Instead, our method is purely probabilistic and predicts that the exiting time from a given basin is exponentially long. This method is based on an adaptation of the large deviation principle of Dembo & Zeitouni (2010).

2 STATEMENT OF MAIN RESULTS

2.1 PRELIMINARIES ON SDE MODEL

A polar coordinate system has been adopted in (Li et al., 2020) to study the SDE model (2). For this purpose, denote by Xt = Xt |Xt| the unit renormalization of Xt, and γt = |Xt|4η 2. By (Li et al., 2020, Theorem 5.1), (2) is equivalent to

2 t L(Xt)dt + σ(Xt)d BK t 1

2γ 1 t Tr Σ(Xt)Xtdt; (4)

dt = 4ηλγt + 2 Tr Σ(Xt). (5)

Recall that Σ = σσ is a d d positive semidefinite symmetric matrix.

One may view the motion of Xt as an intrinsic one inside the unit sphere Sd 1, instead of one inside Rd. From this perspective, (Wang & Wang, 2022, Theorem 3.1) shows that (4) can be rewritten as an intrinsic SDE on Sd 1

L(Xt)dt + σ(Xt)d BK t , (6)

where σ( ) 1 2 is a tensor field along Sd 1 whose value is given by the restriction of σ( ) 1 2 and is the gradient operator on Sd 1. (See the remark after (Wang & Wang, 2022, Theorem 3.1) for the

Published as a conference paper at ICLR 2024

meaning of being intrinsic. In particular, L(Xt)dt, σ(Xt)d BK t are viewed as vector fields along Sd 1.) Remark 2.1. Instead of the term σ(Xt)d BK t in (4) and (6), the papers (Li et al., 2020; Wang & Wang, 2022) actually used the restriction of Σ(Xt) 1 2 d Bd t to Sd 1. However, these two expressions are equivalent as Wiener processes because Σ = σσ .

Since (6) is a perturbation with Brownian noise of the gradient flow d Xt = γ 1

2 L(Xt) with

varying learning rate γ 1

2 t , it makes sense to first understand the constant speed gradient flow

Xt = L(Xt), (7) which is an ODE on the compact manifold Sd 1. Following earlier notation, the local minima of L on Sd 1 is Γ1 and has connected components Γi 1.

Write U i 1 Sd 1 for the attracting basin of Γi 1, i.e. the set of X0 Sd 1 such that the solution Xt to (7) with initial value X0 satisfies limt Xt Γi 1). Lemma 2.2. Under Assumption 1.2.(ii), the U i 1 s for i = 1, , m are disjoint open sets of Sd 1. Moreover, their union Fm i=1 U i 1 has full volume in Sd 1, and Sd 1\ Fm i=1 U i 1 is a proper submanifold of Sd 1.

The proof of Lemma 2.2 is standard and we left it to the reader. The key observation is that the complement Sd 1\ Fm i=1 U i 1 is the union of attracting basins of the connected components of critical points that are not local minima. By Assumption 1.2.(i), those critical points are saddle like and their attracting basins are proper submanifolds.

Note that L is constant on Γi 1 and Γi, and L(Γi 1) = L(Γi 1) is the minimum of L inside U i 1. For q > 0, write U i,q 1 := {x U i 1 : L(x) L(U i 1) + q}.

Define cones U i = {x Rd\{0} : x |x| U i 1}, U i,q = {x Rd\{0} : x |x| U i,q 1 }. Then U i the attracting basin of Γi under the gradient flow Xt = L(Xt). (8) By (Arora et al., 2022, Lemma B.15), under Assumption 1.2.(i), U i is open and the function Φ(x) = limt Xt is C2-differentiable on U i.

2.2 MAIN RESULT

Definition 2.3. We define the Lipschitz distance between two probability measures µ, ν on a metric space X as

dist Lip(µ, ν) = sup φ Lip 1

Z φdµ Z φdν ,

where the Lipschitz norm of a function φ is given by φ Lip := max φ C0, supx =y |φ(x) φ(y)|

We are now able to state our main theorem, which is a mutual reinforcement to both (Li et al., 2022c, Theorem 5.5) and (Wang & Wang, 2022, Theorem 4.6). Theorem 2.4. Under Assumption 1.2, for all ϵ > 0 and compact interval [ρ , ρ+] (0, ), there exist a constant c > 0 and a set Λ Fm i=1 U i 1 Sd 1 of volume vol Sd 1(Λ) > 1 ϵ, such that for all η O(λ) O(1), the following holds:

For all initial parameter x0 Rd with |x0| [ρ , ρ+] and x0 |x0| Λ, all growth rates K such that K as ηλ 0, and all time values

t h K log(1 + λ

the random trajectory to (2) with initial value x0 satisfies dist PX0=x0(Xt), νi < ϵ,

where νi is a probability measure supported on the attractor Γi of the unique attracting basin U i containing x0, and νi only depend on L, σ and i.

Published as a conference paper at ICLR 2024

Here vol Sd 1 is the renormalized volume on the sphere Sd 1 so that the total mass of Sd 1 is 1.

3 REMOVAL OF ANALYTICITY ASSUMPTION

In this part, we will prove that Assumption 1.3.(ii) on analyticity (which is (Li et al., 2022c, Assumption 5.3)) is unnecessary for (Li et al., 2022c, Theorem 5.4), and thus (Li et al., 2022c, Theorem 1.2) holds without such an assumption as well. Proposition 3.1. Under Assumption 1.2 and Assumption 1.3.(i), the conclusion of (Li et al., 2022c, Theorem 1.2) hold.

For this purpose, we temporarily adopt the setting of (Li et al., 2022c) for now. In other words, Assumption 1.3.(i) is being assumed and Γ1 is a compact connected submanifold of Sd 1 and Γ = {rx : r > 0; x Γ1}. By (Li et al., 2022c, Theorem 4.1) it suffices to prove the (qualitative) mixing of the SDE (Li et al., 2022c, Equation (13)) on Γ towards a unique invariant probability measure. Using the notations from (Li et al., 2022c, Chapter E.3), this SDE writes:

d Yt = f0(Yt)dt +

k=1 fk(Yt) d Bk,t, (9)

where fk are certain vector fields along the radial cone Γ Rd\{0} and Bk,t, k = 1, K are i.i.d. Wiener processes. Indeed, f1, , fk are orthogonal to the radial direction with

fk(x) = Φ(x)σk(x), 1 k K

and they span TxΓ x = TxΓ|x| for all x Γ, where Γr = {x Γ : |x| = r}, and f0 has the form

2 x + 2Φ(x)[Σ(x)] 1

i=k fk(x)fk(x) .

In the proofs from (Li et al., 2022c), analyticity is only used in Chapter F.4, when Tr Σ is not a constant on Γ1. In this case, it was proved there (without using analyticity) that Γ := {y Γ : r |x| r+} with r = min Γ1 (Tr Σ) 1 4 , r+ = max Γ1 (Tr Σ) 1 4 (10)

is the unique invariant control set for the control problem corresponding to (9).

Instead of using Kliemann s condition (Kliemann, 1987), which requires the vector field Lie algebra l generated by f0, , f N to be of maximal dimension at all points in Γ and is the reason for the need of analyticity in (Li et al., 2022c), we will use Arnold-Kliemann s condition (Arnold & Kliemann, 1987), which only requires that the vector field Lie algebra l to be of maximal dimensional at one point in Γ . This condition is true because the projection of f0 to the radial direction is not constantly 0 if Tr Σ is not a constant. (Otherwise Γ wouldn t be the unique invariant set.) Under this condition, (Arnold & Kliemann, 1987, Theorem 5.1) proved that there is a unique invariant probability measure ν supported on Γ . The measure ν then has to be ergodic. Moreover, (Arnold & Kliemann, 1987, Theorem 5.2) showed that ν is absolutely continuous with respect to the Riemannian volume on the manifold Γ. In particular, ν( Γ ) = 0 and ν(Γ ) = 1.

The main issue is to prove the convergence of the distribution St,x towards ν as t , where St,x denotes the measure of all trajectories of solutions to (9) at time t starting from x Γ . A priori, such convergence is only known to hold for 1

T R T 0 St,xdt by ergodic theorem.

Applying (Duflo & Revuz, 1969, Theorem II.4) to (Γ , ν), it suffices to check two conditions to guarantee St,x ν (in total variation distance):

(Harris s recurrence condition) For all sets A with ν(A) > 0 and all x Γ ,

0 1A(Xt)dt = = 1; (11)

(Regularity condition) The absolutely continuous component of St,x respect to ν, called S0 t,x, satisfies limt S0 t,x(Γ ) = 1 for all x Γ .

Published as a conference paper at ICLR 2024

Let us first verify (Harris s recurrence condition). Define D Γ by

D = {y Γ : f0(x) / TxΓ|x|}.

Then D is open in Γ . For all x Γ and A Γ , we define the random variable τx,A 0 to be the first entering time into A for a trajectory of (9) starting at y. Lemma 3.2. P(τx,D < ) = 1 for all x Γ .

Proof of (Harris s recurrence condition). By (Arnold & Kliemann, 1987, Theorem 6.1), (11) holds for all x D intΓ = D. By Lemma 3.2, it then holds for all x Γ . This verifies Harris s recurrence condition.

We now verify the (Regularity condition): In addition to the Lie algebra l and set D, define a Lie algebra l0 TΓ and an open set D0 by

l0 = span(f1, , f N) + [l, l], D0 = {y Γ : dim l0(x) = dim Γ }.

It is easy to see that l0 is indeed a Lie algebra and D0 is a relatively open subset in Γ Lemma 3.3. The set D0 has non-empty interior.

We postpone the proofs of Lemma 3.2 and 3.3 to Appendix A. Lemma 3.4. For all x D0 and t > 0, S0 t,x(Γ ) > 0.

Proof. By (Ichihara & Kunita, 1974, Lemma 2.1), at every x int D0, the second order differential operator on the right hand side of (9) is elliptic (non-degenerate) at x. The lemma follows.

Proof of (Regularity condition). By (Harris s recurrence condition) and Lemma 3.3, for all x Γ , P(τx,D0 < ) > 0, and thus there exists t0(x) > 0 such that P(τx,D0 < t0(x)) > 0. In other words, on a subset Ωx,D0 Ωof stochastic incidences ω with P(Ωx,D0) > 0, there exists τ x,D0 = τ x,D0(ω) (0, t0(x)) such that the solution starting at Y (0) = x satisfies Y (τ x,D0) D0. By Lemma 3.4, for all t > t0(x), S0 t,x(Γ ) R

Ωx,D S0 t τ x,D0,Y (τ x,D0)(Γ ) > 0. This shows that the

statement For ν-a.e. x, S0 t,x(Γ ) = 0 for all t > 0 is false. By (Duflo & Revuz, 1969, Proposition, p235), this guarantees (Regularity condition).

We have by now completed the proof of the mixing property St,x ν under the generic Assumption 1.2, and the Assumption 1.3.(i).

4 REMOVAL OF UNIQUE BASIN ASSUMPTION

We now stop assuming Assumption 1.3.(i) and decompose Γ = F Γi where each Γi is a connected

cone. Write Γ# = Γ {|x| [R , R+]} where we fix R < η min|x|=1 Tr Σ(x)

4 and R+ > η max|x|=1 Tr Σ(x)

4 near these bounds.

Note that Γ {|x| [ η min|x|=1 Tr Σ(x)

4 , η max|x|=1 Tr Σ(x)

4 ] is an invariant control set of (2). Moreover, it was proved in (Li et al., 2020) that for a given initial radius |x0|, the radius |Xt| of (2) starting at x0 will be almost surely inside [R , R+] for t O((ηλ) 1).

Write Γi r = Γi {|x| = r} and Γi # = Γi Γ#. Then Γi 1 are the connected components of the manifold Γ1. In particular, there are only finitely many Γi s. Write di = dim Γi. For s > 0, write U i 1,p for the p-neighborhood of Γi 1 in Γ1, U i r,p = r U i 1,p and U i #,p = S

r [R ,R+] U i r,p.

Fix from now on a sufficiently small parameter p0 such that U i 1,p0 are disjoint for distinct i s, and Γi 1 is the set of all critical points of L inside U i 1,10p0. This is possible because of Assumption 1.2.(ii). Proposition 4.1. For all sufficiently small p1 (to be determined later), given K > 1, ϵ > 0, there exists a subset ΛK,ϵ Sd 1 with m Sd 1(ΛK,ϵ) > 1 ϵ such that for all x0 with |x0| [ 1

K , K] and x0 |x0| ΛK,ϵ, with probability > 1 ϵ, the trajectory of (2) starting at x0 will remain inside S

i U i #,p1 for t [0, Cdes(ηλ) 1] for some constant Cdes.

Published as a conference paper at ICLR 2024

The proof of the proposition, which we omit, is a simple combination of (Li et al., 2020, Equation (7)) and the proof of (Wang & Wang, 2022, Theorem 4.5).

The proposition allows us to assume that our initial point is inside one of finitely many basins U i #,p1. To prove the main result, it now suffices to make two observations: First, with very high probability, the trajectory will not escape from the basin in exponential time. Second, as long as the trajectory remains in the basin, its distribution always mixes towards a unique probability measure νi supported at the bottom Γi of the basin.

The first property is stated as Proposition 4.2 below. Proposition 4.2. There exist C > 0, and sufficiently small p0 > p1 > 0, such that for all i and x0 U i #,p1, in the regime η O(λ) O(1), the solutions Xt to (2) with initial position x0 satisfy

lim λη 0 PX0=x0(Xt remains in U i #,p0 for t [0, e C(ηλ) 1

The convergence is uniform with respect to the inital data x0.

It is similar in nature to (Wang & Wang, 2022, Lemma E.6) but requires a more sophisticated proof. This is the main theoretical component of this paper. The argument will be based on the large deviation principle of Dembo-Zeitouni in Dembo & Zeitouni (2010, Chapter 5), which was an adaptation of (Freidlin & Wentzell, 2012, Chapter 6). The reason for which Freidlin-Wentzell s original theory cannot be applied here like in (Wang & Wang, 2022) is that the diffusion in the SDE system (44), (45) is degenerate. Dembo-Zeitouni s work allows degenerate diffusions. However, further modifications to (Dembo & Zeitouni, 2010) are needed in our case as the first order drift in (45) depends on the γt = |Xt|4η 2. We will treat γt as a control variable. The full proof will be in Appendix B.

The second property follows from the main results (Theorem 5.1 & Theorem 6.7) in (Li et al., 2022c)) and is restated as Proposition 4.3 here with an additional emphasis on uniformity. A more detailed discussion can be found in Appendix C. Recall that (Li et al., 2022c)) also assumes Assumption 1.3.(ii), but that can be dropped by the discussion in Chapter 3 above. Proposition 4.3. Under Assumption 1.2 For all K > 0 and sufficiently small p0 > p1 > 0, such that in the regime η O(λ) O(1) and ηλ 0, the following holds: For each index i and

for all initial parameter x0 U i #,p1, the distribution of all trajectories n Xt o

0 t K log(1+ λ

that do not leave U i #,p1 converges in distribution to the trajectories { ˆXt} to a fixed SDE model (the Katzenberger model) supported on Γi with initial position Φ(x0). The convergence is uniform in x0. Moreover, as K the trajectories { ˆXt} are uniformly mixing (with respect to different x0 s) towards a fixed equilibrium measure νi.

Our main theorem, Theorem 2.4, then follows from combining Propositions 4.1, 4.2 and 4.3.

Proof of Theorem 2.4. By Proposition 4.1, after ignoring O( 1

ηλ) time at the beginning, as well as an o(1) portion of stochastic incidences. One may assume x0 U i #,p for some i. For t within the range in the statement of Theorem 2.4, again after ignoring an o(1) portion of incidences, one may assume that all trajectories under consideration stay within U i #,p up to time t. Because of the lower bound for

t, one may consider a window of length K log(1+ λ

η ) ηλ that ends at t where K . The distribution of trajectories along this window is the average of distributions over different initial positions. By Proposition 4.3, all such components uniformly mix toward νi. The proof is completed.

5 EXPERIMENTS

Mixing on local manifold: The key technical observation of this paper (Proposition 4.2) is that the distribution of trajectories with an initial position is trapped locally in the attracting basin containing the initial position during any practical observation windows. Using the method from (Wang & Wang, 2022, Fig. 13), this observation is supported by the experiment below: using a reduced MNIST dataset with only 1280 samples and a small CNN with 1786 parameters (so that the model is still overparametrized), we ran 15 independent instances of SGD, at λ = η = 1 32, for each of two

Published as a conference paper at ICLR 2024

randomly chosen initial parametrizations. Each instance lasts 0.8 million steps of SGD. A smilar experiment was ran for reduced CIFAR10 dataset with 1280 samples, a CNN model with 2658 parameters, η = 1 1024, λ = 1 32 and 1.28 million SGD steps. In order to show that the distribution arising from each initial position does stabilize toward an equilibrium and the two equilibria are different, we track the variance within each group, and compare them with the average distance square over all pairs of point s from different groups. Namely, denoting by {Xk i,t} the i-th trajectory starting at initial point xk where k = 1, 2, we compute the following quantities:

V11(t) = Ei =j|X1 i,t X1 j,t|2, V22(t) = Ei =j|X2 i,t X2 j,t|2, V12(t) = Ei,j|X1 i,t X2 j,t|2.

Figures 1, 2 shows that V11 and V22 stabilizes near similar but different values, but V12 stabilizes at a much bigger value. This suggests that the distributions of trajectories with starting point x1 and x2 mixes towards equilibria whose support have similar scales, but these two equilibria are far apart from each other.

Figure 1: small MNIST: Variance comparison between distributions with different initial positions

Figure 2: small CIFAR: Variance comparison between distributions with different initial positions

Figure 3: small MNIST: Comparison between training losses before and after SWAP

Figure 4: small CIFAR: Comparison between training losses before and after SWAP

Prediction on the failure of stochastic weight averaging in parallel (SWAP): Our theory predicts that if the local minima manifold of minimizes Γi has non-trivial geometry, that is, the average of parameters on the manifold may fall off the manifold, then it might fail to decrease loss, or even increase the loss, once the SGD mixes to the local manifold.

We apply stochastic weight average over trajectories (SWAP) to the neural network parameters at each step over the 15 independent instances with the same initial position and compute the loss function at the averaged parameter. SWAP, a variant of stochastic weight average (SWA) from Izmailov et al. (2018), was proposed by Gupta et al. (2020). Figures 3, 4 show that although the loss of the SWAP parameter improves the average loss over the independent instances at the beginning, the improvement quickly breaks after a couple thousands of training steps. This phenomenon verifies our theoretical prediction and also suggests that the support of the equilibrium is not a convex set but rather a manifold of curved shape.

6 CONCLUSION

We give rigorous proof of fast equilibrium conjecture in generic situations, removing previous assumptions that there is only one basin and the set of minima is analytic. The main technical contribution is that we justify most of the trajectories of SDE would not escape from one basin within exponential time. Instead of using spectral analysis, we adopt the large deviation principle type of argument. Possible interesting direction may include understanding the dependence of mixing time on dimension, architecture and noise structure.

Published as a conference paper at ICLR 2024

7 ACKNOWLEDGEMENT

Y.W. and Z.W. acknowledge respectively supports from NSF.

Ludwig Arnold and Wolfgang Kliemann. On unique ergodicity for degenerate diffusions. Stochastics, 21(1):41 61, 1987.

Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. In Proceedings of the 39th International Conference on Machine Learning, pp. 948 1024, 2022.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

Anton Bovier, Michael Eckhoff, V eronique Gayrard, and Markus Klein. Metastability in reversible diffusion processes i: Sharp asymptotics for capacities and exit times. Journal of the European Mathematical Society, 006(4):399 424, 2004.

Yaim Cooper. Global minima of overparameterized neural networks. SIAM Journal on Mathematics of Data Science, 3(2):676 691, 2021.

Berfin S ims ek, Franc ois Ged, Arthur Jacot, Francesco Spadaro, Clement Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In Proceedings of the 38th International Conference on Machine Learning, pp. 9722 9732, 2021.

Alex Damian, Tengyu Ma, and Jason Lee. Label noise sgd provably prefers flat global minimizers. In Advances in Neural Information Processing Systems, pp. 27449 27461, September 2021.

Amir Dembo and Ofer Zeitouni. Large deviations techniques and applications, volume 38 of Stochastic Modelling and Applied Probability. Springer-Verlag, Berlin, 2010. ISBN 978-3-64203310-0. doi: 10.1007/978-3-642-03311-7. Corrected reprint of the second (1998) edition.

Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A. Hamprecht. Essentially no barriers in neural network energy landscape. In Proceedings of the 35th International Conference on Machine Learning,, volume 80, pp. 1308 1317, 2018.

M. Duflo and D. Revuz. Propri et es asymptotiques des probabilit es de transition des processus de Markov r ecurrents. Ann. Inst. H. Poincar e Sect. B (N.S.), 5:233 244, 1969.

K.J. Falconer. Differentiation of the limit mapping in a dynamical system. Journal of the London Mathematical Society, s2-27(2):356 372, 1983.

Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research, 21(136):1 48, 2020.

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In In Proceedings of the 37th International Conference on Machine Learning, pp. 3259 3269, November 2020.

Mark I. Freidlin and Alexander D. Wentzell. Random perturbations of dynamical systems, volume 260 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer, Heidelberg, third edition, 2012. ISBN 978-3-642-25846-6. doi: 10.1007/978-3-642-25847-3. Translated from the 1979 Russian original by Joseph Sz ucs.

Xinran Gu, Kaifeng Lyu, Longbo Huang, and Sanjeev Arora. Why (and when) does local sgd generalize better than sgd? In The International Conference on Learning Representations, September 2022.

Published as a conference paper at ICLR 2024

Vipul Gupta, Santiago Akle Serrano, and De Coste De Coste. Stochastic weight averaging in parallel: Large-batch training that generalizes well. In The International Conference on Learning Representations, September 2019.

Vipul Gupta, Santiago Akle Serrano, and Dennis De Coste. Stochastic weight averaging in parallel: Large-batch training that generalizes well. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum?id=ryg FWAEFw S.

Kanji Ichihara and Hiroshi Kunita. A classification of the second order degenerate elliptic operators and its probabilistic characterization. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 30:235 254, 1974. doi: 10.1007/BF00533476.

Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, pp. 448 456, 2015.

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. In Ricardo Silva, Amir Globerson, and Amir Globerson (eds.), 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pp. 876 885. Association For Uncertainty in Artificial Intelligence (AUAI), January 2018. 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 ; Conference date: 06-08-2018 Through 10-08-2018.

Gary Shon Katzenberger. Solutions of a stochastic differential equation forced onto a manifold by a large drift. The Annals of Probability, pp. 1587 1628, 1991.

Wolfgang Kliemann. Recurrence and invariant measures for degenerate diffusions. Ann. Probab., 15(2):690 707, 1987.

Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning, pp. 2101 2110, 2017.

Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. Journal of Machine Learning Research, 20(40): 1 47, 2019.

Zhiyuan Li and Sanjeev Arora. An exponential learning rate schedule for deep learning. In International Conference on Learning Representations, 2020.

Zhiyuan Li, Kaifeng Lyu, and S. Arora. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. Advances in Neural Information Processing Systems 33 pre-proceedings (Neur IPS 2020), 2020.

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? a mathematical framework. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022a.

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? a mathematical framework. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022b.

Zhiyuan Li, Tianhao Wang, and Dingli Yu. Fast mixing of stochastic gradient descent with normalization and weight decay. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022c.

Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901 909, 2016.

Bin Shi, Weijie J Su, and Michael I Jordan. On learning rates and schr odinger operators. ar Xiv preprint ar Xiv:2004.06977, 2020.

Published as a conference paper at ICLR 2024

Barry Simon. Semiclassical analysis of low lying eigenvalues. I. Nondegenerate minima: asymptotic expansions. Ann. Inst. H. Poincar e Sect. A (N.S.), 38(3):295 308, 1983. ISSN 0246-0211.

Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of neural network with normalization, weight decay, and sgd. In Advances in Neural Information Processing Systems, volume 34, 2021.

Yi Wang and Zhiren Wang. Three-stage evolution and fast equilibrium for SGD with non-degenerate critical points. In Proceedings of the 39th International Conference on Machine Learning, pp. 23092 23113, 2022.

Yuxin Wu and Kaiming He. Group normalization. In European Conference on Computer Vision, 2018.

A PROOFS FOR SECTION 3

Proof. of Lemma 3.2. Instead of (9), define

d Y t = f 0 (Y t )dt + N i=1fi(Y t ) d Bi,t, (12)

where f 0 (x) is the projection of f0(x) to TxΓ|x|.

For all 0 t < τx,D, f0 = f 0 at Yt, thus the solutions Y and Y to (9) and (12) starting at y coincide up to τx,D. Moreover, the trajectories of (12) stay in Γ|x|. So τx,D is also the first entering time into D Γ|x| for (12). We claim that D Γ|x| is a non-empty subset of Γ|x|. Indeed, if this is not true, then the trajectories Y and Y coincide forever and stay inside Γ|x|, making Γ|x| an invariant control set which contradicts to the uniqueness of Γ . Moreover, D Γ|x| is relatively open in the submanifold Γ|x|. By Assumption 1.2.(iii), (12) has non-degenerate diffusion on the submanifold Γ|x| = |x|Γ1. Thus Y enters D Γ|x| almost surely. This proves the lemma.

Proof. of Lemma 3.4. The proof is similar to that of (Li et al., 2022c, Lemma F.18). Recall that Tr Σ is ( 2)-homogeneous and assumed to be non-constant on Γ1. Therefore, there is an open set V1 Γ1 and a vector field f taking values in span(f1, , f N) such that Tr Σ, f = 0 on V1. By homogeneity, this is also true on r V1 Γr for all r > 0. As Tr Σ(x) = 1

2 x, f0(x) + x 2 and x, f = 0, this implies x, f0(x) , f = 0 on r V1. By (Li et al., 2022c, Lemma F.16), x, [f0, f ] = 0 on r V1. Note that [f0, f ] l0, and span(f1, , f N) = TΓr on r V1 by controllability. It follows that {r V1 : r (r , r+)} D0, which is sufficient to conclude.

B EXITING TIME WITH DEGENERATE DIFFUSION AND EXTRA CONTROL VARIABLE

In this appendix, we establish a probabilistic lower bound to the exiting time of a stochastic process from a basin (Theorem B.21), based on a one-sided large deviation principle (Theorem B.12). Our proofs adapt those from the work of Dembo-Zeitouni (Dembo & Zeitouni, 2010, Chapter 5) to a more general setting. The main differences in the setting are:

1. The stochastic process in the basin is now governed not only by the current location and an Brownian motion, but also by an extra control variable as stated in equation (17);

2. The local minima set in the basin is no longer assumed to be a unique isolated fixed point.

The diffusion in the stochastic process is allowed to be degenerate, which was the main novelty in the Dembo-Zeitouni theory compared to the earlier work of Freidlin-Wentzell (Freidlin & Wentzell, 2012, Chapter 6).

This appendix will solely consist of mathematical analysis. With the exception of B.4, all notations are chosen independently from those used in other parts of the current paper.

Published as a conference paper at ICLR 2024

B.1 PROPERTIES OF UPPER LARGE DEVIATION PRINCIPLE

In this section we define the notion of upper large deviation principle (upper LDP). This is the upper bound part of the large deviation principle defined in (Dembo & Zeitouni, 2010, Chapter 1.2). The principles proved in (Dembo & Zeitouni, 2010, Chapter 4.1 & 4.2), which allows to pass the LDP property between random processes, are still valid for upper LDP because the upper and lower bounds are treated separately in their proofs. The purpose of this section is to briefly list which facts are relevant and justify the survival of their proofs in (Dembo & Zeitouni, 2010) with upper LDP.

Definition B.1. A rate function I is a lower semicontinuous mapping I : X [0, ] on a metric space X, i.e. I 1([0, a]) is closed for all finite a. A good rate function is a rate function I such that I 1([0, a]) is compact for all finite a. The effective domain DI if I is I 1([0, )).

Definition B.2. A family of Borel probability measures {µϵ} on (X, B) satisfies upper large deviation principle (upper LDP) with rate function I if for all measurable subsets A of X,

lim sup ϵ 0 ϵ log µϵ(A) inf x A I(x). (13)

A family of Borel probability measures µϵ on (X, B) satisfies weak upper large deviation principle (weak upper LDP) with rate function I if (13) holds for all compact subsets A.

For background, recall that {µϵ} is said to satisfy the large deviation principle with rate function I if in addition to (13) it also satisfies the lower bound

lim inf ϵ 0 ϵ log µϵ(A) inf x A I(x). (14)

Theorem B.3. Suppose A is a base for the topology of X. Then a family of probability measures {µϵ} on X satisfies weak upper LDP with rate function

I(x) := sup A A,x A

lim sup ϵ 0 ϵ log µϵ(A) .

Theorem B.4. (Contraction Principle) If F : X Y is a continuous map between Hausdorff spaces and I : X [0, ] is a good rate function. Then

(a) The function I (y) := inf F 1({y}) I is a good rate function on Y;

(b) If a family of probability measures {µϵ} on X satisfies upper LDP with rate function I, then the pushforward measures {µϵ F 1} satisfies upper LDP with rate function I on Y.

Remark about the proofs. Theorem B.3 and Theorem B.4 are the upper bound directions of (Dembo & Zeitouni, 2010, Theorem 4.1.11 & 4.2.1). Their proofs are identical to those therein. For Theorem B.3, note that this direction only uses the equality (4.1.14), but not (4.1.12) and (4.1.13), in (Dembo & Zeitouni, 2010).

Definition B.5. For families {νϵ,m} and {νϵ} of probability measures on a metric space Y, where m N and ϵ > 0, we say {νϵ,m} are exponentially good approximations of {νϵ} if there exist probability spaces (Ω, Bϵ, Pϵ,m) and two families of random variables yϵ,m, yϵ with joint distribution Pϵ and marginal distributions νϵ,m, νϵ such that for all δ > 0, the event dist(yϵ,m, yϵ) > δ is Bϵ-measurable and

lim m lim sup ϵ 0 ϵ log Pϵ dist(yϵ,m, yϵ) > δ = . (15)

If in addition νϵ,m = eνϵ is independent of m, we say {eνϵ} and {νϵ} are exponentially equivalent.

Theorem B.6. For families {νϵ,m} and {νϵ} of probability measures on a metric space Y, where m N and ϵ > 0, and {νϵ,m} are exponentially good approximations of {νϵ}. If for each m, {νϵ,m} satisfies upper LDP with rate function Im, then

(a) {eνϵ} satisfies weak upper LDP with rate function

I(y) := sup δ>0 lim sup m inf z Bδ(y) Im(z).

Published as a conference paper at ICLR 2024

(b) If in addition I is a good rate function and for every closed subset A Y,

inf Y A I(Y ) lim sup m inf Y A Im(Y ).

Remark about the proof. The proof is the same as that of (Dembo & Zeitouni, 2010, Theorem 4.2.16). For part (a), by Theorem 3.3 applied to the topological base consisting of all metric balls Bδ(y) in Y, {νϵ,m} satisfies weak upper LDP with rate function

I (y) := sup δ>0

lim sup ϵ 0 ϵ log νϵ(Bδ(y)) .

So it suffices to prove I(y) I (y). This was done in the proof of (Dembo & Zeitouni, 2010, Theorem 4.2.16, part (a)) via the inequality

lim sup ϵ 0 ϵ log νϵ(Bδ(y)) lim inf m inf z B2δ(y) Im(z) = lim sup m inf z B2δ(y) Im(z).

I (y) = sup δ>0

lim sup ϵ 0 ϵ log νϵ(Bδ(y)) sup δ>0 lim sup m inf z B2δ(y) Im(z) = I(y).

The proof of part (b) is verbatim as in (Dembo & Zeitouni, 2010).

Theorem B.7. Suppose a family of probability measures {µϵ} satisfies upper LDP with a good rate function I on a Hausdorff topological space X. And suppose a sequence of continuous maps {F m} from X to another Hausdorff topological space Y approximate a measurable maps F in the sense that for all a < , lim sup m sup {x:I(x) a} dist(F m(x), F(x)) = 0.

Finally, assume the families {µϵ (F m) 1} are exponentially good approximations of another family of probability distributions {νϵ} on Y. Then {νϵ} satisfies upper LDP with good rate functions I (y) := inf F 1({y}) I.

Remark about the proof. The theorem is the upper bound part of of (Dembo & Zeitouni, 2010, Theorem 4.2.23). The proof stay the same with (Dembo & Zeitouni, 2010, Theorems 4.2.1 & 4.2.16) replaced by their respective upper bounds direction, namely Theorem B.4 and Theorem B.6.

B.2 UPPER LDP FOR DEGENERATE DIFFUSION WITH EXTRA CONTROL VARIABLE

In this part, we prove a variation of the upper bound direction in the large deviation principle proved in (Dembo & Zeitouni, 2010, Theorem 5.6.7). The notations in this section are self-contained and independent of those from other parts of this paper.

Throughout this section we will consider the following setting:

U Rn is an open domain, D Rl be a compact set, b : U Rn, σ : U D L(Rd, Rn) and h : R U D Rl are Lipschitz continuous functions, b and σ are bounded, and h is bounded on [0, ϵ0] U D for some ϵ0 > 0. We will fix ϵ0 and a common upper bound H for b, σ and h respectively on these domains.

Consider the following families of stochastic differential equations on (Y, Z) U Rl and ϵ > 0:

d Y ϵ t = b(Y ϵ t , Zϵ t )dt + ϵσ(Y ϵ t , Zϵ t )d Bd t ; (16)

d Zϵ t = h(ϵ, Y ϵ t , Zϵ t )dt. (17)

The main difference of our setting from that in (Dembo & Zeitouni, 2010) is the existence of the additional control variable Zϵ t , whose evolution depends on ϵ in a less prescribed way.

We will assume throughout this section, in addition to Assumption B.8, that:

Published as a conference paper at ICLR 2024

Assumption B.8. For all ϵ > 0 and initial values (y0, z0) U D, the solution (Y ϵ t , Zϵ t ) to (16) and (17) starting at (y0, z0) almost surely remains in U D for all t > 0. Definition B.9. Given the functions b, σ, the upper bound H on |h|, and T > 0. The associated path space ST is defined as the family of triples (f, g, u) with f C0([0, T], U), g CLip H ([0, T], D) and u W 1,2([0, T], Rd) that

ft = y0 + Z t

0 b(fs, gs)ds + Z t

0 σ(fs, gs)dus. (18)

Here CLip H ([0, T], D) is the subspace in C0([0, T], D) of functions g with Lipschitz constant bounded by H, i.e. that satisfy

And W 1,2 is the square integrable Sobolev space of first order differentiability.

Lemma B.10. CLip H ([0, T], D) is compact in C0([0.T], D).

Proof. As C0([0.T], D) is a metric space, it suffices to show any sequence g(k) in CLip H ([0, T], D) has a convergent subsequence with limit in CLip H ([0, T], D)

Because D is a bounded domain, the g(k) s are uniformly bounded. As they are also uniformly Lipschitz with Lipschitz constant bounded by H, by Arzel a-Ascoli Theorem we may assume g(k) converges in C0 to some g C0([0, T], D). Then g would be H-Lipschitz continuous as well.

From now on, CLip H ([0, T], D) will be equipped with the C0 topology without further notice.

Definition B.11. Given a function f C0([0, T], U), the corresponding energy functional is

ΦT (f) := inf (g,u) such that (f,g,u) ST

1 2| ut|2dt.

By convention, ΦT (f) = if ST is empty. Theorem B.12. Given a closed subset A of the metric space A C0([0, T], U) and an initial value y0, for the solution (Y ϵ t , Zϵ t ) to (16) and (17) with initial value (y0, z0), the following inequality holds: lim sup ϵ 0 ϵ sup z0 D log P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(Y ϵ t A) inf f A f0=y0 ΦT (f).

In order to prove Theorem B.12, we need more notations to study the Y ϵ t while keeping the path Zϵ t fixed. For this purpose, we define some distributions of (Zϵ t , ϵBd t ) and Y ϵ t respectively.

Definition B.13. Denote by µϵ (y0,z0),T the joint distribution of (Zϵ t , ϵBd t ) on CLip H ([0, T], D) C0([0, T], Rd), and νϵ (y0,z0),T for the distribution of Y ϵ t in C0([0, T], U), where (Y ϵ t , Zϵ t ) is solution to (16), (17) with initial value (y0, z0) at t = 0. Write λϵ for the distribution of ϵBd t on C0([0, T], Rd).

Definition B.14. Denote by Mϵ T the space of all probability distributions µ on CLip H ([0, T], D) C0([0, T], Rd) whose projection to the second coordinate is λϵ.

For all initial values (y0, z0) U D, Y ϵ t , Zϵ t are progressively measurable processes. In consequence, µϵ (y0,z0),T Mϵ T . (20)

Lemma B.15. The function

1 2| ut|2dt if u W 1,2([0, 1], Rd)

Published as a conference paper at ICLR 2024

is a good rate function on CLip H ([0, 1], D) C0([0, 1], Rd). Moreover, any family of probability measures {µϵ} on CLip H ([0, 1], D) C0([0, 1], Rd) such that µϵ Mϵ 1 satisfies upper LDP with rate function I.

Proof. We first check that I is a rate function. That is, it is lower semicountinuous on CLip H ([0, 1], D) C0([0, 1], Rd). It suffices to show that that I(g, u) lim inf k I(g(k), u(k)) if

(g(k), u(k)) (g, u) in C0 norm.

By Lemma B.10, g is in CLip H ([0, 1], D), and thus by definition I(g, u) = I0(u) in this case.

It was known by Schilder s Theorem ((Dembo & Zeitouni, 2010, Theorem 5.2.3)) that the function

1 2| ut|2dt if u W 1,2([0, 1], Rd);

is a good rate function, and in particular lower semicountinuous. Thus

lim inf k I(g(k), u(k)) = lim inf k I0(u(k)) I0(u) = I(g, u).

Thus I is a good rate function and we want to show that it is good, i.e. I 1([0, a]) is compact for all finite a. Remark that I 1([0, a]) = CLip H ([0, 1], D) I 1 0 ([0, a]) and the second factor is compact as I0 is a good rate function. Note that I0(u) = I(g, u). So it suffices to know that CLip H ([0, 1], D) is a compact space in C0 topology, which is the assertion of Lemma B.10.

Finally we need to show that (13) holds for {µϵ} and I. The same inequality holds for {λϵ} and I0, i.e. for all A0 C0([0, 1], Rd),

lim sup ϵ 0 ϵ log λϵ(A0) inf u A0 I0(u).

For all measurable set A CLip H ([0, 1], D) C0([0, 1], Rd), let A0 denote its projection to C0([0, 1], Rd). Then

lim sup ϵ 0 ϵ log µϵ(A) lim sup ϵ 0 ϵ log µϵ CLip H ([0, 1], D) A0

= lim sup ϵ 0 ϵ log λϵ(A0)

inf u A0 I0(u) inf (g,u) A I(g, u).

In the last inequality, we used the fact that A0 contains the projection of A and that I(g, u) = I0(u). This completes the proof.

For all (g, u) CLip H ([0, 1], D) C0([0, 1], Rd), denote by Y ϵ (g,u),t the solution on [0, 1], with initial value Y ϵ (g,u),0 = y0 to the stochastic differential equation

d Y ϵ (g,u),t = b(Y ϵ (g,u),t, gt)dt + σ(Y ϵ (g,u),t, gt)dut. (21)

In addition, given an integer m N, we also define Y ϵ,m (g,u),t as the solution on [0, 1], also with initial value y0, to the following stochastic differential equation

d Y ϵ,m (g,u),t = b(Y ϵ,m (g,u), mt

m , gt)dt + σ(Y ϵ,m (g,u), mt

m , gt)dut. (22)

We emphasize that Y ϵ (g,u),t and Y ϵ,m (g,u),t are deterministic once the pair (g, u) are given and all randomness comes from µϵ (y0,z0),T .

Lemma B.16. For all any δ > 0 and all initial values y0 D,

lim m lim sup ϵ 0 ϵ sup µ Mϵ 1 log P(g,u) µ( sup t [0,1] |Y ϵ,m (g,u),t Y ϵ (g,u),t| δ) = .

Published as a conference paper at ICLR 2024

Proof of Lemma B.16. Fix δ > 0. Let ϵ,m (g,u),t := Y ϵ,m (g,u),t Y ϵ (g,u),t , for any ρ > 0, define the stopping time τ ϵ,m,ρ := min(inf{t : |Y ϵ,m (g,u),t Y ϵ,m (g,u), [mt]

m | ρ}, 1).

τ ϵ,m,ρ depends on (g, u), but we skip it to simplify the notation.

The process ϵ,n (g,u),t satisfies the SDE

d ϵ,n (g,u),t = b t dt + ϵσ t dut

with coefficients b t := b(Y ϵ,m (g,u), [mt]

m , gt) b(Y ϵ (g,u),t, gt),

σ t := σ(Y ϵ,m (g,u), [mt]

m , gt) σ(Y ϵ (g,u),t, gt).

By the uniform Lipschitz continuity of b and σ, and the boundedness of gt D, there is a constant C such that for all t τ ϵ,m,ρ,

max(|b t |, |σ t |) C(| ϵ,n (g,u),t| + ρ).

By (Dembo & Zeitouni, 2010, Lemma 5.6.18), for all ϵ (0, 1), δ > 0,

ϵ sup µ Mϵ 1 log P(g,u) µ( sup t [0,τ ϵ,m,ρ] | ϵ,m (g,u),t| δ) K + log ρ2

where K is a constant independent of δ, ϵ, ρ, µ and m. Then

lim ρ 0 sup m 1 lim ϵ 0 ϵ sup µ Mϵ 1 log P(g,u) µ( sup t [0,τ ϵ,m,ρ] | ϵ,m (g,u),t| δ) = .

Since the event {supt [0,1] | ϵ,m (g,u),t| δ} is contained in the union

{τ ϵ,m,ρ < 1} { sup t [0,τ ϵ,m,ρ] | ϵ,m (g,u),t| δ},

the lemma is proved if we show for all ρ > 0,

lim m lim ϵ 0 ϵ sup µ Mϵ 1 log P(g,u) µ( sup t [0,1] |Y ϵ,m (g,u),t Y ϵ,m (g,u), [mt]

m | ρ) = . (23)

To prove this, recall that |b| and |σ| are bounded by a constant H, and

|Y ϵ,m (g,u),t Y ϵ,m (g,u), [mt]

m + ϵ max k=0,...,m 1 sup 0 s 1

Therefore, for m > H

sup µ Mϵ 1 log P(g,u) µ( sup 0 t 1 |Y ϵ,m (g,u),t Y ϵ,m (g,u), [mt]

m P( sup 0 s 1

m )2/2dϵC2.

This guarantees (23) and proves the lemma.

Proof of Theorem B.12. First of all, notice that one can assume without loss of generality that T = 1 by rescaling the time interval [0, T] to [0, 1]. To see this, notice that the rescaled Brownian motion Bd T t is equivalent to

TBd t on t [0, 1] and if we make the change of variable u T t =

Tvt accordingly, then Z T

1 2| ut|2dt = 1

dtu T t 2dt = Z 1

1 2| vt|2dt.

We will only deal with the [0, 1] interval below.

Published as a conference paper at ICLR 2024

Define maps F m, F : CLip H ([0, 1], D) C0([0, 1], Rd) C0([0, 1], U) as follows.

For F m, the image f m = F m(g, u) satisfies f m 0 = y0 and

f m t = f m k m + Z t

k m b(f m k m , gs)ds + Z t

k m σ(f m k m , gs) usds, (24)

m ], k = 0, ..., m 1.

For F, the image f = F(g, u) instead satisfies f0 = y0 and

ft = f0 + Z t

0 b(fs, gs)ds + Z t

0 σ(fs, gs) usds (25)

for t [0, 1].

It is not hard to check by Lipschitz boundedness of b, σ, and the compactnes of D that F m and F send CLip H ([0, 1], D) C0([0, 1], Rd) to C0([0, 1], U). Moreover, (21) and (22) can be reformulated as Y ϵ,m (g,u),t = F m(g, u) and Y ϵ (g,u),t = F(g, u) respectively. (26)

For (g, u), (g , u ) CLip H ([0, 1], D) C0([0, 1], Rd), let

Θm := |F m(g, u) F m(g , u )|.

Θm is a function of t, and by the Lipschitz bounds on b and σ,

m ] Θm(t) C(Θm( k

m) + (g, u) (g , u ) C0).

Since Θm(0) = 0, we derive the continuity of the operator F m by iterating the argument with k = 0, 1, ..., m 1. A similar argument guarantees the continuity of the operator F.

By (20) and Lemma B.16, for all family of initial values {zϵ 0 D}ϵ>0,

lim m lim sup ϵ 0 ϵ log P(g,u) µϵ (y0,zϵ 0),1( sup t [0,1] |Y ϵ,m (g,u),t Y ϵ (g,u),t| δ) = .

In other words, the families {µϵ (y0,zϵ 0),1 (F m) 1} are exponentially good approximations of the distribution of Y ϵ (g,u),t with (g, u) µϵ (y0,zϵ 0),1.

Thus if we can prove: for all a < ,

lim m sup (g,u) CLip H ([0,1],D) C0([0,1],Rd): R 1 0 1 2 | ut|2dt a

F m(g, u) F(g, u) C0 = 0, (27)

then by Lemma B.7 and Lemma B.15, the distribution of Y ϵ (g,u),t with (g, u) µϵ (y0,zϵ 0),1 satisfies upper LDP with good rate function y 7 inf F (g,u)=y I(g, u), which is exactly Φ1(y). Because for each different value of ϵ, zϵ 0 is arbitrarily chosen in D and the distribution of Y ϵ (g,u),t for (g, u) µϵ (y0,zϵ 0),1 coincides with the distribution νϵ (y0,zϵ 0),1 of Y ϵ t with initial condition (Y ϵ 0 , Zϵ 0) = (y0, zϵ 0) in (16), (17), this exactly yields the statement of the theorem.

It remains to show (27). Let m = |F m(g, u) F(g, u)|.

Note that if Z 1

1 2| ut|2dt a

holds, then for f m = F m(g, u) and f = F(g, u) given in (24), (25),

sup t [0,1] max |f m t f m [mt]

m |, |ft f [mt]

Published as a conference paper at ICLR 2024

for some constant C0 by the bound on the coefficients and Cauchy-Schwarz inequality. Write ηm for the right hand side in (28). Then by Lipschitz continuity of b and σ, for some other constant C,

m t =|f m t ft|

m , gs) b(fs, gs) ds + Z t

m , gs) σ(fs, gs) usds

0 C|f m [ms]

m fs|ds + Z t

0 C|f m [ms]

m fs| | us|ds

0 C( m s + ηm)(1 + | us|)ds

By Cauchy-Schwarz inequality, for all t [0, 1]

0 ( m s + ηm)2ds Z t

0 (1 + | us|)2ds

0 2 ( m s )2 + (ηm)2 ds Z t

0 2(1 + | us|2)ds

4C2(1 + a) Z t

0 ( m s )2ds + (ηm)2 .

Thus by Gronwall s inequality, ( m t )2 4C2(1 + a)e4C2(1+a)t(ηm)2. The equality (27) follows by letting m , This completes the proof.

The following is a strengthen version of Theorem B.12. Theorem B.17. Given a closed subset A of the metric space A C0([0, 1], U) and an initial value (y , z ) U V , for solutions (Y ϵ t , Zϵ t ) to (16) and (17), the following inequality holds:

lim sup ϵ 0 y0 y sup z0 D ϵ log P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(Y ϵ t A) inf f A f0=y ΦT (f).

Proof of Theorem B.17. As in the proof of Theorem B.12, we can assume T = 1.

By Theorem B.6, it suffices to prove that for any family of points yϵ 0 y as ϵ 0 and arbitrary zϵ 0 D, the family of distribution {νϵ (yϵ 0,zϵ 0),1} is exponentially equivalent to {νϵ (y ,zϵ 0),1} as in Definition B.5. Write

ϵ t = Y ϵ (yϵ 0,zϵ 0),t Y ϵ (y ,zϵ 0),t Zϵ (yϵ 0,zϵ 0),t Zϵ (y ,zϵ 0),t

Then ϵ t satisfies d ϵ t = αϵ tdt + βϵ td Bd t where

αϵ t = b(Y ϵ (yϵ 0,zϵ 0),t, Zϵ (yϵ 0,zϵ 0),t) b(Y ϵ (y ,zϵ 0),t, Zϵ (y ,zϵ 0),t) h(Y ϵ (yϵ 0,zϵ 0),t, Zϵ (yϵ 0,zϵ 0),t) h(Y ϵ (y ,zϵ 0),t, Zϵ (y ,zϵ 0),t)

βϵ t = σ(Y ϵ (yϵ 0,zϵ 0),t, Zϵ (yϵ 0,zϵ 0),t) σ(Y ϵ (y ,zϵ 0),t, Zϵ (y ,zϵ 0),t) 0

ϵ 0 = yϵ 0 y 0

Remark that the coefficients αϵ t, βϵ t are progressively measurable processes with respect to the filter generated by the Brownian motion {Bd t }. Moreover, by Lipschitz continuity of b, σ, h, for some constant C0, max(|αϵ t|, |βϵ t|) C0| ϵ t|.

Published as a conference paper at ICLR 2024

By applying (Dembo & Zeitouni, 2010, Lemma 5.6.18) with τ1 = 1, we know that there is another constant C such that for all ρ > 0, δ > 0,

ϵ log P( sup t [0,1] | ϵ t| δ) C + log ρ2 + |yϵ 0 y |2

By letting ρ 0 first and then ϵ 0, it follows that

lim sup ϵ 0 ϵ log P( sup t [0,1] | ϵ t| δ) = .

This shows {νϵ (yϵ 0,zϵ 0),1} is exponentially equivalent to {νϵ (y ,zϵ 0),1}, which suffices to conclude the proof.

Corollary B.18. Given a closed subset A of the metric space A C0([0, 1], U) and a compact set K U, for solutions (Y ϵ t , Zϵ t ) to (16) and (17), the following inequality holds:

lim sup ϵ 0 ϵ log sup (y0,z0) K D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(Y ϵ t A) inf f A f0 K ΦT (f).

Proof. Let M [0, ) be a finite value strictly less than inf f A f0 K ΦT (f) [0, ]. By Theorem

B.17, for all y K, there is a value ϵy such that for all y0 Bϵy(y) and 0 < ϵ < ϵy,

ϵ sup z0 D ϵ log P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(Y ϵ t A) M.

Cover K by finitely many balls of the form Bϵyi(yi), then for all 0 < ϵ < mini ϵyi,

ϵ sup y K sup z0 D ϵ log P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(Y ϵ t A) M.

The proof is completed by letting M inf f A f0 K ΦT (f).

B.3 EXITING TIME FROM BASIN

In this section, it will be assumed, in addition to Assumption B.8, that:

Assumption B.19. There are a C2 function L : U [0, ) and a bounded open set V U such that:

(1) b(y,z)L(y) 0 for all (y, z) V D and the equality holds if and only if L(y) = 0;

(2) L is strictly positive on V .

Write Vq = L 1([0, q)) V and choose q0 sufficiently small such that Vq0 lies in the interior of V . In particular, for all q [0, q0], L| Vq q.

Definition B.20. Suppose 0 < q < Q q0. For a solution (Y ϵ t , Zϵ t ) of (16), (17) with initial value in VQ D, denote by τ ϵ q,Q the first time Y ϵ t hits Vq VQ, and by τ ϵ Q the first time Y ϵ t hits VQ.

Our goal is to prove the following main theorem:

Theorem B.21. Under Assumptions B.8 and B.19, for all 0 < q < Q < q0 there exists IQ > 0 such that for all,

lim ϵ 0 sup (y0,z0) Vq D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ Q > e

The following lemma is an analogue to (Dembo & Zeitouni, 2010, Lemma 5.7.22)

Lemma B.22. For all 0 < q < Q < Q q0, the stopping time τ ϵ q,Q satisfies

lim ϵ 0 sup (y0,z0) VQ D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ q,Q < and Y ϵ τ ϵ q,Q Vq) = 1.

Published as a conference paper at ICLR 2024

Proof. Consider the solution Y 0 t to the deterministic flow

d Y 0 t = b(Y 0 t , z0)dt (29)

starting at y0. By Assumption B.19, L(Y 0 t ) is decreasing and must converge to 0 as t 0, and Y 0 t remains in VQ. Denote by T the first moment L(Y 0 t ) reaches q

δ = L 1 C1 min(q

Notice supt [0, T ] dist (Y ϵ t , Zϵ t ), (Y 0 t , z0) < δ implies

sup t [0, T ] |L(Y ϵ t ) L(Y 0 t )| min(1

4q, Q L(y0)

Since L(Y 0 t ) L(Y0) = Q and L(Y 0 T ) = q

2, this guarantees that

L(Y ϵ t ) L(Y0) + L C1δ < Q

and L(Y ϵ T ) q

2 + L C1δ < q.

It would in turn follow that τ ϵ q,Q T < and Y ϵ τ ϵ q,Q Vq. Hence it suffices to prove

lim ϵ 0 P(Y ϵ 0 ,Zϵ 0)=(y0,z0) sup t [0, T ]

(Y ϵ t , Zϵ t ), (Y 0 t , z0) < δ = 1. (30)

For simplicity, write Jϵ t := (Y ϵ t , Zϵ t ), (Y 0 t , z0) . Let M < be a common upper bound on the C0

and the Lipschitz norms of the maps b, σ and h over the domains ϵ [0, 1], y Vq0, z D. Then Jϵ 0 = 0 and

0 2MJϵ sds + ϵ Z t

0 σ(Y ϵ s , Zϵ s)d Bd s .

By Gronwall s inequality,

sup t [0, T ] Jϵ t ϵe2M T sup t [0, T ]

0 σ(Y ϵ s , Zϵ s)d Bd s .

Therefore, in order to make supt [0, T ] Jϵ t δ, we must have

sup t [0, T ]

0 σ(Y ϵ s , Zϵ s)d Bd s δϵ 1

sup t [0, T ]

0 σk(Y ϵ s , Zϵ s)d B1 s δ

must hold for at least one of the row vectors σk of σ.

The process Gϵ k,t = R t 0 σk(Y ϵ s , Zϵ s)d B1 s is a continuous martingale with increasing process Gϵ k t M 2t almost surely.

Recall Gϵ k t is defined as R t 0 σ2 k(Y ϵ s , Zϵ s)ds. By the Burkholder-Davis-Gundy inequality (see e.g. (Dembo & Zeitouni, 2010, Chapter E)),

E(Y ϵ 0 ,Zϵ 0)=(y0,z0) sup t [0, T ] |Gϵ k,t|2 CE(Y ϵ 0 ,Zϵ 0)=(y0,z0) Gϵ k T CM 2 T

for an absolute constant C. And thus by the Chebyshev s theorem

P(Y ϵ 0 ,Zϵ 0)=(y0,z0) sup t [0, T ]

0 σk(Y ϵ s , Zϵ s)d B1 s δ

2 e 2M T ) 2.

Published as a conference paper at ICLR 2024

From the earlier discussion, after summing over all k s, we obtain that

P(Y ϵ 0 ,Zϵ 0)=(y0,z0)( sup t [0, T ] Jϵ t δ) CM 2 T(δ

2 e 2M T ) 2. (31)

We deduce (30) from (31) by letting ϵ 0, which concludes the proof.

The following lemma is the analogue of (Dembo & Zeitouni, 2010, Lemma 5.7.23). Lemma B.23. For all δ, a > 0 and bounded set K U, there exists a constant T0 = T0(δ, a, K) > 0 such that

lim sup ϵ 0 ϵ log sup (y0,z0) K D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)( sup t [0,T0] |(Y ϵ t y0, Zϵ t z0)| δ) < a.

Proof. Without loss of generality, assume ϵ, δ [0, 1] with δ fixed and ϵ varying. Let the stopping time ζϵ be the first time such that |(Y ϵ t y0, Zϵ t z0)| δ. Then for every t [0, ζϵ],

|b(Y ϵ t , Zϵ t )| max Bδ(K) D |b| + b Lipδ,

|σ(Y ϵ t , Zϵ t )| max Bδ(K) D |σ| + σ Lipδ,

|h(ϵ, Y ϵ t , Zϵ t )| max [0,1] Bδ(K) D |h| + h Lipδ,

and thus they all are uniformly bounded by a constant M. For all 0 t min(ζϵ, δ 4M ), |Zϵ t z0| δ 4, and |Y ϵ t y0| δ

4 + | R t 0 ϵσ(Y ϵ s , Zϵ s)d Bd s|. Hence for any T0 δ 4M we have

P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(ζϵ T0)

P(Y ϵ 0 ,Zϵ 0)=(y0,z0) ζϵ T0 and ϵ sup t [0,T ]

0 σ(Y ϵ s , Zϵ s)d Bd s δ

As in the proof of (Dembo & Zeitouni, 2010, Lemma 5.7.23), it suffices to consider each row vector σk of σ. The stochastic process R t 0 σk(Y ϵ s , Zϵ s)d B1 s is equivalent to B1 τ ϵ k,t by the time change theorem

(see (Dembo & Zeitouni, 2010, Chapter E.2)) where τ ϵ k,t = R t 0 σ2 k(Y ϵ s , Zϵ s)ds. The function τ ϵ k,t is increasing in t and almost surely τ ϵ k,t M 2t if T0 ζϵ as the C0 and the Lipschitz norms of b, σ, h are bounded by M before the stopping time ζϵ. By (Dembo & Zeitouni, 2010, Lemma 5.2.1),

P(Y ϵ 0 ,Zϵ 0)=(y0,z0) ζϵ T0 and ϵ sup t [0,T0]

0 σk(Y ϵ s , Zϵ s)d B1 s δ

P(Y ϵ 0 ,Zϵ 0)=(y0,z0) ζϵ T0 and ϵ sup t [0,T0] B1 τ ϵ k,t δ

P(Y ϵ 0 ,Zϵ 0)=(y0,z0) ζϵ T0 and ϵ sup τ [0,M 2T0] B1 τ δ

P(Y ϵ 0 ,Zϵ 0)=(y0,z0) ϵ sup τ [0,M 2T0] B1 τ δ

2d )2/(2ϵM 2T0).

Summing over different k s, we conclude from the two inequalities above that for all T0 [0, δ 4M ]

P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(ζϵ T0) d 4e ( δ

2d )2/(2ϵM 2T0) = 4de ( 1

8 δ2d 2M 2T 1 0 )/ϵ.

It now suffices to take T0 = min( δ

4M , δ2 8d2M 2a).

Definition B.24. For y0, y1 U and T > 0, define an energy cost ΨT (y0, y1) by

ΨT (y0, y1) = inf n Z T

1 2| us|2ds : (f, g, u) ST , f0 = y0, f T = y1 o

= inf Φ(f) : f C0([0, T], U) .

Published as a conference paper at ICLR 2024

The following lemma claims there is a minimum energy cost required to increase the loss function L between different levels. Lemma B.25. For all 0 < q < Q q0,

inf T 0 inf (y0,y1) Vq VQ ΨT (y0, y1) > 0. (32)

Moreover, there is constant Tq,Q > 0 such that

inf T Tq,Q inf y0,y1 VQ\Vq ΨT (y0, y1) > 0. (33)

Proof. Suppose for now y0, y1 VQ\ Vq, and (f, g, u) ST with f0 = y0, f T = y1. Then by (18), d L(ft) = L(ft)dft = L(fs) b(fs, gs)ds + L(fs) σ(fs, gs)dus.

Because q > 0, by Assumption B.19, as L(y) b(y, z) = b(y,z)L(y) < 0 for all y VQ\Vq and z D. In particular, this also shows L(y) = 0 for all y VQ\Vq. Since both VQ\Vq and D are compact, There exists positive constant κ = κ(q, Q) > 0 and η = η(q, Q) > 0, such that L(y) b(y, z) κ| L(y)|2 and | L(y)| η for (y, z) (VQ\Vq) D.

Integrating from 0 to T, we get

L(y1) L(y0)

=L(f T ) L(f0) = Z T

L(fs) b(fs, gs)ds + L(fs) σ(fs, gs)dus

0 | L(fs)|2ds

0 | L(fs)|2ds 1

0 |σ(fs, gs) us|2ds 1

0 | L(fs)|2ds

0 | L(fs)|2ds 1

2 σ C0(VQ D) Z T

0 | us|2ds 1

In order to show (32), note that if y0 Vq and y1 VQ, then L(y1) L(y0) = Q q > 0. By (34) and Cauchy-Schwarz inequality,

0 | L(fs)|2ds

0 | L(fs)|2ds 1

2 σ C0(VQ D) Z T

0 | us|2ds 1

4κ σ 2 C0(VQ D)

0 | us|2ds.

Since σ is continuous, σ C0(VQ D) < . It follow that

1 2| us|2ds 2κ(Q q) σ 2 C0(VQ D) > 0.

This provides a positive lower bound for ΨT (y0, y1) that is uniform for T > 0, y0 Vq and y1 VQ. This proves (32).

Published as a conference paper at ICLR 2024

We now prove (33). Note that for all y0, y1 VQ\Vq, L(y1) L(y0) q Q. By (34),

0 |us|2ds 1

σ 1 C0(VQ D)

0 | L(fs)|2ds 1

2 (Q q) Z T

0 | L(fs)|2ds 1

σ 1 C0(VQ D)(κηT 1 2 (Q q)η 1T 1

The last expression is uniformly positive for T Tq,Q := 2(Q q)κ 1η 2. Thus ΨT (y0, y1) is uniformly positive over the region given by T Tq,Q, y0, y1 VQ\ Vq. This proves (33).

Lemma B.26. For all 0 < q < Q q0 and initial value (y0, z0) VQ D, the stopping time τ ϵ q,Q satisfies lim T lim sup ϵ 0 ϵ log sup (y0,z0) VQ D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ q,Q > T) = .

Proof. Note that τ ϵ q,Q > T implies Y ϵ t VQ\Vq VQ\Vq for all t [0, T]; or in other words, {Y ϵ t }t [0,T ] C0([0, T], VQ\Vq).

Therefore, by Corollary B.18,

lim T lim sup ϵ 0 ϵlog sup (y0,z0) VQ D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ q,Q > T)

inf f C0([0,T ],VQ\Vq) Φ(f);

and in consequence it suffices to show

lim T inf f C0([0,T ],VQ\Vq) Φ(f) = . (36)

Assume, for the sake of contradiction, that (36) is false, then for some M < and any all k N, there exists fk C0([0, k Tq,Q], VQ\Vq) with Φ(fk) M, where Tq,Q is given by Lemma B.25. After breaking fk into k segments on subintervals of length Tq,Q, it follows that there exists f k C0([0, Tq,Q], VQ\Vq) such that Φ(f k) M

By taking limit in C0 norm (which is permitted by Lemma B.10) and using the lower semicontinuity of the good rate function Φ, there exists f C0([0, Tq,Q], VQ\Vq) with Φ(f ) = 0. This implies ΨT (y0, y1) = 0 and thus it contradicts to the inequality (33) of Lemma B.25. This completes the proof.

Denote by Iq,Q > 0 the left hand side in (32).

Lemma B.27. The solutions to (16) and (17) satisfy

lim q 0 lim sup ϵ 0 ϵ log sup (y0,z0) V2q D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(Y ϵ τ ϵ q,Q VQ) lim q 0 Iq,Q. (37)

Proof. Fix an arbitrarily small δ, and let b IQ,δ := min(limq 0 Iq,Q δ, 1

δ ). Note that the right hand side in (37) always exists because Iq,Q is a decreasing function in q by construction. In particular, I2q,Q b IQ,δ when q is sufficiently small depending on Q and δ.

By Lemma B.26, there exists a large T = T (q, Q, δ) < such that

lim sup ϵ 0 ϵ log sup (y0,z0) V2q D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ q,Q > T ) b IQ,δ. (38)

In addition, inf f C0([0,T ],U): f0 V2q, supt [0,T ] L(ft) Q

Φ(f) I2q,Q b IQ,δ,

Published as a conference paper at ICLR 2024

and thus by Corollary B.18,

lim sup ϵ 0 ϵ log sup (y0,z0) V2q D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)( sup t [0,T ] L(Y ϵ t ) Q) b IQ,δ. (39)

Note that the event {Y ϵ τ ϵ q,Q VQ} is contained in the union of the events {τ ϵ q,Q > T } and {supt [0,T ] L(Y ϵ t ) Q}. Therefore, we obtain by combining (38) and (39) that the inequality

lim sup ϵ 0 ϵ log sup (y0,z0) V2q D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(Y ϵ τ ϵ q,Q VQ) b IQ,δ (40)

holds for all sufficiently small q. The lemma then follows by letting δ 0.

We are now ready to establish Theorem B.21.

Proof of Theorem B.21. Choose q < Q

2 to be sufficiently small. Define a sequence of stopping times θ0 < τ0 < θ1 < τ1 < recursively by letting

τm := inf{t > θm : Y ϵ t Vq VQ}, θm := inf{t > τm 1 : Y ϵ t V2q}.

By Lemma B.22, all these stopping times are finite almost surely. Write e Ym = Y ϵ τm, which is a Markov chain.

Recall that by Lemma B.25 Iq,Q > 0 for all 0 < q < Q and is an decreasing function in q. Let IQ := 1

2 limq 0 Iq,Q > 0 and fix 0 < α < 1

4IQ. By Lemma B.27, if we fix a sufficiently small q, then lim sup ϵ 0 ϵ log sup (y0,z0) V2q D P(Y ϵ 0 ,Zϵ 0)=(y,z)(Y ϵ τ ϵ q,Q VQ) IQ 3α.

By plugging in (Y ϵ θm, Zϵ θm) as the the value for (y, z), we deduce that there exists ϵ0 > 0 such that for all 0 < ϵ < ϵ0 and m 1,

sup (y0,z0) VQ D ϵ log P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ Q = τm) IQ 2α. (41)

On the other hand, assuming ϵ0 is sufficiently small, applying Lemma B.23 with a = IQ + 2α, δ = 1

2q and K = VQ yields that, for some fixed T0 > 0 depending on q and Q and all ϵ (0, ϵ0].

sup (y0,z0) VQ D ϵ log P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(θm τm 1 T0)

sup (y,z) VQ D ϵ log P(Y ϵ 0 ,Zϵ 0)=(y,z)( sup t [0,T0] |(Y ϵ t y, Zϵ t z)| 1

For a given M, the event {τ ϵ Q MT0} is contained in the union of the events SM m=0{τ ϵ Q = τm} and SM m=1{θm τm 1 T0}. Combining this fact and the inequalities (41), (42) yield

sup (y0,z0) Vq D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ Q MT0)

sup (y0,z0) Vq D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ Q = τ0) +

m=1 sup (y0,z0) Vq D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ Q = τm)

m=1 sup (y0,z0) Vq D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(θm τm 1 T0)

sup (y0,z0) Vq D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ Q = τ0) + 2Me IQ+2α

Published as a conference paper at ICLR 2024

Set M = T 1 0 e

ϵ , which makes MT0 e

ϵ for all sufficiently small ϵ. Moreover as ϵ 0,

sup(y0,z0) Vq D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ Q = τ0) 0 by Lemma B.22; and 2Me IQ+2α

ϵ 0 by the choice of M. Thus

sup (y0,z0) Vq D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ Q e

ϵ ) sup (y0,z0) Vq D P(Y ϵ 0 ,Zϵ 0)=(y0,z0)(τ ϵ Q MT0)

which tends to 0 as ϵ 0. This completes the proof of Theorem B.21.

B.4 PROOF OF PROPOSITION 4.2

Recall that the movement of (2) stays in Γ#, i.e. |Xt| [R , R+] and can be characterized as the new model (6), (5).

After applying a time change T = (ηλ) 1 2 t

and writing γt = ηλγt = λη 1|Xt|4,

we get the following system of equations on (XT , γT ) Sd 1 R+:

2 T L(XT )d T + (ηλ) 1 4 γ 1

2 T σ(XT )d Bd T ; (44)

d γT = (ηλ) 1 2 ( 4 γT + 2 Tr Σ(XT ))d T. (45)

To deduce (44), we used the standard fact that for any a > 0, a 1 2 Bd a 1T and Bd T are equivalent as Wiener processes.

Note γT stays in the fixed interval [ γ , γ+] = [λη 1R4 , λη 1R4 +] as long as Xt Γ#.

As there are only finitely many basins, we may fix an index i without impact Proposition B.28. In addition, without loss of generality let us assume L = 0 on Γi. For our purpose, it might be better to measure the distance to Γi on U i #,p0 where p0 > p1 is sufficiently small and in particular U j #,p0 are disjoint for distinct j s. For q 0, we will write V i r,q = {x U i r,p0, L(x) < q} and V i #,q = {x U i #,p0, L(x) < q}.

By Assumption 1.2.(ii), one may manipulate p0, q0, p1, q1 (p0 > p1, q0 > q1) such that

U i 1,p1 V i 1,q1 V i 1,q0 U i 1,p0,

and thus reformulate Proposition 4.2 as

Proposition B.28. There exists c > 0 such that if q0 > q1 > 0 are fixed, but q1 is sufficiently small compared to q0, then in the regime η O(λ) O(1),

lim ηλ 0 sup (y0,z0) V i 1,q1 [ γ , γ+] P(X0, γ0)=(y0,z0)

XT remains in V i 1,q0 for all t [0, ec(ηλ) 1

The convergence is uniform with respect to (y0, z0).

Proof. In order to apply Theorem B.21 with domain V i 1,q0 and control domain D = [ γ , γ+], we first make the following remark.

Theorem B.21 is in the setting where the domain is an open neighborhood in Rn, while our current V i q0 is a neighborhood in the sphere. This is not a problem because unless L is a constant function, Γi 1 is a proper subset of the sphere Sd 1. And V i q is also a proper subset in Sd 1 for small q. One can then change coordinates and identify V i q with a subset of the Euclidean space.

This converts the problem to the equations (16), (17) with the following dictionary: ϵ = (ηλ) 1 2 ; XT and γT play the roles of Y ϵ t , and Zϵ t respectively; b(y, z) = z 1

2 (y); σ(y, z) = z 1

2 σ(y); and h(ϵ, y, z) = ϵ( 4z + 2 Tr Σ(y)).

Published as a conference paper at ICLR 2024

It remains to check Assumptions B.8 and B.19. We start with the latter. The strictly positivity of L directly follows from the construction of L and the neighborhood V i q0. The property (1) in Assumption B.19 holds because b is negatively proportional to the gradient of L with respect to the spherical coordinates.

Unfortunately Assumption B.8 doesn t automatically hold as trajectories may escape from V i 1,q0. In order to adapt to this case, we smoothly modify the value of σ on V i 1,q0 so that it remains unchanged

on V i 1,q1 and vanishes near V i 1,q0. Then near the boundary, the SDEs (16), (17) become deterministic. Because of Assumption B.19.(1), the trajectories do not escape from V i 1,q0. Moreover, we know that γt remains in D = [ γ , γ+]. This verifies Assumption B.8 for the modified model.

We conclude by applying Theorem B.21 that, for some fixed I > 0, for all initial positions in V i 1,q1 [ γ , γ+], the probability that a trajectory (XT , γT ) (with respect to the modified model)

leaves V i 1,q0 [ γ , γ+] before T = e I ηλ = e O((ηλ) 1

2 ) converges to 0 as ηλ 0. The convergence is in addition uniform with respect to the initial position.

Since such modifications only take place outside V i 1,q1 [ γ , γ+], the same statement also holds for the original model. As V i 1,q1 V i 0,q0, we obtain the statement of Proposition B.28 after reparamatrization of variables.

C UNIQUE KATZENBERGER LIMIT INSIDE EACH BASIN

The results from B.2, stated in the form of Proposition 4.2, guarantee that, after discarding an exponentially small subset of random incidences, the trajectories of (2) stays inside the basin that contains the initial position for exponentially long time O(e C(ηλ) 1). We now justify Proposition 4.3

We restart (2) from an initial point, still written as x0 by abuse of notation, in some U i #,p1. We now apply a uniform approximation theorem, which is a stronger version of (Li et al., 2022a, Theorem 4.6). We can prove this uniform approximation result because we can strengthen(Katzenberger, 1991, Theorem 6.3) to be a compactness theorem, uniformly with respect to the initial points of the SDE. To describe this result, let (Ωn, Fn, {Fn t }t 0, P) be a filtered probability space, Zn an Revalued cadlag {Fn t }-semimartingale with Zn(0) = 0 and An a real-valued cadlag {Fn t }-adapted nondecreasing process with An(0) = 0. Let σn : U M(d, e) be continuous with σn σ uniformly on compact subsets of U. Let Xn be an Rd-valued cadlag {Fn t }-semimartingale satisfying, for all compact K U,

Xn(t) = Xn(0) + Z t

0 σ(Xn)d Zn + Z t

0 L(Xn)d An (46)

for all t λn(K) where λn(K) = inf{t 0|Xn(t ) / } K or Xn(t) / K is the stopping time of Xn leaving K. Theorem C.1. Suppose Xn(0) U, Assumption 3.1, 3.2 and Condition B.2, B.3, B.4, B.5 from (Li et al., 2022a) hold. For any compact K U define µn(K) = inf{t 0|Yn(t ) / K or Yn(t) / K}, then the sequence {(Y µn(K) n , Zµn(K) n , µn(K))} is relatively compact in DRd e[0, ) [0, ). If (Y, Z, µ) is a limit point of this sequence under the Skorohod metric, then (Y, Z) is a continuous semimartingale, Y (t) Γ for every t 0 a.s., µ inf{t 0|Y (t) / K} a.s. and Y (t) admits

Y (t) =Y (0) + Z t µ

0 Φ(Y (s))σ(Y (s))d Z(s)

0 ijΦ(Y (s))σ(Y (s))ikσ(Y (s))jld Zk, Zl (s). (47)

We will present Assumption B.3, B.4 and Condition B.5, B.6, B.7 and B.8 from (Li et al., 2022a) in Appendix B.

The main difference of Theorem C.1 to (Katzenberger, 1991, Theorem 6.3) is that we allow the initial point Xn(0) to vary within U.

Published as a conference paper at ICLR 2024

Theorem C.2. Let the manifold Γ and its open neighborhood U satisfy Assumption 3.1 and 3.2. Let K U be any compact set and xn,0 K be a sequence of initial points. Consider the SGD formulated in (46) where Xηn(0) xn,0. Define

Yηn(t) = Xηn(t) Φ(Xηn(0), Aηn(t)) + Φ(Xηn(0))

and µηn(K) = min{t N|Yηn(t) / K}. Then the sequence {Y µn(K) ηn , Zµn(K) ηn , µηn(K))}n 1 is relatively compact in DRd e[0, ) [0, ). If (Y, Z, µ) is a limit point of this sequence, it holds that Y (t) Γ a.s. for all t 0, µ inf{t 0|Y (t) / K} and Y (t) admits

Y (t) = Z t µ

0 Φ(Y (s))σ(Y (s))d Z(s)+ 1

i,j=1 ijΦ(Y (s))(σ(Y (s))σ(Y (s)) )ijds (48)

where {W(s)}s 0 is the standard Brownian motion.

Proof. The proof of Theorem C.2 follows how (Li et al., 2022a, Theorem B.8) was proved by using (Li et al., 2022a, Lemma B.6) and the standard Katzenberger s theorem (Katzenberger, 1991, Theorem 6.3). One difference is that here not all trajectories stays inside one basin. However, we claim that the probability that trajectories escape the basin goes to zero when ηλ tends to zero. Once this claim is proved, Theorem C.2 is an immediate consequence of (Li et al., 2022a, Lemma B.6) and Theorem C.1.

To prove the claim, we adopt the same idea as in the proof of Theorem 4.2 in Appendix B.4. Although trajectories may escape from level set V i 1,q0, we can smoothly modify the value of σ on the closure V i 1,q1 so that it remains unchanged on V i 1,q0 and vanishes near the boundary V i 1,q0. The SDEs become deterministic, and thus the trajectories of the modified model do not escape from the boundary V i 1,q0. Now, as ηλ tends to zero, by Theorem B.21 the probability of a trajectory of the modified model leaving V i 1,q1 [ γ , γ+] before a fixed T tends to 0. Since such modifications only take place outside V i 1,q1 [ γ , γ+], the same statement holds for the original model. This finishes the proof of the claim.

Proof of Proposition 4.3. The above uniform version of the Katzenberger s theorem guarantees that, starting from different initial points in the same compact neighborhood of the basin, the distribution of trajectories associated with (2) is still close to that of the Katzenberger s SDE (47). By Proposition 3.1, the latter is mixing towards a unique equilibrium νi. Note that even though in Chapter 3, we have only proved it for one basin case, Theorem 4.1 shows that with a large probability, the trajectories do not escape from the basin. For those trajectories, they satisfy a modified SDE equation like before, so that all trajectories do not escape from this basin. At this moment, we can directly apply Proposition 3.1. It follows that within any polynomial time window under consideration, the distribution of trajectories associated with (2) are also mixing towards νi. This proves Proposition 4.3.

Assumption C.3. (Li et al., 2022a, Assumption 3.1) Assume that the loss L : Rd R is a C3 function, and that Γ is a (d M)-dimensional C2-submanifold of Rd for some integer 0 M d, where for all x Γ, x is a local minimizer of L and rank( 2L(x)) = M.

Assumption C.4. (Li et al., 2022a, Assumption 3.2) Assume that U is an open neighborhood of Γ satisfying that gradient flow starting in U converges to some point in γ, i.e., x U, Φ(x) Γ. (Then Φ is C2 on U by (Falconer, 1983).)

Condition C.5. (Li et al., 2022a, Lemma B.2) The integrator sequence {An}n 1 is asymptotically continuous: supt>0 |An(t) An(t )| 0 where An(t ) = lims t An(s) is the left limit of An at t.

Condition C.6. (Li et al., 2022a, Lemma B.3) The integrator sequence {An}n 1 increases infinitely fast: ϵ > 0, inft 0(An(t + ϵ)) An(t)) .

Condition C.7. ((Katzenberger, 1991, Equation 5.1), (Li et al., 2022a, Lemma B.4)) For every T > 0, as n , it holds that

sup 0<t T λn(K) Zn(t) 2 0.

Published as a conference paper at ICLR 2024

Condition C.8. ((Katzenberger, 1991, Condition 4.2), (Li et al., 2022a, Lemma B.5)) For each n 1, let Yn be a {Fn t }-semimartingale with sample paths in DRd[0, ). Assume that for some δ > 0 allowing δ = and every n 1 there exist stopping times {τ m n |m 1} and a decomposition of Yn Jδ(Yn) into a local martingale Mn plus a finite variation process Fn such that P[τ m n m] 1/m, {[Mn](t τ m n ) + Tt τ m n (Fn)}n 1 is uniformly integrable for every t 0 and m 1, and lim γ 0 lim sup n P sup 0 t T (Tt+γ(Fn) Tt(Fn)) > ϵ = 0,

for every ϵ > 0 and T > 0, where Tt( ) denotes total variation on the interval [0, t].

It was shown in (Li et al., 2022a, Lemma B.6) that for SGD formulated in (46), the sequences {An}n 1 and {Zn}n 1 satisfy Condition C.5, C.6, C.7, and C.8. And the landscape of L satisfies Assumption C.3 and C.4. Thus the Katzenberger theorem holds in our case.